add support for saving and restoring caches by emilyalbini · Pull Request #78 · oxidecomputer/buildomat

emilyalbini · 2026-02-21T00:25:00Z

This PR adds caching support in Buildomat, driven by the bmat CLI.

The core building blocks of this implementation are the bmat cache save and bmat cache restore commands. These commands are fairly low level: they require the user to explicitly provide the cache key, and bmat cache save also requires an exhaustive list of files to cache to be provided via stdin.

On top of this building block, the bmat cache rust save and bmat cache rust restore commands wrap the low level primitives and automatically determine the cache key and list of files to cache. This allows us to enforce best practices (like caching only third-party dependencies out of the target/ directory) and reduces the effort to adopt caching.

Caches are stored in an S3-compatible object storage: when a worker requests a cache from the Buildomat server, the server will generate a pre-signed download URL and return it to the client. Similarly, when a worker wants to save a cache, the server returns the pre-signed upload URLs the agent has to upload the file to. Multipart uploads are used to increase concurrency during upload.

I chose to return pre-signed URLs to the worker rather than using the chunked upload facility in Buildomat for two reasons: (a) Buildomat chunked uploads are tied to the lifetime of a job, while caches will outlast jobs, and (b) streaming large uploads and downloads through the Buildomat server adds unnecessary latency and overhead.

On the agent side, I chose to execute the whole logic in the bmat CLI rather than the agent daemon, with the agent daemon's only responsibility being proxying API requests between bmat and the server. This allowed me to output more useful logs than what would've been practical when implementing the logic in the agent daemon.

There are three things missing in this PR that I'm going to tackle in followups:

We need to implement quotas for cache uploads, purging the least recently used cache(s) when a Buildomat user / GitHub repository reaches their quota.
We need a user API for listing and purging caches.
We need a scheduled task to purge incomplete multipart uploads.

Fixes #32

emilyalbini · 2026-02-21T00:28:59Z

The first commit is best reviewed with whitespace off.

jclulow

With my apologies for the delays, I have taken an architectural level look at the backend parts of this change, and made some comments.

I have run out of time at this exact minute, but I'll come back separately and look at the additions to the bmat command. Those feel like they represent a Committed interface, but the backend storage stuff at least represents cache objects that we can always drop and recreate as needed.

One of the commits, _create wrappers for sending control messages in the agent, feels essentially separate, but also seems uncontroversial. Could we break that out into a separate PR so that we can get that in first? The GitHub review UI is even worse than I remember for large multi-commit changes, and it would help to keep separate things in separate PRs I think.

jclulow · 2026-03-06T06:13:21Z

server/src/api/worker.rs

+    /*
+     * According to the documentation for the request, "The processing of a
+     * CompleteMultipartUpload request could take several minutes to finalize".
+     * We don't want to block the CI job until the request succeedes.  Rather,
+     * complete the upload in a background task, and immediately return.
+     */
+    let c = c.clone();
+    let log = log.clone();
+    tokio::spawn(async move {
+        info!(log, "started async task to complete cache upload {upload_id}");
+        if let Err(err) = complete_cache_upload(c, upload, etags).await {
+            error!(log, "failed to complete cache upload {upload_id}: {err}");
+        } else {
+            info!(log, "completed cache upload {upload_id}");
+        }
+    });


I would rather we do wait for the commit to occur. We already have a similar construct in the existing file upload mechanism where the agent waits for commit to occur.

If we don't wait, we are potentially building up an unbounded amount of unobservable work in the background here without any back pressure.

It's also not clear to me what happens if we return success from the complete request before we've actually done the work, and then the buildomat server is restarted for some reason?

I would rather we do wait for the commit to occur. We already have a similar construct in the existing file upload mechanism where the agent waits for commit to occur.

Hmm, my understanding is that for output uploads we do need that consistency (as job dependencies do need to download them as inputs), but for caches there isn't really any case I can see where that consistency is needed. Dependent jobs shouldn't rely on caches at all (but rather outputs), and for other jobs it doesn't matter whether the acknowledgement happens sync or async, the cache is going to be available to them at the same time.

Is there any case I'm missing where we would benefit from consistency? If we don't, it's kinda wasteful to delay the CI build until the write is acknowledged by S3, and I'd rather avoid it.

If we don't wait, we are potentially building up an unbounded amount of unobservable work in the background here without any back pressure.

It's also not clear to me what happens if we return success from the complete request before we've actually done the work, and then the buildomat server is restarted for some reason?

That's a fair point! I will move this from the one-off background task into a queue of cache uploads to complete (that a persistent background worker processes), so that we can control how many uploads are processed at the same time and we can restart them again if the server gets killed into the meantime.

To address the second and third paragraphs of this, when the agent finishes uploading and the endpoint is called, the server sets the new time_finish column in cache_pending_upload to the current timestamp and return.

There is then a new persistent background task that looks for finished uploads and completes them one at the time. This both solves the back pressure problem and enforces only one multipart upload is completed at the same time, handling the case of two caches with the same key being uploaded at the same time.

I also added some logic to handle the case Buildomat crashes between completing the S3 multipart upload and making changes to the database.

server/src/api/worker.rs

server/src/main.rs

jclulow · 2026-03-06T06:42:05Z

server/src/api/worker.rs

+    let upload_id =
+        c.db.record_pending_cache_upload(
+            j.owner,
+            &path.name,
+            w.id,
+            size_bytes,
+            &s3_upload_id,
+            part_number,
+        )
+        .or_500()?;


We ought to do this registration prior to speaking with S3 at all, to convert the user-provided name into a ULID that represents the upload and use that ULID to create the multi-part upload. That way, we'll always have a record of every multi-part upload that we may start (assuming it doesn't fail and we don't crash) and won't accidentally leak any.

We should also probably sanitise the user-provided path name, and cap the user-provided cache object size, while doing so.

We ought to do this registration prior to speaking with S3 at all, to convert the user-provided name into a ULID that represents the upload and use that ULID to create the multi-part upload. That way, we'll always have a record of every multi-part upload that we may start (assuming it doesn't fail and we don't crash) and won't accidentally leak any.

Hmm, what would be the purpose of that database row if it doesn't contain the S3 upload ID? Note that even if we crash somewhere between calling S3 and storing the entry in the database we can still cleanup afterwards, as the ListMultipartUploads S3 API would surface the orphaned upload.

We should also probably sanitise the user-provided path name, and cap the user-provided cache object size, while doing so.

Yes! Quotas were also something I was planning to defer after the MVP implementation to reduce the review load, but I can implement a basic version of them now.

To be clear, I just mean setting a maximum size on a single object, even if we don't do other quota stuff yet.

Added a configurable cap for individual cache objects and validation for the cache key names.

emilyalbini · 2026-03-06T13:28:24Z

One of the commits, _create wrappers for sending control messages in the agent, feels essentially separate, but also seems uncontroversial. Could we break that out into a separate PR so that we can get that in first?

Done! #79

emilyalbini · 2026-03-11T17:00:24Z

Address most of the feedback in the force push I did, there are still some open things that I'd love your thoughts on my rationale.

emilyalbini · 2026-03-12T12:32:34Z

I noticed some slow compression and extraction for actual omicron builds that somehow I'm not seeing in my test builds, I'll investigate.

jclulow

Thanks for the changes so far! I've done another pass and it's looking generally pretty good!

At a high level, I have an important question about the bmat interface (which is committed) for caching, with particular respect to how we expect people to hold it in their jobs.

Often in jobs people have used the bash errexit and pipefail options, so if the bmat cache family of commands exit with a non-zero status, that will presumably terminate the job for what is presumably (by virtue of being a cache) an optional operation that's not for correctness but just to improve performance. Do we want jobs to fail if bmat cache fails in an unexpected way, or should we always exit 0 so that whether or not the operation did anything (or whether or not the buildomat in question has an available S3 backend, say) people don't end up eventually just doing bmat cache .. || true everywhere?

common/src/lib.rs

server/src/api/worker.rs

jclulow · 2026-03-17T06:39:42Z

server/src/api/worker.rs

+    c.db.record_pending_cache_upload(
+        cache_id,
+        j.owner,
+        &path.name,
+        w.id,
+        size_bytes,
+        &s3_upload_id,
+        part_number,
+    )
+    .or_500()?;


It's not ideal that we can potentially create detritus within S3 in the form of dangling multipart uploads if we are interrupted prior to this database write, but I guess it is what it is for now.

agent/src/control/cache/rust.rs

server/src/db/mod.rs

server/src/api/worker.rs

jclulow

Thanks for the changes so far! I've done another pass and it's looking generally pretty good!

At a high level, I have an important question about the bmat interface (which is committed) for caching, with particular respect to how we expect people to hold it in their jobs.

Often in jobs people have used the bash errexit and pipefail options, so if the bmat cache family of commands exit with a non-zero status, that will presumably terminate the job for what is presumably (by virtue of being a cache) an optional operation that's not for correctness but just to improve performance. Do we want jobs to fail if bmat cache fails in an unexpected way, or should we always exit 0 so that whether or not the operation did anything (or whether or not the buildomat in question has an available S3 backend, say) people don't end up eventually just doing bmat cache .. || true everywhere?

emilyalbini · 2026-03-17T15:47:10Z

Do we want jobs to fail if bmat cache fails in an unexpected way, or should we always exit 0 so that whether or not the operation did anything (or whether or not the buildomat in question has an available S3 backend, say) people don't end up eventually just doing bmat cache .. || true everywhere?

I think that makes sense, I'll sleep on it to see if I can think of a reason to error out on failure, but you are right that in most cases we should just ignore the error. Another thing to consider is how to handle retries if requests for a cache fail: I think we do want to do a limited amount of retries to account for spurious failures, but the code as written now would keep retrying forever (I noticed that when a job was running while I stopped my local buildomat server).

Another thing that comes to mind is the design of bmat cache rust. Right now we have both bmat cache rust restore [Cargo.toml] and bmat cache rust save [Cargo.toml], but the fact that you need to remember to run both commands could be annoying.

If we had a way in the buildomat agent to queue a command to run when a job finishes (bikeshedding, bmat process on-success COMMAND), we could reuse the same functionality to have bmat cache rust [Cargo.toml] restore the cache (if present) and queue itself to save the cache on success.

emilyalbini requested a review from jclulow February 21, 2026 00:25

jclulow requested changes Mar 6, 2026

View reviewed changes

emilyalbini mentioned this pull request Mar 6, 2026

agent: wrappers for sending control messages #79

Merged

emilyalbini changed the title ~~Add support for saving and restoring caches~~ add support for saving and restoring caches Mar 11, 2026

emilyalbini force-pushed the ea-cache branch from c7a5825 to 5f6bd89 Compare March 11, 2026 16:34

emilyalbini requested a review from jclulow March 11, 2026 17:00

jclulow requested changes Mar 17, 2026

View reviewed changes

emilyalbini force-pushed the ea-cache branch from 5f6bd89 to 9cf14e3 Compare March 17, 2026 10:24

emilyalbini added 4 commits March 17, 2026 16:29

add cache endpoints based on multipart upload

7f77009

add support for caching in the agent

b5d72ce

add rust-specific caching support

bfb5dfd

document caching

31df78f

emilyalbini force-pushed the ea-cache branch from 9cf14e3 to 31df78f Compare March 17, 2026 15:29

Conversation

emilyalbini commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emilyalbini commented Feb 21, 2026

Uh oh!

jclulow left a comment

Choose a reason for hiding this comment

Uh oh!

jclulow Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

emilyalbini Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emilyalbini Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jclulow Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

emilyalbini Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

jclulow Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

emilyalbini Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emilyalbini commented Mar 6, 2026

Uh oh!

emilyalbini commented Mar 11, 2026

Uh oh!

emilyalbini commented Mar 12, 2026

Uh oh!

jclulow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jclulow Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jclulow left a comment

Choose a reason for hiding this comment

Uh oh!

emilyalbini commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

emilyalbini commented Feb 21, 2026 •

edited

Loading

emilyalbini Mar 6, 2026 •

edited

Loading

emilyalbini Mar 11, 2026 •

edited

Loading