Conversation
|
The first commit is best reviewed with whitespace off. |
jclulow
left a comment
There was a problem hiding this comment.
With my apologies for the delays, I have taken an architectural level look at the backend parts of this change, and made some comments.
I have run out of time at this exact minute, but I'll come back separately and look at the additions to the bmat command. Those feel like they represent a Committed interface, but the backend storage stuff at least represents cache objects that we can always drop and recreate as needed.
One of the commits, _create wrappers for sending control messages in the agent, feels essentially separate, but also seems uncontroversial. Could we break that out into a separate PR so that we can get that in first? The GitHub review UI is even worse than I remember for large multi-commit changes, and it would help to keep separate things in separate PRs I think.
server/src/api/worker.rs
Outdated
| /* | ||
| * According to the documentation for the request, "The processing of a | ||
| * CompleteMultipartUpload request could take several minutes to finalize". | ||
| * We don't want to block the CI job until the request succeedes. Rather, | ||
| * complete the upload in a background task, and immediately return. | ||
| */ | ||
| let c = c.clone(); | ||
| let log = log.clone(); | ||
| tokio::spawn(async move { | ||
| info!(log, "started async task to complete cache upload {upload_id}"); | ||
| if let Err(err) = complete_cache_upload(c, upload, etags).await { | ||
| error!(log, "failed to complete cache upload {upload_id}: {err}"); | ||
| } else { | ||
| info!(log, "completed cache upload {upload_id}"); | ||
| } | ||
| }); |
There was a problem hiding this comment.
I would rather we do wait for the commit to occur. We already have a similar construct in the existing file upload mechanism where the agent waits for commit to occur.
If we don't wait, we are potentially building up an unbounded amount of unobservable work in the background here without any back pressure.
It's also not clear to me what happens if we return success from the complete request before we've actually done the work, and then the buildomat server is restarted for some reason?
There was a problem hiding this comment.
I would rather we do wait for the commit to occur. We already have a similar construct in the existing file upload mechanism where the agent waits for commit to occur.
Hmm, my understanding is that for output uploads we do need that consistency (as job dependencies do need to download them as inputs), but for caches there isn't really any case I can see where that consistency is needed. Dependent jobs shouldn't rely on caches at all (but rather outputs), and for other jobs it doesn't matter whether the acknowledgement happens sync or async, the cache is going to be available to them at the same time.
Is there any case I'm missing where we would benefit from consistency? If we don't, it's kinda wasteful to delay the CI build until the write is acknowledged by S3, and I'd rather avoid it.
If we don't wait, we are potentially building up an unbounded amount of unobservable work in the background here without any back pressure.
It's also not clear to me what happens if we return success from the complete request before we've actually done the work, and then the buildomat server is restarted for some reason?
That's a fair point! I will move this from the one-off background task into a queue of cache uploads to complete (that a persistent background worker processes), so that we can control how many uploads are processed at the same time and we can restart them again if the server gets killed into the meantime.
There was a problem hiding this comment.
To address the second and third paragraphs of this, when the agent finishes uploading and the endpoint is called, the server sets the new time_finish column in cache_pending_upload to the current timestamp and return.
There is then a new persistent background task that looks for finished uploads and completes them one at the time. This both solves the back pressure problem and enforces only one multipart upload is completed at the same time, handling the case of two caches with the same key being uploaded at the same time.
I also added some logic to handle the case Buildomat crashes between completing the S3 multipart upload and making changes to the database.
server/src/api/worker.rs
Outdated
| let upload_id = | ||
| c.db.record_pending_cache_upload( | ||
| j.owner, | ||
| &path.name, | ||
| w.id, | ||
| size_bytes, | ||
| &s3_upload_id, | ||
| part_number, | ||
| ) | ||
| .or_500()?; |
There was a problem hiding this comment.
We ought to do this registration prior to speaking with S3 at all, to convert the user-provided name into a ULID that represents the upload and use that ULID to create the multi-part upload. That way, we'll always have a record of every multi-part upload that we may start (assuming it doesn't fail and we don't crash) and won't accidentally leak any.
We should also probably sanitise the user-provided path name, and cap the user-provided cache object size, while doing so.
There was a problem hiding this comment.
We ought to do this registration prior to speaking with S3 at all, to convert the user-provided name into a ULID that represents the upload and use that ULID to create the multi-part upload. That way, we'll always have a record of every multi-part upload that we may start (assuming it doesn't fail and we don't crash) and won't accidentally leak any.
Hmm, what would be the purpose of that database row if it doesn't contain the S3 upload ID? Note that even if we crash somewhere between calling S3 and storing the entry in the database we can still cleanup afterwards, as the ListMultipartUploads S3 API would surface the orphaned upload.
We should also probably sanitise the user-provided path name, and cap the user-provided cache object size, while doing so.
Yes! Quotas were also something I was planning to defer after the MVP implementation to reduce the review load, but I can implement a basic version of them now.
There was a problem hiding this comment.
To be clear, I just mean setting a maximum size on a single object, even if we don't do other quota stuff yet.
There was a problem hiding this comment.
Added a configurable cap for individual cache objects and validation for the cache key names.
Done! #79 |
|
Address most of the feedback in the force push I did, there are still some open things that I'd love your thoughts on my rationale. |
|
I noticed some slow compression and extraction for actual omicron builds that somehow I'm not seeing in my test builds, I'll investigate. |
jclulow
left a comment
There was a problem hiding this comment.
Thanks for the changes so far! I've done another pass and it's looking generally pretty good!
At a high level, I have an important question about the bmat interface (which is committed) for caching, with particular respect to how we expect people to hold it in their jobs.
Often in jobs people have used the bash errexit and pipefail options, so if the bmat cache family of commands exit with a non-zero status, that will presumably terminate the job for what is presumably (by virtue of being a cache) an optional operation that's not for correctness but just to improve performance. Do we want jobs to fail if bmat cache fails in an unexpected way, or should we always exit 0 so that whether or not the operation did anything (or whether or not the buildomat in question has an available S3 backend, say) people don't end up eventually just doing bmat cache .. || true everywhere?
| c.db.record_pending_cache_upload( | ||
| cache_id, | ||
| j.owner, | ||
| &path.name, | ||
| w.id, | ||
| size_bytes, | ||
| &s3_upload_id, | ||
| part_number, | ||
| ) | ||
| .or_500()?; |
There was a problem hiding this comment.
It's not ideal that we can potentially create detritus within S3 in the form of dangling multipart uploads if we are interrupted prior to this database write, but I guess it is what it is for now.
jclulow
left a comment
There was a problem hiding this comment.
Thanks for the changes so far! I've done another pass and it's looking generally pretty good!
At a high level, I have an important question about the bmat interface (which is committed) for caching, with particular respect to how we expect people to hold it in their jobs.
Often in jobs people have used the bash errexit and pipefail options, so if the bmat cache family of commands exit with a non-zero status, that will presumably terminate the job for what is presumably (by virtue of being a cache) an optional operation that's not for correctness but just to improve performance. Do we want jobs to fail if bmat cache fails in an unexpected way, or should we always exit 0 so that whether or not the operation did anything (or whether or not the buildomat in question has an available S3 backend, say) people don't end up eventually just doing bmat cache .. || true everywhere?
I think that makes sense, I'll sleep on it to see if I can think of a reason to error out on failure, but you are right that in most cases we should just ignore the error. Another thing to consider is how to handle retries if requests for a cache fail: I think we do want to do a limited amount of retries to account for spurious failures, but the code as written now would keep retrying forever (I noticed that when a job was running while I stopped my local buildomat server). Another thing that comes to mind is the design of If we had a way in the buildomat agent to queue a command to run when a job finishes (bikeshedding, |
This PR adds caching support in Buildomat, driven by the
bmatCLI.The core building blocks of this implementation are the
bmat cache saveandbmat cache restorecommands. These commands are fairly low level: they require the user to explicitly provide the cache key, andbmat cache savealso requires an exhaustive list of files to cache to be provided via stdin.On top of this building block, the
bmat cache rust saveandbmat cache rust restorecommands wrap the low level primitives and automatically determine the cache key and list of files to cache. This allows us to enforce best practices (like caching only third-party dependencies out of thetarget/directory) and reduces the effort to adopt caching.Caches are stored in an S3-compatible object storage: when a worker requests a cache from the Buildomat server, the server will generate a pre-signed download URL and return it to the client. Similarly, when a worker wants to save a cache, the server returns the pre-signed upload URLs the agent has to upload the file to. Multipart uploads are used to increase concurrency during upload.
I chose to return pre-signed URLs to the worker rather than using the chunked upload facility in Buildomat for two reasons: (a) Buildomat chunked uploads are tied to the lifetime of a job, while caches will outlast jobs, and (b) streaming large uploads and downloads through the Buildomat server adds unnecessary latency and overhead.
On the agent side, I chose to execute the whole logic in the
bmatCLI rather than the agent daemon, with the agent daemon's only responsibility being proxying API requests betweenbmatand the server. This allowed me to output more useful logs than what would've been practical when implementing the logic in the agent daemon.There are three things missing in this PR that I'm going to tackle in followups:
Fixes #32