fix: check for existing buildkitd before mounting sticky disk#71
Open
fix: check for existing buildkitd before mounting sticky disk#71
Conversation
When setup-docker-builder is invoked twice in the same job (e.g. via a composite action called twice), the second invocation was calling setupStickyDisk() before detecting the already-running buildkitd. This caused a new sticky disk to be mounted on top of /var/lib/buildkit while buildkitd was still running with in-memory metadata referencing snapshot directories from the original disk. The subsequent build then failed with: ERROR: failed to solve: failed to read dockerfile: failed to walk: resolve: lstat /var/lib/buildkit/runc-overlayfs/snapshots/snapshots/N: no such file or directory Fix: move the buildkitd process check to the very beginning of startBlacksmithBuilder(), before any sticky disk setup. If buildkitd is already running, log an informational message and return immediately so the fallback path reuses the existing configured builder (from the first invocation) without corrupting its overlayfs snapshot state.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Why? ==== When setup-docker-builder is called twice in one job, the second instance's post-action cleanup kills the buildkitd process that the first instance started. GitHub Actions runs post steps in reverse order, so the second cleanup runs first and tears down buildkitd before the first instance's builds can export their build records. This causes "connection refused" errors in the first build's post step and "buildkitd has crashed" warnings in the first setup's cleanup. How? ==== - Inspected the failing job logs from the ee-monorepo CI run to trace the exact sequence: Post Setup (2nd) kills buildkitd, then Post Build (1st) gets connection refused. - Confirmed that `startBlacksmithBuilder` returns `addr: null` when it detects existing buildkitd, which means `buildkitdAddr` is never saved to state for the second instance. - Used `buildkitdAddr` presence in state as the ownership signal — only the instance that started buildkitd should shut it down. - Extracted the shutdown logic into `maybeShutdownBuildkitd()` with early returns to flatten the deeply nested if/try/catch structure, and deduplicated the crash-log printing into `logBuildkitdCrashLogs()`.
|
When you rebuild I will test this again! |
Contributor
Author
|
@chadxz - sorry for the delay you should be good to go here! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When setup-docker-builder is invoked twice in the same job (e.g. via a composite action called twice), the second invocation was calling setupStickyDisk() before detecting the already-running buildkitd. This caused a new sticky disk to be mounted on top of /var/lib/buildkit while buildkitd was still running with in-memory metadata referencing snapshot directories from the original disk. The subsequent build then failed with:
Fix: move the buildkitd process check to the very beginning of startBlacksmithBuilder(), before any sticky disk setup. If buildkitd is already running, log an informational message and return immediately so the fallback path reuses the existing configured builder (from the first invocation) without corrupting its overlayfs snapshot state.
Based on #65 by @chadxz, with dist rebuilt.
Note
Medium Risk
Changes builder initialization control flow to early-exit when
buildkitdis already running, which affects how repeated action invocations behave and could impact workflows that relied on re-initialization side effects. Scope is localized to setup sequencing and logging, with no new external interfaces.Overview
Prevents repeated
setup-docker-builderinvocations in the same job from remounting the sticky disk over an already-runningbuildkitd.startBlacksmithBuilder()now checks for an existingbuildkitdprocess before callingsetupStickyDisk(); if found, it logs and returns early so the action reuses the already-configured builder instead of corrupting BuildKit’s on-disk snapshot state.Written by Cursor Bugbot for commit 730dff6. This will update automatically on new commits. Configure here.