Skip to content

fix: check for existing buildkitd before mounting sticky disk#71

Open
pbardea wants to merge 4 commits intomainfrom
fix-double-setup-error
Open

fix: check for existing buildkitd before mounting sticky disk#71
pbardea wants to merge 4 commits intomainfrom
fix-double-setup-error

Conversation

@pbardea
Copy link
Contributor

@pbardea pbardea commented Mar 16, 2026

When setup-docker-builder is invoked twice in the same job (e.g. via a composite action called twice), the second invocation was calling setupStickyDisk() before detecting the already-running buildkitd. This caused a new sticky disk to be mounted on top of /var/lib/buildkit while buildkitd was still running with in-memory metadata referencing snapshot directories from the original disk. The subsequent build then failed with:

ERROR: failed to solve: failed to read dockerfile: failed to walk:
resolve: lstat /var/lib/buildkit/runc-overlayfs/snapshots/snapshots/N:
no such file or directory

Fix: move the buildkitd process check to the very beginning of startBlacksmithBuilder(), before any sticky disk setup. If buildkitd is already running, log an informational message and return immediately so the fallback path reuses the existing configured builder (from the first invocation) without corrupting its overlayfs snapshot state.

Based on #65 by @chadxz, with dist rebuilt.


Note

Medium Risk
Changes builder initialization control flow to early-exit when buildkitd is already running, which affects how repeated action invocations behave and could impact workflows that relied on re-initialization side effects. Scope is localized to setup sequencing and logging, with no new external interfaces.

Overview
Prevents repeated setup-docker-builder invocations in the same job from remounting the sticky disk over an already-running buildkitd.

startBlacksmithBuilder() now checks for an existing buildkitd process before calling setupStickyDisk(); if found, it logs and returns early so the action reuses the already-configured builder instead of corrupting BuildKit’s on-disk snapshot state.

Written by Cursor Bugbot for commit 730dff6. This will update automatically on new commits. Configure here.

chadxz and others added 2 commits March 10, 2026 17:26
When setup-docker-builder is invoked twice in the same job (e.g. via a
composite action called twice), the second invocation was calling
setupStickyDisk() before detecting the already-running buildkitd. This
caused a new sticky disk to be mounted on top of /var/lib/buildkit while
buildkitd was still running with in-memory metadata referencing snapshot
directories from the original disk. The subsequent build then failed with:

  ERROR: failed to solve: failed to read dockerfile: failed to walk:
  resolve: lstat /var/lib/buildkit/runc-overlayfs/snapshots/snapshots/N:
  no such file or directory

Fix: move the buildkitd process check to the very beginning of
startBlacksmithBuilder(), before any sticky disk setup. If buildkitd is
already running, log an informational message and return immediately so
the fallback path reuses the existing configured builder (from the first
invocation) without corrupting its overlayfs snapshot state.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Why?
====

When setup-docker-builder is called twice in one job, the
second instance's post-action cleanup kills the buildkitd
process that the first instance started. GitHub Actions runs
post steps in reverse order, so the second cleanup runs first
and tears down buildkitd before the first instance's builds
can export their build records. This causes "connection
refused" errors in the first build's post step and "buildkitd
has crashed" warnings in the first setup's cleanup.

How?
====

- Inspected the failing job logs from the ee-monorepo CI run
  to trace the exact sequence: Post Setup (2nd) kills
  buildkitd, then Post Build (1st) gets connection refused.
- Confirmed that `startBlacksmithBuilder` returns
  `addr: null` when it detects existing buildkitd, which
  means `buildkitdAddr` is never saved to state for the
  second instance.
- Used `buildkitdAddr` presence in state as the ownership
  signal — only the instance that started buildkitd should
  shut it down.
- Extracted the shutdown logic into `maybeShutdownBuildkitd()`
  with early returns to flatten the deeply nested if/try/catch
  structure, and deduplicated the crash-log printing into
  `logBuildkitdCrashLogs()`.
@useblacksmith useblacksmith deleted a comment from blacksmith-staging bot Mar 18, 2026
@useblacksmith useblacksmith deleted a comment from blacksmith-staging bot Mar 18, 2026
@chadxz
Copy link

chadxz commented Mar 18, 2026

When you rebuild I will test this again!

@pbardea
Copy link
Contributor Author

pbardea commented Mar 19, 2026

@chadxz - sorry for the delay you should be good to go here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants