Skip to content

Add buildctl du diagnostics around prune#75

Open
ConnorMul wants to merge 2 commits intouseblacksmith:mainfrom
ConnorMul:debug/layer-cache-prune
Open

Add buildctl du diagnostics around prune#75
ConnorMul wants to merge 2 commits intouseblacksmith:mainfrom
ConnorMul:debug/layer-cache-prune

Conversation

@ConnorMul
Copy link

Summary

  • Adds buildctl du --verbose output logging before and after the pruneBuildkitCache() call in the post-action cleanup
  • Exports BUILDKIT_DAEMON_ADDR from setup_builder.ts so it can be used in main.ts
  • This helps diagnose whether buildctl prune --keep-duration 168h --all is removing freshly built layers due to nil LastUsedAt on cache-miss-created entries

Context

A customer (hoverinc/ug-engine) reported Docker layer cache misses despite only changing a single comment. Investigation suggests the --all flag in the prune command may treat newly created layers (with nil LastUsedAt) as "never used" and prune them immediately.

Test plan

  • CI builds dist successfully
  • Run docker-layer-cache-repro workflow in test-workflows with this branch
  • Compare "BEFORE prune" vs "AFTER prune" output to confirm root cause

🤖 Generated with Claude Code

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

let beforeStats = "";
try {
const { stdout } = await execAsync(`cat ${statPath}`);
beforeStats = stdout.trim();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integrity checks lost hard timeout protection

High Severity

checkBoltDbIntegrity now uses bare execAsync for filesystem and bbolt commands, so stalled I/O can block forever. The previous execWithTimeout/ExecTimeoutError path that allowed skipping on hangs was removed, which can freeze post-action cleanup.

Additional Locations (1)
Fix in Cursor Fix in Web

const { stdout: duBeforeSummary } = await execAsync(
`sudo buildctl --addr ${BUILDKIT_DAEMON_ADDR} du 2>&1 | tail -5`,
);
core.info(`Cache summary before prune: ${duBeforeSummary}`);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New buildctl du diagnostics can hang cleanup

Medium Severity

The new pre/post-prune buildctl du diagnostics run without timeout limits. If buildkitd is unresponsive, these calls can block indefinitely and prevent prune, shutdown, unmount, and sticky-disk commit from completing.

Additional Locations (1)
Fix in Cursor Fix in Web

const { stdout: dbFiles } = await execAsync(
"find /var/lib/buildkit -name '*.db' 2>/dev/null || true",
30_000,
"find db files",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented max-parallelism input no longer applied

Medium Severity

The public max-parallelism input is still declared in action.yml, but parsing and override logic were removed. User-provided values are now silently ignored and parallelism is always derived from CPU count.

Additional Locations (1)
Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant