Skip to content

fix(plugin-backfill): add retry backoff and continue processing after chunk failure#80

Merged
KeKs0r merged 2 commits intomainfrom
marc/backfill-retry-backoff
Mar 1, 2026
Merged

fix(plugin-backfill): add retry backoff and continue processing after chunk failure#80
KeKs0r merged 2 commits intomainfrom
marc/backfill-retry-backoff

Conversation

@KeKs0r
Copy link
Contributor

@KeKs0r KeKs0r commented Mar 1, 2026

Summary

  • Add exponential backoff between retries (configurable via defaults.retryDelayMs, default 1000ms) to recover from transient ClickHouse Cloud merge conflicts
  • Continue processing remaining chunks after one fails permanently instead of stopping the entire backfill run
  • Make resume automatically retry failed chunks without requiring --replay-failed flag
  • Refactor executeChunk to eliminate duplicated simulation/real execution branches
  • Update backfill plugin documentation with new option and simplified recovery workflow

Test plan

  • All 52 existing tests pass (11 runtime tests, 52 total backfill plugin tests)
  • TypeCheck passes with no errors
  • Verification successful (turbo typecheck lint test build)
  • New tests added for continue-past-failure behavior and resume-without-replay-failed

🤖 Generated with Claude Code

KeKs0r and others added 2 commits March 1, 2026 21:42
… chunk failure

Three fixes to backfill runtime:
1. Add exponential backoff between retries (configurable via defaults.retryDelayMs, default 1000ms)
   - Backoff formula: baseDelay * 2^(attempt-1), so retries happen at 1s, 2s, 4s, etc.
   - Allows transient ClickHouse Cloud merge conflicts to resolve before re-attempting

2. Don't stop backfill on first permanent chunk failure - continue processing remaining chunks
   - Failed chunks are marked failed and rest of plan continues
   - Final run status shows both done and failed chunk counts
   - Users can retry failed chunks with resume command

3. Resume always retries failed chunks - users shouldn't need --replay-failed
   - Simplifies the common recovery flow after transient errors

Also refactor executeChunk to merge duplicated simulation/real execution branches.

Docs: add retryDelayMs option to backfill docs, update resume behavior description,
simplify failed chunk recovery example (no longer needs --replay-failed).

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
The backfill run now processes chunk 3 even when chunk 2 fails,
so chunkCounts.done is 2 instead of 1.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@KeKs0r KeKs0r force-pushed the marc/backfill-retry-backoff branch from f4b3afc to 1daa9c7 Compare March 1, 2026 13:44
@KeKs0r KeKs0r merged commit ebca417 into main Mar 1, 2026
2 checks passed
@KeKs0r KeKs0r deleted the marc/backfill-retry-backoff branch March 1, 2026 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant