caching: staged debt-aware runtime vlog rewrite maintenance#842
caching: staged debt-aware runtime vlog rewrite maintenance#842
Conversation
There was a problem hiding this comment.
Pull request overview
This PR advances “Maintenance V2” for caching DB runtime value-log (vlog) rewrite maintenance by (a) draining persisted rewrite debt more effectively during checkpoint-kick passes and (b) preventing low-signal generic rewrites via an efficacy floor.
Changes:
- Add generic rewrite efficacy floors via
MinSegmentStaleRatio(with size-aware fallback) andMinSegmentStaleBytesfor planning/execution. - Add bounded multi-segment debt-drain behavior for checkpoint-kick rewrite resume queues (capped by
vlogGenerationRewriteDebtDrainMaxSegmentsand budget). - Extend scheduler tests to cover multi-segment debt drain and new planning option propagation.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| TreeDB/caching/db.go | Adds rewrite efficacy floors and a debt-drain multi-segment scheduler path, including checkpoint-kick wiring. |
| TreeDB/caching/vlog_generation_scheduler_test.go | Adds assertions for new plan options and a new test covering multi-segment debt drain behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| if queueLen < maxSegments { | ||
| maxSegments = queueLen | ||
| } | ||
| if queueLen > maxSegments { | ||
| maxSegments = queueLen | ||
| } |
| {FileID: 11, BytesLive: 64, BytesTotal: 128, BytesStale: 64, StaleRatio: 0.2}, | ||
| {FileID: 22, BytesLive: 64, BytesTotal: 128, BytesStale: 64, StaleRatio: 0.7}, | ||
| {FileID: 33, BytesLive: 64, BytesTotal: 128, BytesStale: 64, StaleRatio: 0.9}, |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ab72ba687e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| minStaleRatio := db.vlogGenerationRewriteMinStaleRatioForGenericPass(totalBytes) | ||
| plan, err := planner.ValueLogRewritePlan(ctx, backenddb.ValueLogRewriteOnlineOptions{ | ||
| MaxSourceSegments: 0, | ||
| MaxSourceBytes: maxSourceBytes, | ||
| MaxSourceSegments: 0, | ||
| MaxSourceBytes: maxSourceBytes, | ||
| MinSegmentStaleRatio: minStaleRatio, |
There was a problem hiding this comment.
Avoid gating total-bytes/churn rewrites on per-segment stale ratio
This new MinSegmentStaleRatio is applied to the generic total-bytes/churn planner, not just the stale-ratio trigger. Once totalBytes >= 512 MiB, the helper returns valueLogRewriteTriggerRatioPPM (20% by default), and ValueLogRewritePlan then drops every segment whose individual stale ratio is below that floor (TreeDB/db/vlog_rewrite.go:847-848). On workloads where stale bytes are spread across many segments (for example, lots of updates leaving each segment ~5-15% stale), shouldRunVlogGenerationRewrite still fires for total-bytes/churn, but the plan comes back empty and lines 11728-11729 disable rewrite entirely, so the total-bytes trigger no longer reclaims space and the vlog can keep growing.
Useful? React with 👍 / 👎.
| ledger, _ := db.currentVlogGenerationRewriteLedger() | ||
| if len(ledger) > 0 { | ||
| processedRewriteIDs = vlogGenerationRewriteLedgerChunk(ledger, vlogGenerationRewriteResumeMaxSegments, budgetTokens) | ||
| processedRewriteIDs = vlogGenerationRewriteLedgerChunk(ledger, rewriteMaxSegments, budgetTokens) | ||
| } else { | ||
| processedRewriteIDs = vlogGenerationRewriteQueueChunk(rewriteQueue, vlogGenerationRewriteResumeMaxSegments) | ||
| processedRewriteIDs = vlogGenerationRewriteQueueChunk(rewriteQueue, rewriteMaxSegments) |
There was a problem hiding this comment.
Keep queue-only rewrite debt budgeted during checkpoint drain
Checkpoint-kick debt drain can now raise rewriteMaxSegments above 1, but this fallback still slices queue-only debt by count only. We still persist queue-only state at lines 11822-11823 when a planner returns SourceFileIDs without SelectedSegments, and vlogGenerationRewriteQueueChunk (TreeDB/caching/vlog_generation_state.go:257-264) ignores budgetTokens entirely. In that supported resume path, a single checkpoint kick can rewrite up to 8 full segments even when the token bucket only has budget for one segment, which defeats the runtime budget cap and can reintroduce the large maintenance bursts this change is trying to avoid.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d84a1656e1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
TreeDB/caching/db.go
Outdated
| if stats.BytesBefore > 0 && stats.BytesAfter >= stats.BytesBefore { | ||
| db.vlogGenerationRewriteIneffectiveRuns.Add(1) | ||
| db.vlogGenerationRewriteIneffectiveBytesIn.Add(uint64(stats.BytesBefore)) | ||
| db.vlogGenerationRewriteIneffectiveBytesOut.Add(uint64(stats.BytesAfter)) |
There was a problem hiding this comment.
Defer ineffective-run accounting until after zombie GC
In normal cached maintenance runs that actually rewrite data, this branch will still classify the pass as “ineffective”. ValueLogRewriteOnline only marks old source segments zombie (TreeDB/internal/valuelog/manager.go:1191) and then sets BytesAfter from valueLogSegmentStats, which counts every on-disk value-log file before the follow-up GC deletes those zombies (TreeDB/db/vlog_rewrite.go:1200-1205, 2301-2314). As a result, any successful rewrite that creates replacement segments will usually satisfy BytesAfter >= BytesBefore, so these new counters will report ineffective runs and whole-DB byte totals even when the subsequent ValueLogGC in this function reclaims the space immediately.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR advances the caching DB’s “Maintenance V2” runtime value-log rewrite strategy by improving how persisted rewrite debt is drained (especially on checkpoint-kick passes) and by adding efficacy floors to avoid low-signal rewrites that can amplify bytes and hurt wall time.
Changes:
- Add debt-drain support for checkpoint-kick rewrite resumes, including bounded multi-segment selection.
- Add “generic pass” rewrite efficacy floors (
MinSegmentStaleRatio+MinSegmentStaleBytes) and track ineffective rewrites via new Stats counters. - Expand scheduler test coverage for debt-drain behavior, efficacy floors, and bypass-quiet planner behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
TreeDB/caching/db.go |
Adds debt-drain scheduling, new planner/exec knobs, new ineffective-run metrics, and introduces a bypass-quiet maintenance context. |
TreeDB/caching/vlog_generation_scheduler_test.go |
Adds tests validating multi-segment debt drain, ineffective-run telemetry, and quiet-window bypass behavior. |
Comments suppressed due to low confidence (1)
TreeDB/caching/db.go:11631
- In the stale-ratio trigger planning pass, MinSegmentStaleBytes is still hard-coded as
1even though vlogGenerationRewriteMinSegmentStaleBytes was introduced alongside the new generic-pass knobs. Using the constant here would keep the planner behavior consistent and avoid future drift if the floor changes.
plan, err := planner.ValueLogRewritePlan(ctx, backenddb.ValueLogRewriteOnlineOptions{
MaxSourceSegments: 0,
MaxSourceBytes: maxSourceBytes,
MinSegmentStaleRatio: minStaleRatio,
MinSegmentStaleBytes: 1,
})
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| if timeout > 0 { | ||
| return context.WithTimeout(context.Background(), timeout) | ||
| } | ||
| return context.WithCancel(context.Background()) |
| if queueLen < maxSegments { | ||
| maxSegments = queueLen | ||
| } | ||
| if queueLen > maxSegments { | ||
| maxSegments = queueLen | ||
| } |
| if err != nil { | ||
| t.Fatalf("open backend: %v", err) | ||
| } | ||
| t.Cleanup(func() { _ = backend.Close() }) |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 801a18f660
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| func (db *DB) vlogGenerationMaintenanceContext(timeout time.Duration, opts vlogGenerationMaintenanceOptions) (context.Context, context.CancelFunc) { | ||
| // Checkpoint-kick maintenance is an explicit caller opt-in to run outside the | ||
| // quiet-window gate. Keep this context timeout-bounded, but do not | ||
| // self-cancel on immediate foreground activity resumes. | ||
| if opts.bypassQuiet { |
There was a problem hiding this comment.
Preserve close-triggered cancellation in bypassQuiet planners
When bypassQuiet is set this helper switches to context.WithTimeout(context.Background(), ...), so the planner no longer observes db.closeCh. The checkpoint-kick path runs maybeRunVlogGenerationMaintenanceWithOptions(... bypassQuiet: true ...) in a goroutine that is counted in db.wg, and Close() waits on that waitgroup before teardown (TreeDB/caching/db.go:12097-12120, 12916-12919). On a large tree, closing during a checkpoint-kick rewrite plan will now block until the full 30s planner timeout instead of canceling promptly.
Useful? React with 👍 / 👎.
| perSegmentBudget := db.valueLogGenerationWarmTarget | ||
| if perSegmentBudget <= 0 { | ||
| perSegmentBudget = defaultVlogGenerationWarmTargetBytes | ||
| } | ||
| if perSegmentBudget <= 0 { |
There was a problem hiding this comment.
Budget queue-only debt drains by actual segment size
Fresh evidence relative to the earlier queue-only budget comment: this helper now budgets legacy/persisted queue-only debt using valueLogGenerationWarmTarget, not the queued segments' real live bytes. That still underestimates cold or oversized source segments: with the default 256 MiB warm / 512 MiB cold targets, a 512 MiB token bucket authorizes two queued cold segments because byBudget = budgetTokens / warmTarget, and the queue fallback at TreeDB/caching/db.go:11837-11842 then slices only by count. Because tokens are only debited after the rewrite finishes (TreeDB/caching/db.go:11959-11976), one checkpoint-kick pass can still overspend the runtime budget by 2x or more on supported queue-only resume state loaded from vlog_generation_state.json.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8463c62153
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if !quiet && !opts.bypassQuiet { | ||
| shouldRewrite = false | ||
| reason = vlogGenerationReasonNone |
There was a problem hiding this comment.
Preserve rewrite progress on hot periodic passes without checkpoints
In TreeDB/caching/db.go I checked that checkpoint-kicks only happen from Checkpoint() (maybeKickVlogGenerationMaintenanceAfterCheckpoint) and that StartAutoCheckpoint() is opt-in, so many deployments can run long stretches with no checkpoints at all. This new guard clears shouldRewrite on every hot maybeRunVlogGenerationMaintenance(true) pass, which means sustained-write WAL-on workloads now lose their only rewrite path under load: total-bytes/churn triggers and any persisted rewrite-resume queue will stop making progress until the foreground finally goes quiet. That can let rewrite debt and retained bytes grow indefinitely in otherwise supported configurations.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR advances the caching DB’s generational value-log maintenance (“Maintenance V2”) to better drain persisted rewrite debt during checkpoint-kick passes and to avoid low-signal/low-efficacy rewrites during generic maintenance runs.
Changes:
- Add rewrite-plan outcome tracking + cancel backoff to reduce immediate re-planning after cancellations.
- Add bounded multi-segment “debt drain” execution for persisted rewrite resume queues (checkpoint-kick).
- Add generic rewrite efficacy floors via
MinSegmentStaleRatio/MinSegmentStaleBytes, plus tests covering the new scheduling behavior and stats.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
TreeDB/caching/db.go |
Adds planning outcome stats/backoff, debt-drain segment selection, and generic rewrite efficacy floors; threads new options through checkpoint-kick maintenance. |
TreeDB/caching/vlog_generation_scheduler_test.go |
Extends scheduler tests to cover multi-segment debt drain, plan outcome stats, cancel backoff, and ineffective rewrite tracking. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| func (db *DB) vlogGenerationMaintenanceContext(timeout time.Duration, opts vlogGenerationMaintenanceOptions) (context.Context, context.CancelFunc) { | ||
| // Checkpoint-kick maintenance is an explicit caller opt-in to run outside the | ||
| // quiet-window gate. Keep this context timeout-bounded, but do not | ||
| // self-cancel on immediate foreground activity resumes. | ||
| if opts.bypassQuiet { | ||
| if timeout > 0 { | ||
| return context.WithTimeout(context.Background(), timeout) | ||
| } | ||
| return context.WithCancel(context.Background()) | ||
| } | ||
| return db.foregroundMaintenanceContext(timeout) | ||
| } |
| if queueLen < maxSegments { | ||
| maxSegments = queueLen | ||
| } | ||
| if queueLen > maxSegments { | ||
| maxSegments = queueLen | ||
| } | ||
| if maxSegments > vlogGenerationRewriteDebtDrainMaxSegments { | ||
| maxSegments = vlogGenerationRewriteDebtDrainMaxSegments | ||
| } | ||
| if maxSegments < 1 { | ||
| maxSegments = 1 | ||
| } |
| func (b *blockingRewritePlannerBackend) recordedRewriteCalls() int { | ||
| b.mu.Lock() | ||
| defer b.mu.Unlock() | ||
| return b.rewriteCalls | ||
| } | ||
|
|
| {FileID: 11, BytesLive: 64, BytesTotal: 128, BytesStale: 64, StaleRatio: 0.2}, | ||
| {FileID: 22, BytesLive: 64, BytesTotal: 128, BytesStale: 64, StaleRatio: 0.7}, | ||
| {FileID: 33, BytesLive: 64, BytesTotal: 128, BytesStale: 64, StaleRatio: 0.9}, |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f398e0967d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
TreeDB/caching/db.go
Outdated
| planBackoff := len(rewriteQueue) == 0 && db.vlogGenerationRewritePlanBackoffActive(now) | ||
| if planBackoff { | ||
| shouldRewrite = false | ||
| reason = vlogGenerationReasonNone |
There was a problem hiding this comment.
Exempt checkpoint-kick rewrites from plan-cancel backoff
On write-heavy workloads where a periodic planner run is canceled by renewed foreground activity, observeVlogGenerationRewritePlanOutcome records the cancel time and this branch suppresses all new rewrite planning for 5 seconds even when maybeKickVlogGenerationMaintenanceAfterCheckpoint later calls maintenance with bypassQuiet: true. Because checkpoint kicks are also rate-limited, one canceled background plan can consume the next checkpoint-triggered maintenance window and leave rewrite debt untouched until a later checkpoint, which undermines the new debt-drain path under sustained ingest.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a29ba223e3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
TreeDB/caching/db.go
Outdated
| rewriteCancelBackoff := len(rewriteQueue) > 0 && db.vlogGenerationRewriteCancelBackoffActive(now) | ||
| if rewriteCancelBackoff { | ||
| shouldRewrite = false | ||
| reason = vlogGenerationReasonNone | ||
| } |
There was a problem hiding this comment.
Skip rewrite-cancel backoff for checkpoint debt drains
On write-heavy WAL-on workloads, ValueLogRewriteOnline can still be canceled by resumed foreground activity because the executor later in maybeRunVlogGenerationMaintenanceWithOptions still uses foregroundMaintenanceContext. Once that happens, this new branch suppresses every queued rewrite for 20 seconds, including the checkpoint-kick maybeKickVlogGenerationMaintenanceAfterCheckpoint(... rewriteDebtDrain: true ...) path. A single foreground-canceled chunk can therefore burn several checkpoint windows and leave persisted rewrite debt idle, which undermines the new debt-drain behavior this change is adding.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR advances the caching DB’s “Maintenance V2” generational value-log rewrite scheduler to better drain persisted rewrite debt (especially on checkpoint-kick passes) while avoiding low-efficacy rewrites during generic maintenance passes.
Changes:
- Added pruning of persisted rewrite ledger/queue entries with non-positive live bytes, and ensured resume selection skips zero-live segments.
- Added bounded multi-segment “debt drain” execution mode for checkpoint-kick rewrite resumes.
- Added additional scheduler observability (plan outcomes, cancel backoffs, ineffective rewrites, queue prune stats) and expanded test coverage around these behaviors.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| TreeDB/caching/vlog_generation_state.go | Adds persisted-ledger pruning for non-positive live bytes and filters zero-live segments from ledger chunk selection. |
| TreeDB/caching/vlog_generation_scheduler_test.go | Adds tests covering pruning, debt-drain multi-segment resumes, cancel backoff behavior, and new stats tracking. |
| TreeDB/caching/db.go | Implements debt-drain segment selection limits, efficacy floors for generic plans/exec, prune integration, cancel backoff tracking, and new Stats() counters. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| if queueLen < maxSegments { | ||
| maxSegments = queueLen | ||
| } | ||
| if queueLen > maxSegments { | ||
| maxSegments = queueLen | ||
| } |
| if seg.FileID == 0 { | ||
| continue | ||
| } | ||
| if seg.BytesLive <= 0 { | ||
| continue | ||
| } |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eaae4f61e9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
TreeDB/caching/db.go
Outdated
| if db == nil || db.valueLogGenerationPolicy != uint8(backenddb.ValueLogGenerationHotWarmCold) { | ||
| return | ||
| } | ||
| quiet := db.foregroundActivityQuietFor(time.Now(), vlogGenerationMaintenanceQuietWindow, vlogForegroundReadQuietWindow) |
There was a problem hiding this comment.
Recompute quiet-window state after retained-prune waits
quiet is sampled here before waitForRetainedValueLogPrune()/checkpoint serialization, but the later guards at TreeDB/caching/db.go:11616-11617 and 11682-11694 reuse that stale value. On a periodic runGC pass that entered while quiet and then blocks behind an active retained prune, foreground reads/writes that resume during the wait are invisible, so the pass can still reach rewrite planning/execution. Because the rewrite contexts are created only after the wait, they also will not cancel for activity that already happened, which reintroduces hot-path rewrites on the very passes this change is trying to keep quiet-window-bound.
Useful? React with 👍 / 👎.
| func (db *DB) vlogGenerationRewriteMinStaleRatioForGenericPass(totalBytes int64) float64 { | ||
| if totalBytes < vlogGenerationRewriteEfficacyMinTotalBytes { | ||
| return 0 | ||
| } |
There was a problem hiding this comment.
Keep a nonzero efficacy floor for small churn-triggered rewrites
This helper returns 0 until the value log reaches 512 MiB, and maybeRunVlogGenerationMaintenance then feeds that into the generic rewrite planner/executor. On deployments that use ValueLogRewriteTriggerChurnPerSec with a sub-512 MiB DB, a churn-triggered pass therefore runs with no stale-ratio floor and only MinSegmentStaleBytes=1, so a segment with a single overwritten value can be selected for rewrite. That defeats the new “generic passes should be efficacy-aware” goal and can still amplify bytes/wall time on small but write-heavy datasets.
Useful? React with 👍 / 👎.
|
Progress update on What changed
Validation
run_celestia (debug maint)
Next step
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 37f1422ce3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
TreeDB/caching/db.go
Outdated
| if len(processedRewriteIDs) > 0 && stats.BytesAfter >= stats.BytesBefore { | ||
| growth := int64(stats.BytesAfter) - int64(stats.BytesBefore) | ||
| if growth >= vlogGenerationRewriteIneffectiveGrowthMinBytes { | ||
| dropRewriteDebt = true |
There was a problem hiding this comment.
Avoid dropping queued rewrite debt on pre-GC byte growth
ValueLogRewriteStats.BytesAfter is computed before the follow-up ValueLogGC removes the zombie source segments (TreeDB/db/vlog_rewrite.go:1189-1205), so a normal successful queued rewrite often reports BytesAfter > BytesBefore by roughly the copied live bytes. With the new material_growth check here, any resumed chunk larger than 4 MiB will clear the rest of vlog_generation_state.json at lines 12082-12089 even though the later GC in this same function reclaims the old segments. On multi-segment debt-drain/rewrite-resume runs, that means one successful chunk can abandon all remaining planned segments until a full replan happens.
Useful? React with 👍 / 👎.
TreeDB/caching/db.go
Outdated
| } else if growth == 0 && stats.RecordsCopied == 0 { | ||
| // A queued rewrite that copies nothing and does not reduce bytes is | ||
| // effectively a no-op resume loop; drop remaining debt so the | ||
| // scheduler can re-plan from current state instead of retrying. | ||
| dropRewriteDebt = true |
There was a problem hiding this comment.
Preserve later queued segments when a resumed source ID vanished
In the supported queue-only resume path (no SelectedSegments ledger), vlogGenerationRewriteQueueChunk can hand the rewriter a source ID that has already disappeared. selectRewriteSourceSegments silently ignores missing IDs, and ValueLogRewriteOnline returns a no-op when none remain (TreeDB/db/vlog_rewrite.go:793-801, 945-963). This new no_progress branch then clears the entire remaining queue at lines 12082-12089, so one stale head entry can discard later queued segments that still exist and should be resumed.
Useful? React with 👍 / 👎.
|
Update on head Implemented two scheduler hardening fixes from review:
Tests added:
Validation run:
Runtime note:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func (db *DB) vlogGenerationMaintenanceContext(timeout time.Duration, opts vlogGenerationMaintenanceOptions) (context.Context, context.CancelFunc) { | ||
| // Checkpoint-kick maintenance is an explicit caller opt-in to run outside the | ||
| // quiet-window gate. Keep this context timeout-bounded, but do not | ||
| // self-cancel on immediate foreground activity resumes. | ||
| if opts.bypassQuiet { | ||
| if timeout > 0 { | ||
| return context.WithTimeout(context.Background(), timeout) | ||
| } | ||
| return context.WithCancel(context.Background()) | ||
| } | ||
| return db.foregroundMaintenanceContext(timeout) | ||
| } | ||
|
|
||
| func (db *DB) vlogGenerationRewritePlanContext(timeout time.Duration, opts vlogGenerationMaintenanceOptions) (context.Context, context.CancelFunc) { | ||
| // Keep planner calls quiet-window-gated, but tolerate short foreground | ||
| // activity resumes so sub-second plan scans can complete. | ||
| if opts.bypassQuiet { | ||
| if timeout > 0 { | ||
| return context.WithTimeout(context.Background(), timeout) | ||
| } | ||
| return context.WithCancel(context.Background()) | ||
| } | ||
| return db.foregroundMaintenanceContextWithResumeGrace(timeout, vlogGenerationRewritePlanResumeGrace) | ||
| } |
There was a problem hiding this comment.
In bypass-quiet mode (checkpoint-kick), the returned contexts are based on context.Background() and don’t get canceled when the DB is closing (closeCh). Since Close() waits on db.wg after closing closeCh, a long-running plan/rewrite/GC kicked in bypass mode can delay shutdown until the timeout elapses. Consider deriving these contexts from a cancelable base that is canceled on db.closeCh (while still bypassing foreground-activity cancel semantics).
| var ctx context.Context | ||
| var cancel context.CancelFunc | ||
| if hadRewriteQueue && len(processedRewriteIDs) > 0 { | ||
| ctx, cancel = context.WithTimeout(context.Background(), vlogGenerationRewriteBoundedExecTimeout) |
There was a problem hiding this comment.
Queued/debt-drain rewrite execution uses context.WithTimeout(context.Background(), vlogGenerationRewriteBoundedExecTimeout), which won’t be canceled on db.closeCh. Because Close() waits on db.wg, a bounded queued rewrite can still hold shutdown for up to the full timeout. Suggest using a context that ignores foreground-activity resumes but still cancels promptly on DB close (e.g., tie cancellation to closeCh).
| ctx, cancel = context.WithTimeout(context.Background(), vlogGenerationRewriteBoundedExecTimeout) | |
| ctx, cancel = db.foregroundMaintenanceContext(vlogGenerationRewriteBoundedExecTimeout) |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 905e687414
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| func (db *DB) schedulePendingVlogGenerationCheckpointKick() { | ||
| if db == nil || db.closing.Load() { | ||
| return | ||
| } | ||
| if !db.vlogGenerationCheckpointKickPending.CompareAndSwap(true, false) { |
There was a problem hiding this comment.
Respect disabled checkpoint-kick policy in deferred retries
This deferred retry helper bypasses the same admission checks that maybeKickVlogGenerationMaintenanceAfterCheckpoint() uses for checkpoint kicks. In particular, it never checks TREEDB_DISABLE_VLOG_GENERATION_CHECKPOINT_KICK (or testSkipVlogCheckpointKick), so a queued rewrite that returns context.Canceled in maybeRunVlogGenerationMaintenanceWithOptions will still set vlogGenerationCheckpointKickPending and eventually re-enter maintenance with skipCheckpoint:false. In deployments that disable checkpoint kicks to avoid extra checkpoint stalls, a canceled resume chunk will now force the very checkpoint-kick maintenance path the operator asked to suppress.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 95f4d278fa
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| ctx, cancel = context.WithTimeout(context.Background(), vlogGenerationRewriteBoundedExecTimeout) | ||
| } else { | ||
| ctx, cancel = db.foregroundMaintenanceContext(2 * time.Minute) |
There was a problem hiding this comment.
Wire bypass-quiet context into checkpoint-kick execution
The new vlogGenerationMaintenanceContext added above is never used, so checkpoint-kick maintenance still runs the first freshly planned rewrite chunk here—and GC still does the same at db.go:12350—under foregroundMaintenanceContext. Any read/write that resumes after the checkpoint cancels those operations immediately even when opts.bypassQuiet was set specifically to let checkpoint-triggered maintenance run outside the quiet window. On sustained-ingest workloads this leaves fresh checkpoint kicks unable to execute their first rewrite chunk, and GC-only kicks have no retry path at all.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if queueLen < maxSegments { | ||
| maxSegments = queueLen | ||
| } | ||
| if queueLen > maxSegments { | ||
| maxSegments = queueLen | ||
| } |
There was a problem hiding this comment.
In vlogGenerationRewriteMaxSegmentsForRun, the queueLen comparison logic is redundant and effectively forces maxSegments to become queueLen for any queueLen != vlogGenerationRewriteResumeMaxSegments. That makes the initial resume cap (vlogGenerationRewriteResumeMaxSegments) irrelevant in debt-drain mode and is confusing to read/maintain. Consider replacing the two comparisons with a single, clearer assignment (e.g., cap directly to min(queueLen, vlogGenerationRewriteDebtDrainMaxSegments)) or otherwise expressing the intended policy explicitly.
| if queueLen < maxSegments { | |
| maxSegments = queueLen | |
| } | |
| if queueLen > maxSegments { | |
| maxSegments = queueLen | |
| } | |
| // In debt-drain mode, cap by the current queue length and the per-run maximum. | |
| maxSegments = queueLen |
| // During checkpoint-kick debt drain, allow a bounded multi-segment rewrite | ||
| // selection so debt can converge faster than one-segment-per-pass. |
There was a problem hiding this comment.
The comment on vlogGenerationRewriteDebtDrainMaxSegments says checkpoint-kick debt drain allows a bounded multi-segment rewrite selection, but vlogGenerationRewriteMaxSegmentsForRun currently returns 1 segment for checkpoint-kick runs (opts.bypassQuiet && !opts.skipCheckpoint). Either the comment should be adjusted to match the single-segment checkpoint-kick behavior, or the selection policy should be updated if multi-segment checkpoint-kick drain is actually intended.
| // During checkpoint-kick debt drain, allow a bounded multi-segment rewrite | |
| // selection so debt can converge faster than one-segment-per-pass. | |
| // Upper bound on the number of segments that may be selected in a single | |
| // checkpoint-kick debt-drain rewrite pass. The current selection policy | |
| // still runs with a single segment per pass; this limit exists to cap any | |
| // future multi-segment checkpoint-kick behavior. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 41976c452a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
TreeDB/caching/db.go
Outdated
| db.maybeRunVlogGenerationMaintenanceWithOptions(true, vlogGenerationMaintenanceOptions{ | ||
| bypassQuiet: true, | ||
| skipRetainedPruneWait: true, | ||
| // This path retries a checkpoint-triggered kick that collided with an | ||
| // already-active maintenance pass. |
There was a problem hiding this comment.
Wait for active checkpoints before retrying a pending kick
Unlike maybeKickVlogGenerationMaintenanceAfterCheckpoint() in the same file, this deferred retry path does not wait on checkpointCond before re-entering maintenance. If an auto/user checkpoint is already running, maybeRunVlogGenerationMaintenanceWithOptions immediately bails out at db.go:11736-11738, and because line 11644 has already cleared vlogGenerationCheckpointKickPending, the queued retry is lost instead of waiting for the checkpoint to finish. In that interleaving, a foreground-canceled rewrite resume can miss its only checkpoint-kick follow-up until some later checkpoint or quiet periodic pass happens.
Useful? React with 👍 / 👎.
|
Update on
Validation:
Maint behavior note:
I’ll next target the |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 975af29196
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
TreeDB/caching/db.go
Outdated
| // A new collision re-queued pending=true. Consume that token and retry. | ||
| if !db.vlogGenerationCheckpointKickPending.CompareAndSwap(true, false) { | ||
| return | ||
| } | ||
| time.Sleep(time.Duration(attempt+1) * vlogGenerationPendingKickRetryStep) |
There was a problem hiding this comment.
Keep pending checkpoint kicks armed after retry budget expires
Fresh evidence relative to the earlier active-collision handling: this retry loop still drops a checkpoint-triggered pass whenever the conflicting maintenance run lasts longer than the 4 immediate retries here (about 50ms total). maybeRunVlogGenerationMaintenanceWithOptions re-sets vlogGenerationCheckpointKickPending on each collision, but these lines immediately consume that bit again before sleeping; once the loop exits, no state remains to retry after the active planner/rewrite/GC pass finishes. Since those passes can run for seconds (30s planner/GC timeouts and 2m rewrite timeout in the same file), a normal long maintenance run can still silently discard the queued checkpoint kick until some later checkpoint or quiet periodic pass happens.
Useful? React with 👍 / 👎.
This reverts commit 975af29.
|
Follow-up update: I tested and then reverted What happened with
Branch head is now
Local validation after revert is clean ( |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // If a queued rewrite is pending, do not run it while the bucket is | ||
| // empty; that defeats the whole point of a bounded executor. | ||
| if budgetTokens <= 0 && len(rewriteQueue) > 0 { | ||
| db.vlogGenerationSchedulerState.Store(vlogGenerationSchedulerIdle) | ||
| return nil |
There was a problem hiding this comment.
In the queued-rewrite path, this early return when budgetTokens <= 0 skips executing a rewrite, but vlogGenerationLastRewriteUnixNano has already been updated earlier in the same closure. That can throttle the next eligible queued rewrite attempt and makes treedb.cache.vlog_generation.rewrite.last_unix_nano look like a rewrite ran when it didn’t. Consider moving the vlogGenerationLastRewriteUnixNano.Store(...) update to just before the actual ValueLogRewriteOnline call (or only when a rewrite is about to run).
| db.debugVlogMaintf("rewrite_err reason=%s err=%v dur_ms=%.3f", vlogGenerationReasonString(reason), err, float64(time.Since(rewriteStart).Microseconds())/1000) | ||
| if errors.Is(err, context.Canceled) { | ||
| db.observeVlogGenerationRewriteCanceled() | ||
| if len(processedRewriteIDs) > 0 { | ||
| // A canceled rewrite that already selected a queued chunk should | ||
| // immediately queue a checkpoint-kick retry. The retry executes | ||
| // as resumable debt with bounded non-cancel semantics. | ||
| db.vlogGenerationCheckpointKickPending.Store(true) | ||
| } |
There was a problem hiding this comment.
Rewrite cancellation/backoff handling only checks errors.Is(err, context.Canceled). Since some rewrite executions use context.WithTimeout (e.g., the bounded queued exec timeout), timeouts will surface as context.DeadlineExceeded and currently won’t be counted as canceled/backed-off (and will instead be treated as a hard error). If the intent is to treat timeouts similarly to foreground-cancel, consider checking context.DeadlineExceeded here as well (or reusing isVlogGenerationPlannerCanceled).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1715486dbc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if trigger := db.valueLogRewriteTriggerBytes; trigger > 0 { | ||
| retained, bytes := db.valueLogRetainedStats() | ||
| if bytes < trigger && retained < 2 { |
There was a problem hiding this comment.
Let checkpoint kicks start stale-ratio/churn rewrites below byte floor
Fresh evidence from this change: maybeRunVlogGenerationMaintenanceWithOptions now clears shouldRewrite on non-quiet periodic passes (TreeDB/caching/db.go:11789-11791), so checkpoint kicks are the only rewrite path under sustained foreground activity. This fast-path still skips every kick whenever bytes < valueLogRewriteTriggerBytes && retained < 2, even though shouldRunVlogGenerationRewrite in the same file can request rewrite from stale-ratio or churn alone (db.go:12612-12619). On write-heavy WAL-on workloads whose vlog is still below the total-bytes threshold, checkpoint-triggered maintenance will therefore never start rewrite debt until the DB goes quiet or grows past the byte floor.
Useful? React with 👍 / 👎.
| if gcStats.BytesDeleted > 0 { | ||
| gcBytesDeleted = int64(gcStats.BytesDeleted) | ||
| effectiveBytesAfter -= gcBytesDeleted |
There was a problem hiding this comment.
Don't offset harmful rewrites with unrelated GC deletions
This subtracts the entire post-rewrite ValueLogGC deletion total from stats.BytesAfter, but ValueLogGC is a whole-DB sweep (TreeDB/db/vlog_gc.go:40-46) and BytesDeleted includes any unrelated unreachable segments it happens to reclaim. If a queued rewrite grows its own source chunk while the same GC pass also deletes older garbage from previous work, effectiveBytesAfter can look smaller than effectiveBytesBefore, so the new material_growth/ineffective-run logic never fires and the remaining debt queue is kept even though this rewrite cycle was harmful.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f78133892c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if len(rewriteQueue) == 0 { | ||
| if trigger := db.valueLogRewriteTriggerBytes; trigger > 0 { | ||
| retained, bytes := db.valueLogRetainedStats() | ||
| if bytes < trigger && retained < 2 { |
There was a problem hiding this comment.
Skip checkpoint-kick retries when rewrite budget is empty
When rewrite budgeting is enabled and the token bucket is depleted, maybeRunVlogGenerationMaintenanceWithOptions exits at TreeDB/caching/db.go:12134-12140 before rewriting any queued or freshly planned work. This checkpoint-kick fast-path never checks that budget first, so a DB with pending rewrite debt (or one merely above the byte floor) will still launch another checkpoint-triggered maintenance pass and re-enter Checkpoint() with zero chance of making progress. On deployments that checkpoint frequently with a tight rewrite budget, that turns every checkpoint into an extra checkpoint barrier for no benefit until tokens refill.
Useful? React with 👍 / 👎.
|
Update on Results on this head:
Compared with the kept baseline on this branch ( Important forensic note: on both runs, application-db maintenance did not execute an online rewrite at all ( |
|
Pushed Summary:
Validation on this head:
This is a keep from my side: it fixes the specific late queued-debt mismatch that remained after the earlier generic planner floor change. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| } | ||
|
|
||
| func TestVlogGenerationRewriteMinStaleRatioForGenericPass_UsesHigherConfiguredRatio(t *testing.T) { |
There was a problem hiding this comment.
This test name is misleading: it says it "UsesHigherConfiguredRatio", but the configured ratio here (0.80) is lower than the expected floor (vlogGenerationRewriteGenericMinSegmentStaleRatio == 0.85). Consider renaming the test to reflect that the generic quality floor overrides a lower configured ratio.
| func TestVlogGenerationRewriteMinStaleRatioForGenericPass_UsesHigherConfiguredRatio(t *testing.T) { | |
| func TestVlogGenerationRewriteMinStaleRatioForGenericPass_UsesQualityFloorOverLowerConfiguredRatio(t *testing.T) { |
| if queueLen < maxSegments { | ||
| maxSegments = queueLen | ||
| } | ||
| if queueLen > maxSegments { | ||
| maxSegments = queueLen | ||
| } |
There was a problem hiding this comment.
In vlogGenerationRewriteMaxSegmentsForRun(), the if queueLen > maxSegments { maxSegments = queueLen } block is redundant because maxSegments is either unchanged (vlogGenerationRewriteResumeMaxSegments) or already set to queueLen by the preceding if queueLen < maxSegments branch. Consider simplifying this to a single clamp (e.g., maxSegments = min(queueLen, vlogGenerationRewriteResumeMaxSegments) before the debt-drain cap) to make the intent clearer and avoid confusing future edits.
| if queueLen < maxSegments { | |
| maxSegments = queueLen | |
| } | |
| if queueLen > maxSegments { | |
| maxSegments = queueLen | |
| } | |
| maxSegments = queueLen |
| func (db *DB) vlogGenerationMaintenanceContext(timeout time.Duration, opts vlogGenerationMaintenanceOptions) (context.Context, context.CancelFunc) { | ||
| // Checkpoint-kick maintenance is an explicit caller opt-in to run outside the | ||
| // quiet-window gate. Keep this context timeout-bounded, but do not | ||
| // self-cancel on immediate foreground activity resumes. | ||
| if opts.bypassQuiet { | ||
| if timeout > 0 { | ||
| return context.WithTimeout(context.Background(), timeout) | ||
| } | ||
| return context.WithCancel(context.Background()) | ||
| } | ||
| return db.foregroundMaintenanceContext(timeout) | ||
| } | ||
|
|
There was a problem hiding this comment.
vlogGenerationMaintenanceContext() is introduced here but appears unused (no call sites in TreeDB/caching). If it’s not intended for near-term use, consider removing it to keep the maintenance context behavior centralized in the actually-used helpers (foregroundMaintenanceContext / vlogGenerationRewritePlanContext).
| func (db *DB) vlogGenerationMaintenanceContext(timeout time.Duration, opts vlogGenerationMaintenanceOptions) (context.Context, context.CancelFunc) { | |
| // Checkpoint-kick maintenance is an explicit caller opt-in to run outside the | |
| // quiet-window gate. Keep this context timeout-bounded, but do not | |
| // self-cancel on immediate foreground activity resumes. | |
| if opts.bypassQuiet { | |
| if timeout > 0 { | |
| return context.WithTimeout(context.Background(), timeout) | |
| } | |
| return context.WithCancel(context.Background()) | |
| } | |
| return db.foregroundMaintenanceContext(timeout) | |
| } |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 672378881e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // Checkpoint-kick retries should keep each debt-drain run small to reduce | ||
| // write amplification when foreground ingest is still active. | ||
| if opts.bypassQuiet && !opts.skipCheckpoint { | ||
| return 1 |
There was a problem hiding this comment.
Let checkpoint debt-drain runs select multiple segments
This guard forces vlogGenerationRewriteMaxSegmentsForRun to return 1 for every rewriteDebtDrain caller in this commit: the only call sites that set rewriteDebtDrain=true (TreeDB/caching/db.go:11823, 11889, 11975, 12017, 12033, 13147) also pass bypassQuiet=true and skipCheckpoint=false. That means the new checkpoint-kick “multi-segment debt drain” path never actually rewrites more than one queued segment per pass, so large persisted queues still drain at the old one-segment rate.
Useful? React with 👍 / 👎.
| if hasExecutableRewriteQueue { | ||
| shouldRewrite = true | ||
| reason = vlogGenerationReasonRewriteResume |
There was a problem hiding this comment.
Preserve generic efficacy floor on queued rewrite debt
Overwriting reason with vlogGenerationReasonRewriteResume here loses whether the persisted queue originally came from the total-bytes or churn triggers. The later quality filter at TreeDB/caching/db.go:12693 calls vlogGenerationRewriteMinStaleRatioForQueuedDebt, and that helper only applies the stricter generic floor for vlogGenerationReasonTotalBytes/vlogGenerationReasonChurn (TreeDB/caching/db.go:13320-13326). As a result, only the first chunk of a generic rewrite is gated by the new 0.85 efficacy floor; resumed chunks fall back to the looser 0.50 floor and can keep rewriting much less stale segments.
Useful? React with 👍 / 👎.
| } else if shouldDeferVlogGenerationRewritePlanForAge(plan, planOpts.MinSegmentAge) { | ||
| db.setVlogGenerationRewriteAgeBlockedUntil(now.Add(plan.AgeBlockedMinRemainingAge)) |
There was a problem hiding this comment.
Schedule age-blocked retries from plan completion time
This retry deadline is anchored to the pass's earlier now value instead of to when ValueLogRewritePlan finishes; the same pattern appears again at TreeDB/caching/db.go:12578. On large trees or slower disks, planning can take seconds, so a result like AgeBlockedMinRemainingAge=10s can be converted into an immediate retry even though the youngest candidate still has about 10 seconds left before it reaches MinSegmentAge, causing repeated premature full-tree replans.
Useful? React with 👍 / 👎.
|
Follow-on decomposition result is now tracked in stacked PR #843. Summary:
Most important new evidence is deterministic, not blind runtime sampling:
So #843 is the promoted candidate path. Further iteration should happen there rather than on the old mixed head. |
Summary
This PR takes a staged Maintenance V2 step for runtime value-log debt:
Changes
MinSegmentStaleRatiofor generic passes (with size-aware fallback).MinSegmentStaleBytesfloor.Tests
go test ./TreeDB/caching -count=1go test ./TreeDB/db ./TreeDB/internal/valuelog ./TreeDB/tree -count=1go vet ./...run_celestia evidence (branch head)
Code:
ab72ba68fast
/home/mikers/.celestia-app-mainnet-treedb-20260318154110duration_seconds=268max_rss_kb=7764772end_app_bytes=4749147116du -sb application.db=4749147116vlog-gc -rw:4749150764vlog-rewrite -rw:2494985619wal_on_fast
/home/mikers/.celestia-app-mainnet-treedb-20260318154549duration_seconds=316max_rss_kb=8770156end_app_bytes=5021754982du -sb application.db=5021754982vlog-gc -rw:5021754574vlog-rewrite -rw:2698181953Notes on rejected follow-up (not in this PR)
A subsequent local experiment that used non-foreground-cancel contexts for full debt-drain planning+execution caused major wall/end-size regressions in fast (
duration_secondsup to 440-781s,end_app_bytes~6.0-6.2GB) and was reverted locally.