decomp: isolate layer-a plus layer-c runtime rewrite maintenance#843
decomp: isolate layer-a plus layer-c runtime rewrite maintenance#843snissn wants to merge 55 commits intopr/vlog-maint-v2-runtime-debtfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR promotes the decomposed “A + C” runtime value-log rewrite maintenance path: it improves correctness of rewrite planning for explicit SourceFileIDs by estimating live bytes when stale filters are applied (layer A), and changes checkpoint-kick rewrite resume behavior to process a single queued segment then restage remaining debt for later confirmation (layer C). It also adds tests/bench coverage around the new behavior and introduces additional plan telemetry for “age-blocked” segments.
Changes:
- Add age-blocked selection stats to
ValueLogRewritePlan, and estimate live bytes for explicitSourceFileIDswhen stale filters require it. - Persist and schedule “rewrite stage pending” state so checkpoint-kick resume rewrites process one segment and restage remaining debt for later confirmation.
- Add/extend tests and a benchmark to validate restaging and retained-source cleanup behavior, including a real-backend checkpoint-kick test.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| TreeDB/db/vlog_rewrite.go | Adds age-blocked plan fields, selection stats plumbing, and conditional live-byte estimation for SourceFileIDs with stale filters. |
| TreeDB/db/vlog_rewrite_test.go | Updates selection tests for new return signature/stats; adds tests covering SourceFileIDs + stale filters + live estimation. |
| TreeDB/caching/db.go | Adds retained-source cleanup after rewrite+GC and integrates restaging behavior into checkpoint-kick maintenance execution. |
| TreeDB/caching/vlog_generation_state.go | Persists stage-pending/observed timestamps; adds restage helper and ledger stabilization helpers. |
| TreeDB/caching/vlog_generation_state_test.go | Updates state loader tests for new stage fields. |
| TreeDB/caching/vlog_retained_manager_drift_test.go | Adds a test ensuring processed retained rewrite sources are cleaned up when backend references are gone. |
| TreeDB/caching/vlog_generation_scheduler_test.go | Adds a real-backend test for checkpoint-kick restaging and a benchmark harness for drain window behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for i := 0; i < b.N; i++ { | ||
| db, recorder, initialQueue := seedRealBackendRewriteQueueScenario(b, 25*time.Millisecond) | ||
| b.StopTimer() | ||
| var ( | ||
| totalStart = time.Now() | ||
| maxKick time.Duration |
There was a problem hiding this comment.
In BenchmarkVlogGenerationRewriteQueueDrainCheckpointKick, the expensive setup (seedRealBackendRewriteQueueScenario) currently runs while the benchmark timer is active, so the benchmark and the reported metrics include DB creation/workload seeding time. Also, b.ReportMetric is called inside the per-iteration loop; only the last call per unit is retained in the final BenchmarkResult, so earlier iterations are effectively discarded. Consider stopping the timer before seeding, starting it immediately before the drain loop (set totalStart after StartTimer), and reporting metrics once per benchmark run (or restructure so b.N controls the drain work rather than repeating full setup under the timer).
| func mergeConfirmedVlogGenerationRewriteLedgerSegments(prev, planned []backenddb.ValueLogRewritePlanSegment) []backenddb.ValueLogRewritePlanSegment { | ||
| if len(prev) == 0 || len(planned) == 0 { | ||
| return nil | ||
| } | ||
| stable := stableVlogGenerationRewriteLedgerSegments(prev, planned) | ||
| if len(stable) == 0 { | ||
| return nil | ||
| } | ||
| if len(stable) == len(planned) { | ||
| return stable | ||
| } | ||
| seen := make(map[uint32]struct{}, len(stable)) | ||
| out := make([]backenddb.ValueLogRewritePlanSegment, 0, len(planned)) | ||
| for _, seg := range stable { | ||
| if seg.FileID == 0 || seg.BytesLive <= 0 { | ||
| continue | ||
| } | ||
| seen[seg.FileID] = struct{}{} | ||
| out = append(out, seg) | ||
| } | ||
| for _, seg := range planned { | ||
| if seg.FileID == 0 || seg.BytesLive <= 0 { | ||
| continue | ||
| } | ||
| if _, ok := seen[seg.FileID]; ok { | ||
| continue | ||
| } | ||
| out = append(out, seg) | ||
| } | ||
| return out | ||
| } |
There was a problem hiding this comment.
mergeConfirmedVlogGenerationRewriteLedgerSegments is introduced here but does not appear to be referenced anywhere in the repository. If it’s not part of the intended behavior in this PR, consider removing it to avoid carrying unused code (or add the call site if it is required).
| func mergeConfirmedVlogGenerationRewriteLedgerSegments(prev, planned []backenddb.ValueLogRewritePlanSegment) []backenddb.ValueLogRewritePlanSegment { | |
| if len(prev) == 0 || len(planned) == 0 { | |
| return nil | |
| } | |
| stable := stableVlogGenerationRewriteLedgerSegments(prev, planned) | |
| if len(stable) == 0 { | |
| return nil | |
| } | |
| if len(stable) == len(planned) { | |
| return stable | |
| } | |
| seen := make(map[uint32]struct{}, len(stable)) | |
| out := make([]backenddb.ValueLogRewritePlanSegment, 0, len(planned)) | |
| for _, seg := range stable { | |
| if seg.FileID == 0 || seg.BytesLive <= 0 { | |
| continue | |
| } | |
| seen[seg.FileID] = struct{}{} | |
| out = append(out, seg) | |
| } | |
| for _, seg := range planned { | |
| if seg.FileID == 0 || seg.BytesLive <= 0 { | |
| continue | |
| } | |
| if _, ok := seen[seg.FileID]; ok { | |
| continue | |
| } | |
| out = append(out, seg) | |
| } | |
| return out | |
| } |
There was a problem hiding this comment.
💡 Codex Review
gomap/TreeDB/db/vlog_rewrite.go
Lines 950 to 954 in 0b16b54
When SourceFileIDs is combined with stale filters, rewritePlanNeedsLiveEstimate() now routes the explicit-ID case through selectRewriteSourceSegments, and this block starts enforcing MaxSourceSegments/MaxSourceBytes even though ValueLogRewriteOnlineOptions says those caps apply only when SourceFileIDs is empty (TreeDB/db/vlog_rewrite.go:102-106). Any caller that narrows a preconfigured sparse-plan options struct to a fixed set of file IDs will now silently rewrite only a subset of the requested IDs, leaving the rest of that explicit rewrite set untouched.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if len(db.vlogGenerationRewriteLedger) == 0 || len(db.vlogGenerationRewriteQueue) == 0 { | ||
| return len(db.vlogGenerationRewriteQueue), false, nil | ||
| } |
There was a problem hiding this comment.
Restage queue-only debt even when no ledger was persisted
This early return means the new restage path only works for queues backed by SelectedSegments. The scheduler still creates queue-only debt via setVlogGenerationRewriteQueue(rewritePlan.SourceFileIDs) in TreeDB/caching/db.go:12943-12945 whenever a planner returns SourceFileIDs without SelectedSegments—which is still allowed by ValueLogRewritePlan (SelectedSegments is documented as present only "when" live estimates were produced in TreeDB/db/vlog_rewrite.go:56-60). In that case a checkpoint-coupled resume never sets RewriteStagePending, so the remaining tail is not deferred and this commit’s hot-window bound disappears for those planners or reopened queue-only states.
Useful? React with 👍 / 👎.
|
Added follow-on commit
What changed:
New focused validation:
Local gate:
Checkpoint-kick benchmark after this change:
So the narrow refresh did not reopen the hot checkpoint-kick drain behavior. Sanity
Caveat:
|
There was a problem hiding this comment.
💡 Codex Review
Lines 12773 to 12776 in 8858d34
If a staged confirmation pass narrows the debt set (for example, the planner now returns only file 22 out of a staged [11,22]), this only updates the local plan. The execution path later still resumes from the persisted rewriteQueue/vlogGenerationRewriteLedger, so vlogGenerationRewriteLedgerChunk can pick and rewrite a segment that the confirmation planner just dropped. In that case the new two-phase “observe, then confirm” flow does not actually constrain which staged segments are executed.
Lines 12778 to 12780 in 8858d34
When confirmed is empty here, the code only changes reason and leaves RewriteStagePending plus the old observedAt intact. After that, generic periodic/checkpoint passes keep returning early because staged work is still pending, while scheduleDueVlogGenerationDeferredMaintenance immediately requeues another stage-confirm wake because that timestamp is already overdue. A segment that stops qualifying during the confirmation delay can therefore trap the scheduler in a replan loop and block later GC/rewrite work.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
TreeDB/caching/db.go
Outdated
| if hasPlanner && db.shouldRefreshQueuedVlogGenerationRewriteLedger(opts, stagePending, rewriteQueue, rewriteLedger) { | ||
| queueMinStaleRatio := db.vlogGenerationRewriteMinStaleRatioForQueuedDebt(totalBytes, vlogGenerationReasonRewriteResume) | ||
| refreshedPlan, refreshErr := db.refreshQueuedVlogGenerationRewritePlan(planner, rewriteQueue, queueMinStaleRatio, opts) |
There was a problem hiding this comment.
Gate queued-ledger refresh behind the quiet-window check
This refresh planner runs before the later !quiet && !opts.bypassQuiet guard in maybeRunVlogGenerationMaintenanceWithOptions. In WAL-off/relaxed mode, a periodic pass with queued rewrite debt can now call ValueLogRewritePlan during hot ingest, and after this commit that planner may rescan live bytes even for explicit SourceFileIDs. That defeats the quiet-window contract and reintroduces expensive tree scans on the foreground path even when the pass would have skipped rewrite work entirely.
Useful? React with 👍 / 👎.
|
Pushed This adds a frozen-copy maintenance replay path and folds in two runtime fixes validated against a real copied Celestia
Local gates on this head:
Most important frozen-copy result on a real staged Celestia DB copy:
This is the first frozen-copy replay where staged confirm on a real copied Celestia DB consumed exactly one queued segment, GC deleted it, restaged the remaining tail, and returned to idle instead of chain-draining through repeated |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Checkpoint-coupled resumable work should consume one chunk, then | ||
| // re-confirm the remaining tail later instead of draining multiple | ||
| // queued segments back-to-back in the same hot ingest window. | ||
| if hadRewriteQueue && !opts.skipCheckpoint { |
There was a problem hiding this comment.
This restage trigger is currently if hadRewriteQueue && !opts.skipCheckpoint, which will also fire on periodic maintenance passes (default opts.skipCheckpoint=false) whenever there is remaining queued debt. That seems broader than the intent described in the nearby comment ("checkpoint-coupled") and the PR description (restage after checkpoint-kick). Consider tightening the condition (e.g., require opts.bypassQuiet / checkpoint-kick context) so periodic resume behavior doesn’t unexpectedly transition into staged-confirm draining.
| if hadRewriteQueue && !opts.skipCheckpoint { | |
| if hadRewriteQueue && !opts.skipCheckpoint && opts.bypassQuiet { |
| db, recorder, initialQueue := seedRealBackendRewriteQueueScenario(b, 25*time.Millisecond) | ||
| b.StopTimer() |
There was a problem hiding this comment.
Benchmark setup is included in the timed portion and the helper uses tb.Cleanup, so each iteration schedules DB/backend closes only at benchmark end. This can keep many DBs open during the run (potential FD/memory exhaustion) and also skews the benchmark timing. Stop the timer before calling seedRealBackendRewriteQueueScenario and explicitly close/cleanup the returned DB/backend per iteration (e.g., return a cleanup func instead of using tb.Cleanup in the helper when running under *testing.B).
| db, recorder, initialQueue := seedRealBackendRewriteQueueScenario(b, 25*time.Millisecond) | |
| b.StopTimer() | |
| b.StopTimer() | |
| db, recorder, initialQueue := seedRealBackendRewriteQueueScenario(b, 25*time.Millisecond) |
| case len(seedPlan.SelectedSegments) > 0: | ||
| stageObservedAt := int64(0) | ||
| if *seedStagePending { | ||
| stageObservedAt = time.Now().Add(-*seedStageObservedAgo).UnixNano() |
There was a problem hiding this comment.
-seed-stage-observed-ago is described as a duration "in the past", but negative values will produce an observedAt timestamp in the future, delaying stage confirmation in a confusing way. Consider validating that seedStageObservedAgo >= 0 (or clamping to 0) before computing stageObservedAt.
| stageObservedAt = time.Now().Add(-*seedStageObservedAgo).UnixNano() | |
| observedAgo := *seedStageObservedAgo | |
| if observedAgo < 0 { | |
| observedAgo = 0 | |
| } | |
| stageObservedAt = time.Now().Add(-observedAgo).UnixNano() |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 52041e4152
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if !db.valueLogRetained(path) { | ||
| continue | ||
| } | ||
| if db.cleanupOrphanedRetainedValueLog(path) { |
There was a problem hiding this comment.
Respect backend snapshot pins before deleting processed segments
When a caller keeps a cached AcquireSnapshot()/iterator open across rewrite maintenance, the backend snapshot still expects the old value-log file to remain readable until its pinned ValueLogSet is released. This new path bypasses the backend manager's zombie/pin handling and deletes the file directly via cleanupOrphanedRetainedValueLog, using the cache's separate valueLogReader; as a result, reads from the older snapshot can start failing with missing value-log segments even though the snapshot is still live.
Useful? React with 👍 / 👎.
|
Pushed This narrows stage-confirm further: when there is already a persisted staged rewrite ledger, the due stage-confirm path now uses that ledger directly instead of issuing a hot-path explicit- Why this matters:
Local gates:
Frozen-copy delta on the same staged Celestia DB copy:
Most important trace line on the new head:
So this change preserves the good staged-tail behavior from |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 32578e987d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| dropped := len(rewriteQueue) | ||
| if err := db.setVlogGenerationRewriteLedger(nil); err != nil { |
There was a problem hiding this comment.
Preserve queue-only rewrite plans during refresh
This prune path treats any refresh result with SelectedSegments == 0 as “nothing remains”, but ValueLogRewritePlan still allows SourceFileIDs without SelectedSegments (TreeDB/db/vlog_rewrite.go:54-60). If a periodic/reopen refresh gets that queue-only shape from the planner, the code clears the persisted resume queue here even though the planner is still returning actionable source IDs, so the remaining rewrite debt is silently lost until some later full replan recreates it.
Useful? React with 👍 / 👎.
| var rewriteLedger []backenddb.ValueLogRewritePlanSegment | ||
| if len(rewriteQueue) > 1 { | ||
| rewriteLedger, _ = db.currentVlogGenerationRewriteLedger() | ||
| } |
There was a problem hiding this comment.
Load the staged ledger for one-segment confirmations
The stale-ratio path can now persist a stage-pending ledger with exactly one queued file, but this code only loads the persisted ledger when len(rewriteQueue) > 1. On the next due rewrite_stage_confirm pass, stageConfirmUsePersistedLedger therefore stays false for one-entry stages, so any transient ValueLogRewritePlan timeout/error leaves hasExecutableRewriteQueue false and the pass does nothing even though the exact segment to rewrite is already on disk. Multi-segment confirms avoid that planner dependency; single-segment confirms do not.
Useful? React with 👍 / 👎.
|
Pushed What changed:
Local gate:
Frozen-copy replay result on the completed
This explains the earlier WAL frozen replay failure mode: the persisted queued debt on reopen could already be stale/deleted, and we were spending a maintenance opportunity on nonexistent source IDs. This patch makes that case cheap. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func (db *DB) cleanupProcessedRetainedRewriteSources(reason uint32, processedRewriteIDs []uint32) { | ||
| if db == nil || len(processedRewriteIDs) == 0 || !db.valueLogEnabled() { | ||
| return | ||
| } | ||
| liveIDs, err := db.collectValueLogLiveIDsUntil(db.lastForegroundWriteUnixNano.Load()) | ||
| if err != nil { | ||
| db.debugVlogMaintf( | ||
| "rewrite_retained_cleanup_scan_err reason=%s source_ids=%d err=%v", | ||
| vlogGenerationReasonString(reason), | ||
| len(processedRewriteIDs), | ||
| err, | ||
| ) | ||
| return |
There was a problem hiding this comment.
cleanupProcessedRetainedRewriteSources() calls collectValueLogLiveIDsUntil(), which performs a full pointer-projection scan of the tree (and leaf-ref scan) to rebuild live segment IDs. This runs after every online rewrite pass (and after GC), so it can add an O(size-of-DB) scan to a maintenance step that’s otherwise intended to be bounded. Consider avoiding the full scan here (e.g., rely on rewrite outcomes + backend segment presence/zombie state, or defer retained cleanup to existing periodic retained pruning where the scan is already expected), or at least gate the scan behind a debug/env toggle or a cheap heuristic (only when the retained set actually contains the processed IDs).
| db, recorder, initialQueue := seedRealBackendRewriteQueueScenario(b, 25*time.Millisecond) | ||
| b.StopTimer() |
There was a problem hiding this comment.
BenchmarkVlogGenerationRewriteQueueDrainCheckpointKick includes seedRealBackendRewriteQueueScenario() in the timed portion (it’s called before b.StopTimer()). This makes the benchmark’s ns/op dominated by setup (DB creation + synthetic workload) rather than the checkpoint-kick drain behavior the metrics are trying to capture. Move the seeding/setup under b.StopTimer()/b.StartTimer() (or use b.ResetTimer() after setup) so the benchmark timing reflects only the drain loop.
| db, recorder, initialQueue := seedRealBackendRewriteQueueScenario(b, 25*time.Millisecond) | |
| b.StopTimer() | |
| b.StopTimer() | |
| db, recorder, initialQueue := seedRealBackendRewriteQueueScenario(b, 25*time.Millisecond) |
| beforeState, err := db.DebugValueLogGenerationState() | ||
| if err != nil { | ||
| fatalf("debug state before run: %v", err) | ||
| } | ||
| beforeStats := filterVlogGenerationStats(db.Stats()) | ||
|
|
||
| report := valueLogMaintOnceReport{ | ||
| Dir: dir, | ||
| Mode: *mode, | ||
| BeforeState: beforeState, | ||
| BeforeStats: beforeStats, | ||
| } |
There was a problem hiding this comment.
runVlogMaintOnce records report.BeforeState/BeforeStats before applying -clear-state or -seed-from-plan. When either flag is used, the printed/JSON "before_state" no longer reflects the actual state that the maintenance pass ran with, which can make offline forensics misleading. Consider capturing the initial state separately (e.g., InitialState) or recomputing BeforeState/BeforeStats after seeding/clearing and before DebugRunValueLogGenerationMaintenanceOnce().
There was a problem hiding this comment.
💡 Codex Review
Lines 12339 to 12343 in d6d0a50
TREEDB_DISABLE_VLOG_GENERATION_DEFERRED now disables the timer-based stage/age schedulers, but this exit path ignores the same flag and can immediately enqueue rewrite_stage_confirm_exit or rewrite_age_blocked_exit after any acquired pass. That means treemap vlog-maint-once -disable-auto-deferred can still run extra maintenance in the background on copied DBs with overdue staged/age-blocked debt, defeating the new "one pass" offline-analysis mode.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } | ||
| db.debugVlogMaintf("gc_after_rewrite_done reason=%s dur_ms=%.3f", vlogGenerationReasonString(reason), float64(time.Since(gcStart).Microseconds())/1000) | ||
| } | ||
| db.cleanupProcessedRetainedRewriteSources(reason, processedRewriteIDs) |
There was a problem hiding this comment.
Avoid full live-ID scans after each queued rewrite
When processedRewriteIDs is non-empty, this new call synchronously runs cleanupProcessedRetainedRewriteSources, and that helper immediately calls collectValueLogLiveIDsUntil to rescan the entire tree for live value-log IDs. On large DBs with checkpoint-kick or resumed rewrite debt, every single-segment rewrite now pays an extra full pointer scan before the maintenance pass returns, so the new restage logic no longer keeps the hot-window latency bounded to one rewritten segment.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c869e0a08d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if hasPlanner && (db.shouldRefreshQueuedVlogGenerationRewriteLedger(opts, stagePending, rewriteQueue, rewriteLedger) || stageConfirmRefresh) { | ||
| queueMinStaleRatio := db.vlogGenerationRewriteMinStaleRatioForQueuedDebt(totalBytes, vlogGenerationReasonRewriteResume) | ||
| refreshedPlan, refreshErr := db.refreshQueuedVlogGenerationRewritePlan(planner, rewriteQueue, queueMinStaleRatio, opts) |
There was a problem hiding this comment.
Replan staged stale-ratio confirms with the trigger ratio
When a stale-ratio pass stages exactly one segment, the later rewrite_stage_confirm refresh goes through this path and replans it with vlogGenerationRewriteMinStaleRatioForQueuedDebt(..., vlogGenerationReasonRewriteResume), which is 50% by default. That is stricter than the stale-ratio trigger that created the stage (default 20%), so any staged segment in the 20%-50% band is filtered out here even though it was intentionally staged. Because the stageConfirmRefresh empty-selection branch keeps the staged queue and skips the sparse-plan fallback, that debt can get stuck pending forever instead of ever being rewritten.
Useful? React with 👍 / 👎.
| for i := 0; i < k+1; i++ { | ||
| cur := binary.LittleEndian.Uint32(prefix[off : off+4]) |
There was a problem hiding this comment.
Validate grouped RIDs before fast-decoding compressed frames
This new ReadAppend fast path skips over the grouped RID table and only parses offsets, so it no longer rejects frames whose RID entries are structurally invalid. The previous fallback (ReadAtWithDict) validated each RID, which means a compressed grouped record with zeroed/corrupt RID bytes but otherwise readable payload now returns data through ReadAppend/ReadUnsafeAppendBatch instead of ErrCorrupt (including when the bad frame still has a matching CRC, or when checksum verification is disabled). That weakens value-log integrity on iterator and batch-read paths.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 37 out of 37 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // AcquireSealedLazyMmapBudget raises this manager's sealed lazy-mmap limits | ||
| // until the returned release func is called. It only ever increases the global | ||
| // defaults for this manager instance; ordinary readers keep using the existing | ||
| // defaults unless a caller explicitly acquires a higher maintenance budget. | ||
| func (m *Manager) AcquireSealedLazyMmapBudget(minSegments int, minBytes int64) func() { | ||
| if m == nil { | ||
| return func() {} | ||
| } | ||
| if minSegments < 0 { | ||
| minSegments = 0 | ||
| } | ||
| if minBytes < 0 { | ||
| minBytes = 0 | ||
| } | ||
| m.mu.Lock() | ||
| m.sealedLazyMmapBudgetDepth++ | ||
| if minSegments > m.sealedLazyMmapBudgetSegments { | ||
| m.sealedLazyMmapBudgetSegments = minSegments | ||
| } | ||
| if minBytes > m.sealedLazyMmapBudgetBytes { | ||
| m.sealedLazyMmapBudgetBytes = minBytes | ||
| } | ||
| m.mu.Unlock() | ||
| return func() { | ||
| m.mu.Lock() | ||
| if m.sealedLazyMmapBudgetDepth > 0 { | ||
| m.sealedLazyMmapBudgetDepth-- | ||
| if m.sealedLazyMmapBudgetDepth == 0 { | ||
| m.sealedLazyMmapBudgetSegments = 0 | ||
| m.sealedLazyMmapBudgetBytes = 0 | ||
| } | ||
| } | ||
| m.mu.Unlock() | ||
| } | ||
| } |
There was a problem hiding this comment.
AcquireSealedLazyMmapBudget uses a single depth counter and keeps the maximum requested segments/bytes until all acquisitions are released. With nested acquisitions, releasing the inner budget will not restore the previous (smaller) budget while the outer acquisition is still active, even though the comment says the override lasts "until the returned release func is called".
If nested acquisition is possible, consider tracking prior values per-acquisition in the returned closure (stack/restore pattern), so each release restores the manager budget to the max of remaining active budgets rather than only resetting when depth reaches 0.
| if canceledErr == nil && restrictSource && haveCachedReferencedBefore { | ||
| referencedAfter = rewriteCleanupDerivedReferencedSegments(cachedReferencedBefore, sourceIDs, newValueIDs) | ||
| } else { | ||
| runRewriteCleanupReferencedScanHook() | ||
| referencedAfter, err = db.referencedValueLogSegments(context.Background()) | ||
| if err != nil { | ||
| return stats, err | ||
| } |
There was a problem hiding this comment.
rewriteCleanupDerivedReferencedSegments derives referencedAfter by cloning the cached referenced set and unconditionally deleting all sourceIDs. This is only safe if the online rewrite guarantees it rewrote all references to those segments. Since ValueLogRewriteOnline only scans the user tree (and only rewrites leaf-ref pages separately), segments in sourceIDs could still be referenced elsewhere (e.g., the system tree or other reachability sources), and would then be incorrectly omitted from referencedAfter, potentially causing them to be marked zombie and deleted.
Consider falling back to referencedValueLogSegments (full scan) for this optimization, or augment the derivation with a targeted verification pass that checks for remaining references to sourceIDs outside the rewritten keyspace before deleting them from the referenced set.
| if canceledErr == nil && restrictSource && haveCachedReferencedBefore { | |
| referencedAfter = rewriteCleanupDerivedReferencedSegments(cachedReferencedBefore, sourceIDs, newValueIDs) | |
| } else { | |
| runRewriteCleanupReferencedScanHook() | |
| referencedAfter, err = db.referencedValueLogSegments(context.Background()) | |
| if err != nil { | |
| return stats, err | |
| } | |
| runRewriteCleanupReferencedScanHook() | |
| referencedAfter, err = db.referencedValueLogSegments(context.Background()) | |
| if err != nil { | |
| return stats, err |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8236397431
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| lane, _ := valuelog.DecodeFileID(segment.fileID) | ||
| current, ok := newestByLane[lane] | ||
| if !ok || segment.seq > current.seq { | ||
| newestByLane[lane] = segment |
There was a problem hiding this comment.
Skip empty newest segments when deriving rewrite RID start
If a lane's highest-seq segment exists but has not received any records yet, this optimization picks that empty file and ignores the previous sealed segment that still holds the lane's highest RID. That state is reachable in normal cached mode because rotateValueLogWriterMuHeld creates and registers/promotes a new segment before the first append (TreeDB/caching/db.go:16282-16301). In a direct DB.ValueLogRewriteOnline call where ReserveRIDs is nil, nextRewriteRIDStart can therefore return a too-low start RID and the rewrite will emit duplicate RIDs; reopen/recovery later fails on the duplicate check in TreeDB/db/wal_recovery.go:193-197.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b05a8b4564
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if db.cached != nil { | ||
| if err := db.Checkpoint(); err != nil { |
There was a problem hiding this comment.
Keep the plan-only API from checkpointing cached DBs
DB.ValueLogRewritePlan is now called from the new vlog-maint-exit-loop reseed/probe path, but this wrapper unconditionally runs Checkpoint() whenever the DB is cached. In cached mode Checkpoint() is not a read-only preparation step: caching.DB.Checkpoint() can flush memtables and then call maybeKickVlogGenerationMaintenanceAfterCheckpoint, so a caller that only wants to inspect candidate segments can mutate the DB and even start rewrite/GC before the explicit maintenance pass runs. That makes the new offline/debug commands nondeterministic and also gives the new public planning API surprising side effects.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 43 out of 43 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| fatalf("cpuprofile: %v", err) | ||
| } | ||
| profFile = f | ||
| runtimepprof.StartCPUProfile(profFile) |
There was a problem hiding this comment.
runtime/pprof.StartCPUProfile returns an error, but this call ignores it. Please handle the error (and close the file on failure) so the command doesn’t silently run without profiling or leak the file descriptor when profiling can’t be started.
| runtimepprof.StartCPUProfile(profFile) | |
| if err := runtimepprof.StartCPUProfile(profFile); err != nil { | |
| _ = profFile.Close() | |
| fatalf("cpuprofile: %v", err) | |
| } |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 64711a9b3b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } | ||
| recs[i] = valuelog.Record{RID: rid, Value: leafPages[start+i]} | ||
| } | ||
| ptrs, _, err := w.w.AppendFrameWithStatsInto(0, nil, recs, dst) |
There was a problem hiding this comment.
Don't batch leaf pages into multi-record grouped frames
page.EncodeLeafRef only preserves FileID and Offset; page/leaf_ref.go reconstructs every leaf ref with grouped sub-index 0 under the explicit K=1 assumption. This new path writes up to 8 leaf pages in one AppendFrameWithStatsInto call, so internal/bulk/builder.go:194-203 can now encode nonzero-subindex pointers as leaf refs during offline rewrite. On any DB using IndexOuterLeavesInValueLog with more than one rewritten leaf page per batch, the 2nd+ child refs decode back to the first leaf in that frame after reopen, corrupting tree navigation and lookups.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7f484cef31
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| Value: leafPages[i], | ||
| } | ||
| } | ||
| return l.db.appendValueLog(lane, 0, nil, records, journalDurabilityNone, true) |
There was a problem hiding this comment.
Do not group multiple outer-leaf pages into one frame
When IndexOuterLeavesInValueLog is enabled and a split/merge or bulk build persists 2+ outer leaf pages at once, this funnels them through a single appendValueLog call. That produces grouped pointers, but page.EncodeLeafRef/DecodeLeafRef in TreeDB/page/leaf_ref.go only preserve (FileID, Offset) and explicitly assume leaf pages are stored as K=1, so every pointer after the first collapses to the same leaf-ref ID as sub-index 0. In practice, zipper.mergeLeaf and the bulk builder can now write sibling/root child refs that already alias the wrong leaf in memory and always decode to the wrong page after reopen.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 52 out of 52 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| var ( | ||
| leafPage []byte | ||
| err error | ||
| ) | ||
| if toer, ok := c.leafReader.(unsafeToReader); ok { | ||
| leafPage, _, err = toer.ReadUnsafeTo(ptr, c.leafScratch[:0]) | ||
| if err == nil && len(leafPage) > 0 && len(leafPage) <= cap(c.leafScratch) { | ||
| c.leafScratch = leafPage[:0] | ||
| } |
There was a problem hiding this comment.
In the leaf-ref rewrite path, leafScratch starts nil, so calling ReadUnsafeTo(ptr, c.leafScratch[:0]) won’t provide reusable capacity and the cap(c.leafScratch) check prevents retaining the returned buffer. Consider preallocating c.leafScratch to make([]byte, 0, page.PageSize) (or growing when cap < page.PageSize) before the first ReadUnsafeTo call so the scratch reuse optimization actually takes effect.
There was a problem hiding this comment.
💡 Codex Review
gomap/TreeDB/db/vlog_rewrite.go
Lines 1081 to 1087 in d020d3c
When IndexOuterLeavesInValueLog is enabled, leaf refs can only encode offsets up to page.LeafRefMaxOffset, and the offline path later in this file already clamps rewrite segment size for that reason. The online path here still only applies the packed-pointer cap, so a caller that sets MaxSegmentBytes above ~512 MiB can fill a rewrite segment past the LeafRef offset limit and then fail in rewriteLeafRefsOnline when page.EncodeLeafRef sees an unrepresentable offset. This only shows up on DBs with outer leaves in the value log, but in that configuration it turns a valid online rewrite request into a mid-pass failure.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| for _, id := range candidateFileIDs { | ||
| f := files[id] | ||
| if _, ok := active[id]; ok { | ||
| continue |
There was a problem hiding this comment.
Don't discard explicit active source IDs
With SourceFileIDs plus stale filters, rewritePlanNeedsLiveEstimate now routes explicit IDs through this sparse-selection loop. This unconditional active check silently drops any requested segment that is still marked active, so ValueLogRewritePlan/ValueLogRewriteOnline can return an empty selection even when the caller explicitly named that file ID and it satisfies the stale-byte/ratio filter. That regression is specific to the new filtered-SourceFileIDs path; before this change, explicit IDs were honored as requested.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 25da6d2fe7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| lane := &l.db.lanes[l.laneID] | ||
| rid := l.db.nextRID.Add(1) | ||
| ptr, retainPath, err := l.db.appendValueLogOneInternal(lane, 0, nil, rid, leafPage, journalDurabilityNone, false) | ||
| ptr, retainPath, err := l.db.appendValueLogOneInternal(lane, 0, nil, rid, leafPage, journalDurabilityNone, false, true) |
There was a problem hiding this comment.
Write single outer-leaf pages as grouped records
When IndexOuterLeavesInValueLog is enabled and compression falls back to the non-block path (for example ValueLogCompression=Off), this AppendLeafPage call still goes through appendValueLogOneInternal, which writes a plain record instead of a grouped K=1 frame. After this change page.DecodeLeafRef always reconstructs grouped pointers, so the empty/single-leaf builder paths in TreeDB/internal/bulk/builder.go:61 and :305 now persist leaf refs that reopen as corrupt because leaf-page reads try to parse a grouped frame that was never written.
Useful? React with 👍 / 👎.
| if opts.IndexOuterLeavesInValueLog { | ||
| leafRefMaxSegmentBytes := int64(page.LeafRefMaxOffset) - 4 | ||
| if valueLogMaxSegmentBytes == 0 || leafRefMaxSegmentBytes < valueLogMaxSegmentBytes { | ||
| valueLogMaxSegmentBytes = leafRefMaxSegmentBytes |
There was a problem hiding this comment.
Clamp rewrite segment bytes to the new LeafRef offset limit
These new LeafRefMaxOffset clamps only protect normal cached writes. Online rewrite still uses opts.MaxSegmentBytes verbatim in TreeDB/db/vlog_rewrite.go:1077-1089, and cached generational maintenance feeds that field from db.valueLogGenerationWarmTarget at TreeDB/caching/db.go:13257-13261. In deployments with IndexOuterLeavesInValueLog and a warm/max rewrite segment target above the new ~512 MiB leaf-ref ceiling, a rewrite can now fail mid-run once a copied leaf page lands past the 29-bit offset limit, because page.EncodeLeafRef rejects the rewritten pointer.
Useful? React with 👍 / 👎.
|
Tracking the next sprint off a single authority issue: #844\n\nThis captures:\n- the current leaf-frame fix status\n- the remaining ordinary-value write-shape problem\n- the roadmap\n- the final validation matrix (, , post-run data-folder analysis, post-offline-rewrite analysis, and unified-bench gates). |
|
Tracking the next sprint off a single authority issue: #844 This captures:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 58 out of 58 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if z.outerLeavesInValueLog && z.leafPageBatchLog != nil { | ||
| data := append([]byte(nil), target.Data()...) | ||
| pendingLeafPersists = append(pendingLeafPersists, pendingLeafPersist{ | ||
| data: data, | ||
| root: target == builder, | ||
| splitIdx: pendingSplitIdx, | ||
| }) | ||
| if target == builder { | ||
| rootPersisted = true | ||
| } | ||
| if len(pendingLeafPersists) >= 8 { | ||
| return 0, flushPendingLeafPersists() | ||
| } | ||
| return 0, nil | ||
| } |
There was a problem hiding this comment.
In mergeLeaf's batched outer-leaf persist path, the split leaf builder/page buffers are never released back to their pools. Unlike the non-batched path (which releases pooled builders/pages after persisting when target != builder), the batched branch copies target.Data() into pendingLeafPersists but then returns without releasing target when it's a pooled split leaf. This will accumulate unreleased pooled builders/outerLeafBuildPages across splits and can materially increase memory usage during large merges. After copying the page data, release pooled resources for non-root targets (and clear the split's pooled flags) similarly to the non-batched persist path.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 45e774698f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if len(pendingLeafPersists) >= 8 { | ||
| return 0, flushPendingLeafPersists() | ||
| } | ||
| return 0, nil |
There was a problem hiding this comment.
Release pooled split pages on batched leaf persistence
In the new batched outer-leaf path, persistTarget returns from this branch before reaching the cleanup block that releases target and targetOuterLeafData back to their pools. On split-heavy merges when outerLeavesInValueLog and leafPageBatchLog are enabled, each persisted non-root split leaf leaves behind pooled builder/scratch allocations, so memory use grows with split count and pooling effectiveness is lost.
Useful? React with 👍 / 👎.
| if len(in) == 0 { | ||
| return nil |
There was a problem hiding this comment.
Preserve empty referenced sets in ref-cache cloning
Returning nil for an empty referenced-ID set collapses a valid "known empty" scan result into the same sentinel used for "no cached value." Downstream cache lookups check for non-nil refs, so successful scans that find zero live segments become cache misses and trigger repeated full reachability rescans in rewrite/GC maintenance, increasing latency on fully-rewritten or near-empty datasets.
Useful? React with 👍 / 👎.
Summary
This PR promotes the decomposed
A + Ccandidate path on top of #842 and explicitly leaves planner-refreshBout of the merge path.A: retained processed-source cleanup after online rewrite + backend explicitSourceFileIDslive-byte estimationC: checkpoint-coupled queued rewrite restage after one resumed chunkB: excluded for now; it improved debt slightly but regressed wall time materially in exact-heightfastWhy this exists
The mixed local maintenance head had one real problem: it bundled multiple ideas, so debt wins and wall regressions were not attributable.
This branch is the cleaned candidate path after decomposition:
Ais a keep nowCis a keep nowBis preserved for redesign, not mergeWhat changed
Layer A
SourceFileIDsrewrite planning estimates live/stale bytes correctlyLayer C
Deterministic evidence
Real-backend test proof
TestVlogGenerationRewriteQueue_CheckpointKickRestagesRemainingRealBackendDebtnow proves on real backend source IDs that:
Checkpoint-kick latency harness
Added benchmark:
BenchmarkVlogGenerationRewriteQueueDrainCheckpointKickOne-shot comparison:
A + C
initial_queue=4max_kick_ms=68.14rewrite_calls=1remaining_queue=3stage_pending=1total_window_ms=68.22A only
initial_queue=4max_kick_ms=84.29rewrite_calls=4remaining_queue=0stage_pending=0total_window_ms=302.1Interpretation:
Cis not speculative; it bounds a single checkpoint-kick window to one rewrite and restages the remaining tail instead of draining four rewrites through the same hot window.Local validation
GOWORK=off go test ./TreeDB/db -count=1 -timeout=10mGOWORK=off go test ./TreeDB/caching -count=1go vet ./...run_celestiaevidenceLayer A exact-height
fast/home/mikers/.celestia-app-mainnet-treedb-20260320062122duration_seconds=294max_rss_kb=11078624end_app_bytes=5375716247vlog-rewrite:2564040376A + C exact-height
fast/home/mikers/.celestia-app-mainnet-treedb-20260320080727duration_seconds=315max_rss_kb=11627000end_app_bytes=5108730308vlog-gc:4796275739vlog-rewrite:2480106610du -sb application.db:2519003499Important caveat:
run_celestiaruns onA + Cdid not actually hit online rewrite/restage innode.logCis now strongly proven by deterministic backend-backed tests and benchmark evidence, but still needs runtime attribution in a sample that actually exercises queued resume activityStatus
This PR is the promoted candidate path.
Further work should iterate here, not on the old mixed forensic branch.