Skip to content

decomp: isolate layer-a plus layer-c runtime rewrite maintenance#843

Open
snissn wants to merge 55 commits intopr/vlog-maint-v2-runtime-debtfrom
pr/vlog-maint-a-plus-c
Open

decomp: isolate layer-a plus layer-c runtime rewrite maintenance#843
snissn wants to merge 55 commits intopr/vlog-maint-v2-runtime-debtfrom
pr/vlog-maint-a-plus-c

Conversation

@snissn
Copy link
Copy Markdown
Owner

@snissn snissn commented Mar 20, 2026

Summary

This PR promotes the decomposed A + C candidate path on top of #842 and explicitly leaves planner-refresh B out of the merge path.

  • A: retained processed-source cleanup after online rewrite + backend explicit SourceFileIDs live-byte estimation
  • C: checkpoint-coupled queued rewrite restage after one resumed chunk
  • B: excluded for now; it improved debt slightly but regressed wall time materially in exact-height fast

Why this exists

The mixed local maintenance head had one real problem: it bundled multiple ideas, so debt wins and wall regressions were not attributable.

This branch is the cleaned candidate path after decomposition:

  • A is a keep now
  • C is a keep now
  • B is preserved for redesign, not merge

What changed

Layer A

  • retained processed rewrite sources are cleaned up after online rewrite + GC
  • backend explicit SourceFileIDs rewrite planning estimates live/stale bytes correctly

Layer C

  • checkpoint-kick rewrite resume processes one queued segment
  • remaining queued debt is restaged instead of draining through the same hot window

Deterministic evidence

Real-backend test proof

TestVlogGenerationRewriteQueue_CheckpointKickRestagesRemainingRealBackendDebt
now proves on real backend source IDs that:

  • one checkpoint-coupled queued rewrite runs
  • remaining queue is restaged
  • the consumed source was retained at rewrite start
  • after rewrite it is no longer retained and is dead in the backend

Checkpoint-kick latency harness

Added benchmark:

  • BenchmarkVlogGenerationRewriteQueueDrainCheckpointKick

One-shot comparison:

A + C

  • initial_queue=4
  • max_kick_ms=68.14
  • rewrite_calls=1
  • remaining_queue=3
  • stage_pending=1
  • total_window_ms=68.22

A only

  • initial_queue=4
  • max_kick_ms=84.29
  • rewrite_calls=4
  • remaining_queue=0
  • stage_pending=0
  • total_window_ms=302.1

Interpretation:

  • C is not speculative; it bounds a single checkpoint-kick window to one rewrite and restages the remaining tail instead of draining four rewrites through the same hot window.

Local validation

  • GOWORK=off go test ./TreeDB/db -count=1 -timeout=10m
  • GOWORK=off go test ./TreeDB/caching -count=1
  • go vet ./...

run_celestia evidence

Layer A exact-height fast

  • home: /home/mikers/.celestia-app-mainnet-treedb-20260320062122
  • duration_seconds=294
  • max_rss_kb=11078624
  • end_app_bytes=5375716247
  • post vlog-rewrite: 2564040376

A + C exact-height fast

  • home: /home/mikers/.celestia-app-mainnet-treedb-20260320080727
  • duration_seconds=315
  • max_rss_kb=11627000
  • end_app_bytes=5108730308
  • post vlog-gc: 4796275739
  • post vlog-rewrite: 2480106610
  • final du -sb application.db: 2519003499

Important caveat:

  • the sampled run_celestia runs on A + C did not actually hit online rewrite/restage in node.log
  • so C is now strongly proven by deterministic backend-backed tests and benchmark evidence, but still needs runtime attribution in a sample that actually exercises queued resume activity

Status

This PR is the promoted candidate path.
Further work should iterate here, not on the old mixed forensic branch.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR promotes the decomposed “A + C” runtime value-log rewrite maintenance path: it improves correctness of rewrite planning for explicit SourceFileIDs by estimating live bytes when stale filters are applied (layer A), and changes checkpoint-kick rewrite resume behavior to process a single queued segment then restage remaining debt for later confirmation (layer C). It also adds tests/bench coverage around the new behavior and introduces additional plan telemetry for “age-blocked” segments.

Changes:

  • Add age-blocked selection stats to ValueLogRewritePlan, and estimate live bytes for explicit SourceFileIDs when stale filters require it.
  • Persist and schedule “rewrite stage pending” state so checkpoint-kick resume rewrites process one segment and restage remaining debt for later confirmation.
  • Add/extend tests and a benchmark to validate restaging and retained-source cleanup behavior, including a real-backend checkpoint-kick test.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
TreeDB/db/vlog_rewrite.go Adds age-blocked plan fields, selection stats plumbing, and conditional live-byte estimation for SourceFileIDs with stale filters.
TreeDB/db/vlog_rewrite_test.go Updates selection tests for new return signature/stats; adds tests covering SourceFileIDs + stale filters + live estimation.
TreeDB/caching/db.go Adds retained-source cleanup after rewrite+GC and integrates restaging behavior into checkpoint-kick maintenance execution.
TreeDB/caching/vlog_generation_state.go Persists stage-pending/observed timestamps; adds restage helper and ledger stabilization helpers.
TreeDB/caching/vlog_generation_state_test.go Updates state loader tests for new stage fields.
TreeDB/caching/vlog_retained_manager_drift_test.go Adds a test ensuring processed retained rewrite sources are cleaned up when backend references are gone.
TreeDB/caching/vlog_generation_scheduler_test.go Adds a real-backend test for checkpoint-kick restaging and a benchmark harness for drain window behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2301 to +2306
for i := 0; i < b.N; i++ {
db, recorder, initialQueue := seedRealBackendRewriteQueueScenario(b, 25*time.Millisecond)
b.StopTimer()
var (
totalStart = time.Now()
maxKick time.Duration
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In BenchmarkVlogGenerationRewriteQueueDrainCheckpointKick, the expensive setup (seedRealBackendRewriteQueueScenario) currently runs while the benchmark timer is active, so the benchmark and the reported metrics include DB creation/workload seeding time. Also, b.ReportMetric is called inside the per-iteration loop; only the last call per unit is retained in the final BenchmarkResult, so earlier iterations are effectively discarded. Consider stopping the timer before seeding, starting it immediately before the drain loop (set totalStart after StartTimer), and reporting metrics once per benchmark run (or restructure so b.N controls the drain work rather than repeating full setup under the timer).

Copilot uses AI. Check for mistakes.
Comment on lines +491 to +521
func mergeConfirmedVlogGenerationRewriteLedgerSegments(prev, planned []backenddb.ValueLogRewritePlanSegment) []backenddb.ValueLogRewritePlanSegment {
if len(prev) == 0 || len(planned) == 0 {
return nil
}
stable := stableVlogGenerationRewriteLedgerSegments(prev, planned)
if len(stable) == 0 {
return nil
}
if len(stable) == len(planned) {
return stable
}
seen := make(map[uint32]struct{}, len(stable))
out := make([]backenddb.ValueLogRewritePlanSegment, 0, len(planned))
for _, seg := range stable {
if seg.FileID == 0 || seg.BytesLive <= 0 {
continue
}
seen[seg.FileID] = struct{}{}
out = append(out, seg)
}
for _, seg := range planned {
if seg.FileID == 0 || seg.BytesLive <= 0 {
continue
}
if _, ok := seen[seg.FileID]; ok {
continue
}
out = append(out, seg)
}
return out
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mergeConfirmedVlogGenerationRewriteLedgerSegments is introduced here but does not appear to be referenced anywhere in the repository. If it’s not part of the intended behavior in this PR, consider removing it to avoid carrying unused code (or add the call site if it is required).

Suggested change
func mergeConfirmedVlogGenerationRewriteLedgerSegments(prev, planned []backenddb.ValueLogRewritePlanSegment) []backenddb.ValueLogRewritePlanSegment {
if len(prev) == 0 || len(planned) == 0 {
return nil
}
stable := stableVlogGenerationRewriteLedgerSegments(prev, planned)
if len(stable) == 0 {
return nil
}
if len(stable) == len(planned) {
return stable
}
seen := make(map[uint32]struct{}, len(stable))
out := make([]backenddb.ValueLogRewritePlanSegment, 0, len(planned))
for _, seg := range stable {
if seg.FileID == 0 || seg.BytesLive <= 0 {
continue
}
seen[seg.FileID] = struct{}{}
out = append(out, seg)
}
for _, seg := range planned {
if seg.FileID == 0 || seg.BytesLive <= 0 {
continue
}
if _, ok := seen[seg.FileID]; ok {
continue
}
out = append(out, seg)
}
return out
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

if maxSourceSegments > 0 && len(selected) >= maxSourceSegments {
break
}
if maxSourceBytes > 0 {
next := selectedBytes + candidate.liveBytes

P2 Badge Ignore max-source caps when SourceFileIDs are explicit

When SourceFileIDs is combined with stale filters, rewritePlanNeedsLiveEstimate() now routes the explicit-ID case through selectRewriteSourceSegments, and this block starts enforcing MaxSourceSegments/MaxSourceBytes even though ValueLogRewriteOnlineOptions says those caps apply only when SourceFileIDs is empty (TreeDB/db/vlog_rewrite.go:102-106). Any caller that narrows a preconfigured sparse-plan options struct to a fixed set of file IDs will now silently rewrite only a subset of the requested IDs, leaving the rest of that explicit rewrite set untouched.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +603 to +605
if len(db.vlogGenerationRewriteLedger) == 0 || len(db.vlogGenerationRewriteQueue) == 0 {
return len(db.vlogGenerationRewriteQueue), false, nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restage queue-only debt even when no ledger was persisted

This early return means the new restage path only works for queues backed by SelectedSegments. The scheduler still creates queue-only debt via setVlogGenerationRewriteQueue(rewritePlan.SourceFileIDs) in TreeDB/caching/db.go:12943-12945 whenever a planner returns SourceFileIDs without SelectedSegments—which is still allowed by ValueLogRewritePlan (SelectedSegments is documented as present only "when" live estimates were produced in TreeDB/db/vlog_rewrite.go:56-60). In that case a checkpoint-coupled resume never sets RewriteStagePending, so the remaining tail is not deferred and this commit’s hot-window bound disappears for those planners or reopened queue-only states.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

snissn commented Mar 20, 2026

Added follow-on commit 8858d340 on top of #843:

  • caching: narrow queued rewrite refresh to cool paths

What changed:

  • salvages the useful part of prior B without reintroducing hot-path planner scans
  • queued-ledger refresh is now best-effort and narrow:
    • only multi-segment queued debt
    • only non-bypass / cool paths
    • not checkpoint-kick resume
    • not singleton resume
  • checkpoint-kick resume keeps the A + C latency-shaping behavior unchanged

New focused validation:

  • TestVlogGenerationRewriteQueue_RefreshesCurrentEconomicsBeforePeriodicResume
  • TestVlogGenerationRewriteQueue_RefreshClearsPeriodicQueueWhenNoSegmentsRemain
  • TestVlogGenerationRewriteQueue_CheckpointKickResumeDoesNotRefreshQueue
  • existing TestVlogGenerationRewriteQueue_ResumesWithoutReplanning still passes

Local gate:

  • GOWORK=off go test ./TreeDB/caching -count=1
  • GOWORK=off go vet ./...

Checkpoint-kick benchmark after this change:

  • initial_queue=4
  • rewrite_calls=1
  • max_kick_ms=62.95
  • remaining_queue=3
  • stage_pending=1
  • total_window_ms=63.05

So the narrow refresh did not reopen the hot checkpoint-kick drain behavior.

Sanity fast run from this head:

  • home: /home/mikers/.celestia-app-mainnet-treedb-20260320084514
  • duration_seconds=292
  • max_rss_kb=10329548
  • end_app_bytes=4970583650
  • du -sb application.db=4971729868

Caveat:

  • this run again did not hit online rewrite/refresh in node.log, so it is only a sanity sample, not a runtime attribution run for the new narrow refresh path.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

gomap/TreeDB/caching/db.go

Lines 12773 to 12776 in 8858d34

confirmed := stableVlogGenerationRewriteLedgerSegments(stagedLedger, plan.SelectedSegments)
if len(confirmed) > 0 {
plan = filterVlogGenerationRewritePlanToSegments(plan, confirmed)
shouldRewrite = true

P1 Badge Persist the reconfirmed staged ledger before resume

If a staged confirmation pass narrows the debt set (for example, the planner now returns only file 22 out of a staged [11,22]), this only updates the local plan. The execution path later still resumes from the persisted rewriteQueue/vlogGenerationRewriteLedger, so vlogGenerationRewriteLedgerChunk can pick and rewrite a segment that the confirmation planner just dropped. In that case the new two-phase “observe, then confirm” flow does not actually constrain which staged segments are executed.


gomap/TreeDB/caching/db.go

Lines 12778 to 12780 in 8858d34

} else {
reason = vlogGenerationReasonStaleRatio
}

P1 Badge Clear RewriteStagePending when confirmation no longer matches

When confirmed is empty here, the code only changes reason and leaves RewriteStagePending plus the old observedAt intact. After that, generic periodic/checkpoint passes keep returning early because staged work is still pending, while scheduleDueVlogGenerationDeferredMaintenance immediately requeues another stage-confirm wake because that timestamp is already overdue. A segment that stops qualifying during the confirmation delay can therefore trap the scheduler in a replan loop and block later GC/rewrite work.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +12577 to +12579
if hasPlanner && db.shouldRefreshQueuedVlogGenerationRewriteLedger(opts, stagePending, rewriteQueue, rewriteLedger) {
queueMinStaleRatio := db.vlogGenerationRewriteMinStaleRatioForQueuedDebt(totalBytes, vlogGenerationReasonRewriteResume)
refreshedPlan, refreshErr := db.refreshQueuedVlogGenerationRewritePlan(planner, rewriteQueue, queueMinStaleRatio, opts)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate queued-ledger refresh behind the quiet-window check

This refresh planner runs before the later !quiet && !opts.bypassQuiet guard in maybeRunVlogGenerationMaintenanceWithOptions. In WAL-off/relaxed mode, a periodic pass with queued rewrite debt can now call ValueLogRewritePlan during hot ingest, and after this commit that planner may rescan live bytes even for explicit SourceFileIDs. That defeats the quiet-window contract and reintroduces expensive tree scans on the foreground path even when the pass would have skipped rewrite work entirely.

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings March 20, 2026 20:00
Copy link
Copy Markdown
Owner Author

snissn commented Mar 20, 2026

Pushed 52041e41 to pr/vlog-maint-a-plus-c.

This adds a frozen-copy maintenance replay path and folds in two runtime fixes validated against a real copied Celestia application.db:

  • new offline tooling:
    • treemap vlog-maint-once
    • debug wrappers/state helpers for cached value-log generation maintenance
    • -disable-auto-deferred, -rewrite-plan-timeout, and -wait-idle
  • stage-confirm semantics hardened:
    • staged debt is no longer cleared before a resumed rewrite actually executes
    • stage-confirm canceled refresh no longer falls through to an extra sparse stale-ratio replan
  • executor cadence hardened:
    • remaining queued debt is restaged at the end of the maintenance pass, so the confirm timer starts after rewrite+GC tail instead of in the middle of the pass

Local gates on this head:

  • GOWORK=off go test ./TreeDB/caching ./TreeDB/cmd/treemap -count=1
  • GOWORK=off go test ./TreeDB -run TestDebugValueLogGenerationStateRoundTrip -count=1
  • GOWORK=off go vet ./...

Most important frozen-copy result on a real staged Celestia DB copy:

  • source state: 5 queued staged source IDs
  • command used:
    • TREEDB_DEBUG_VLOG_MAINT=1 TREEDB_DEBUG_VLOG_MAINT_BUDGET=400 go run ./TreeDB/cmd/treemap vlog-maint-once <copied application.db> -rw -json -mode stage-confirm -disable-auto-deferred -rewrite-plan-timeout=2m -wait-idle=3m
  • result after the pass settled:
    • queue shrank 5 -> 4
    • RewriteStagePending=true remained set
    • RewriteStageObservedAtNS advanced to the new restage point
    • gc.deleted_segments=1, gc.deleted_bytes=268435848
    • rewrite.runs=1
    • checkpoint_kick.pending=false
    • scheduler returned to idle

This is the first frozen-copy replay where staged confirm on a real copied Celestia DB consumed exactly one queued segment, GC deleted it, restaged the remaining tail, and returned to idle instead of chain-draining through repeated rewrite_stage_confirm_exit passes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// Checkpoint-coupled resumable work should consume one chunk, then
// re-confirm the remaining tail later instead of draining multiple
// queued segments back-to-back in the same hot ingest window.
if hadRewriteQueue && !opts.skipCheckpoint {
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This restage trigger is currently if hadRewriteQueue && !opts.skipCheckpoint, which will also fire on periodic maintenance passes (default opts.skipCheckpoint=false) whenever there is remaining queued debt. That seems broader than the intent described in the nearby comment ("checkpoint-coupled") and the PR description (restage after checkpoint-kick). Consider tightening the condition (e.g., require opts.bypassQuiet / checkpoint-kick context) so periodic resume behavior doesn’t unexpectedly transition into staged-confirm draining.

Suggested change
if hadRewriteQueue && !opts.skipCheckpoint {
if hadRewriteQueue && !opts.skipCheckpoint && opts.bypassQuiet {

Copilot uses AI. Check for mistakes.
Comment on lines +2569 to +2570
db, recorder, initialQueue := seedRealBackendRewriteQueueScenario(b, 25*time.Millisecond)
b.StopTimer()
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark setup is included in the timed portion and the helper uses tb.Cleanup, so each iteration schedules DB/backend closes only at benchmark end. This can keep many DBs open during the run (potential FD/memory exhaustion) and also skews the benchmark timing. Stop the timer before calling seedRealBackendRewriteQueueScenario and explicitly close/cleanup the returned DB/backend per iteration (e.g., return a cleanup func instead of using tb.Cleanup in the helper when running under *testing.B).

Suggested change
db, recorder, initialQueue := seedRealBackendRewriteQueueScenario(b, 25*time.Millisecond)
b.StopTimer()
b.StopTimer()
db, recorder, initialQueue := seedRealBackendRewriteQueueScenario(b, 25*time.Millisecond)

Copilot uses AI. Check for mistakes.
case len(seedPlan.SelectedSegments) > 0:
stageObservedAt := int64(0)
if *seedStagePending {
stageObservedAt = time.Now().Add(-*seedStageObservedAgo).UnixNano()
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-seed-stage-observed-ago is described as a duration "in the past", but negative values will produce an observedAt timestamp in the future, delaying stage confirmation in a confusing way. Consider validating that seedStageObservedAgo >= 0 (or clamping to 0) before computing stageObservedAt.

Suggested change
stageObservedAt = time.Now().Add(-*seedStageObservedAgo).UnixNano()
observedAgo := *seedStageObservedAgo
if observedAgo < 0 {
observedAgo = 0
}
stageObservedAt = time.Now().Add(-observedAgo).UnixNano()

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 52041e4152

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +3972 to +3975
if !db.valueLogRetained(path) {
continue
}
if db.cleanupOrphanedRetainedValueLog(path) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Respect backend snapshot pins before deleting processed segments

When a caller keeps a cached AcquireSnapshot()/iterator open across rewrite maintenance, the backend snapshot still expects the old value-log file to remain readable until its pinned ValueLogSet is released. This new path bypasses the backend manager's zombie/pin handling and deletes the file directly via cleanupOrphanedRetainedValueLog, using the cache's separate valueLogReader; as a result, reads from the older snapshot can start failing with missing value-log segments even though the snapshot is still live.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

snissn commented Mar 20, 2026

Pushed 32578e98 to pr/vlog-maint-a-plus-c.

This narrows stage-confirm further: when there is already a persisted staged rewrite ledger, the due stage-confirm path now uses that ledger directly instead of issuing a hot-path explicit-SourceFileIDs replan.

Why this matters:

  • the frozen-copy replay from the prior commit showed the remaining cost driver was the first explicit-ID confirmation plan on a real copied Celestia application.db
  • that plan was taking about 36s cold before the actual rewrite/GC work even started
  • queued resume already sorts/prunes the persisted ledger by stale ratio, so we can preserve the economic ranking without paying for a fresh full-tree live-byte scan on the stage-confirm wake

Local gates:

  • GOWORK=off go test ./TreeDB/caching ./TreeDB/cmd/treemap -count=1
  • GOWORK=off go test ./TreeDB -run TestDebugValueLogGenerationStateRoundTrip -count=1
  • GOWORK=off go vet ./...

Frozen-copy delta on the same staged Celestia DB copy:

  • prior head (52041e41) with -rewrite-plan-timeout=2m -wait-idle=3m
    • first maintenance acquire to final idle: about 115s
    • included rewrite_queue_refresh ... dur_ms=35963
  • new head (32578e98) with -wait-idle=2m
    • first maintenance acquire to final idle: about 70.6s
    • no rewrite_queue_refresh
    • consumed one queued segment, GC deleted it, restaged remaining queue 5 -> 4, returned to idle with checkpoint_kick.pending=false

Most important trace line on the new head:

  • rewrite_stage_confirm_use_persisted queued=5 ledger=5

So this change preserves the good staged-tail behavior from C, but removes the remaining hot-path planner stall that the frozen-copy tool exposed.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 32578e987d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +12659 to +12660
dropped := len(rewriteQueue)
if err := db.setVlogGenerationRewriteLedger(nil); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve queue-only rewrite plans during refresh

This prune path treats any refresh result with SelectedSegments == 0 as “nothing remains”, but ValueLogRewritePlan still allows SourceFileIDs without SelectedSegments (TreeDB/db/vlog_rewrite.go:54-60). If a periodic/reopen refresh gets that queue-only shape from the planner, the code clears the persisted resume queue here even though the planner is still returning actionable source IDs, so the remaining rewrite debt is silently lost until some later full replan recreates it.

Useful? React with 👍 / 👎.

Comment on lines +12607 to +12610
var rewriteLedger []backenddb.ValueLogRewritePlanSegment
if len(rewriteQueue) > 1 {
rewriteLedger, _ = db.currentVlogGenerationRewriteLedger()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Load the staged ledger for one-segment confirmations

The stale-ratio path can now persist a stage-pending ledger with exactly one queued file, but this code only loads the persisted ledger when len(rewriteQueue) > 1. On the next due rewrite_stage_confirm pass, stageConfirmUsePersistedLedger therefore stays false for one-entry stages, so any transient ValueLogRewritePlan timeout/error leaves hasExecutableRewriteQueue false and the pass does nothing even though the exact segment to rewrite is already on disk. Multi-segment confirms avoid that planner dependency; single-segment confirms do not.

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings March 20, 2026 20:50
Copy link
Copy Markdown
Owner Author

snissn commented Mar 20, 2026

Pushed d6d0a505 (caching: prune missing queued vlog rewrite segments).

What changed:

  • Added cheap backend segment-presence capability (ValueLogHasSegment) and use it to prune persisted queued rewrite ledger entries that no longer exist before rewrite_resume spends a maintenance pass.
  • Added scheduler coverage for the due stage-confirm path when one persisted source ID is missing.

Local gate:

  • GOWORK=off go test ./TreeDB/caching -run 'TestVlogGenerationRewriteQueue_StageConfirmPrunesMissingPersistedSegments|TestVlogGenerationRewriteQueue_StageConfirmRefreshesStagedLedgerEconomics|TestVlogGenerationRewriteQueue_StageConfirmCanceledRefreshDoesNotFallbackToSparsePlan' -count=1
  • GOWORK=off go test ./TreeDB/db ./TreeDB/caching ./TreeDB/cmd/treemap -count=1
  • GOWORK=off go vet ./...

Frozen-copy replay result on the completed wal_on_fast DB (/home/mikers/.celestia-app-mainnet-treedb-20260320101754/data/application.db):

  • persisted state before replay had 2 staged queued source IDs: 2147483668, 2147483669
  • exact replay command on a reflinked copy: TREEDB_DEBUG_VLOG_MAINT=1 TREEDB_DEBUG_VLOG_MAINT_BUDGET=400 GOWORK=off go run ./TreeDB/cmd/treemap vlog-maint-once <copy>/application.db -rw -json -mode stage-confirm -disable-auto-deferred -wait-idle=2m
  • new behavior:
    • rewrite_queue_missing_prune dropped=2 remaining=0 stage_pending=false
    • no rewrite ran (rewrite.runs=0)
    • queue/state cleared instead of burning a no-op stage-confirm pass
  • the follow-on checkpoint pass scanned and found only stale_bytes=4066800 across the whole DB, which is well below the online rewrite threshold.

This explains the earlier WAL frozen replay failure mode: the persisted queued debt on reopen could already be stale/deleted, and we were spending a maintenance opportunity on nonexistent source IDs. This patch makes that case cheap.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +3930 to +3942
func (db *DB) cleanupProcessedRetainedRewriteSources(reason uint32, processedRewriteIDs []uint32) {
if db == nil || len(processedRewriteIDs) == 0 || !db.valueLogEnabled() {
return
}
liveIDs, err := db.collectValueLogLiveIDsUntil(db.lastForegroundWriteUnixNano.Load())
if err != nil {
db.debugVlogMaintf(
"rewrite_retained_cleanup_scan_err reason=%s source_ids=%d err=%v",
vlogGenerationReasonString(reason),
len(processedRewriteIDs),
err,
)
return
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanupProcessedRetainedRewriteSources() calls collectValueLogLiveIDsUntil(), which performs a full pointer-projection scan of the tree (and leaf-ref scan) to rebuild live segment IDs. This runs after every online rewrite pass (and after GC), so it can add an O(size-of-DB) scan to a maintenance step that’s otherwise intended to be bounded. Consider avoiding the full scan here (e.g., rely on rewrite outcomes + backend segment presence/zombie state, or defer retained cleanup to existing periodic retained pruning where the scan is already expected), or at least gate the scan behind a debug/env toggle or a cheap heuristic (only when the retained set actually contains the processed IDs).

Copilot uses AI. Check for mistakes.
Comment on lines +2645 to +2646
db, recorder, initialQueue := seedRealBackendRewriteQueueScenario(b, 25*time.Millisecond)
b.StopTimer()
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BenchmarkVlogGenerationRewriteQueueDrainCheckpointKick includes seedRealBackendRewriteQueueScenario() in the timed portion (it’s called before b.StopTimer()). This makes the benchmark’s ns/op dominated by setup (DB creation + synthetic workload) rather than the checkpoint-kick drain behavior the metrics are trying to capture. Move the seeding/setup under b.StopTimer()/b.StartTimer() (or use b.ResetTimer() after setup) so the benchmark timing reflects only the drain loop.

Suggested change
db, recorder, initialQueue := seedRealBackendRewriteQueueScenario(b, 25*time.Millisecond)
b.StopTimer()
b.StopTimer()
db, recorder, initialQueue := seedRealBackendRewriteQueueScenario(b, 25*time.Millisecond)

Copilot uses AI. Check for mistakes.
Comment on lines +85 to +96
beforeState, err := db.DebugValueLogGenerationState()
if err != nil {
fatalf("debug state before run: %v", err)
}
beforeStats := filterVlogGenerationStats(db.Stats())

report := valueLogMaintOnceReport{
Dir: dir,
Mode: *mode,
BeforeState: beforeState,
BeforeStats: beforeStats,
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runVlogMaintOnce records report.BeforeState/BeforeStats before applying -clear-state or -seed-from-plan. When either flag is used, the printed/JSON "before_state" no longer reflects the actual state that the maintenance pass ran with, which can make offline forensics misleading. Consider capturing the initial state separately (e.g., InitialState) or recomputing BeforeState/BeforeStats after seeding/clearing and before DebugRunValueLogGenerationMaintenanceOnce().

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

gomap/TreeDB/caching/db.go

Lines 12339 to 12343 in d6d0a50

func (db *DB) scheduleDueVlogGenerationDeferredMaintenance() {
if db == nil || db.closing.Load() {
return
}
now := time.Now()

P2 Badge Respect deferred-disable env when rescheduling due wakes

TREEDB_DISABLE_VLOG_GENERATION_DEFERRED now disables the timer-based stage/age schedulers, but this exit path ignores the same flag and can immediately enqueue rewrite_stage_confirm_exit or rewrite_age_blocked_exit after any acquired pass. That means treemap vlog-maint-once -disable-auto-deferred can still run extra maintenance in the background on copied DBs with overdue staged/age-blocked debt, defeating the new "one pass" offline-analysis mode.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}
db.debugVlogMaintf("gc_after_rewrite_done reason=%s dur_ms=%.3f", vlogGenerationReasonString(reason), float64(time.Since(gcStart).Microseconds())/1000)
}
db.cleanupProcessedRetainedRewriteSources(reason, processedRewriteIDs)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid full live-ID scans after each queued rewrite

When processedRewriteIDs is non-empty, this new call synchronously runs cleanupProcessedRetainedRewriteSources, and that helper immediately calls collectValueLogLiveIDsUntil to rescan the entire tree for live value-log IDs. On large DBs with checkpoint-kick or resumed rewrite debt, every single-segment rewrite now pays an extra full pointer scan before the maintenance pass returns, so the new restage logic no longer keeps the hot-window latency bounded to one rewritten segment.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c869e0a08d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +12682 to +12684
if hasPlanner && (db.shouldRefreshQueuedVlogGenerationRewriteLedger(opts, stagePending, rewriteQueue, rewriteLedger) || stageConfirmRefresh) {
queueMinStaleRatio := db.vlogGenerationRewriteMinStaleRatioForQueuedDebt(totalBytes, vlogGenerationReasonRewriteResume)
refreshedPlan, refreshErr := db.refreshQueuedVlogGenerationRewritePlan(planner, rewriteQueue, queueMinStaleRatio, opts)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Replan staged stale-ratio confirms with the trigger ratio

When a stale-ratio pass stages exactly one segment, the later rewrite_stage_confirm refresh goes through this path and replans it with vlogGenerationRewriteMinStaleRatioForQueuedDebt(..., vlogGenerationReasonRewriteResume), which is 50% by default. That is stricter than the stale-ratio trigger that created the stage (default 20%), so any staged segment in the 20%-50% band is filtered out here even though it was intentionally staged. Because the stageConfirmRefresh empty-selection branch keeps the staged queue and skips the sparse-plan fallback, that debt can get stuck pending forever instead of ever being rewritten.

Useful? React with 👍 / 👎.

Comment on lines 609 to 610
for i := 0; i < k+1; i++ {
cur := binary.LittleEndian.Uint32(prefix[off : off+4])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate grouped RIDs before fast-decoding compressed frames

This new ReadAppend fast path skips over the grouped RID table and only parses offsets, so it no longer rejects frames whose RID entries are structurally invalid. The previous fallback (ReadAtWithDict) validated each RID, which means a compressed grouped record with zeroed/corrupt RID bytes but otherwise readable payload now returns data through ReadAppend/ReadUnsafeAppendBatch instead of ErrCorrupt (including when the bad frame still has a matching CRC, or when checksum verification is disabled). That weakens value-log integrity on iterator and batch-read paths.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 37 out of 37 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1100 to +1134
// AcquireSealedLazyMmapBudget raises this manager's sealed lazy-mmap limits
// until the returned release func is called. It only ever increases the global
// defaults for this manager instance; ordinary readers keep using the existing
// defaults unless a caller explicitly acquires a higher maintenance budget.
func (m *Manager) AcquireSealedLazyMmapBudget(minSegments int, minBytes int64) func() {
if m == nil {
return func() {}
}
if minSegments < 0 {
minSegments = 0
}
if minBytes < 0 {
minBytes = 0
}
m.mu.Lock()
m.sealedLazyMmapBudgetDepth++
if minSegments > m.sealedLazyMmapBudgetSegments {
m.sealedLazyMmapBudgetSegments = minSegments
}
if minBytes > m.sealedLazyMmapBudgetBytes {
m.sealedLazyMmapBudgetBytes = minBytes
}
m.mu.Unlock()
return func() {
m.mu.Lock()
if m.sealedLazyMmapBudgetDepth > 0 {
m.sealedLazyMmapBudgetDepth--
if m.sealedLazyMmapBudgetDepth == 0 {
m.sealedLazyMmapBudgetSegments = 0
m.sealedLazyMmapBudgetBytes = 0
}
}
m.mu.Unlock()
}
}
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AcquireSealedLazyMmapBudget uses a single depth counter and keeps the maximum requested segments/bytes until all acquisitions are released. With nested acquisitions, releasing the inner budget will not restore the previous (smaller) budget while the outer acquisition is still active, even though the comment says the override lasts "until the returned release func is called".

If nested acquisition is possible, consider tracking prior values per-acquisition in the returned closure (stack/restore pattern), so each release restores the manager budget to the max of remaining active budgets rather than only resetting when depth reaches 0.

Copilot uses AI. Check for mistakes.
Comment on lines +1245 to +1252
if canceledErr == nil && restrictSource && haveCachedReferencedBefore {
referencedAfter = rewriteCleanupDerivedReferencedSegments(cachedReferencedBefore, sourceIDs, newValueIDs)
} else {
runRewriteCleanupReferencedScanHook()
referencedAfter, err = db.referencedValueLogSegments(context.Background())
if err != nil {
return stats, err
}
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewriteCleanupDerivedReferencedSegments derives referencedAfter by cloning the cached referenced set and unconditionally deleting all sourceIDs. This is only safe if the online rewrite guarantees it rewrote all references to those segments. Since ValueLogRewriteOnline only scans the user tree (and only rewrites leaf-ref pages separately), segments in sourceIDs could still be referenced elsewhere (e.g., the system tree or other reachability sources), and would then be incorrectly omitted from referencedAfter, potentially causing them to be marked zombie and deleted.

Consider falling back to referencedValueLogSegments (full scan) for this optimization, or augment the derivation with a targeted verification pass that checks for remaining references to sourceIDs outside the rewritten keyspace before deleting them from the referenced set.

Suggested change
if canceledErr == nil && restrictSource && haveCachedReferencedBefore {
referencedAfter = rewriteCleanupDerivedReferencedSegments(cachedReferencedBefore, sourceIDs, newValueIDs)
} else {
runRewriteCleanupReferencedScanHook()
referencedAfter, err = db.referencedValueLogSegments(context.Background())
if err != nil {
return stats, err
}
runRewriteCleanupReferencedScanHook()
referencedAfter, err = db.referencedValueLogSegments(context.Background())
if err != nil {
return stats, err

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8236397431

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +1641 to +1644
lane, _ := valuelog.DecodeFileID(segment.fileID)
current, ok := newestByLane[lane]
if !ok || segment.seq > current.seq {
newestByLane[lane] = segment
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Skip empty newest segments when deriving rewrite RID start

If a lane's highest-seq segment exists but has not received any records yet, this optimization picks that empty file and ignores the previous sealed segment that still holds the lane's highest RID. That state is reachable in normal cached mode because rotateValueLogWriterMuHeld creates and registers/promotes a new segment before the first append (TreeDB/caching/db.go:16282-16301). In a direct DB.ValueLogRewriteOnline call where ReserveRIDs is nil, nextRewriteRIDStart can therefore return a too-low start RID and the rewrite will emit duplicate RIDs; reopen/recovery later fails on the duplicate check in TreeDB/db/wal_recovery.go:193-197.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b05a8b4564

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +115 to +116
if db.cached != nil {
if err := db.Checkpoint(); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep the plan-only API from checkpointing cached DBs

DB.ValueLogRewritePlan is now called from the new vlog-maint-exit-loop reseed/probe path, but this wrapper unconditionally runs Checkpoint() whenever the DB is cached. In cached mode Checkpoint() is not a read-only preparation step: caching.DB.Checkpoint() can flush memtables and then call maybeKickVlogGenerationMaintenanceAfterCheckpoint, so a caller that only wants to inspect candidate segments can mutate the DB and even start rewrite/GC before the explicit maintenance pass runs. That makes the new offline/debug commands nondeterministic and also gives the new public planning API surprising side effects.

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings March 23, 2026 18:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 43 out of 43 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

fatalf("cpuprofile: %v", err)
}
profFile = f
runtimepprof.StartCPUProfile(profFile)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtime/pprof.StartCPUProfile returns an error, but this call ignores it. Please handle the error (and close the file on failure) so the command doesn’t silently run without profiling or leak the file descriptor when profiling can’t be started.

Suggested change
runtimepprof.StartCPUProfile(profFile)
if err := runtimepprof.StartCPUProfile(profFile); err != nil {
_ = profFile.Close()
fatalf("cpuprofile: %v", err)
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 64711a9b3b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}
recs[i] = valuelog.Record{RID: rid, Value: leafPages[start+i]}
}
ptrs, _, err := w.w.AppendFrameWithStatsInto(0, nil, recs, dst)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Don't batch leaf pages into multi-record grouped frames

page.EncodeLeafRef only preserves FileID and Offset; page/leaf_ref.go reconstructs every leaf ref with grouped sub-index 0 under the explicit K=1 assumption. This new path writes up to 8 leaf pages in one AppendFrameWithStatsInto call, so internal/bulk/builder.go:194-203 can now encode nonzero-subindex pointers as leaf refs during offline rewrite. On any DB using IndexOuterLeavesInValueLog with more than one rewritten leaf page per batch, the 2nd+ child refs decode back to the first leaf in that frame after reopen, corrupting tree navigation and lookups.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7f484cef31

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Value: leafPages[i],
}
}
return l.db.appendValueLog(lane, 0, nil, records, journalDurabilityNone, true)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not group multiple outer-leaf pages into one frame

When IndexOuterLeavesInValueLog is enabled and a split/merge or bulk build persists 2+ outer leaf pages at once, this funnels them through a single appendValueLog call. That produces grouped pointers, but page.EncodeLeafRef/DecodeLeafRef in TreeDB/page/leaf_ref.go only preserve (FileID, Offset) and explicitly assume leaf pages are stored as K=1, so every pointer after the first collapses to the same leaf-ref ID as sub-index 0. In practice, zipper.mergeLeaf and the bulk builder can now write sibling/root child refs that already alias the wrong leaf in memory and always decode to the wrong page after reopen.

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings March 23, 2026 19:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 52 out of 52 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1402 to +1410
var (
leafPage []byte
err error
)
if toer, ok := c.leafReader.(unsafeToReader); ok {
leafPage, _, err = toer.ReadUnsafeTo(ptr, c.leafScratch[:0])
if err == nil && len(leafPage) > 0 && len(leafPage) <= cap(c.leafScratch) {
c.leafScratch = leafPage[:0]
}
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the leaf-ref rewrite path, leafScratch starts nil, so calling ReadUnsafeTo(ptr, c.leafScratch[:0]) won’t provide reusable capacity and the cap(c.leafScratch) check prevents retaining the returned buffer. Consider preallocating c.leafScratch to make([]byte, 0, page.PageSize) (or growing when cap < page.PageSize) before the first ReadUnsafeTo call so the scratch reuse optimization actually takes effect.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

if db.indexPackedValuePtr {
// Packed on-disk pointers store Offset as u32. Ensure rewritten segments
// rotate so newly written pointers remain representable.
const packedMax = int64(^uint32(0)) - 4
if maxBytes > packedMax {
maxBytes = packedMax
}

P2 Badge Honor leaf-ref offset cap in online rewrite

When IndexOuterLeavesInValueLog is enabled, leaf refs can only encode offsets up to page.LeafRefMaxOffset, and the offline path later in this file already clamps rewrite segment size for that reason. The online path here still only applies the packed-pointer cap, so a caller that sets MaxSegmentBytes above ~512 MiB can fill a rewrite segment past the LeafRef offset limit and then fail in rewriteLeafRefsOnline when page.EncodeLeafRef sees an unrepresentable offset. This only shows up on DBs with outer leaves in the value log, but in that configuration it turns a valid online rewrite request into a mid-pass failure.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +886 to 889
for _, id := range candidateFileIDs {
f := files[id]
if _, ok := active[id]; ok {
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't discard explicit active source IDs

With SourceFileIDs plus stale filters, rewritePlanNeedsLiveEstimate now routes explicit IDs through this sparse-selection loop. This unconditional active check silently drops any requested segment that is still marked active, so ValueLogRewritePlan/ValueLogRewriteOnline can return an empty selection even when the caller explicitly named that file ID and it satisfies the stale-byte/ratio filter. That regression is specific to the new filtered-SourceFileIDs path; before this change, explicit IDs were honored as requested.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 25da6d2fe7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

lane := &l.db.lanes[l.laneID]
rid := l.db.nextRID.Add(1)
ptr, retainPath, err := l.db.appendValueLogOneInternal(lane, 0, nil, rid, leafPage, journalDurabilityNone, false)
ptr, retainPath, err := l.db.appendValueLogOneInternal(lane, 0, nil, rid, leafPage, journalDurabilityNone, false, true)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Write single outer-leaf pages as grouped records

When IndexOuterLeavesInValueLog is enabled and compression falls back to the non-block path (for example ValueLogCompression=Off), this AppendLeafPage call still goes through appendValueLogOneInternal, which writes a plain record instead of a grouped K=1 frame. After this change page.DecodeLeafRef always reconstructs grouped pointers, so the empty/single-leaf builder paths in TreeDB/internal/bulk/builder.go:61 and :305 now persist leaf refs that reopen as corrupt because leaf-page reads try to parse a grouped frame that was never written.

Useful? React with 👍 / 👎.

Comment on lines +523 to +526
if opts.IndexOuterLeavesInValueLog {
leafRefMaxSegmentBytes := int64(page.LeafRefMaxOffset) - 4
if valueLogMaxSegmentBytes == 0 || leafRefMaxSegmentBytes < valueLogMaxSegmentBytes {
valueLogMaxSegmentBytes = leafRefMaxSegmentBytes
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clamp rewrite segment bytes to the new LeafRef offset limit

These new LeafRefMaxOffset clamps only protect normal cached writes. Online rewrite still uses opts.MaxSegmentBytes verbatim in TreeDB/db/vlog_rewrite.go:1077-1089, and cached generational maintenance feeds that field from db.valueLogGenerationWarmTarget at TreeDB/caching/db.go:13257-13261. In deployments with IndexOuterLeavesInValueLog and a warm/max rewrite segment target above the new ~512 MiB leaf-ref ceiling, a rewrite can now fail mid-run once a copied leaf page lands past the 29-bit offset limit, because page.EncodeLeafRef rejects the rewritten pointer.

Useful? React with 👍 / 👎.

@snissn
Copy link
Copy Markdown
Owner Author

snissn commented Mar 23, 2026

Tracking the next sprint off a single authority issue: #844\n\nThis captures:\n- the current leaf-frame fix status\n- the remaining ordinary-value write-shape problem\n- the roadmap\n- the final validation matrix (, , post-run data-folder analysis, post-offline-rewrite analysis, and unified-bench gates).

@snissn
Copy link
Copy Markdown
Owner Author

snissn commented Mar 23, 2026

Tracking the next sprint off a single authority issue: #844

This captures:

  • the current leaf-frame fix status
  • the remaining ordinary-value write-shape problem
  • the roadmap
  • the final validation matrix (fast, wal_on_fast, post-run data-folder analysis, post-offline-rewrite analysis, and unified-bench gates).

Copilot AI review requested due to automatic review settings March 24, 2026 05:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 58 out of 58 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1292 to +1306
if z.outerLeavesInValueLog && z.leafPageBatchLog != nil {
data := append([]byte(nil), target.Data()...)
pendingLeafPersists = append(pendingLeafPersists, pendingLeafPersist{
data: data,
root: target == builder,
splitIdx: pendingSplitIdx,
})
if target == builder {
rootPersisted = true
}
if len(pendingLeafPersists) >= 8 {
return 0, flushPendingLeafPersists()
}
return 0, nil
}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In mergeLeaf's batched outer-leaf persist path, the split leaf builder/page buffers are never released back to their pools. Unlike the non-batched path (which releases pooled builders/pages after persisting when target != builder), the batched branch copies target.Data() into pendingLeafPersists but then returns without releasing target when it's a pooled split leaf. This will accumulate unreleased pooled builders/outerLeafBuildPages across splits and can materially increase memory usage during large merges. After copying the page data, release pooled resources for non-root targets (and clear the split's pooled flags) similarly to the non-batched persist path.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 45e774698f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +1302 to +1305
if len(pendingLeafPersists) >= 8 {
return 0, flushPendingLeafPersists()
}
return 0, nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Release pooled split pages on batched leaf persistence

In the new batched outer-leaf path, persistTarget returns from this branch before reaching the cleanup block that releases target and targetOuterLeafData back to their pools. On split-heavy merges when outerLeavesInValueLog and leafPageBatchLog are enabled, each persisted non-root split leaf leaves behind pooled builder/scratch allocations, so memory use grows with split count and pooling effectiveness is lost.

Useful? React with 👍 / 👎.

Comment on lines +339 to +340
if len(in) == 0 {
return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve empty referenced sets in ref-cache cloning

Returning nil for an empty referenced-ID set collapses a valid "known empty" scan result into the same sentinel used for "no cached value." Downstream cache lookups check for non-nil refs, so successful scans that find zero live segments become cache misses and trigger repeated full reachability rescans in rewrite/GC maintenance, increasing latency on fully-rewritten or near-empty datasets.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants