Skip to content

caching: gate runtime vlog maintenance by sync phase#845

Open
snissn wants to merge 16 commits intopr/vlog-maint-v2-runtime-debtfrom
pr/vlog-maint-v2-phasehook-base
Open

caching: gate runtime vlog maintenance by sync phase#845
snissn wants to merge 16 commits intopr/vlog-maint-v2-runtime-debtfrom
pr/vlog-maint-v2-phasehook-base

Conversation

@snissn
Copy link
Copy Markdown
Owner

@snissn snissn commented Mar 24, 2026

Summary

  • add explicit TreeDB maintenance phases (restore, catchup, steady)
  • bridge existing treedbtrace phase events into TreeDB maintenance phase changes
  • keep this as the clean active sprint base for real run_celestia validation

Why

Corrected real-node validation against gomap-gemini shows this minimal fix set is sufficient for both fast and wal_on_fast, while larger local runtime experiments are unnecessary and worse.

Validation

  • corrected wal_on_fast completed cleanly on the clean phase-hook base in ~297s
  • corrected fast completed cleanly on the clean phase-hook base in ~300s
  • both stayed inside the expected wall-time band for the sprint

Copilot AI review requested due to automatic review settings March 24, 2026 17:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an explicit maintenance “phase” concept (restore/catchup/steady) and uses it to gate background value-log (vlog) generation maintenance, with a trace bridge so existing treedbtrace phase events can drive TreeDB’s maintenance phase during real-node runs.

Changes:

  • Add MaintenancePhase API to TreeDB (re-exported from caching DB) and expose it in stats.
  • Bridge treedbtrace phases into treedb.DB.SetMaintenancePhase() via a global phase bus tied to wrapped DB lifetimes.
  • Gate periodic + checkpoint-kick vlog generation maintenance by maintenance phase, and expand rewrite planning/state to support staged confirmation and age-blocked reporting.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
kvstore/adapters/treedbtrace/treedbtrace.go Registers/unregisters wrapped *treedb.DB with a phase bus so trace phase changes propagate into maintenance phase.
kvstore/adapters/treedbtrace/trace.go Adds the phase bus, phase normalization/mapping, and CurrentTracePhase().
kvstore/adapters/treedbtrace/phase_bridge_test.go Verifies trace-phase propagation, apply-on-wrap, and unregister-on-close behavior.
TreeDB/public.go Re-exports MaintenancePhase and adds DB.MaintenancePhase() / DB.SetMaintenancePhase().
TreeDB/maintenance_phase_test.go Confirms TreeDB forwards maintenance phase to cached DB and exposes it via stats.
TreeDB/db/vlog_rewrite.go Adds age-blocked selection stats to rewrite plans and refactors sparse source selection / live-estimate decisioning.
TreeDB/caching/db.go Implements phase state, stats export, phase-gated periodic maintenance, and phase-gated checkpoint kick.
TreeDB/caching/vlog_generation_state.go Persists staged-rewrite confirmation state and adds helper paths for staging/restaging.
TreeDB/caching/vlog_generation_state_test.go Updates loader tests for new stage fields returned by the loader.
TreeDB/caching/vlog_generation_scheduler_test.go Updates WAL-on checkpoint-kick expectations and adds tests for phase-gated maintenance behavior.
TreeDB/caching/vlog_generation_checkpoint_kick_walon_test.go Adds a focused WAL-on checkpoint-kick skip test in caching package.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +11381 to +11390
func (db *DB) maybeRunPeriodicVlogGenerationMaintenance(runGC bool) bool {
if db == nil {
return false
}
if db.suppressBackgroundVlogGenerationForMaintenancePhase() {
db.debugVlogMaintf("periodic_skip reason=maintenance_phase phase=%s run_gc=%t", maintenancePhaseString(uint32(db.MaintenancePhase())), runGC)
return false
}
db.maybeRunVlogGenerationMaintenance(runGC)
return true
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintenance-phase gating is currently applied to the periodic loop and checkpoint-kick, but deferred/background paths (e.g., stage-confirm/age-blocked wakes via startVlogGenerationDeferredMaintenance and scheduleVlogGenerationRewriteStageConfirmation) still call maybeRunVlogGenerationMaintenanceWithOptions without checking MaintenancePhase(). That means restore/catchup can still run generation maintenance despite the new phase gate. Consider moving the phase check into maybeRunVlogGenerationMaintenanceWithOptions (or into the deferred wake/scheduler entrypoints) so all background triggers are consistently suppressed when phase != steady.

Copilot uses AI. Check for mistakes.
Comment on lines +3345 to +3349
time.Sleep(2 * schedulerTestWait(t))
_, rewriteCalls := recorder.recordedRewrite()
_, planCalls := recorder.recordedPlan()
if rewriteCalls != 0 || planCalls != 0 {
t.Fatalf("checkpoint kick unexpectedly ran under WAL-on: planCalls=%d rewriteCalls=%d", planCalls, rewriteCalls)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test uses a fixed sleep to assert that no checkpoint-kick ran. Negative assertions via time.Sleep can be flaky and can miss late-arriving goroutines. If possible, prefer checking a deterministic signal (e.g., vlogGenerationCheckpointKickRuns/Active remain 0, or have the recorder expose a channel that would fire on any plan/rewrite call and select with a timeout).

Suggested change
time.Sleep(2 * schedulerTestWait(t))
_, rewriteCalls := recorder.recordedRewrite()
_, planCalls := recorder.recordedPlan()
if rewriteCalls != 0 || planCalls != 0 {
t.Fatalf("checkpoint kick unexpectedly ran under WAL-on: planCalls=%d rewriteCalls=%d", planCalls, rewriteCalls)
// Monitor for any unexpected plan/rewrite activity after the checkpoint.
done := make(chan struct{})
go func() {
for {
_, rewriteCalls := recorder.recordedRewrite()
_, planCalls := recorder.recordedPlan()
if rewriteCalls != 0 || planCalls != 0 {
close(done)
return
}
time.Sleep(10 * time.Millisecond)
}
}()
select {
case <-done:
// Re-read to report current counts.
_, rewriteCalls := recorder.recordedRewrite()
_, planCalls := recorder.recordedPlan()
t.Fatalf("checkpoint kick unexpectedly ran under WAL-on: planCalls=%d rewriteCalls=%d", planCalls, rewriteCalls)
case <-time.After(2 * schedulerTestWait(t)):
// Success: no checkpoint kick occurred within the wait window.

Copilot uses AI. Check for mistakes.
Comment on lines 46 to 52
err := d.inner.Close()
if d.phase != nil {
phaseBus.unregister(d.phase)
}
if t := getTrace(); t != nil {
t.closeDB()
}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Close(), the adapter unregisters from phaseBus only after calling d.inner.Close(). That leaves a window where SetTracePhase can still call SetMaintenancePhase on a DB that is closing/closed (and if Close() returns an error but leaves the DB usable, it gets unregistered anyway). Consider unregistering from phaseBus before closing the inner DB (and optionally re-registering on close failure if that’s meaningful).

Suggested change
err := d.inner.Close()
if d.phase != nil {
phaseBus.unregister(d.phase)
}
if t := getTrace(); t != nil {
t.closeDB()
}
if d.phase != nil {
phaseBus.unregister(d.phase)
}
if t := getTrace(); t != nil {
t.closeDB()
}
err := d.inner.Close()

Copilot uses AI. Check for mistakes.

func selectRewriteSourceSegments(opts ValueLogRewriteOnlineOptions, files map[uint32]*valuelog.File, active map[uint32]struct{}, liveByID map[uint32]int64) map[uint32]struct{} {
if len(opts.SourceFileIDs) > 0 {
func selectRewriteSourceSegments(opts ValueLogRewriteOnlineOptions, files map[uint32]*valuelog.File, active map[uint32]struct{}, liveByID map[uint32]int64) (map[uint32]struct{}, rewriteSourceSelectionStats) {
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

selectRewriteSourceSegments now returns two values, but there are existing call sites in the repo (including TreeDB/db/vlog_rewrite_test.go) that still expect a single return value. This will break compilation unless all callers are updated to handle the added rewriteSourceSelectionStats result (even if they ignore it).

Suggested change
func selectRewriteSourceSegments(opts ValueLogRewriteOnlineOptions, files map[uint32]*valuelog.File, active map[uint32]struct{}, liveByID map[uint32]int64) (map[uint32]struct{}, rewriteSourceSelectionStats) {
func selectRewriteSourceSegments(opts ValueLogRewriteOnlineOptions, files map[uint32]*valuelog.File, active map[uint32]struct{}, liveByID map[uint32]int64) map[uint32]struct{} {
selected, _ := selectRewriteSourceSegmentsWithStats(opts, files, active, liveByID)
return selected
}
func selectRewriteSourceSegmentsWithStats(opts ValueLogRewriteOnlineOptions, files map[uint32]*valuelog.File, active map[uint32]struct{}, liveByID map[uint32]int64) (map[uint32]struct{}, rewriteSourceSelectionStats) {

Copilot uses AI. Check for mistakes.
Comment on lines +828 to +836
if len(opts.SourceFileIDs) > 0 && liveByID == nil {
selected := make(map[uint32]struct{}, len(opts.SourceFileIDs))
for _, id := range opts.SourceFileIDs {
if _, ok := files[id]; !ok {
continue
}
selected[id] = struct{}{}
}
return selected
return selected, stats
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When SourceFileIDs are provided and liveByID is nil, this early return ignores the provided active set (and also bypasses the protectedIDs/minSegmentAge filtering below). That makes it possible for explicitly requested segments to include the currently-active segment(s), which the rest of the selection logic explicitly excludes. Suggest applying the same active/protected/min-age checks for SourceFileIDs selection, and removing the later (currently unreachable) in-loop branch that checks the same condition.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b67d6d8cdc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +13171 to +13173
if db.suppressBackgroundVlogGenerationForMaintenancePhase() {
db.debugVlogMaintf("checkpoint_kick_skip reason=maintenance_phase phase=%s", maintenancePhaseString(uint32(db.MaintenancePhase())))
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Clear pending checkpoint retries in non-steady phases

This early return skips new checkpoint kicks during restore/catchup, but it leaves any previously-set vlogGenerationCheckpointKickPending bit untouched. If that bit was set before the phase transition (for example via a maintenance collision), schedulePendingVlogGenerationCheckpointKick can still launch retry maintenance passes in non-steady phases, so background rewrite/GC work still runs while the phase gate is supposed to suppress it.

Useful? React with 👍 / 👎.

Comment on lines +312 to +315
if len(opts.SourceFileIDs) == 0 {
return true
}
return opts.MinSegmentStaleRatio > 0 || opts.MinSegmentStaleBytes > 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep plan and execution filters consistent for explicit sources

ValueLogRewritePlan now treats SourceFileIDs + MinSegmentStale* as requiring live-byte estimation, so planning can filter out explicit IDs by stale thresholds. ValueLogRewriteOnline still does not estimate live bytes when SourceFileIDs is non-empty, so execution can still rewrite all explicit IDs. This creates plan/execute divergence for the same options and can cause rewrites that the plan said would be skipped.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9f4c306770

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +867 to +871
for _, id := range opts.SourceFileIDs {
if _, ok := files[id]; !ok {
continue
}
candidateFileIDs = append(candidateFileIDs, id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Deduplicate explicit source IDs before budgeted selection

The new explicit-source branch now appends every SourceFileIDs entry directly into candidateFileIDs and runs through the shared byte-budget loop, so duplicate IDs are processed multiple times. Because selection accounting adds candidate.liveBytes per candidate, duplicates can consume MaxSourceBytes repeatedly and cause later distinct IDs to be skipped (for example, [1,1,2] can select only {1} under a tight byte cap). This regresses explicit-ID rewrites by silently omitting requested segments when duplicates are present.

Useful? React with 👍 / 👎.

Comment on lines +149 to +153
b.mu.Lock()
b.dbs[db]++
b.mu.Unlock()
db.SetMaintenancePhase(maintenancePhaseForTrace(b.current()))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Serialize phase registration with phase updates

register records the DB in b.dbs, unlocks, and only then applies b.current() to the DB. If SetTracePhase runs in that window, set() can already push the new phase to this DB and register can immediately overwrite it with a stale phase value. In that race, a freshly wrapped DB may stay in steady during restore/catchup until another phase change arrives, which defeats the maintenance-phase gate for that instance.

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings March 24, 2026 19:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

minStaleBytes := opts.MinSegmentStaleBytes
maxSourceSegments := opts.MaxSourceSegments
maxSourceBytes := opts.MaxSourceBytes
minSegmentAge := opts.MinSegmentAge
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In selectRewriteSourceSegmentsWithStats, MaxSourceSegments/MaxSourceBytes are applied even when opts.SourceFileIDs is non-empty (explicit source selection). The option docs above state these caps apply only when SourceFileIDs is empty, so this changes API semantics and can unexpectedly drop explicitly requested segments. Consider ignoring maxSourceSegments/maxSourceBytes when explicitSources is true (or update the option documentation and call sites accordingly).

Suggested change
minSegmentAge := opts.MinSegmentAge
minSegmentAge := opts.MinSegmentAge
explicitSources := len(opts.SourceFileIDs) > 0
if explicitSources {
// When explicit source file IDs are provided, ignore global caps so that
// we do not drop any explicitly requested segments. The option
// documentation states that MaxSourceSegments/MaxSourceBytes only apply
// when SourceFileIDs is empty.
maxSourceSegments = 0
maxSourceBytes = 0
}

Copilot uses AI. Check for mistakes.
Comment on lines 899 to +913
if minSegmentAge > 0 && f.Path != "" {
if info, err := os.Stat(f.Path); err == nil {
if age := now.Sub(info.ModTime()); age < minSegmentAge {
liveBytes := liveByID[id]
if liveBytes < 0 {
liveBytes = 0
}
if liveBytes > size {
liveBytes = size
}
staleBytes := size - liveBytes
stats.ageBlockedSegments++
stats.ageBlockedBytesTotal += size
stats.ageBlockedBytesLive += liveBytes
stats.ageBlockedBytesStale += staleBytes
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AgeBlockedBytesLive/AgeBlockedBytesStale are computed from liveByID even when liveByID is nil (no live-byte estimation). In that case liveByID[id] reads as 0, which will record the entire segment as stale and can make the plan’s AgeBlocked* stats misleading. Consider treating live bytes as unknown here (e.g., skip live/stale accounting) or default to liveBytes=size when no estimate is available, consistent with the explicitSources/liveByID==nil candidate path below.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings March 24, 2026 19:47
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

if maxSourceSegments > 0 && len(selected) >= maxSourceSegments {
break
}
if maxSourceBytes > 0 {

P2 Badge Skip sparse caps when explicit source IDs are supplied

ValueLogRewriteOnlineOptions documents that MaxSourceSegments/MaxSourceBytes apply only when SourceFileIDs is empty, but after this refactor explicit IDs now flow through the shared selection loop and hit these cap checks. That means callers who provide SourceFileIDs (for a precise rewrite set) can now silently lose requested segments when a non-zero cap is present, e.g. from a reused options struct, producing incomplete rewrites that contradict the API contract.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 899 to +914
if minSegmentAge > 0 && f.Path != "" {
if info, err := os.Stat(f.Path); err == nil {
if age := now.Sub(info.ModTime()); age < minSegmentAge {
liveBytes := liveByID[id]
if liveBytes < 0 {
liveBytes = 0
}
if liveBytes > size {
liveBytes = size
}
staleBytes := size - liveBytes
stats.ageBlockedSegments++
stats.ageBlockedBytesTotal += size
stats.ageBlockedBytesLive += liveBytes
stats.ageBlockedBytesStale += staleBytes
remaining := minSegmentAge - age
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When MinSegmentAge excludes a segment, the AgeBlockedBytesLive/BytesStale stats are computed from liveByID[id]. If liveByID is nil (e.g., explicit SourceFileIDs without stale filters), the nil-map lookup returns 0, so age-blocked segments are reported as 0 live / 100% stale. Consider only populating the live/stale breakdown when liveByID != nil, or treating liveBytes as size (unknown) when no estimate is available, so the plan stats don’t mislead callers.

Copilot uses AI. Check for mistakes.
Comment on lines +12179 to +12185
db.vlogGenerationCheckpointKickPending.Store(false)
if opts.debugSource != "" {
db.debugVlogMaintf(
"maintenance_skip reason=maintenance_phase source=%s phase=%s checkpoint_pending=%t deferred_pending=%t",
opts.debugSource,
maintenancePhaseString(uint32(db.MaintenancePhase())),
db.vlogGenerationCheckpointKickPending.Load(),
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the maintenance-phase suppression branch, checkpoint_pending is cleared before logging, so the debug line will always print checkpoint_pending=false even if it was true on entry. Capture the previous pending value before clearing (or log that it was cleared) to keep the debug output accurate for diagnosing why a kick was dropped.

Suggested change
db.vlogGenerationCheckpointKickPending.Store(false)
if opts.debugSource != "" {
db.debugVlogMaintf(
"maintenance_skip reason=maintenance_phase source=%s phase=%s checkpoint_pending=%t deferred_pending=%t",
opts.debugSource,
maintenancePhaseString(uint32(db.MaintenancePhase())),
db.vlogGenerationCheckpointKickPending.Load(),
prevCheckpointPending := db.vlogGenerationCheckpointKickPending.Swap(false)
if opts.debugSource != "" {
db.debugVlogMaintf(
"maintenance_skip reason=maintenance_phase source=%s phase=%s checkpoint_pending=%t deferred_pending=%t",
opts.debugSource,
maintenancePhaseString(uint32(db.MaintenancePhase())),
prevCheckpointPending,

Copilot uses AI. Check for mistakes.
Comment on lines +3427 to 3432
time.Sleep(2 * schedulerTestWait(t))
_, rewriteCalls := recorder.recordedRewrite()
_, planCalls := recorder.recordedPlan()
if rewriteCalls != 0 || planCalls != 0 {
t.Fatalf("checkpoint kick unexpectedly ran under WAL-on: planCalls=%d rewriteCalls=%d", planCalls, rewriteCalls)
}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test now uses a fixed sleep to assert the checkpoint kick doesn’t run under WAL-on. Since the WAL-on guard should prevent scheduling the kick at all, you can make this faster and less timing-dependent by asserting kick state/counters (e.g., vlogGenerationCheckpointKickRuns/Active/Pending and recorder calls) immediately after Checkpoint, instead of sleeping.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings March 24, 2026 19:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

TreeDB/caching/vlog_generation_state.go:184

  • The PR adds persisted stage fields (RewriteStagePending / RewriteStageObserved), but the updated tests only assert the default zero-values on load. Adding a round-trip test that calls saveValueLogGenerationRewriteState(..., stagePending=true, stageObservedAt!=0) and then reloads to assert the fields persist would better protect this new behavior from regressions.
func saveValueLogGenerationRewriteState(path string, ids []uint32, ledger []backenddb.ValueLogRewritePlanSegment, penalties map[uint32]valueLogGenerationRewritePenalty, stagePending bool, stageObservedAt int64) error {
	raw, err := loadValueLogGenerationState(path)
	if err != nil {
		return err
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 13 to +14
inner kvstore.DB
phase *treedb.DB
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new field name phase is misleading because it holds a *treedb.DB (not the current phase value). Renaming it to something like phaseDB, treedbDB, or maintenanceDB would make the lifecycle/register/unregister intent much clearer.

Copilot uses AI. Check for mistakes.
Comment on lines +12178 to +12190
if db.suppressBackgroundVlogGenerationForMaintenancePhase() {
db.vlogGenerationCheckpointKickPending.Store(false)
if opts.debugSource != "" {
db.debugVlogMaintf(
"maintenance_skip reason=maintenance_phase source=%s phase=%s checkpoint_pending=%t deferred_pending=%t",
opts.debugSource,
maintenancePhaseString(uint32(db.MaintenancePhase())),
db.vlogGenerationCheckpointKickPending.Load(),
db.vlogGenerationDeferredMaintenancePending.Load(),
)
}
return
}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This debug log always prints checkpoint_pending=false because it loads the value after forcibly clearing it. If the intent is to help diagnose why work was deferred/skipped, capture the prior pending value before Store(false) and log that instead.

Copilot uses AI. Check for mistakes.
Comment on lines +229 to 232
raw.RewriteStagePending = stagePending
raw.RewriteStageObserved = stageObservedAt
return saveValueLogGenerationState(path, raw)
}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR adds persisted stage fields (RewriteStagePending / RewriteStageObserved), but the updated tests only assert the default zero-values on load. Adding a round-trip test that calls saveValueLogGenerationRewriteState(..., stagePending=true, stageObservedAt!=0) and then reloads to assert the fields persist would better protect this new behavior from regressions.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings March 24, 2026 20:12
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0bf23e5b92

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +315 to +317
if len(opts.SourceFileIDs) == 0 {
return opts.MinSegmentStaleRatio > 0 || opts.MinSegmentStaleBytes > 0 || opts.MaxSourceSegments > 0 || opts.MaxSourceBytes > 0
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require live-byte estimation for MinSegmentAge-only selection

When callers set only MinSegmentAge (no stale/byte caps and no explicit SourceFileIDs), this helper returns false, so ValueLogRewriteOnline/ValueLogRewritePlan skip estimateValueLogLiveBytesBySegment. In selectRewriteSourceSegmentsWithStats, non-explicit candidates then read liveByID[id] as zero and are all filtered out (liveBytes == 0), causing the rewrite pass to become a no-op instead of selecting old-enough segments. This regresses the new age-only sparse-selection mode and can silently disable maintenance for callers relying on MinSegmentAge as their only selector.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 909 to +921
if age := now.Sub(info.ModTime()); age < minSegmentAge {
liveBytes := liveByID[id]
if liveBytes < 0 {
liveBytes = 0
}
if liveBytes > size {
liveBytes = size
}
staleBytes := size - liveBytes
stats.ageBlockedSegments++
stats.ageBlockedBytesTotal += size
stats.ageBlockedBytesLive += liveBytes
stats.ageBlockedBytesStale += staleBytes
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When liveByID is nil (e.g., MinSegmentAge-only selection, or explicit SourceFileIDs without stale filters), liveByID[id] reads as 0 and makes the AgeBlocked live/stale stats look like the segment is 100% stale. If these plan stats are surfaced/used operationally, this will be misleading. Consider handling liveByID == nil explicitly here (e.g., record only total bytes, or pessimistically/optimistically set liveBytes=size), and document the chosen semantics.

Copilot uses AI. Check for mistakes.
Comment on lines +12177 to +12186
db.vlogGenerationCheckpointKickPending.Store(false)
if opts.debugSource != "" {
db.debugVlogMaintf(
"maintenance_skip reason=maintenance_phase source=%s phase=%s checkpoint_pending=%t deferred_pending=%t",
opts.debugSource,
maintenancePhaseString(uint32(db.MaintenancePhase())),
db.vlogGenerationCheckpointKickPending.Load(),
db.vlogGenerationDeferredMaintenancePending.Load(),
)
}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This debug log always prints checkpoint_pending=false because the flag is cleared immediately before Load(). If the intent is to aid diagnosis, capture the prior value (and optionally deferred-pending) before clearing, log that, then apply the state change. That keeps logs accurate and makes phase gating easier to reason about.

Suggested change
db.vlogGenerationCheckpointKickPending.Store(false)
if opts.debugSource != "" {
db.debugVlogMaintf(
"maintenance_skip reason=maintenance_phase source=%s phase=%s checkpoint_pending=%t deferred_pending=%t",
opts.debugSource,
maintenancePhaseString(uint32(db.MaintenancePhase())),
db.vlogGenerationCheckpointKickPending.Load(),
db.vlogGenerationDeferredMaintenancePending.Load(),
)
}
// Capture prior pending flags for accurate diagnostics before clearing.
priorCheckpointPending := db.vlogGenerationCheckpointKickPending.Load()
priorDeferredPending := db.vlogGenerationDeferredMaintenancePending.Load()
if opts.debugSource != "" {
db.debugVlogMaintf(
"maintenance_skip reason=maintenance_phase source=%s phase=%s checkpoint_pending=%t deferred_pending=%t",
opts.debugSource,
maintenancePhaseString(uint32(db.MaintenancePhase())),
priorCheckpointPending,
priorDeferredPending,
)
}
db.vlogGenerationCheckpointKickPending.Store(false)

Copilot uses AI. Check for mistakes.
Comment on lines +3427 to +3431
time.Sleep(2 * schedulerTestWait(t))
_, rewriteCalls := recorder.recordedRewrite()
_, planCalls := recorder.recordedPlan()
if rewriteCalls != 0 || planCalls != 0 {
t.Fatalf("checkpoint kick unexpectedly ran under WAL-on: planCalls=%d rewriteCalls=%d", planCalls, rewriteCalls)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relying on a fixed Sleep to assert that something did not happen can be flaky and can also miss delayed async work (e.g., a kick firing just after the sleep). A more robust pattern here is to poll until a deadline and fail immediately if calls become non-zero, and/or assert the kick state flags (pending/active) are cleared in WAL-on mode to prove no delayed goroutine will fire.

Suggested change
time.Sleep(2 * schedulerTestWait(t))
_, rewriteCalls := recorder.recordedRewrite()
_, planCalls := recorder.recordedPlan()
if rewriteCalls != 0 || planCalls != 0 {
t.Fatalf("checkpoint kick unexpectedly ran under WAL-on: planCalls=%d rewriteCalls=%d", planCalls, rewriteCalls)
deadline := time.Now().Add(2 * schedulerTestWait(t))
for time.Now().Before(deadline) {
_, rewriteCalls := recorder.recordedRewrite()
_, planCalls := recorder.recordedPlan()
if rewriteCalls != 0 || planCalls != 0 {
t.Fatalf("checkpoint kick unexpectedly ran under WAL-on: planCalls=%d rewriteCalls=%d", planCalls, rewriteCalls)
}
time.Sleep(schedulerTestWait(t) / 10)

Copilot uses AI. Check for mistakes.
Comment on lines +293 to +296
const (
keys = 50_000
batchSize = 2_000
)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test adds a fairly large workload (50k keys + churn) and runs it for two profiles, which can significantly increase CI wall time and/or cause timeouts in slower environments. Consider gating with testing.Short() (skip/reduce size when short), or lowering the key count while keeping the same invariants, so the regression signal remains without a large runtime cost.

Suggested change
const (
keys = 50_000
batchSize = 2_000
)
keys := 50_000
batchSize := 2_000
if testing.Short() {
keys = 5_000
}

Copilot uses AI. Check for mistakes.
Comment on lines 12 to 15
type DB struct {
inner kvstore.DB
phase *treedb.DB
}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new field name phase is ambiguous (it holds a *treedb.DB, not the phase value). Renaming to something like phaseDB, treedbDB, or maintenancePhaseDB would make it clearer what’s being registered/unregistered with the phase bus.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

gomap/TreeDB/caching/db.go

Lines 12521 to 12525 in 0efea40

} else if shouldDeferVlogGenerationRewritePlanForAge(plan, planOpts.MinSegmentAge) {
db.setVlogGenerationRewriteAgeBlockedUntil(now.Add(plan.AgeBlockedMinRemainingAge))
db.debugVlogMaintf(
"rewrite_plan stale_ratio_trigger age_blocked segments=%d stale_bytes=%d retry_after_ms=%d min_age_ms=%d",
plan.AgeBlockedSegments,

P1 Badge Clear or restage staged debt when confirm plan is empty

When a stage-confirmation pass runs and ValueLogRewritePlan returns no SourceFileIDs, this branch only updates age-blocked state and leaves vlogGenerationRewriteStagePending/...ObservedUnixNano untouched. Because vlogGenerationRewriteStageConfirmDue() then remains true indefinitely, scheduleDueVlogGenerationDeferredMaintenance() keeps re-queuing rewrite_stage_confirm_exit immediately, so the deferred worker can repeatedly re-enter planning without waiting for new budget/data. This can create a tight maintenance retry loop (high CPU / repeated tree scans) whenever a previously staged ledger can no longer be confirmed.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +13145 to +13149
db.schedulePendingVlogGenerationCheckpointKick()
} else {
db.vlogGenerationCheckpointKickPending.Store(false)
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Abort active maintenance when leaving steady phase

This phase transition handler only records the new phase and clears pending checkpoint kicks, but it does not stop an already-running maintenance pass. Since the non-steady phase guard is checked only before acquiring vlogGenerationMaintenanceActive, a rewrite/GC that started in steady will continue through restore/catchup if the phase flips mid-run. In that race, background maintenance still executes during phases that are meant to suppress it.

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings March 24, 2026 20:48
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a744c4049

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +11839 to +11840
if !db.vlogGenerationRewriteAgeBlockedWakeRunning.CompareAndSwap(false, true) {
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Re-arm age-block wake when deadline is shortened

This early return means a new setVlogGenerationRewriteAgeBlockedUntil call cannot reschedule the wake timer while a previous wake goroutine is already sleeping. If the new call sets an earlier deadline (for example after the candidate set changes and AgeBlockedMinRemainingAge drops), maintenance will still wait for the old, later timer before retrying, which can delay rewrite/GC far past the intended age gate and leave rewrite debt stalled longer than necessary.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +11839 to +11864
if !db.vlogGenerationRewriteAgeBlockedWakeRunning.CompareAndSwap(false, true) {
return
}
db.wg.Add(1)
go func(expectedUntil int64, delay time.Duration) {
go func() {
defer db.wg.Done()
if delay > 0 {
timer := time.NewTimer(delay)
select {
case <-timer.C:
case <-db.closeCh:
if !timer.Stop() {
<-timer.C
defer db.vlogGenerationRewriteAgeBlockedWakeRunning.Store(false)
for !db.closing.Load() {
expectedUntil := db.vlogGenerationRewriteAgeBlockedUntilNS.Load()
if expectedUntil <= 0 {
return
}
deadline := time.Unix(0, expectedUntil)
delay := time.Until(deadline)
if delay > 0 {
timer := time.NewTimer(delay)
select {
case <-timer.C:
case <-db.closeCh:
if !timer.Stop() {
<-timer.C
}
return
}
timer.Stop()
}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The age-blocked wake goroutine isn’t notified when setVlogGenerationRewriteAgeBlockedUntil updates the deadline while a wake is already running. If the new deadline is earlier than the timer currently being waited on, the goroutine will still sleep until the old (later) timer fires, delaying the retry beyond the intended min-age boundary. Consider adding a wake-up signal/resettable timer (or allowing a new goroutine when the deadline is shortened) so earlier deadlines take effect immediately.

Copilot uses AI. Check for mistakes.
Comment on lines +154 to +178
b := db.NewBatch()
rng := rand.New(rand.NewSource(1))
const uniqueKeys = adaptiveMinWrites + 1200
value := make([]byte, 8<<10) // 8KiB
keys := make([][16]byte, uniqueKeys)

for i := 0; i < uniqueKeys; i++ {
var key [16]byte
binary.BigEndian.PutUint64(key[8:16], uint64(i))
// First write introduces a new key; second write is an overwrite.
if err := db.Set(key[:], value); err != nil {
t.Fatalf("Set(first %d): %v", i, err)
binary.BigEndian.PutUint64(keys[i][0:8], rng.Uint64())
binary.BigEndian.PutUint64(keys[i][8:16], uint64(i))
if err := b.Set(keys[i][:], value); err != nil {
t.Fatalf("Batch.Set(first %d): %v", i, err)
}
if err := db.Set(key[:], value); err != nil {
t.Fatalf("Set(overwrite %d): %v", i, err)
}
for i := range uniqueKeys {
j := rng.Intn(uniqueKeys)
keys[i], keys[j] = keys[j], keys[i]
}
for i := 0; i < uniqueKeys; i++ {
if err := b.Set(keys[i][:], value); err != nil {
t.Fatalf("Batch.Set(overwrite %d): %v", i, err)
}
}
if err := b.Write(); err != nil {
t.Fatalf("Batch.Write: %v", err)
}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test creates a caching.Batch via db.NewBatch() but never closes it. Batch.Close() returns pooled slices/arenas back to the DB; skipping it can inflate memory usage across the test suite and may affect later tests. Add a defer/explicit Close after Write (and handle its error) similar to other batch-using tests.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e2c713da02

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +12193 to +12196
if db.suppressBackgroundVlogGenerationForMaintenancePhase() {
if opts.debugSource != "" {
db.debugVlogMaintf(
"maintenance_skip reason=maintenance_phase source=%s phase=%s checkpoint_pending=%t deferred_pending=%t",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve checkpoint-pending retries when phase gate skips work

When maintenance is suppressed in restore/catchup, this early return drops out before any retry state is re-armed. In the checkpoint_pending path, schedulePendingVlogGenerationCheckpointKick has already flipped vlogGenerationCheckpointKickPending from true to false before launching retries, so a suppressed attempt here exits runVlogGenerationMaintenanceRetries with the pending bit cleared and no follow-up scheduled. If this happens while a phase transition is in progress, the queued checkpoint retry is effectively lost and won't run after returning to steady unless another checkpoint event re-queues it.

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings March 24, 2026 22:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 45 to +49
func (d *DB) Close() error {
err := d.inner.Close()
if d.phase != nil {
phaseBus.unregister(d.phase)
}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close() unregisters the wrapped *treedb.DB from phaseBus only after calling d.inner.Close(). Since SetTracePhase() can concurrently iterate registered DBs and call db.SetMaintenancePhase(), this ordering allows the phase bridge to invoke SetMaintenancePhase during/after treedb.DB.Close() (which nils db.cached), creating a potential data race and post-close calls. Unregister the DB from phaseBus before closing the inner DB (ideally with a defer) so no further phase updates can target a closing instance.

Copilot uses AI. Check for mistakes.
Comment on lines +155 to +171
b := db.NewBatch()
rng := rand.New(rand.NewSource(1))
const uniqueKeys = adaptiveMinWrites + 1200
value := make([]byte, 8<<10) // 8KiB
keys := make([][16]byte, uniqueKeys)

for i := 0; i < uniqueKeys; i++ {
var key [16]byte
binary.BigEndian.PutUint64(key[8:16], uint64(i))
// First write introduces a new key; second write is an overwrite.
if err := db.Set(key[:], value); err != nil {
t.Fatalf("Set(first %d): %v", i, err)
binary.BigEndian.PutUint64(keys[i][0:8], rng.Uint64())
binary.BigEndian.PutUint64(keys[i][8:16], uint64(i))
if err := b.Set(keys[i][:], value); err != nil {
t.Fatalf("Batch.Set(first %d): %v", i, err)
}
if err := db.Set(key[:], value); err != nil {
t.Fatalf("Set(overwrite %d): %v", i, err)
}
for i := range uniqueKeys {
j := rng.Intn(uniqueKeys)
keys[i], keys[j] = keys[j], keys[i]
}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test won’t compile: for i := range uniqueKeys attempts to range over an int constant. Use an index loop (for i := 0; i < uniqueKeys; i++ { ... }) or range over a slice. Also, b := db.NewBatch() isn’t closed in this test; add defer b.Close() (or close after Write()) to match other batch-using tests.

Copilot uses AI. Check for mistakes.
Comment on lines +936 to +948
if explicitSources && liveByID == nil {
candidates = append(candidates, rewriteSourceSegment{
fileID: id,
liveBytes: size,
})
continue
}
if liveByID == nil {
candidates = append(candidates, rewriteSourceSegment{
fileID: id,
liveBytes: size,
})
continue
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explicitSources && liveByID == nil branch duplicates the immediately following if liveByID == nil branch (both append the same segment and continue). This is dead/redundant logic; remove the explicitSources-specific branch to simplify the selection loop.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants