Skip to content

treedb: improve live vlog maintenance observability#856

Open
snissn wants to merge 37 commits intomainfrom
pr/live-vacuum-observability
Open

treedb: improve live vlog maintenance observability#856
snissn wants to merge 37 commits intomainfrom
pr/live-vacuum-observability

Conversation

@snissn
Copy link
Copy Markdown
Owner

@snissn snissn commented Mar 28, 2026

Summary

  • export treedb.cache.vlog_generation.* metrics through expvar selection
  • add rewrite-plan byte counters (total/live/stale) to maintenance stats
  • add cumulative rewrite reclaimed-bytes counter to make progress visible under live rewrite+gc
  • cover new metric exposure with expvar and scheduler unit tests

Testing

  • go test ./TreeDB/caching -count=1
  • go test ./TreeDB/caching -run 'TestSelectTreeDBExpvarStatsFiltersAndCoerces|TestObserveVlogGenerationRewritePlanOutcome_SelectedTracksBytes|TestObserveVlogGenerationRewritePlanOutcome_SelectedTracksSegmentFallbackBytes' -count=1

Copilot AI review requested due to automatic review settings March 28, 2026 03:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves TreeDB live value-log (vlog) maintenance observability by exposing vlog-generation metrics via expvar and adding additional byte-level counters to make rewrite planning and rewrite+GC progress visible.

Changes:

  • Export treedb.cache.vlog_generation.* metrics through the expvar stat-selection filter.
  • Add rewrite-plan selected-byte counters (total/live/stale) and a cumulative rewrite “reclaimed bytes” counter.
  • Add unit tests covering new expvar exposure and rewrite-plan byte accounting paths.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
TreeDB/caching/db.go Adds new vlog-generation stats counters (plan-selected bytes + reclaimed bytes) and emits them via Stats().
TreeDB/caching/expvar_stats.go Extends expvar filtering to include treedb.cache.vlog_generation.*.
TreeDB/caching/expvar_stats_test.go Verifies expvar selection/coercion includes a representative vlog_generation metric.
TreeDB/caching/vlog_generation_scheduler_test.go Adds unit tests for rewrite-plan outcome observation and byte counter tracking (including fallback path).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 141 to 143
strings.HasPrefix(k, "treedb.cache.vlog_dict.") ||
strings.HasPrefix(k, "treedb.cache.vlog_generation.") ||
strings.HasPrefix(k, "treedb.cache.vlog_payload_kind.") ||
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The selection comment says expvar exports only process-wide metric families plus a few cache/backend families (mmap/decode/batch-arena), but this function now also exports the full treedb.cache.vlog_generation.* family. Please update the comment to reflect the expanded scope so future readers don’t miss that maintenance stats are intentionally exposed via expvar.

Copilot uses AI. Check for mistakes.
@snissn
Copy link
Copy Markdown
Owner Author

snissn commented Mar 28, 2026

Follow-up slice pushed in 9ce5339a.

What changed

  • Added live maintenance gate/rejection counters:
    • treedb.cache.vlog_generation.maintenance.{attempts,acquired,collisions}
    • treedb.cache.vlog_generation.maintenance.skip.*
    • treedb.cache.vlog_generation.maintenance.passes.{noop,with_rewrite,with_gc}
  • Added vacuum skip counters:
    • treedb.cache.vlog_generation.vacuum.{skipped_disabled,skipped_rewrite_bytes,skipped_cooldown}
  • Added scheduler tests for the new counters.
  • Added run notes and reproduction in worklog/2026-03-27.md.

run_celestia validation (fast profile)

Command:
LOCAL_GOMAP_DIR=/home/mikers/dev/snissn/gomap-phasehook-active TREEDB_OPEN_PROFILE=fast STOP_AT_LOCAL_HEIGHT=500 FREEZE_REMOTE_HEIGHT_AT_START=1 NO_PROGRESS_WARN_SECONDS=120 NO_PROGRESS_FAIL_SECONDS=1800 HEAP_CAPTURE_RSS_DELTA_KB=1 CAPTURE_HEAP_ON_MAX_RSS=1 ~/run_celestia.sh

Run home:
/home/mikers/.celestia-app-mainnet-treedb-20260327173138

Latest diagnostics snapshot sampled:
sync/diagnostics/pprof-heap-max-rss-7070580k-20260327173541.treedb_vars.json

Key counters:

  • maintenance.attempts=637
  • maintenance.acquired=56
  • maintenance.collisions=581
  • maintenance.passes.noop=53
  • maintenance.passes.with_gc=2
  • maintenance.passes.with_rewrite=0
  • maintenance.skip.quiet_window=26
  • gc.runs=2
  • rewrite.runs=0
  • vacuum.runs=0

Takeaway: in this early state-sync window, rewrite/vacuum did not trigger, and the dominant behavior was high active-pass contention. We now have explicit counters to drive the next tuning step.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9ce5339a4e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 13107 to +13110
db.scheduleDueVlogGenerationDeferredMaintenance()
db.schedulePendingVlogGenerationCheckpointKick()
rewriteRan := db.vlogGenerationRewriteRuns.Load() > rewriteRunsBefore
gcRan := db.vlogGenerationGCRuns.Load() > gcRunsBefore
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Count pass outcomes before queuing follow-up maintenance

The new pass-outcome counters are computed after scheduleDueVlogGenerationDeferredMaintenance()/schedulePendingVlogGenerationCheckpointKick() run in the defer path, so a just-scheduled retry goroutine can increment vlogGenerationRewriteRuns/vlogGenerationGCRuns first and make this pass look like it performed rewrite/GC when it did not. This appears under contention (fast retry wakeups on another core) and skews the new observability metrics (maintenance.passes.*) by overcounting rewrite/gc and undercounting noop for the current pass.

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings March 28, 2026 03:49
@snissn
Copy link
Copy Markdown
Owner Author

snissn commented Mar 28, 2026

Pushed another follow-up in bf11ec9d to reduce live maintenance collision churn.

Code change

  • runVlogGenerationMaintenanceRetries now coalesces retries while maintenanceActive=true if retry intent is already pending:
    • checkpoint-kick path: when checkpoint_kick.pending=true, wait/backoff instead of re-entering maybeRun... and colliding.
    • deferred path: same behavior when deferred_pending=true.
  • maybeRunPeriodicVlogGenerationMaintenance now exits early while maintenanceActive=true.
  • Added test: TestRunVlogGenerationMaintenanceRetries_CoalescesPendingCollisionRetries.

Validation

  • go test ./TreeDB/caching -count=1

A/B run_celestia comparison

Both runs used:
LOCAL_GOMAP_DIR=/home/mikers/dev/snissn/gomap-phasehook-active TREEDB_OPEN_PROFILE=fast STOP_AT_LOCAL_HEIGHT=500 FREEZE_REMOTE_HEIGHT_AT_START=1 NO_PROGRESS_WARN_SECONDS=120 NO_PROGRESS_FAIL_SECONDS=1800 HEAP_CAPTURE_RSS_DELTA_KB=1 CAPTURE_HEAP_ON_MAX_RSS=1 ~/run_celestia.sh

Baseline run home:
/home/mikers/.celestia-app-mainnet-treedb-20260327173138
(snapshot pprof-heap-max-rss-7070580k-20260327173541.treedb_vars.json)

New run home:
/home/mikers/.celestia-app-mainnet-treedb-20260327174335
(snapshot pprof-heap-max-rss-6605092k-20260327174744.treedb_vars.json)

Key counter deltas:

  • maintenance.attempts: 637 -> 124
  • maintenance.acquired: 56 -> 119
  • maintenance.collisions: 581 -> 5
  • maintenance.passes.with_rewrite: 0 -> 1
  • maintenance.passes.with_gc: 2 -> 1
  • rewrite.runs: 0 -> 1
  • vacuum.runs: 0 -> 1

This looks like a material reduction in scheduler contention, and maintenance made actual rewrite/vacuum progress in the same early-state-sync lab window.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 13136 to +13152
db.vlogGenerationMaintenanceActive.Store(false)
// If a deferred confirmation/age wake became due while this pass held the
// scheduler active, requeue it immediately on exit instead of relying on
// the original retry goroutine to still be alive.
db.scheduleDueVlogGenerationDeferredMaintenance()
db.schedulePendingVlogGenerationCheckpointKick()
rewriteRan := db.vlogGenerationRewriteRuns.Load() > rewriteRunsBefore
gcRan := db.vlogGenerationGCRuns.Load() > gcRunsBefore
if rewriteRan {
db.vlogGenerationMaintenancePassWithRewrite.Add(1)
}
if gcRan {
db.vlogGenerationMaintenancePassWithGC.Add(1)
}
if !rewriteRan && !gcRan {
db.vlogGenerationMaintenancePassNoop.Add(1)
}
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The maintenance pass outcome counters (passes.noop/with_rewrite/with_gc) can be mis-attributed due to the defer order: maintenanceActive is cleared and new maintenance may be scheduled before computing rewriteRan/gcRan from the global counters. If a deferred/checkpoint-kick pass starts quickly after maintenanceActive.Store(false), rewriteRan/gcRan may reflect the next pass instead of the one being released. Consider computing/incrementing the pass outcome counters before clearing vlogGenerationMaintenanceActive and before scheduling any follow-up maintenance (or track rewrite/gc execution via local booleans set during this pass).

Copilot uses AI. Check for mistakes.
@snissn
Copy link
Copy Markdown
Owner Author

snissn commented Mar 28, 2026

Pushed c83e2d61 with one more low-risk scheduler optimization.

Code change

  • Added periodic preflight in maybeRunPeriodicVlogGenerationMaintenance:
    • when runGC=false and foreground is hot, periodic loop now skips entering maintenance unless a deferred/checkpoint wake is pending.
    • intent: reduce high-frequency no-op maintenance acquisitions.
  • Added test: TestVlogGenerationMaintenance_PeriodicPreflightSkipsHotNoPending.

Validation

  • go test ./TreeDB/caching -count=1

Third comparison run (same lab command)

Run home:
/home/mikers/.celestia-app-mainnet-treedb-20260327175340

Snapshot:
pprof-heap-max-rss-6560568k-20260327175747.treedb_vars.json

Counters:

  • maintenance.attempts=46
  • maintenance.acquired=41
  • maintenance.collisions=5
  • maintenance.passes.noop=38
  • maintenance.passes.with_rewrite=1
  • maintenance.passes.with_gc=1
  • maintenance.skip.quiet_window=0
  • rewrite.runs=1
  • vacuum.runs=1

Trend across runs:

  • baseline: attempts=637, collisions=581, rewrite=0, vacuum=0
  • retry-coalesced: attempts=124, collisions=5, rewrite=1, vacuum=1
  • preflight: attempts=46, collisions=5, rewrite=1, vacuum=1

So the latest change primarily removes maintenance churn/noops while preserving rewrite/vacuum progress.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c83e2d6164

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +12205 to +12208
!db.vlogGenerationCheckpointKickPending.Load() &&
!db.vlogGenerationDeferredMaintenancePending.Load() &&
!db.vlogGenerationDeferredMaintenanceDue(now) {
return false
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Surface deferred-due check errors before preflight skip

This new hot-foreground preflight can return early without entering maybeRunVlogGenerationMaintenance, but the !db.vlogGenerationDeferredMaintenanceDue(now) term internally treats rewrite-stage read errors as a plain false result. In sustained non-quiet workloads with no pending flags, backend errors from the deferred-due probe are now silently ignored, so scheduler error state and notifyError are never triggered until the node eventually becomes quiet. That regresses operational visibility compared with the previous path, which always entered maintenance and reported queue/stage load failures immediately.

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings March 28, 2026 04:17
Copy link
Copy Markdown
Owner Author

snissn commented Mar 28, 2026

Instrumentation-first update pushed to pr/live-vacuum-observability (31fbb0a5).

What this adds

  • Maintenance/rewrite/GC/vacuum timing counters exposed in Stats():
    • treedb.cache.vlog_generation.maintenance.pass.{total,max,avg}_ms
    • treedb.cache.vlog_generation.rewrite.plan.{total,max,avg}_ms
    • treedb.cache.vlog_generation.rewrite.exec.{total,max,avg}_ms
    • treedb.cache.vlog_generation.gc.exec.{total,max,avg}_ms
    • treedb.cache.vlog_generation.vacuum.exec.{total,max,avg}_ms
  • Rewrite debt/backlog observability:
    • treedb.cache.vlog_generation.rewrite.ledger_segments
    • treedb.cache.vlog_generation.rewrite.ledger_bytes_{total,live,stale}
    • treedb.cache.vlog_generation.rewrite.ledger_stale_ratio_ppm
    • treedb.cache.vlog_generation.rewrite.stage_{pending,observed_unix_nano}
    • treedb.cache.vlog_generation.rewrite.penalties_active
    • treedb.cache.vlog_generation.rewrite.age_blocked_{until_unix_nano,remaining_ms}
  • Rewrite budget runtime visibility:
    • treedb.cache.vlog_generation.rewrite_budget.tokens_{bytes,cap_bytes}
    • treedb.cache.vlog_generation.rewrite_budget.tokens_utilization_pct
    • treedb.cache.vlog_generation.rewrite_budget.consumed_bytes_total
  • Token-consume path now tracks consumed bytes in vlogGenerationConsumeRewriteBudgetBytes.

Test coverage

  • Added TestVlogGenerationStats_ReportRewriteBacklogAndDurations in TreeDB/caching/vlog_generation_scheduler_test.go.
  • Validated with:
    • go test ./TreeDB/caching -run TestVlogGenerationStats_ReportRewriteBacklogAndDurations -count=1
    • go test ./TreeDB/caching -count=1

Next step: run run_celestia again and pull these new stats from .treedb_vars.json to quantify where we are bottlenecked (budget-limited, age-blocked, or collision/phase-gated) before changing rewrite behavior.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 139 to 143
strings.HasPrefix(k, "treedb.cache.vlog_payload_split.") ||
strings.HasPrefix(k, "treedb.cache.vlog_auto.") ||
strings.HasPrefix(k, "treedb.cache.vlog_dict.") ||
strings.HasPrefix(k, "treedb.cache.vlog_generation.") ||
strings.HasPrefix(k, "treedb.cache.vlog_payload_kind.") ||
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline comment above this prefix list says only certain cache/backend families are exported via expvar, but treedb.cache.vlog_generation.* is now included. Please update the comment to mention vlog_generation (or reword it so it stays accurate as this allowlist evolves).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 31fbb0a53e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

stats, err := rewriter.ValueLogRewriteOnline(ctx, rewriteOpts)
cancel()
rewriteDur := time.Since(rewriteStart)
db.observeVlogGenerationRewriteExecDuration(rewriteDur)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Count rewrite exec duration only for successful runs

observeVlogGenerationRewriteExecDuration is called before the rewrite error check, so canceled/failed rewrites contribute to treedb.cache.vlog_generation.rewrite.exec.{total,max}_ms, but rewrite.runs (the denominator for rewrite.exec.avg_ms in Stats()) is incremented only on success. Under repeated failures this yields misleading observability (for example non-zero total/max with avg_ms=0.000, or inflated averages), which can skew maintenance tuning and experiment interpretation.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

snissn commented Mar 28, 2026

Follow-up pushed: 7cc50d6d (treedb: stop maintenance retry collision amplification).

Why

In a longer run_celestia window (STOP_AT_LOCAL_HEIGHT=2000), application-db metrics showed retry amplification under load:

  • maintenance.attempts=333, acquired=29, collisions=304
  • rewrite.plan_runs=3 but plan_selected=0, rewrite.runs=0
  • checkpoint_kick.pending=true persisted while attempts/collisions climbed

This indicated retry loops were still attempting while maintenanceActive=true in some pending-state transitions.

Change

  • runVlogGenerationMaintenanceRetries now always backs off while maintenanceActive=true (until release/deadline), rather than conditionally retrying based on pending flags.
  • Added regression coverage in TestRunVlogGenerationMaintenanceRetries_CoalescesPendingCollisionRetries for checkpoint path with checkpoint_pending=false + maintenanceActive=true.

Validation

  • go test ./TreeDB/caching -run TestRunVlogGenerationMaintenanceRetries_CoalescesPendingCollisionRetries -count=1
  • go test ./TreeDB/caching -count=1

A/B run_celestia outcome (application.db instance snapshots)

Pre-fix bad snapshot (20260327183236) vs post-fix (20260327184427):

  • maintenance.attempts: 333 -> 38
  • maintenance.acquired: 29 -> 38
  • maintenance.collisions: 304 -> 0
  • rewrite.plan_selected: 0 -> 2
  • rewrite.runs: 0 -> 1
  • vacuum.runs: 0 -> 1
  • rewrite_budget.consumed_bytes_total: 0 -> 33073906

Also logged in worklog/2026-03-27.md with run homes and command context.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7cc50d6de3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +13053 to +13055
if db.vlogGenerationMaintenanceActive.Load() {
if time.Now().After(deadline) {
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve checkpoint retry intent while pass is active

When runVlogGenerationMaintenanceRetries is used for checkpoint kicks (stopWhenAcquired=false) and another maintenance pass remains active longer than vlogGenerationCheckpointKickRetryWindow (5s), this new branch only sleeps and then returns without ever invoking maybeRunVlogGenerationMaintenanceWithOptions. That means no collision path runs to re-set vlogGenerationCheckpointKickPending, so the active pass’s deferred schedulePendingVlogGenerationCheckpointKick() sees no pending work and the checkpoint-triggered follow-up maintenance can be dropped entirely.

Useful? React with 👍 / 👎.

Comment on lines 14113 to 14116
gcStats, err := gcer.ValueLogGC(ctx, gcOpts)
cancel()
db.observeVlogGenerationGCExecDuration(time.Since(gcStart))
if err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Record GC exec duration only for successful runs

observeVlogGenerationGCExecDuration is called before checking err, but vlogGenerationGCRuns (used as the denominator for gc.exec.avg_ms in Stats()) is incremented only on success. Failed or canceled GC calls therefore increase gc.exec.total_ms/max_ms without increasing gc.runs, producing inconsistent and inflated observability during error periods.

Useful? React with 👍 / 👎.

Comment on lines +14321 to 14323
db.observeVlogGenerationVacuumExecDuration(time.Since(vacuumStart))
if err != nil {
db.vlogGenerationVacuumFailures.Add(1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Record vacuum exec duration only for successful runs

The vacuum execution duration is observed before the error check, while vlogGenerationVacuumRuns (the denominator for vacuum.exec.avg_ms) is incremented only after success. If vacuum fails, vacuum.exec.total_ms/max_ms still grow but vacuum.runs does not, so the new average metrics can report misleading values (including non-zero totals with zero average).

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings March 28, 2026 05:30
Copy link
Copy Markdown
Owner Author

snissn commented Mar 28, 2026

Follow-up update pushed: 00355572 (treedb: add stage-gate and rewrite segment counters).

What was added:

  • New maintenance split counters:
    • treedb.cache.vlog_generation.maintenance.skip.stage_gate_not_due
    • treedb.cache.vlog_generation.maintenance.skip.stage_gate_due_reserved
  • New rewrite selection/execution counters:
    • treedb.cache.vlog_generation.rewrite.plan_selected_segments_total
    • treedb.cache.vlog_generation.rewrite.exec.source_segments_total
  • Test coverage extended in TestVlogGenerationStats_ReportRewriteBacklogAndDurations.

Validation:

  • go test ./TreeDB/caching -count=1

Live run (baseline fast profile, no trigger override):

  • Command:
    • LOCAL_GOMAP_DIR=/home/mikers/dev/snissn/gomap-phasehook-active TREEDB_OPEN_PROFILE=fast FREEZE_REMOTE_HEIGHT_AT_START=1 STOP_AT_LOCAL_HEIGHT=2000 ~/run_celestia.sh
  • Home:
    • /home/mikers/.celestia-app-mainnet-treedb-20260327192126
  • Final snapshot:
    • pprof-heap-max-rss-final-7983364k-20260327192636.treedb_vars.json
  • Key counters:
    • maintenance.attempts=38 acquired=38 collisions=0
    • rewrite.plan_runs=2 plan_selected=2
    • rewrite.plan_selected_segments_total=3
    • rewrite.runs=1 rewrite.exec.source_segments_total=1
    • rewrite.bytes_in=33073442 rewrite.reclaimed_bytes=0
    • maintenance.skip.stage_gate=7
    • maintenance.skip.stage_gate_not_due=7
    • maintenance.skip.stage_gate_due_reserved=0

Interpretation:

  • Planner selected more rewrite debt than was executed in-run (3 selected vs 1 executed).
  • Stage gating was entirely the "not due yet" confirmation wait path in this run.

Offline headroom on same home:

  • pre: du -sb application.db = 4707839386
  • treemap vlog-rewrite -rw:
    • segments_before=20 segments_after=16 bytes_before=4637168004 bytes_after=2039183405 records=964467
  • post: du -sb application.db = 2077350273

So live maintenance remains significantly below post-rewrite floor, and these counters now make that bottleneck explicit.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

TreeDB/db/vlog_gc.go:270

  • ValueLogGC() returns early on opts.DryRun before computing SegmentsPending/BytesPending (and the observed-source pending fields). If these fields are intended to represent "eligible - deleted" backlog, then in DryRun (where deleted=0) they likely should be populated as pending=eligible, or the struct/docs should explicitly state that *Pending is only set for non-dry-run GC runs.
		if opts.DryRun {
			continue
		}
		if err := vm.MarkZombie(id); err != nil {
			return stats, err
		}
		candidates[id] = candidate{path: f.Path, size: size, observed: observed}
	}

	if opts.DryRun {
		if set != nil {
			_ = vm.Release(set)
		}
		db.persistValueLogRefTrackerBestEffort()
		return stats, nil

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +838 to +861
if v, ok := envInt(envVlogMaxRetainedBytes); ok {
opts.ValueLog.MaxRetainedBytes = int64(v)
}
if v, ok := envInt(envVlogMaxRetainedBytesHard); ok {
opts.ValueLog.MaxRetainedBytesHard = int64(v)
}
if v, ok := envInt(envVlogRewriteBudgetBytesPerSec); ok {
opts.ValueLog.Generational.RewriteBudgetBytesPerSec = int64(v)
}
if v, ok := envInt(envVlogRewriteBudgetRecordsPerSec); ok {
opts.ValueLog.Generational.RewriteBudgetRecordsPerSec = v
}
if v, ok := envInt(envVlogRewriteTriggerTotalBytes); ok {
opts.ValueLog.Generational.RewriteTriggerTotalBytes = int64(v)
}
if v, ok := envInt(envVlogRewriteTriggerStaleRatioPPM); ok {
if v < 0 {
v = 0
}
opts.ValueLog.Generational.RewriteTriggerStaleRatioPPM = uint32(v)
}
if v, ok := envInt(envVlogRewriteTriggerChurnPerSec); ok {
opts.ValueLog.Generational.RewriteTriggerChurnPerSec = int64(v)
}
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applyEnvMaintenanceOverrides() parses several env vars documented as int64 (e.g. TREEDB_VLOG_MAX_RETAINED_BYTES*, TREEDB_VLOG_REWRITE_*_BYTES_PER_SEC) via envInt(), which uses strconv.Atoi into an int. This can overflow / reject valid large int64 values (and is inconsistent with the int64 doc comments). Consider adding an envInt64 helper (strconv.ParseInt) and using it for int64-backed options (and keeping envInt for true int fields like *_RECORDS_PER_SEC).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 431d323f84

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

return nil
}
}
now := time.Now()
db.vlogGenerationLastGCUnixNano.Store(now.UnixNano())
ctx, cancel := db.foregroundMaintenanceContext(30 * time.Second)
gcOpts := backenddb.ValueLogGCOptions{ProtectedPaths: db.valueLogProtectedPaths()}
ctx, cancel := db.vlogGenerationMaintenanceContext(30*time.Second, opts)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve close cancellation in bypass-quiet GC maintenance

This GC call now uses vlogGenerationMaintenanceContext(30*time.Second, opts), and in checkpoint-kick/deferred paths (opts.bypassQuiet == true) that helper returns a timeout-only context that is not canceled by db.closeCh. If Close() happens while ValueLogGC is blocked, the maintenance goroutine can keep running until timeout/backend return, which can delay shutdown by up to 30s; this is a regression from the previous close-aware foregroundMaintenanceContext behavior on this path.

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings March 28, 2026 15:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +25 to +27
if value is None or math.isnan(value):
return "n/a"
n = float(value)
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

human_bytes calls math.isnan(value) before coercing to float. Many call sites pass ints (e.g., from counters), and math.isnan(int) raises TypeError, causing the script to crash. Coerce to float first (or guard with isinstance(value, float) / try-except) before calling math.isnan.

Suggested change
if value is None or math.isnan(value):
return "n/a"
n = float(value)
if value is None:
return "n/a"
try:
n = float(value)
except (TypeError, ValueError):
return "n/a"
if math.isnan(n):
return "n/a"

Copilot uses AI. Check for mistakes.
Comment on lines 4451 to +4468
removed = true
if size > 0 {
out.RemovedSegments++
out.RemovedBytes += size
}
if candidate.observed {
out.ObservedSourceRemovedSegments++
if size > 0 {
out.ObservedSourceRemovedBytes += size
}
}
continue
}
if db.cleanupMissingRetainedValueLog(path) {
if size > 0 {
out.RemovedSegments++
out.RemovedBytes += size
}
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as earlier: when cleaning up orphaned/missing retained logs, RemovedSegments/RemovedBytes are only updated when size > 0, which can undercount removals if the size wasn't tracked. Increment the segment counter regardless of size knowledge.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e7dd2a3448

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

gcOpts := db.valueLogGCOptions(false)
if forceObservedSourceGC {
gcOpts.ObservedSourceFileIDs = append([]uint32(nil), observedSourceGCIDs...)
db.vlogGenerationObservedGCRuns.Add(1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Increment observed GC run counter only after successful GC

The new observed-source GC run metric is incremented before ValueLogGC returns, so failed or canceled GC attempts are counted as completed runs. This happens when forceObservedSourceGC is true and gcer.ValueLogGC returns an error; the code re-queues IDs on error, but treedb.cache.vlog_generation.observed_gc.runs has already been increased, which makes run-count telemetry inaccurate during failure periods.

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings March 28, 2026 17:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 18 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1303 to +1307
if f.Path != "" {
if info, err := os.Stat(f.Path); err == nil && info.Size() > 0 {
return uint64(info.Size())
}
}
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZombieStats is called from Stats/expvar paths and currently may call os.Stat(f.Path) via valueLogFileSizeBestEffort while holding m.mu.RLock(). Doing filesystem I/O under the manager lock can introduce latency and block writers/maintenance. Consider avoiding os.Stat here (e.g., rely on cached fileSize/mmap length only) or snapshot zombie file paths/sizes under the lock and perform any best-effort stat calls after unlocking (optionally caching results back into fileSize).

Suggested change
if f.Path != "" {
if info, err := os.Stat(f.Path); err == nil && info.Size() > 0 {
return uint64(info.Size())
}
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e7ef33865a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +4409 to +4411
if size > 0 {
out.RemovedSegments++
out.RemovedBytes += size
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Count removed prune segments even when size is unknown

This branch only increments RemovedSegments when size > 0, but valueLogClosedSegmentSize returns 0 when a retained path has no tracked closed-size entry (for example after reopen or metadata loss). In that case the segment is still pruned by cleanupMissingRetainedValueLog, yet treedb.cache.vlog_retained_prune.removed_segments is not incremented, and observed-source removed counts can exceed global removed counts. This introduces inconsistent retained-prune telemetry for operators and automation.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a9e6fc3a9a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +323 to +324
passes_total = m["maintenance_noop"] + m["maintenance_with_rewrite"] + m["maintenance_with_gc"]
m["maintenance_passes_total"] = passes_total
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Derive maintenance pass totals from mutually exclusive counters

maintenance_with_rewrite and maintenance_with_gc are not mutually exclusive (a single pass can increment both in maybeRunVlogGenerationMaintenanceWithOptions), so summing noop + with_rewrite + with_gc overcounts total passes whenever rewrite and GC run together. This inflates maintenance_passes_total and skews the rewrite/gc share percentages used by this analyzer, which can mislead capacity interpretation during mixed rewrite+gc workloads.

Useful? React with 👍 / 👎.

Comment on lines +320 to +322
m["maintenance_skip"] = skip_map
m["maintenance_skip_total"] = sum(skip_map.values())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude umbrella stage_gate from skip-total aggregation

maintenance_skip_total is computed as a raw sum of all skip counters, but stage_gate is an umbrella counter that is incremented together with stage_gate_not_due or stage_gate_due_reserved for the same skip event. This double-counts stage-gate skips and overstates total skip pressure, which can lead operators to overestimate scheduler contention in diagnostics.

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings March 28, 2026 18:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 18 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1317 to +1333
m.mu.RLock()
for _, f := range m.files {
if f == nil || !f.IsZombie.Load() {
continue
}
segments++
size := valueLogFileSizeBestEffort(f)
bytes += size
if f.RefCount.Load() > 0 {
pinnedSegments++
pinnedBytes += size
continue
}
unpinnedSegments++
unpinnedBytes += size
}
m.mu.RUnlock()
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZombieStats() holds m.mu.RLock() while calling valueLogFileSizeBestEffort(), which may perform os.Stat(f.Path). Doing filesystem I/O under the manager lock can block writers/maintenance and increase tail latency. Consider snapshotting the zombie *File pointers (or their Path + cached size fields) under the lock, then compute sizes (including any os.Stat) after releasing the lock.

Copilot uses AI. Check for mistakes.
Comment on lines 139 to 146
strings.HasPrefix(k, "treedb.cache.vlog_payload_split.") ||
strings.HasPrefix(k, "treedb.cache.vlog_auto.") ||
strings.HasPrefix(k, "treedb.cache.vlog_dict.") ||
strings.HasPrefix(k, "treedb.cache.vlog_generation.") ||
strings.HasPrefix(k, "treedb.cache.vlog_retained_prune.") ||
strings.HasPrefix(k, "treedb.cache.vlog_payload_kind.") ||
strings.HasPrefix(k, "treedb.cache.vlog_outer_leaf_codec.") ||
strings.HasPrefix(k, "treedb.cache.batch_arena.") {
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expvar selector now exports treedb.cache.vlog_generation.* and treedb.cache.vlog_retained_prune.*, but the capacity analyzer/runbook also relies on treedb.cache.vlog_zombie.* keys. Since that prefix is not allowlisted here, *.debug_vars.json snapshots won't include zombie inventory metrics and the report will silently show zeros. Add strings.HasPrefix(k, "treedb.cache.vlog_zombie.") to the allowlist (and consider a small selector test for one zombie key).

Copilot uses AI. Check for mistakes.
Comment on lines 692 to +719
// Value-log dictionary compression knobs (cached mode).
//
// Enabling dict compression requires:
// - ValueLog compression mode that allows dicts (auto/dict), and
// - Dict training enabled (TrainBytes > 0), and
// - Side stores enabled (dictdb), and
// - Split value log enabled (value pointers used).
envVlogDictEnable = "TREEDB_VLOG_DICT_ENABLE" // bool
envVlogDictTrainBytes = "TREEDB_VLOG_DICT_TRAIN_BYTES" // int
envVlogDictBytes = "TREEDB_VLOG_DICT_BYTES" // int
envVlogDictMinRecords = "TREEDB_VLOG_DICT_MIN_RECORDS" // int
envVlogDictMaxRecordBytes = "TREEDB_VLOG_DICT_MAX_RECORD_BYTES" // int
envVlogDictSampleStride = "TREEDB_VLOG_DICT_SAMPLE_STRIDE" // int
envVlogDictDedupWindow = "TREEDB_VLOG_DICT_DEDUP_WINDOW" // int
envVlogDictTrainLevel = "TREEDB_VLOG_DICT_TRAIN_LEVEL" // int
envVlogDictMaxK = "TREEDB_VLOG_DICT_MAX_K" // int
envVlogDictClassMode = "TREEDB_VLOG_DICT_CLASS_MODE" // single|split_outer_leaf
envVlogDictZstdLevel = "TREEDB_VLOG_DICT_ZSTD_LEVEL" // fastest|default|better|best|int
envVlogDictEntropy = "TREEDB_VLOG_DICT_ENTROPY" // bool
envVlogDictAdaptiveRatio = "TREEDB_VLOG_DICT_ADAPTIVE_RATIO" // float64
envVlogDictMinPayloadSavings = "TREEDB_VLOG_DICT_MIN_PAYLOAD_SAVINGS_RATIO" // float64
envVlogDictEnable = "TREEDB_VLOG_DICT_ENABLE" // bool
envVlogDictTrainBytes = "TREEDB_VLOG_DICT_TRAIN_BYTES" // int
envVlogDictBytes = "TREEDB_VLOG_DICT_BYTES" // int
envVlogDictMinRecords = "TREEDB_VLOG_DICT_MIN_RECORDS" // int
envVlogDictMaxRecordBytes = "TREEDB_VLOG_DICT_MAX_RECORD_BYTES" // int
envVlogDictSampleStride = "TREEDB_VLOG_DICT_SAMPLE_STRIDE" // int
envVlogDictDedupWindow = "TREEDB_VLOG_DICT_DEDUP_WINDOW" // int
envVlogDictTrainLevel = "TREEDB_VLOG_DICT_TRAIN_LEVEL" // int
envVlogDictMaxK = "TREEDB_VLOG_DICT_MAX_K" // int
envVlogDictClassMode = "TREEDB_VLOG_DICT_CLASS_MODE" // single|split_outer_leaf
envVlogDictZstdLevel = "TREEDB_VLOG_DICT_ZSTD_LEVEL" // fastest|default|better|best|int
envVlogDictEntropy = "TREEDB_VLOG_DICT_ENTROPY" // bool
envVlogDictAdaptiveRatio = "TREEDB_VLOG_DICT_ADAPTIVE_RATIO" // float64
envVlogDictMinPayloadSavings = "TREEDB_VLOG_DICT_MIN_PAYLOAD_SAVINGS_RATIO" // float64
envVlogMaxRetainedBytes = "TREEDB_VLOG_MAX_RETAINED_BYTES" // int64
envVlogMaxRetainedBytesHard = "TREEDB_VLOG_MAX_RETAINED_BYTES_HARD" // int64
envVlogRewriteBudgetBytesPerSec = "TREEDB_VLOG_REWRITE_BUDGET_BYTES_PER_SEC" // int64
envVlogRewriteBudgetRecordsPerSec = "TREEDB_VLOG_REWRITE_BUDGET_RECORDS_PER_SEC" // int
envVlogRewriteTriggerTotalBytes = "TREEDB_VLOG_REWRITE_TRIGGER_TOTAL_BYTES" // int64
envVlogRewriteTriggerStaleRatioPPM = "TREEDB_VLOG_REWRITE_TRIGGER_STALE_RATIO_PPM" // uint32
envVlogRewriteTriggerChurnPerSec = "TREEDB_VLOG_REWRITE_TRIGGER_CHURN_PER_SEC" // int64
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new TREEDB_VLOG_* retained/rewrite env vars are grouped under the “Value-log dictionary compression knobs” header, but they’re unrelated to dict compression. This makes the comment above the block misleading and makes the env list harder to scan. Consider moving these constants to a separate section (e.g., “Value-log maintenance knobs”) or updating the header comment to reflect the expanded scope.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e9a018a23

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +142 to +143
strings.HasPrefix(k, "treedb.cache.vlog_generation.") ||
strings.HasPrefix(k, "treedb.cache.vlog_retained_prune.") ||
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include vlog_zombie keys in expvar whitelist

The commit adds treedb.cache.vlog_zombie.* counters in DB.Stats() (used by the new maintenance analyzer), but selectTreeDBExpvarStats still whitelists only vlog_generation and vlog_retained_prune families here. In environments that consume diagnostics via this expvar filter (the default run-celestia path), zombie metrics are dropped and the analyzer’s zombie inventory fields read as zero, masking pinned-zombie buildup and skewing reclaim diagnostics.

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings March 28, 2026 21:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 18 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1317 to +1334
m.mu.RLock()
for _, f := range m.files {
if f == nil || !f.IsZombie.Load() {
continue
}
segments++
size := valueLogFileSizeBestEffort(f)
bytes += size
if f.RefCount.Load() > 0 {
pinnedSegments++
pinnedBytes += size
continue
}
unpinnedSegments++
unpinnedBytes += size
}
m.mu.RUnlock()
return segments, bytes, pinnedSegments, pinnedBytes, unpinnedSegments, unpinnedBytes
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZombieStats holds Manager.mu.RLock while calling valueLogFileSizeBestEffort, which can fall back to os.Stat on f.Path. Doing filesystem I/O under the manager lock can introduce latency/lock contention if fileSize/mmapData are unset (e.g., newly discovered or partially initialized files). Consider collecting the zombie file pointers/paths under the lock, then computing best-effort sizes outside the lock (or restricting to cached size only) to keep stats reporting non-blocking.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: db46ff7081

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +128 to 131
protectedAll := mergeUniqueNonEmptyPaths(opts.ProtectedPaths, opts.ProtectedInUsePaths, opts.ProtectedRetainedPaths)
if len(protectedAll) > 0 {
if recent := recentValueLogIDsForProtectedPaths(set, valueLogKeepRecentSegmentsPerLane, protectedAll); len(recent) > 0 {
keptIDs = recent
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve active IDs when narrowing GC lanes

keptIDs starts as currentValueLogIDs(set) but is then replaced with recent when any protected-path set is provided. With the new ProtectedInUsePaths/ProtectedRetainedPaths API, callers can validly pass only one class of paths; in that case recentValueLogIDsForProtectedPaths keeps only those lanes, so current writable segments in other lanes can be treated as eligible and marked zombie/deleted if currently unreferenced. This can prune active segments outside the protected lane subset; merge recent into keptIDs instead of overwriting it.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants