Skip to content

Follow up fast-maintenance chain review debt after merge #762

@snissn

Description

@snissn

Summary

Post-merge follow-up issue for the fast-maintenance chain.

The main correctness chain has landed. Review-debt cleanup and CI-runtime follow-ups have also landed. This issue now tracks the remaining runtime maintenance and RSS work.

The work has shifted from small, stacked fixes to a larger maintenance-design effort. The current best merged baseline is #767.

Completed

Completed by #763:

  • Remove dead duplicate fallback == nil checks in TreeDB/caching/leafref_live_ids.go.
  • Disable the real background generation loop in direct scheduler tests instead of racing explicit maybeRunVlogGenerationMaintenance(...) calls.
  • Revisit schedulerTestWait minimum/default values enough to raise the floor for slower CI.
  • Update .github/scripts/summarize_go_test_json.py to include package-level skip events.

Completed by #764:

  • Review waitForRetainedValueLogPrune() inside maybeRunVlogGenerationMaintenance().
  • Keep the current serialization with waitForRetainedValueLogPrune(); no behavior change needed.

Completed by #765:

  • Preserve / handle snap.Close() errors instead of suppressing them.
  • Simplify / reduce the rewrite-plan test hook registry production surface.
  • Revisit cached rewrite live-bytes cache-hit behavior.

Completed by #766:

  • Reduce TreeDB/caching CI runtime by cutting wall-clock-heavy maintenance/reopen test cost.

Completed by #767:

  • Add resumable maintenance queue / checkpoint-triggered rewrite+GC entry for DisableWAL.
  • Improve runtime debt behavior enough to get a real partial win over the pre-#767 baseline.

Completed by #784 as a narrow assertion-driven fix:

  • Stop idle split-vlog lane checkpoint rotation from churning untouched lanes.
  • Materially reduce pathological l1/l2 tiny-file explosion.

Current Best Merged Baseline

Treat merged #767 as the current baseline for future work.

Representative best merged #767 runs:

case profile duration_s max_rss_kb max_rss_gib end_app_bytes end_app_gib
treedb_fast fast 250 20737692 19.31 9506969532 8.85
treedb_wal_on_fast wal_on_fast 293 18804252 17.51 5314266168 4.95

Run homes:

  • fast: /home/mikers/.celestia-app-mainnet-treedb-20260309214001
  • wal_on_fast: /home/mikers/.celestia-app-mainnet-treedb-20260309214518

This is the baseline to compare against unless a newer merged change supersedes it.

Runtime Follow-Up: Celestia Evidence And Targets

The main correctness state is good enough to separate merge from perf work:

  • Current merged line completes end-to-end run_celestia on both:
    • treedb fast
    • treedb wal_on_fast
  • The remaining work is now perf / maintenance debt, not the earlier correctness blocker.

Earlier head Celestia runs (pre-#767 context)

case profile duration_s max_rss_kb max_rss_gib end_app_bytes end_app_gib du_bytes du_gib
treedb_fast fast 305 20955128 19.51 14942028077 13.92 17204622099 16.02
treedb_wal_on_fast wal_on_fast 321 17424392 16.23 11253062646 10.48 11253062646 10.48

Run homes:

  • fast: /home/mikers/.celestia-app-mainnet-treedb-20260309141847
  • wal_on_fast: /home/mikers/.celestia-app-mainnet-treedb-20260309141303

Manual maintenance on those earlier finished runs

Using a freshly built treemap from the head line, the following was run on each finished application.db:

  1. treemap vlog-gc <db> -rw
  2. treemap vlog-rewrite <db> -rw

Results:

case before_du after_gc_du after_rewrite_du total_reclaimed
treedb_fast 17204622099 7336856119 2468094302 14736527797
treedb_wal_on_fast 11253062646 5209791027 2576638928 8676423718

Approximate GiB:

  • fast: 16.02 GiB -> 6.83 GiB -> 2.30 GiB
  • wal_on_fast: 10.48 GiB -> 4.85 GiB -> 2.40 GiB

Interpretation:

  • Both profiles converge to roughly the same final floor after maintenance: about 2.3-2.4 GiB.
  • That means the large runtime disk gap is primarily maintenance debt / stale vlog accumulation, not irreducible live data.
  • This remains the strongest signal for the next perf sprint.

What Has Been Tried Since #767

These are the important post-#767 experiments. Future agents should not rediscover them blindly.

Rejected small-tuning follow-ups

Rejected because they preserved correctness but regressed wall time, RSS, or runtime debt:

  • #774: persisted GC queue follow-up
  • #776: byte-budgeted rewrite queue phase-1 variant
  • #780: checkpoint-kick GC gating change
  • several smaller scheduler/cadence/chunk-size tweaks around rewrite refill cadence, maxSegments=2, MinSegmentAge, and GC/rewrite gating

Common pattern:

  • correctness stayed good
  • maintained floor stayed reasonable
  • runtime accumulation, wall time, or RSS got worse

Conclusion:

  • the remaining problem is not another small scheduler constant
  • the remaining problem is algorithmic

#782 / #783 style phase-1 scaffolding

These branches produced useful architectural scaffolding and tests, but did not produce a keepable runtime win. They should be treated as design exploration, not as the current baseline.

Current Assertion-Driven Smell Investigation

The next active focus became the pathological directory shape, especially unreasonable l0 accumulation.

Pathological run shape that triggered this direction

Observed on:

  • /home/mikers/.celestia-app-mainnet-treedb-20260310092036/data/application.db/maindb/wal

Main smell:

  • l0 total around 16.84 GB
  • hundreds of l0 segments
  • hundreds of tiny l1 and l2 segments
  • one l255

This is now considered a correctness-style runtime assertion problem, not just a tuning smell.

Important finding from the lane-churn fix

#784 proved one real contributor:

  • idle checkpoint rotation of untouched split-vlog lanes

Comparison showed:

  • l1 files: 596 -> 1
  • l2 files: 596 -> 1
  • tiny <1MiB files on l1: 595 -> 0
  • tiny <1MiB files on l2: 595 -> 0

So:

  • lane churn on l1/l2 was a real bug and was materially reduced
  • but the main remaining smell is still l0

Current main smell

Even after the #784 lane-churn fix, the remaining unreasonable condition is:

  • l0 still extremely large
  • representative observed shape:
    • 357 l0 segments
    • about 16.84 GB in l0

This is the next main target.

Strong Current Inferences

These are the main conclusions a future agent should start from instead of rediscovering:

  1. The earlier Celestia correctness blocker was real and is fixed on the merged line.
  • The decisive fix was flushing unsynced fast importer value-log batches before later synced metadata could expose unreadable pointers.
  • Relevant merged PR: #758.
  1. The remaining problem is not basic correctness.
  • The merged line completes end-to-end for both fast and wal_on_fast.
  • The work is now about memory and maintenance effectiveness.
  1. The runtime disk gap is still mostly maintenance debt.
  • Post-maintenance floors remain much lower than runtime end-state.
  • Small scheduler tweaks do not solve this robustly.
  1. The remaining problem is no longer a missing constant.
  • The small knobs have been exercised enough.
  • The next useful step is a deeper maintenance algorithm change.
  1. Assertion-driven refinement of the subsystem generating pathological directory structure is the right tactic.
  • Do not only assert queue state.
  • Assert filesystem shape and lane behavior.

Current Large-Change Direction

The next phase should be a deeper maintenance algorithm change:

Maintenance V2 target architecture

Build a unified incremental maintenance engine with:

  1. debt ledger
  2. incremental scanner
  3. bounded executor
  4. controller

Phase 1 target

The immediate next implementation target is:

  • persistent debt ledger
  • debt-aware rewrite selection
  • bounded rewrite executor
  • clean resume across reopen

Why this is the right scope

  • resumable is not enough as the final destination
  • incremental is the target architecture
  • but implementation should still proceed in coherent phases, not a giant rewrite

Concrete Workstreams For Additional Agents

These are independent enough to parallelize.

Workstream A: l0 pathology / lane routing assertions

Goal:

  • identify and constrain the subsystem generating unreasonable l0 growth

Likely suspects already identified:

  • outer-leaf value-log appender hard-pinned to one lane
  • small-count but huge-value pointer paths collapsing to one lane because fan-out is record-count based
  • startup/open rotation and all-lane rotation paths

Good test targets:

  1. Open does not rotate untouched split-vlog lanes
  2. memtable/WAL rotation does not churn untouched lanes
  3. a small number of very large pointer-backed values should not all collapse to one lane just because record count is below threshold
  4. repeated maintenance opportunities should not create bounded synthetic tiny-file explosions on non-active lanes
  5. l0 segment growth stays bounded under representative synthetic workloads

Workstream B: debt ledger / rewrite selection phase 1

Goal:

  • replace simple queue-of-file-IDs behavior with a real persistent debt-aware ledger

Needed pieces:

  1. persistent per-segment state
  2. score / priority for rewrite candidate selection
  3. cooldown / retry semantics
  4. bounded rewrite executor using bytes/time/segment budget
  5. reopen/resume persistence tests

Workstream C: telemetry-driven analysis

Goal:

  • use current maintenance telemetry to determine where runtime debt is coming from on real run_celestia

Use the instrumentation landed around checkpoint kicks / queue stats / dry-run stats to answer:

  • are rewrite kicks firing enough?
  • is GC dry-run showing useful eligibility but not enough follow-through?
  • is rewrite debt mostly accumulating in l0 because selection is wrong or because lane routing is wrong?

Workstream D: future incremental GC phase (not immediate)

Goal:

  • after rewrite phase-1 is working, add resumable/incremental GC progress

Not the immediate next code change, but should remain the next architectural phase after rewrite selection/execution improves.

Recommended Validation Rules

Every candidate change in this area should be validated with:

  1. exact-head run_celestia on fast
  2. wal_on_fast sanity run when behavior may affect shared maintenance paths
  3. compare:
    • duration_seconds
    • max_rss_kb
    • end_app_bytes
    • du -sb application.db
  4. run:
    • treemap vlog-gc -rw
    • treemap vlog-rewrite -rw
  5. compare maintained floor against runtime end-state

Do not keep a branch unless it improves the right combination of:

  • correctness preserved
  • runtime debt improved
  • wall time not materially worse
  • RSS not materially worse

Current Success Criteria

The next accepted phase should aim for:

  1. fast still completes correctly
  2. wal_on_fast still completes correctly
  3. runtime du and end_app_bytes move materially closer to the post-maintenance floor
  4. peak RSS does not regress, ideally improves
  5. filesystem shape becomes saner:
    • fewer pathological l0 segments
    • bounded tiny-fragment churn on non-active lanes

Suggested Starting Point For A Future Agent

  1. Start from merged #767, not an old scratch branch.
  2. Treat #784 as proof that assertion-driven directory-shape fixes can expose real bugs.
  3. Do not spend another cycle on micro-tuning scheduler constants.
  4. Start with Workstream A plus Workstream B:
    • assertion-driven l0/lane-routing tests
    • debt-ledger rewrite phase-1 implementation
  5. Use real Celestia runs as the acceptance gate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions