-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
Post-merge follow-up issue for the fast-maintenance chain.
The main correctness chain has landed. Review-debt cleanup and CI-runtime follow-ups have also landed. This issue now tracks the remaining runtime maintenance and RSS work.
The work has shifted from small, stacked fixes to a larger maintenance-design effort. The current best merged baseline is #767.
Completed
Completed by #763:
- Remove dead duplicate
fallback == nilchecks inTreeDB/caching/leafref_live_ids.go. - Disable the real background generation loop in direct scheduler tests instead of racing explicit
maybeRunVlogGenerationMaintenance(...)calls. - Revisit
schedulerTestWaitminimum/default values enough to raise the floor for slower CI. - Update
.github/scripts/summarize_go_test_json.pyto include package-levelskipevents.
Completed by #764:
- Review
waitForRetainedValueLogPrune()insidemaybeRunVlogGenerationMaintenance(). - Keep the current serialization with
waitForRetainedValueLogPrune(); no behavior change needed.
Completed by #765:
- Preserve / handle
snap.Close()errors instead of suppressing them. - Simplify / reduce the rewrite-plan test hook registry production surface.
- Revisit cached rewrite live-bytes cache-hit behavior.
Completed by #766:
- Reduce
TreeDB/cachingCI runtime by cutting wall-clock-heavy maintenance/reopen test cost.
Completed by #767:
- Add resumable maintenance queue / checkpoint-triggered rewrite+GC entry for
DisableWAL. - Improve runtime debt behavior enough to get a real partial win over the pre-
#767baseline.
Completed by #784 as a narrow assertion-driven fix:
- Stop idle split-vlog lane checkpoint rotation from churning untouched lanes.
- Materially reduce pathological
l1/l2tiny-file explosion.
Current Best Merged Baseline
Treat merged #767 as the current baseline for future work.
Representative best merged #767 runs:
| case | profile | duration_s | max_rss_kb | max_rss_gib | end_app_bytes | end_app_gib |
|---|---|---|---|---|---|---|
| treedb_fast | fast | 250 | 20737692 | 19.31 | 9506969532 | 8.85 |
| treedb_wal_on_fast | wal_on_fast | 293 | 18804252 | 17.51 | 5314266168 | 4.95 |
Run homes:
fast:/home/mikers/.celestia-app-mainnet-treedb-20260309214001wal_on_fast:/home/mikers/.celestia-app-mainnet-treedb-20260309214518
This is the baseline to compare against unless a newer merged change supersedes it.
Runtime Follow-Up: Celestia Evidence And Targets
The main correctness state is good enough to separate merge from perf work:
- Current merged line completes end-to-end
run_celestiaon both:treedb fasttreedb wal_on_fast
- The remaining work is now perf / maintenance debt, not the earlier correctness blocker.
Earlier head Celestia runs (pre-#767 context)
| case | profile | duration_s | max_rss_kb | max_rss_gib | end_app_bytes | end_app_gib | du_bytes | du_gib |
|---|---|---|---|---|---|---|---|---|
| treedb_fast | fast | 305 | 20955128 | 19.51 | 14942028077 | 13.92 | 17204622099 | 16.02 |
| treedb_wal_on_fast | wal_on_fast | 321 | 17424392 | 16.23 | 11253062646 | 10.48 | 11253062646 | 10.48 |
Run homes:
fast:/home/mikers/.celestia-app-mainnet-treedb-20260309141847wal_on_fast:/home/mikers/.celestia-app-mainnet-treedb-20260309141303
Manual maintenance on those earlier finished runs
Using a freshly built treemap from the head line, the following was run on each finished application.db:
treemap vlog-gc <db> -rwtreemap vlog-rewrite <db> -rw
Results:
| case | before_du | after_gc_du | after_rewrite_du | total_reclaimed |
|---|---|---|---|---|
| treedb_fast | 17204622099 | 7336856119 | 2468094302 | 14736527797 |
| treedb_wal_on_fast | 11253062646 | 5209791027 | 2576638928 | 8676423718 |
Approximate GiB:
fast:16.02 GiB -> 6.83 GiB -> 2.30 GiBwal_on_fast:10.48 GiB -> 4.85 GiB -> 2.40 GiB
Interpretation:
- Both profiles converge to roughly the same final floor after maintenance: about
2.3-2.4 GiB. - That means the large runtime disk gap is primarily maintenance debt / stale vlog accumulation, not irreducible live data.
- This remains the strongest signal for the next perf sprint.
What Has Been Tried Since #767
These are the important post-#767 experiments. Future agents should not rediscover them blindly.
Rejected small-tuning follow-ups
Rejected because they preserved correctness but regressed wall time, RSS, or runtime debt:
#774: persisted GC queue follow-up#776: byte-budgeted rewrite queue phase-1 variant#780: checkpoint-kick GC gating change- several smaller scheduler/cadence/chunk-size tweaks around rewrite refill cadence,
maxSegments=2,MinSegmentAge, and GC/rewrite gating
Common pattern:
- correctness stayed good
- maintained floor stayed reasonable
- runtime accumulation, wall time, or RSS got worse
Conclusion:
- the remaining problem is not another small scheduler constant
- the remaining problem is algorithmic
#782 / #783 style phase-1 scaffolding
These branches produced useful architectural scaffolding and tests, but did not produce a keepable runtime win. They should be treated as design exploration, not as the current baseline.
Current Assertion-Driven Smell Investigation
The next active focus became the pathological directory shape, especially unreasonable l0 accumulation.
Pathological run shape that triggered this direction
Observed on:
/home/mikers/.celestia-app-mainnet-treedb-20260310092036/data/application.db/maindb/wal
Main smell:
l0total around16.84 GB- hundreds of
l0segments - hundreds of tiny
l1andl2segments - one
l255
This is now considered a correctness-style runtime assertion problem, not just a tuning smell.
Important finding from the lane-churn fix
#784 proved one real contributor:
- idle checkpoint rotation of untouched split-vlog lanes
Comparison showed:
l1files:596 -> 1l2files:596 -> 1- tiny
<1MiBfiles onl1:595 -> 0 - tiny
<1MiBfiles onl2:595 -> 0
So:
- lane churn on
l1/l2was a real bug and was materially reduced - but the main remaining smell is still
l0
Current main smell
Even after the #784 lane-churn fix, the remaining unreasonable condition is:
l0still extremely large- representative observed shape:
357l0segments- about
16.84 GBinl0
This is the next main target.
Strong Current Inferences
These are the main conclusions a future agent should start from instead of rediscovering:
- The earlier Celestia correctness blocker was real and is fixed on the merged line.
- The decisive fix was flushing unsynced fast importer value-log batches before later synced metadata could expose unreadable pointers.
- Relevant merged PR:
#758.
- The remaining problem is not basic correctness.
- The merged line completes end-to-end for both
fastandwal_on_fast. - The work is now about memory and maintenance effectiveness.
- The runtime disk gap is still mostly maintenance debt.
- Post-maintenance floors remain much lower than runtime end-state.
- Small scheduler tweaks do not solve this robustly.
- The remaining problem is no longer a missing constant.
- The small knobs have been exercised enough.
- The next useful step is a deeper maintenance algorithm change.
- Assertion-driven refinement of the subsystem generating pathological directory structure is the right tactic.
- Do not only assert queue state.
- Assert filesystem shape and lane behavior.
Current Large-Change Direction
The next phase should be a deeper maintenance algorithm change:
Maintenance V2 target architecture
Build a unified incremental maintenance engine with:
- debt ledger
- incremental scanner
- bounded executor
- controller
Phase 1 target
The immediate next implementation target is:
- persistent debt ledger
- debt-aware rewrite selection
- bounded rewrite executor
- clean resume across reopen
Why this is the right scope
resumableis not enough as the final destinationincrementalis the target architecture- but implementation should still proceed in coherent phases, not a giant rewrite
Concrete Workstreams For Additional Agents
These are independent enough to parallelize.
Workstream A: l0 pathology / lane routing assertions
Goal:
- identify and constrain the subsystem generating unreasonable
l0growth
Likely suspects already identified:
- outer-leaf value-log appender hard-pinned to one lane
- small-count but huge-value pointer paths collapsing to one lane because fan-out is record-count based
- startup/open rotation and all-lane rotation paths
Good test targets:
Opendoes not rotate untouched split-vlog lanes- memtable/WAL rotation does not churn untouched lanes
- a small number of very large pointer-backed values should not all collapse to one lane just because record count is below threshold
- repeated maintenance opportunities should not create bounded synthetic tiny-file explosions on non-active lanes
l0segment growth stays bounded under representative synthetic workloads
Workstream B: debt ledger / rewrite selection phase 1
Goal:
- replace simple queue-of-file-IDs behavior with a real persistent debt-aware ledger
Needed pieces:
- persistent per-segment state
- score / priority for rewrite candidate selection
- cooldown / retry semantics
- bounded rewrite executor using bytes/time/segment budget
- reopen/resume persistence tests
Workstream C: telemetry-driven analysis
Goal:
- use current maintenance telemetry to determine where runtime debt is coming from on real
run_celestia
Use the instrumentation landed around checkpoint kicks / queue stats / dry-run stats to answer:
- are rewrite kicks firing enough?
- is GC dry-run showing useful eligibility but not enough follow-through?
- is rewrite debt mostly accumulating in
l0because selection is wrong or because lane routing is wrong?
Workstream D: future incremental GC phase (not immediate)
Goal:
- after rewrite phase-1 is working, add resumable/incremental GC progress
Not the immediate next code change, but should remain the next architectural phase after rewrite selection/execution improves.
Recommended Validation Rules
Every candidate change in this area should be validated with:
- exact-head
run_celestiaonfast wal_on_fastsanity run when behavior may affect shared maintenance paths- compare:
duration_secondsmax_rss_kbend_app_bytesdu -sb application.db
- run:
treemap vlog-gc -rwtreemap vlog-rewrite -rw
- compare maintained floor against runtime end-state
Do not keep a branch unless it improves the right combination of:
- correctness preserved
- runtime debt improved
- wall time not materially worse
- RSS not materially worse
Current Success Criteria
The next accepted phase should aim for:
faststill completes correctlywal_on_faststill completes correctly- runtime
duandend_app_bytesmove materially closer to the post-maintenance floor - peak RSS does not regress, ideally improves
- filesystem shape becomes saner:
- fewer pathological
l0segments - bounded tiny-fragment churn on non-active lanes
- fewer pathological
Suggested Starting Point For A Future Agent
- Start from merged
#767, not an old scratch branch. - Treat
#784as proof that assertion-driven directory-shape fixes can expose real bugs. - Do not spend another cycle on micro-tuning scheduler constants.
- Start with Workstream A plus Workstream B:
- assertion-driven
l0/lane-routing tests - debt-ledger rewrite phase-1 implementation
- assertion-driven
- Use real Celestia runs as the acceptance gate.