Follow up fast-maintenance chain review debt after merge

## Summary

Post-merge follow-up issue for the fast-maintenance chain.

The main correctness chain has landed. Review-debt cleanup and CI-runtime follow-ups have also landed. This issue now tracks the remaining runtime maintenance and RSS work.

The work has shifted from small, stacked fixes to a larger maintenance-design effort. The current best merged baseline is `#767`.

## Completed

Completed by [#763](https://github.com/snissn/gomap/pull/763):
- [x] Remove dead duplicate `fallback == nil` checks in `TreeDB/caching/leafref_live_ids.go`.
- [x] Disable the real background generation loop in direct scheduler tests instead of racing explicit `maybeRunVlogGenerationMaintenance(...)` calls.
- [x] Revisit `schedulerTestWait` minimum/default values enough to raise the floor for slower CI.
- [x] Update `.github/scripts/summarize_go_test_json.py` to include package-level `skip` events.

Completed by [#764](https://github.com/snissn/gomap/pull/764):
- [x] Review `waitForRetainedValueLogPrune()` inside `maybeRunVlogGenerationMaintenance()`.
- [x] Keep the current serialization with `waitForRetainedValueLogPrune()`; no behavior change needed.

Completed by [#765](https://github.com/snissn/gomap/pull/765):
- [x] Preserve / handle `snap.Close()` errors instead of suppressing them.
- [x] Simplify / reduce the rewrite-plan test hook registry production surface.
- [x] Revisit cached rewrite live-bytes cache-hit behavior.

Completed by [#766](https://github.com/snissn/gomap/pull/766):
- [x] Reduce `TreeDB/caching` CI runtime by cutting wall-clock-heavy maintenance/reopen test cost.

Completed by [#767](https://github.com/snissn/gomap/pull/767):
- [x] Add resumable maintenance queue / checkpoint-triggered rewrite+GC entry for `DisableWAL`.
- [x] Improve runtime debt behavior enough to get a real partial win over the pre-`#767` baseline.

Completed by [#784](https://github.com/snissn/gomap/pull/784) as a narrow assertion-driven fix:
- [x] Stop idle split-vlog lane checkpoint rotation from churning untouched lanes.
- [x] Materially reduce pathological `l1`/`l2` tiny-file explosion.

## Current Best Merged Baseline

Treat merged `#767` as the current baseline for future work.

Representative best merged `#767` runs:

| case | profile | duration_s | max_rss_kb | max_rss_gib | end_app_bytes | end_app_gib |
|---|---:|---:|---:|---:|---:|---:|
| treedb_fast | fast | 250 | 20737692 | 19.31 | 9506969532 | 8.85 |
| treedb_wal_on_fast | wal_on_fast | 293 | 18804252 | 17.51 | 5314266168 | 4.95 |

Run homes:
- `fast`: `/home/mikers/.celestia-app-mainnet-treedb-20260309214001`
- `wal_on_fast`: `/home/mikers/.celestia-app-mainnet-treedb-20260309214518`

This is the baseline to compare against unless a newer merged change supersedes it.

## Runtime Follow-Up: Celestia Evidence And Targets

The main correctness state is good enough to separate merge from perf work:

- Current merged line completes end-to-end `run_celestia` on both:
  - `treedb fast`
  - `treedb wal_on_fast`
- The remaining work is now perf / maintenance debt, not the earlier correctness blocker.

### Earlier head Celestia runs (pre-`#767` context)

| case | profile | duration_s | max_rss_kb | max_rss_gib | end_app_bytes | end_app_gib | du_bytes | du_gib |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
| treedb_fast | fast | 305 | 20955128 | 19.51 | 14942028077 | 13.92 | 17204622099 | 16.02 |
| treedb_wal_on_fast | wal_on_fast | 321 | 17424392 | 16.23 | 11253062646 | 10.48 | 11253062646 | 10.48 |

Run homes:
- `fast`: `/home/mikers/.celestia-app-mainnet-treedb-20260309141847`
- `wal_on_fast`: `/home/mikers/.celestia-app-mainnet-treedb-20260309141303`

### Manual maintenance on those earlier finished runs

Using a freshly built `treemap` from the head line, the following was run on each finished `application.db`:

1. `treemap vlog-gc <db> -rw`
2. `treemap vlog-rewrite <db> -rw`

Results:

| case | before_du | after_gc_du | after_rewrite_du | total_reclaimed |
|---|---:|---:|---:|---:|
| treedb_fast | 17204622099 | 7336856119 | 2468094302 | 14736527797 |
| treedb_wal_on_fast | 11253062646 | 5209791027 | 2576638928 | 8676423718 |

Approximate GiB:
- `fast`: `16.02 GiB -> 6.83 GiB -> 2.30 GiB`
- `wal_on_fast`: `10.48 GiB -> 4.85 GiB -> 2.40 GiB`

Interpretation:
- Both profiles converge to roughly the same final floor after maintenance: about `2.3-2.4 GiB`.
- That means the large runtime disk gap is primarily maintenance debt / stale vlog accumulation, not irreducible live data.
- This remains the strongest signal for the next perf sprint.

## What Has Been Tried Since `#767`

These are the important post-`#767` experiments. Future agents should not rediscover them blindly.

### Rejected small-tuning follow-ups

Rejected because they preserved correctness but regressed wall time, RSS, or runtime debt:
- `#774`: persisted GC queue follow-up
- `#776`: byte-budgeted rewrite queue phase-1 variant
- `#780`: checkpoint-kick GC gating change
- several smaller scheduler/cadence/chunk-size tweaks around rewrite refill cadence, `maxSegments=2`, `MinSegmentAge`, and GC/rewrite gating

Common pattern:
- correctness stayed good
- maintained floor stayed reasonable
- runtime accumulation, wall time, or RSS got worse

Conclusion:
- the remaining problem is not another small scheduler constant
- the remaining problem is algorithmic

### `#782` / `#783` style phase-1 scaffolding

These branches produced useful architectural scaffolding and tests, but did not produce a keepable runtime win. They should be treated as design exploration, not as the current baseline.

## Current Assertion-Driven Smell Investigation

The next active focus became the pathological directory shape, especially unreasonable `l0` accumulation.

### Pathological run shape that triggered this direction

Observed on:
- `/home/mikers/.celestia-app-mainnet-treedb-20260310092036/data/application.db/maindb/wal`

Main smell:
- `l0` total around `16.84 GB`
- hundreds of `l0` segments
- hundreds of tiny `l1` and `l2` segments
- one `l255`

This is now considered a correctness-style runtime assertion problem, not just a tuning smell.

### Important finding from the lane-churn fix

`#784` proved one real contributor:
- idle checkpoint rotation of untouched split-vlog lanes

Comparison showed:
- `l1` files: `596 -> 1`
- `l2` files: `596 -> 1`
- tiny `<1MiB` files on `l1`: `595 -> 0`
- tiny `<1MiB` files on `l2`: `595 -> 0`

So:
- lane churn on `l1`/`l2` was a real bug and was materially reduced
- but the main remaining smell is still `l0`

### Current main smell

Even after the `#784` lane-churn fix, the remaining unreasonable condition is:
- `l0` still extremely large
- representative observed shape:
  - `357` `l0` segments
  - about `16.84 GB` in `l0`

This is the next main target.

## Strong Current Inferences

These are the main conclusions a future agent should start from instead of rediscovering:

1. The earlier Celestia correctness blocker was real and is fixed on the merged line.
- The decisive fix was flushing unsynced fast importer value-log batches before later synced metadata could expose unreadable pointers.
- Relevant merged PR: `#758`.

2. The remaining problem is not basic correctness.
- The merged line completes end-to-end for both `fast` and `wal_on_fast`.
- The work is now about memory and maintenance effectiveness.

3. The runtime disk gap is still mostly maintenance debt.
- Post-maintenance floors remain much lower than runtime end-state.
- Small scheduler tweaks do not solve this robustly.

4. The remaining problem is no longer a missing constant.
- The small knobs have been exercised enough.
- The next useful step is a deeper maintenance algorithm change.

5. Assertion-driven refinement of the subsystem generating pathological directory structure is the right tactic.
- Do not only assert queue state.
- Assert filesystem shape and lane behavior.

## Current Large-Change Direction

The next phase should be a deeper maintenance algorithm change:

### Maintenance V2 target architecture

Build a unified incremental maintenance engine with:
1. debt ledger
2. incremental scanner
3. bounded executor
4. controller

### Phase 1 target

The immediate next implementation target is:
- persistent debt ledger
- debt-aware rewrite selection
- bounded rewrite executor
- clean resume across reopen

### Why this is the right scope

- `resumable` is not enough as the final destination
- `incremental` is the target architecture
- but implementation should still proceed in coherent phases, not a giant rewrite

## Concrete Workstreams For Additional Agents

These are independent enough to parallelize.

### Workstream A: `l0` pathology / lane routing assertions
Goal:
- identify and constrain the subsystem generating unreasonable `l0` growth

Likely suspects already identified:
- outer-leaf value-log appender hard-pinned to one lane
- small-count but huge-value pointer paths collapsing to one lane because fan-out is record-count based
- startup/open rotation and all-lane rotation paths

Good test targets:
1. `Open` does not rotate untouched split-vlog lanes
2. memtable/WAL rotation does not churn untouched lanes
3. a small number of very large pointer-backed values should not all collapse to one lane just because record count is below threshold
4. repeated maintenance opportunities should not create bounded synthetic tiny-file explosions on non-active lanes
5. `l0` segment growth stays bounded under representative synthetic workloads

### Workstream B: debt ledger / rewrite selection phase 1
Goal:
- replace simple queue-of-file-IDs behavior with a real persistent debt-aware ledger

Needed pieces:
1. persistent per-segment state
2. score / priority for rewrite candidate selection
3. cooldown / retry semantics
4. bounded rewrite executor using bytes/time/segment budget
5. reopen/resume persistence tests

### Workstream C: telemetry-driven analysis
Goal:
- use current maintenance telemetry to determine where runtime debt is coming from on real `run_celestia`

Use the instrumentation landed around checkpoint kicks / queue stats / dry-run stats to answer:
- are rewrite kicks firing enough?
- is GC dry-run showing useful eligibility but not enough follow-through?
- is rewrite debt mostly accumulating in `l0` because selection is wrong or because lane routing is wrong?

### Workstream D: future incremental GC phase (not immediate)
Goal:
- after rewrite phase-1 is working, add resumable/incremental GC progress

Not the immediate next code change, but should remain the next architectural phase after rewrite selection/execution improves.

## Recommended Validation Rules

Every candidate change in this area should be validated with:
1. exact-head `run_celestia` on `fast`
2. `wal_on_fast` sanity run when behavior may affect shared maintenance paths
3. compare:
   - `duration_seconds`
   - `max_rss_kb`
   - `end_app_bytes`
   - `du -sb application.db`
4. run:
   - `treemap vlog-gc -rw`
   - `treemap vlog-rewrite -rw`
5. compare maintained floor against runtime end-state

Do not keep a branch unless it improves the right combination of:
- correctness preserved
- runtime debt improved
- wall time not materially worse
- RSS not materially worse

## Current Success Criteria

The next accepted phase should aim for:
1. `fast` still completes correctly
2. `wal_on_fast` still completes correctly
3. runtime `du` and `end_app_bytes` move materially closer to the post-maintenance floor
4. peak RSS does not regress, ideally improves
5. filesystem shape becomes saner:
   - fewer pathological `l0` segments
   - bounded tiny-fragment churn on non-active lanes

## Suggested Starting Point For A Future Agent

1. Start from merged `#767`, not an old scratch branch.
2. Treat `#784` as proof that assertion-driven directory-shape fixes can expose real bugs.
3. Do not spend another cycle on micro-tuning scheduler constants.
4. Start with Workstream A plus Workstream B:
   - assertion-driven `l0`/lane-routing tests
   - debt-ledger rewrite phase-1 implementation
5. Use real Celestia runs as the acceptance gate.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Follow up fast-maintenance chain review debt after merge #762

Summary

Completed

Current Best Merged Baseline

Runtime Follow-Up: Celestia Evidence And Targets

Earlier head Celestia runs (pre-`#767` context)

Manual maintenance on those earlier finished runs

What Has Been Tried Since `#767`

Rejected small-tuning follow-ups

`#782` / `#783` style phase-1 scaffolding

Current Assertion-Driven Smell Investigation

Pathological run shape that triggered this direction

Important finding from the lane-churn fix

Current main smell

Strong Current Inferences

Current Large-Change Direction

Maintenance V2 target architecture

Phase 1 target

Why this is the right scope

Concrete Workstreams For Additional Agents

Workstream A: `l0` pathology / lane routing assertions

Workstream B: debt ledger / rewrite selection phase 1

Workstream C: telemetry-driven analysis

Workstream D: future incremental GC phase (not immediate)

Recommended Validation Rules

Current Success Criteria

Suggested Starting Point For A Future Agent

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

case	profile	duration_s	max_rss_kb	max_rss_gib	end_app_bytes	end_app_gib
treedb_fast	fast	250	20737692	19.31	9506969532	8.85
treedb_wal_on_fast	wal_on_fast	293	18804252	17.51	5314266168	4.95

case	profile	duration_s	max_rss_kb	max_rss_gib	end_app_bytes	end_app_gib	du_bytes	du_gib
treedb_fast	fast	305	20955128	19.51	14942028077	13.92	17204622099	16.02
treedb_wal_on_fast	wal_on_fast	321	17424392	16.23	11253062646	10.48	11253062646	10.48

case	before_du	after_gc_du	after_rewrite_du	total_reclaimed
treedb_fast	17204622099	7336856119	2468094302	14736527797
treedb_wal_on_fast	11253062646	5209791027	2576638928	8676423718

Follow up fast-maintenance chain review debt after merge #762

Description

Summary

Completed

Current Best Merged Baseline

Runtime Follow-Up: Celestia Evidence And Targets

Earlier head Celestia runs (pre-#767 context)

Manual maintenance on those earlier finished runs

What Has Been Tried Since #767

Rejected small-tuning follow-ups

#782 / #783 style phase-1 scaffolding

Current Assertion-Driven Smell Investigation

Pathological run shape that triggered this direction

Important finding from the lane-churn fix

Current main smell

Strong Current Inferences

Current Large-Change Direction

Maintenance V2 target architecture

Phase 1 target

Why this is the right scope

Concrete Workstreams For Additional Agents

Workstream A: l0 pathology / lane routing assertions

Workstream B: debt ledger / rewrite selection phase 1

Workstream C: telemetry-driven analysis

Workstream D: future incremental GC phase (not immediate)

Recommended Validation Rules

Current Success Criteria

Suggested Starting Point For A Future Agent

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Earlier head Celestia runs (pre-`#767` context)

What Has Been Tried Since `#767`

`#782` / `#783` style phase-1 scaffolding

Workstream A: `l0` pathology / lane routing assertions