Skip to content

valuelog: remove small decode-scratch boxing churn#833

Open
snissn wants to merge 7 commits intopr/vlog-sealed-lazy-mmap-stat-churnfrom
pr/valuelog-decode-scratch-holder-pool
Open

valuelog: remove small decode-scratch boxing churn#833
snissn wants to merge 7 commits intopr/vlog-sealed-lazy-mmap-stat-churnfrom
pr/valuelog-decode-scratch-holder-pool

Conversation

@snissn
Copy link
Copy Markdown
Owner

@snissn snissn commented Mar 15, 2026

Summary

  • remove small decode-scratch []byte interface boxing churn by replacing direct sync.Pool []byte puts with a typed holder pool
  • keep the large-scratch bounded pool behavior unchanged
  • add a small-buffer allocation regression test (TestDecodeScratchPool_ReusesSmallBuffer_NoAllocsAfterWarmup)
  • keep the ReadAtWithDictTo header path on a fixed-size pointer (*[HeaderSize]byte) to avoid slice escape in the hot path

Why

alloc_objects and alloc_space for random-read workloads were still heavily split between:

  • ReadAtWithDictTo
  • putDecodeScratch

The putDecodeScratch side was largely interface-boxing churn from sync.Pool.Put([]byte) on the hot path.

Validation

  • go test ./TreeDB/internal/valuelog -count=1
  • go vet ./TreeDB/internal/valuelog
  • GOWORK=off make unified-bench
  • time ./bin/unified-bench -dbs treedb -profile fast -keys 500000 -progress=false -treedb-index-outer-leaves-in-vlog=true -valsize=100 -treedb-force-value-pointers=false -profile-dir=/home/mikers/tmp/perf-1773576977 -checkpoint-between-tests

Baseline comparison run (same branch baseline before this PR): /home/mikers/tmp/perf-1773575465

Throughput (TreeDB)

  • random_read: 234,557 -> 236,109 (+0.66%)
  • random_read_parallel: 1,667,356 -> 1,692,842 (+1.53%)
  • random_read_parallel_acquire_snapshot: 1,436,009 -> 1,512,065 (+5.30%)
  • random_read_batch: 1,186,844 -> 1,200,989 (+1.19%)

Allocation reduction (random_read_parallel)

  • alloc_objects total: 11,895,488 -> 5,755,107 (-51.62%)
  • alloc_space total: 285.11MB -> 142.59MB (-49.99%)

Allocation reduction (random_read)

  • alloc_objects total: 967,946 -> 642,231 (-33.65%)
  • alloc_space total: 35,931.20kB -> 26,671.85kB (-25.77%)

The putDecodeScratch hotspot drops out of random-read top allocators; ReadAtWithDictTo remains the dominant read-path allocator.

Copilot AI review requested due to automatic review settings March 15, 2026 12:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Eliminates sync.Pool interface-boxing allocations on the hot decode-scratch path by wrapping []byte in a typed *decodeScratchHolder struct. Also changes decodeRecordTo to accept *[HeaderSize]byte to avoid slice escape.

Changes:

  • Replace direct sync.Pool of []byte with a typed *decodeScratchHolder pool to avoid interface-boxing allocations
  • Change decodeRecordTo signature from header []byte to header *[HeaderSize]byte to prevent slice escape on the hot read path
  • Add TestDecodeScratchPool_ReusesSmallBuffer_NoAllocsAfterWarmup regression test

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
TreeDB/internal/valuelog/reader.go Typed holder pool for small scratch buffers; fixed-size array pointer for header; min cap hint
TreeDB/internal/valuelog/decode_scratch_pool_test.go Allocation regression test for small buffer reuse

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@snissn
Copy link
Copy Markdown
Owner Author

snissn commented Mar 15, 2026

Follow-up pushed: 8574f62a (valuelog: pool ReadAtWithDictTo header scratch)

Reason:

  • alloc_objects line attribution showed ~1 alloc/read at ReadAtWithDictTo local header scratch (var header [HeaderSize]byte).
  • Added typed header scratch pool (*[HeaderSize]byte) and reused it in the hot path.

Validation run:

  • /home/mikers/tmp/perf-1773577316
  • baseline: /home/mikers/tmp/perf-1773575465

Throughput deltas (TreeDB):

  • random_read: 234,557 -> 234,154 (-0.17%)
  • random_read_parallel: 1,667,356 -> 1,687,869 (+1.23%)
  • random_read_parallel_acquire_snapshot: 1,436,009 -> 1,489,359 (+3.72%)
  • random_read_batch: 1,186,844 -> 1,174,142 (-1.07%)

Alloc deltas:

  • random_read_parallel alloc_objects: 11,895,488 -> 22,719 (-99.81%)
  • random_read_parallel alloc_space: 285.11MB -> 14,729.67kB (-94.95%)
  • random_read alloc_objects: 967,946 -> 6,630 (-99.32%)
  • random_read alloc_space: 35,931.20kB -> 9,765.48kB (-72.82%)

Current random-read alloc profiles are now dominated by profiling/runtime artifacts (runtime/pprof, compress/flate from profiler snapshots), not valuelog read-path allocation churn.

@snissn
Copy link
Copy Markdown
Owner Author

snissn commented Mar 15, 2026

Additional validation on this PR head (8574f62a) with serial run_celestia runs:

fast

  • /home/mikers/.celestia-app-mainnet-treedb-20260315022429
  • duration_seconds=304
  • max_rss_kb=11574400
  • end_app_bytes=5151373599
  • du_bytes=5151373599

wal_on_fast

  • /home/mikers/.celestia-app-mainnet-treedb-20260315023001
  • duration_seconds=396
  • max_rss_kb=13097288
  • end_app_bytes=5277709748
  • du_bytes=5277709748
  • one transient stuck diagnostic captured at 02:34:42 (statesync_discover), run recovered and completed.

Diagnostics artifacts are present in each run’s sync/diagnostics/ directory (heap snapshots + smaps rollups).

Copilot AI review requested due to automatic review settings March 15, 2026 12:40
@snissn
Copy link
Copy Markdown
Owner Author

snissn commented Mar 15, 2026

Pushed 98e2d451 to address the failing Windows checks seen on this stack:

  • TreeDB/internal/valuelog/manager_test.go
    • Skip sealed-lazy-mmap tests on Windows (matches existing mmap test policy in valuelog test suite).
  • TreeDB/caching/batch_arena_budget_test.go
    • Relax drain target assertion to allow one class-sized residual (target + classCap) since sync.Pool accounting visibility is lossy and drain is class-granular.

Local validation:

  • go test ./TreeDB/caching -run TestDrainBatchArenaPoolToTargetBytes -count=1
  • go test ./TreeDB/internal/valuelog -run 'TestReadUnsafe_SealedLazyMmapBudgetFallsBackToReadAt|TestReadUnsafe_SealedLazyMmapByteBudgetFallsBackToReadAt|TestReadUnsafe_SealedMappedOutOfRangeRemapsToKnownFileSize' -count=1
  • go test ./TreeDB/internal/valuelog -count=1
  • go vet ./TreeDB/internal/valuelog ./TreeDB/caching

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR eliminates sync.Pool interface-boxing overhead for small decode scratch buffers in the valuelog read path by introducing a typed decodeScratchHolder wrapper pool pattern. It also pools the fixed-size header array used in ReadAtWithDictTo and relaxes a flaky drain-budget test assertion.

Changes:

  • Replace direct sync.Pool []byte usage with a typed *decodeScratchHolder two-pool pattern to avoid interface boxing allocations on the hot read path
  • Pool the [HeaderSize]byte header in ReadAtWithDictTo and change decodeRecordTo to accept *[HeaderSize]byte to avoid slice escape
  • Add Windows skip guards for mmap tests and relax TestDrainBatchArenaPoolToTargetBytes tolerance

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
TreeDB/internal/valuelog/reader.go Typed holder pool for decode scratch, header scratch pool, decodeRecordTo signature change
TreeDB/internal/valuelog/decode_scratch_pool_test.go New zero-alloc regression test for small buffer reuse
TreeDB/internal/valuelog/manager_test.go Windows skip guards for mmap tests
TreeDB/caching/batch_arena_budget_test.go Relax drain assertion to tolerate one class of residual

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces allocation churn in the internal/valuelog random-read hot path by eliminating []byte interface-boxing in the small decode-scratch sync.Pool and by keeping the record header on a fixed-size pointer to avoid slice escapes.

Changes:

  • Replace small decode-scratch pooling from sync.Pool of []byte to a typed holder approach to avoid interface boxing allocations.
  • Introduce a pooled *[HeaderSize]byte header scratch path and update decodeRecordTo to accept a fixed-size header pointer.
  • Add an allocation regression test for small scratch reuse; relax a batch-arena pool accounting test expectation to be resilient across sync.Pool implementations/visibility.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
TreeDB/internal/valuelog/reader.go Reworks decode scratch pooling to avoid []byte boxing and pools fixed-size header scratch; updates decodeRecordTo signature accordingly.
TreeDB/internal/valuelog/manager_test.go Skips mmap-dependent tests on Windows where mmap is not supported.
TreeDB/internal/valuelog/decode_scratch_pool_test.go Adds a small-buffer allocation regression test using testing.AllocsPerRun.
TreeDB/caching/batch_arena_budget_test.go Allows one class-sized residual in drain accounting to reduce platform/runtime sensitivity.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot AI review requested due to automatic review settings March 18, 2026 00:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces allocation churn in the valuelog read/decode hot paths by avoiding []byte interface boxing in sync.Pool and by pooling a fixed-size header scratch buffer.

Changes:

  • Replace small decode-scratch []byte pooling with a typed holder-based pooling strategy to avoid interface boxing allocations.
  • Pool *[HeaderSize]byte for ReadAtWithDictTo header reads to avoid slice escape/allocation in the hot path.
  • Add an allocation regression test for small-buffer decode scratch reuse; adjust mmap-related tests to skip on Windows; loosen a batch-arena budget assertion to account for sync.Pool variability.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
TreeDB/internal/valuelog/reader.go Introduces holder-based scratch pool and header scratch pooling; updates decode API to accept *[HeaderSize]byte.
TreeDB/internal/valuelog/manager_test.go Skips mmap budget tests on Windows where mmap isn’t supported.
TreeDB/internal/valuelog/decode_scratch_pool_test.go Adds an allocation regression test for small decode scratch pool reuse.
TreeDB/caching/batch_arena_budget_test.go Relaxes post-drain accounting assertion to tolerate class-sized residuals.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants