Skip to content

treedb: add template lab tooling and class-aware rewrite observability#854

Open
snissn wants to merge 3 commits intofollowon/vlog-dict-class-split-experimentfrom
followon/template-lab-plan-exec
Open

treedb: add template lab tooling and class-aware rewrite observability#854
snissn wants to merge 3 commits intofollowon/vlog-dict-class-split-experimentfrom
followon/template-lab-plan-exec

Conversation

@snissn
Copy link
Copy Markdown
Owner

@snissn snissn commented Mar 27, 2026

Summary

This PR executes the next phase of the compression investigation plan:

  1. PR1 scope: add rewrite observability for template outcomes by payload class and add corpus extraction tooling.
  2. PR2 scope: add a microbench/sweep harness for template experiments on extracted corpora.
  3. PR3 scope (flagged prototype): add an outer-leaf pre-transform experiment path in the lab harness.

What changed

  • Added class-aware template counters/reason stats to rewrite stats:
    • pointer_value vs outer_leaf attempted/kept/input/output bytes and reason maps.
  • Wired template store/config into rewrite path (TemplateStore in value-log options, side-store wiring).
  • Extended treemap vlog-rewrite output with class-specific template lines.
  • Added template_corpus_extract command:
    • emits pointer_values.bin, outer_leaf_pages.bin, manifest.json.
  • Added template_lab command:
    • sweep over min-savings/fingerprint-k/max-fetch,
    • emits JSON/Markdown reports,
    • supports off|header_v1 outer-leaf pre-transform,
    • supports -disable-mask-templates for anchor-only runs.
  • Added template_seed_train command + README for seeding templatedb from existing app dirs.
  • Added benchmark note updates with reproducible commands/results.

Validation

  • go test ./TreeDB/caching -run 'TestValueLogDictClassRangesForRecords_SplitOuterLeaf|TestValueLogDictClassRangesForRecords_SingleMode' -count=1
  • go test ./TreeDB/db -run 'TestRewriteWriter_TemplatePrepassEncodesBeforeDict|TestRewriteWriter_TemplateOnlyDisablesDictCompression' -count=1
  • go test ./TreeDB/cmd/treemap ./TreeDB/cmd/template_corpus_extract ./TreeDB/cmd/template_lab ./TreeDB/cmd/template_seed_train -count=1
  • go test ./TreeDB -run 'TestVacuumIndexOffline_WithTemplateFrames_WiresTemplateLookup' -count=1

Experiment run (full corpus)

Corpus: /tmp/template_corpus_run_20260327_120046

Sweep artifacts:

  • /tmp/template_lab_run_off_20260327.json
  • /tmp/template_lab_run_header_v1_20260327.json

Observed:

  • Pointer corpus: template kept 20092/42802, encoded bytes improved by ~0.3904%.
  • Outer-leaf corpus: 0 keeps across tested configs.
  • header_v1 pre-transform: no measurable outer-leaf improvement in this pass.

Repro details are documented in docs/benchmarks/CELESTIA_VLOG_COMPRESSION_NOTES_2026-03-26.md.

Copy link
Copy Markdown
Owner Author

snissn commented Mar 27, 2026

Tracking ticket with full experiment details and exact reproduction commands is now filed at: #855

Key outcome from this run:

  • Pointer corpus: ~0.39% encoded-byte reduction (20092/42802 keeps)
  • Outer-leaf corpus: 0 keeps across tested configs, including header_v1 pre-transform prototype

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0f8896fcf8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +591 to +593
if mode == "mixed" {
prefixLimit = pProbe
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep prefix probe limit independent from pointer override

In mixed mode, the prefix probe branch assigns prefixLimit = pProbe, so -pointer-probe unexpectedly changes how many prefix records are probed. The CLI help says -pointer-probe only overrides pointer probing, so this silently skews probe_attempted/probe_kept and can under/over-sample prefix results whenever -probe and -pointer-probe differ.

Useful? React with 👍 / 👎.

Comment on lines +271 to +273
pointerWriter, err := newCorpusWriter(pointerPath, *overwrite)
if err != nil {
log.Fatalf("open pointer corpus: %v", err)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor skip flags before opening corpus output files

The command creates pointer_values.bin and outer_leaf_pages.bin before checking -skip-pointer / -skip-outer-leaf. This means a skipped section can still fail due to an existing file when -overwrite=false, and with overwrite enabled it can truncate a file the user intended to keep. Partial reruns should avoid opening (and thus touching) outputs for skipped datasets.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR advances the template-compression investigation by wiring template store/config into the value-log rewrite path, adding class-aware rewrite observability, and introducing lab tooling to extract corpora and run template sweeps.

Changes:

  • Add template rewrite stats (attempted/kept/input/output + per-class reason counters) and expose them in treemap vlog-rewrite.
  • Wire TemplateStore into rewrite options and side-store lookup plumbing.
  • Add new lab commands: template_corpus_extract, template_lab, and template_seed_train (+ docs/benchmark notes) to support corpus extraction, sweeps, and store seeding.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
docs/benchmarks/CELESTIA_VLOG_COMPRESSION_NOTES_2026-03-26.md Adds reproducible commands/results for the template lab sweep.
TreeDB/vlog_rewrite.go Public treedb rewrite stats extended with template/class counters and reason maps.
TreeDB/db/vlog_rewrite.go Implements template-aware rewrite writer, applies template compression during offline rewrite, and plumbs stats/reason snapshots.
TreeDB/db/db.go Adds ValueLogOptions.TemplateStore to support template encoding (e.g., rewrite prepass).
TreeDB/side_store_lookups.go Wires templatedb lookups and conditionally sets TemplateStore when template mode is enabled.
TreeDB/cmd/treemap/main.go Extends vlog-rewrite command with template flags, templatedb wiring, and class-aware output lines.
TreeDB/db/vlog_rewrite_test.go Adds rewrite writer tests validating template prepass ordering and template-only dict bypass.
TreeDB/caching/db.go Adds per-record dict-class range splitting and appends mixed-class batches via multiple dict IDs.
TreeDB/caching/vlog_dict_classifier_test.go Adds tests for class-range splitting behavior (split outer-leaf vs single mode).
TreeDB/cmd/template_corpus_extract/main.go New command to extract pointer/outer-leaf corpora + manifest.
TreeDB/cmd/template_corpus_extract/README.md Documents corpus extraction output format and usage.
TreeDB/cmd/template_lab/main.go New sweep harness for template configs and optional outer-leaf pretransform experiments.
TreeDB/cmd/template_lab/README.md Documents lab harness input format, flags, and notes.
TreeDB/cmd/template_seed_train/main.go New command to seed/warm templatedb from an existing app dir.
TreeDB/cmd/template_seed_train/README.md Documents seeding usage and behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +224 to +281
func (s *gzipRecordSizer) WriteRecord(payload []byte) error {
if s == nil || s.gw == nil {
return errors.New("nil gzip sizer")
}
if len(payload) > int(^uint32(0)) {
return fmt.Errorf("record too large: %d", len(payload))
}
var hdr [4]byte
binary.LittleEndian.PutUint32(hdr[:], uint32(len(payload)))
if _, err := s.gw.Write(hdr[:]); err != nil {
return err
}
if len(payload) == 0 {
return nil
}
_, err := s.gw.Write(payload)
return err
}

func (s *gzipRecordSizer) Close() (int64, error) {
if s == nil || s.gw == nil {
return 0, nil
}
if err := s.gw.Close(); err != nil {
return 0, err
}
return s.cw.n, nil
}

func readCorpusFile(path string, maxRecords int) ([][]byte, error) {
f, err := os.Open(path)
if err != nil {
return nil, err
}
defer f.Close()
r := bufio.NewReaderSize(f, 1<<20)
out := make([][]byte, 0, 1024)
for {
if maxRecords > 0 && len(out) >= maxRecords {
break
}
var hdr [4]byte
if _, err := io.ReadFull(r, hdr[:]); err != nil {
if errors.Is(err, io.EOF) {
break
}
if errors.Is(err, io.ErrUnexpectedEOF) {
return nil, fmt.Errorf("truncated record header in %s", path)
}
return nil, err
}
n := binary.LittleEndian.Uint32(hdr[:])
buf := make([]byte, int(n))
if n > 0 {
if _, err := io.ReadFull(r, buf); err != nil {
return nil, err
}
}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

template_lab has the same 32-bit overflow issue in WriteRecord: len(payload) > int(^uint32(0)) will wrap on 32-bit platforms and reject all records. Also, readCorpusFile converts the on-disk uint32 length to int without checking n fits in int, which can overflow/panic on 32-bit or when reading corrupted inputs. Prefer uint64 bounds checks before converting/allocating.

Copilot uses AI. Check for mistakes.
Comment on lines +429 to +466
templateDir := filepath.Join(*appDir, "templatedb")
tmplDB, err := treedbdb.Open(treedbdb.Options{Dir: templateDir, DisableBackgroundPrune: true})
if err != nil {
log.Fatalf("open templatedb: %v", err)
}
defer tmplDB.Close()
store := templatedb.New(templateBackendKV{db: tmplDB}, templatedb.Config{})

mainDir := *appDir
if info, statErr := os.Stat(filepath.Join(*appDir, "maindb")); statErr == nil && info.IsDir() {
mainDir = filepath.Join(*appDir, "maindb")
}
backendOpts := treedbdb.Options{
Dir: mainDir,
ReadOnly: true,
DisableBackgroundPrune: true,
}
dictDir := filepath.Join(*appDir, "dictdb")
dictIndex := filepath.Join(dictDir, "index.db")
if info, statErr := os.Stat(dictIndex); statErr == nil && !info.IsDir() {
dictOpts := treedbdb.Options{
Dir: dictDir,
ReadOnly: true,
DisableBackgroundPrune: true,
IndexOuterLeavesInValueLog: false,
}
dictBackend, err := treedbdb.Open(dictOpts)
if err != nil {
log.Fatalf("open dictdb: %v", err)
}
defer dictBackend.Close()
dictStore := dictdb.New(dictBackend)
backendOpts.ValueLog.DictLookup = func(dictID uint64) ([]byte, error) {
return dictStore.GetDictBytes(context.Background(), dictID)
}
}

backend, err := treedbdb.Open(backendOpts)
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command opens templatedb, maindb, and dictdb via treedbdb.Open without applying the persisted format config (e.g., via treedbdb.LoadFormatConfig(...).ApplyToOptions(...)). Other tooling (e.g. treemap) does this to ensure options like chunk size / index format match on-disk settings. Without it, the tool can fail to open or misinterpret a DB created with non-default format config.

Copilot uses AI. Check for mistakes.
Comment on lines +271 to +306
pointerWriter, err := newCorpusWriter(pointerPath, *overwrite)
if err != nil {
log.Fatalf("open pointer corpus: %v", err)
}
defer func() {
_ = pointerWriter.Close()
}()
outerWriter, err := newCorpusWriter(outerLeafPath, *overwrite)
if err != nil {
log.Fatalf("open outer-leaf corpus: %v", err)
}
defer func() {
_ = outerWriter.Close()
}()

pointerScanned := 0
if !*skipPointer {
pointerScanned, err = extractPointerCorpus(backend, pointerWriter, *pointerLimit, *pointerStride)
if err != nil {
log.Fatalf("extract pointer corpus: %v", err)
}
}
if err := pointerWriter.Close(); err != nil {
log.Fatalf("close pointer corpus: %v", err)
}

outerLeafScanned := 0
if !*skipOuterLeaf {
outerLeafScanned, err = extractOuterLeafCorpus(backend, outerWriter, *outerLeafLimit, *outerLeafStride)
if err != nil {
log.Fatalf("extract outer-leaf corpus: %v", err)
}
}
if err := outerWriter.Close(); err != nil {
log.Fatalf("close outer-leaf corpus: %v", err)
}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corpusWriter instances are both deferred-closed and then explicitly closed later. Since (*corpusWriter).Close is not idempotent (it doesn’t nil out w.buf / w.file after closing), the second close will attempt to flush/close an already-closed file and may return errors (currently ignored in the deferred path). Prefer a single close path (either defer only, or explicit close only), or make Close idempotent by clearing fields after successful close.

Copilot uses AI. Check for mistakes.
Comment on lines 149 to +152
store := templatedb.New(templateBackendKV{db: templateBackend}, templatedb.Config{})
if opts.ValueLog.TemplateMode != template.TemplateOff && opts.ValueLog.TemplateStore == nil {
opts.ValueLog.TemplateStore = store
}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wireSideStoreLookups may set opts.ValueLog.TemplateStore to a templatedb.Store backed by a DB handle opened read-only (see templateOpts.ReadOnly = true earlier in this block). The template engine’s training/publish pipeline calls Store.PutTemplateDef, which will fail with ErrReadOnly, preventing templates from being published and potentially wasting CPU on repeated publish attempts. Consider either opening templatedb read-write when TemplateMode != TemplateOff (and the main opts are not read-only), or wrapping the store with a read-only implementation that disables training/publishing for rewrite-only experiments.

Copilot uses AI. Check for mistakes.
Comment on lines +2205 to +2220
func (db *DB) appendValueLogForRecords(l *lane, records []valuelog.Record, durability journalDurability) ([]page.ValuePtr, error) {
if len(records) == 0 {
return nil, nil
}
ranges := db.valueLogDictClassRangesForRecords(records)
if len(ranges) == 0 {
return nil, nil
}
resolveDictID := func(class vlogDictClass, cache *[vlogDictClassCount]uint64, cached *[vlogDictClassCount]bool) (uint64, error) {
if db == nil || db.dictStore == nil {
return 0, nil
}
classIdx := int(class)
if classIdx < 0 || classIdx >= vlogDictClassCount {
classIdx = int(vlogDictClassSingleValue)
class = vlogDictClassSingleValue
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appendValueLogForRecords always splits the record batch into class ranges, even when db.dictStore == nil (in which case resolveDictID returns 0 for every class). That turns into multiple appendValueLog(..., dictID=0, ...) calls and repeated lane locking/rotation checks with no functional benefit. Consider a fast-path: if db == nil || db.dictStore == nil (or if all resolved dictIDs are 0), call appendValueLog once with the full records slice.

Copilot uses AI. Check for mistakes.
if w == nil || w.buf == nil {
return errors.New("nil corpus writer")
}
if len(payload) > int(^uint32(0)) {
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size check len(payload) > int(^uint32(0)) can overflow on 32-bit platforms because converting ^uint32(0) to int wraps to -1, making the condition always true. Use a uint-based comparison instead (e.g., compare uint64(len(payload)) to uint64(^uint32(0))) and similarly guard any uint32int conversions.

Suggested change
if len(payload) > int(^uint32(0)) {
if uint64(len(payload)) > uint64(^uint32(0)) {

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

snissn commented Mar 27, 2026

Follow-up update pushed in 94055dff:

  • Fixed template mask sparse-encode correctness bug in TreeDB/template/match.go (removed in-place mutation of input values).
  • Added regression test: TestEncodeMaskTemplateSparse_DoesNotMutateInput.
  • Switched template_lab default to mask-enabled mode (-disable-mask-templates=false).
  • Re-ran full corpus sweeps with mask mode enabled:
    • /tmp/template_lab_run_mask_on_off_20260327.json
    • /tmp/template_lab_run_mask_on_header_v1_20260327.json

Key result (pointer corpus):

  • Best row in off run: tmpl_ms4_fk8_fetch16
  • Encoded bytes 132,634,951 vs raw 880,146,915 (~84.93% reduction)
  • Encoded gzip 19,144,812 vs raw gzip 428,482,010

Outer-leaf remained unchanged (0 keeps across tested configs), including header_v1 pre-transform.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 94055dff59

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

decoded = payload
}
decodeNS += time.Since(decodeStart).Nanoseconds()
recovered, err := reverseOuterLeafPretransform(decoded, cfg.OuterLeafPretransform)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restrict pretransform reversal to outer-leaf datasets

runTemplateLab always calls reverseOuterLeafPretransform whenever -outer-leaf-pretransform=header_v1, but transformValue only applies that transform for outer_leaf datasets. On pointer runs this can fail with pretransform: payload too short for valid small values (or wrongly reinterpret payloads that happen to start with the magic prefix), which makes pointer/both sweeps brittle under a flag that is supposed to be outer-leaf-only.

Useful? React with 👍 / 👎.

Comment on lines +537 to +538
if mode == "mixed" {
l = pLimit
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep prefix training limit independent from pointer override

In mixed mode, prefix training is capped by pLimit, so -pointer-limit also shrinks the prefix training pass. That contradicts the flag contract (-pointer-limit is documented as a pointer-source override) and can silently under-train prefix data, skewing seed behavior and summary counters when users tune pointer limits independently.

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings March 27, 2026 22:53
Copy link
Copy Markdown
Owner Author

snissn commented Mar 27, 2026

Follow-up pushed in 3c6fbfd2:

What was added

  • New outer-leaf pretransform mode in template_lab:
    • header_dir_delta_v1
    • Carries header side-bytes (PageID+Checksum) and delta-normalizes columnar-prefix directory metadata (key/val/prefix dirs) in-place.
  • New round-trip tests in TreeDB/cmd/template_lab/main_test.go to prove lossless reverse transform.
  • New template_lab tuning flags for training/cold-search investigation:
    • -template-train-sample-stride
    • -template-synthesize-every
    • -template-min-anchor-freq
    • -template-min-presence-ratio
    • -template-min-publish-savings
    • -template-min-publish-ratio
    • -template-cold-search-after
    • -template-cold-search-probe-every

Experiments

Corpus: /tmp/template_corpus_run_20260327_120046

  1. Baseline header_dir_delta_v1 outer-leaf sweep (20k pages):
  • /tmp/template_lab_outer_header_dir_delta_20k.json
  • Keeps: 0 across all rows.
  1. Aggressive synthesis/routing sweep (20k pages):
  • /tmp/template_lab_outer_header_dir_delta_20k_aggr.json
  • Keeps: 0 across all rows.
  • Reasons: tmpl_no_candidates=40000, templates_published=0.
  1. Coarser fingerprint check (k=1) still produced 0 keeps.

Conclusion

Outer-leaf pages remain non-templateable in this path even with directory normalization + aggressive template training knobs. This further supports pivoting outer-leaf optimization toward non-template mechanisms (block/dict/codec path), while template remains high-impact for pointer-value payloads.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c6fbfd238

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +441 to +445
backendOpts := treedbdb.Options{
Dir: mainDir,
ReadOnly: true,
DisableBackgroundPrune: true,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Wire template lookup before scanning backend values

The seeding tool opens templatedb but never sets backendOpts.ValueLog.TemplateLookup for the main backend, so reads depend on raw value-log payloads only. On any database that already contains template-encoded values, iterator materialization (ValueCopy) can hit missing-template decode errors and fail the training/probe pass instead of reading real values. This path should pass a template lookup (and decode options) from the opened template store before treedbdb.Open(backendOpts).

Useful? React with 👍 / 👎.

Comment on lines +780 to +784
if manifest.Pointer.File != "" {
*pointerFile = filepath.Join(*corpusDir, manifest.Pointer.File)
}
if manifest.OuterLeaf.File != "" {
*outerLeafFile = filepath.Join(*corpusDir, manifest.OuterLeaf.File)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve explicit dataset file flags when manifest exists

When -corpus-dir is set and manifest.json is present, the code unconditionally overwrites -pointer-file / -outer-leaf-file with manifest paths. This silently ignores explicit CLI file selections and can run experiments against the wrong corpus files, skewing reported results. Manifest-derived paths should be used only as fallbacks when the corresponding flag was not provided.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +203 to +208
if len(payload) < len(magic)+12 {
return nil, false, errors.New("pretransform: payload too short")
}
if !bytes.Equal(payload[:len(magic)], magic) {
return payload, false, nil
}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverseHeaderSideTransform returns an error for untransformed payloads that are simply shorter than len(magic)+12, because it checks length before confirming the magic prefix. This makes the reverse path overly strict and can fail round-trips if shorter payloads are ever passed through (even when not transformed). Reorder the checks so you first verify whether the payload has the magic prefix (or at least has len(magic) bytes) and only enforce the +12 header requirement when the magic matches.

Suggested change
if len(payload) < len(magic)+12 {
return nil, false, errors.New("pretransform: payload too short")
}
if !bytes.Equal(payload[:len(magic)], magic) {
return payload, false, nil
}
// If the payload is shorter than the magic, it cannot be transformed; pass through.
if len(payload) < len(magic) {
return payload, false, nil
}
// If the magic prefix does not match, treat as untransformed and pass through.
if !bytes.Equal(payload[:len(magic)], magic) {
return payload, false, nil
}
// At this point the payload is transformed; enforce the full header length requirement.
if len(payload) < len(magic)+12 {
return nil, false, errors.New("pretransform: payload too short")
}

Copilot uses AI. Check for mistakes.
Comment on lines +703 to +718
templateDir := filepath.Join(rootDir, "templatedb")
templateIndexPath := filepath.Join(templateDir, "index.db")
if _, err := os.Stat(templateIndexPath); err == nil {
templateReadOnly := templateMode == template.TemplateOff
templateOpts := treedbdb.Options{Dir: templateDir, ReadOnly: templateReadOnly}
applyPersistedFormatConfig(templateDir, &templateOpts)
templateOpts.DisableBackgroundPrune = true
templateOpts.ValueLog.Compression = treedbdb.ValueLogCompressionOff
templateOpts.ValueLog.TemplateMode = template.TemplateOff
templateOpts.ValueLog.TemplateLookup = nil
templateOpts.ValueLog.TemplateStore = nil

templateBackend, err := treedbdb.Open(templateOpts)
if err != nil {
fatalf("templatedb open: %v", err)
}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

treemap vlog-rewrite opens templatedb directly, but treedb.ValueLogRewriteOffline (TreeDB-level) also calls wireSideStoreLookups, which (per this PR) opens/wires side stores including templatedb. This duplication can cause double-open behavior and potential lock/handle contention (especially when template-mode is not off and this path opens templatedb read-write). Consider removing the explicit templatedb open/wiring in treemap and relying on wireSideStoreLookups to populate TemplateLookup/TemplateStore, or explicitly disabling side-store wiring when treemap takes ownership of templatedb wiring.

Copilot uses AI. Check for mistakes.
Comment on lines +429 to +433
templateDir := filepath.Join(*appDir, "templatedb")
tmplDB, err := treedbdb.Open(treedbdb.Options{Dir: templateDir, DisableBackgroundPrune: true})
if err != nil {
log.Fatalf("open templatedb: %v", err)
}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command opens existing on-disk DBs (templatedb, and the main backend) without first applying the persisted format config (contrast with treemap’s applyPersistedFormatConfig). For safety/consistency, load and apply the format config for each opened DB directory (or explicitly set IgnoreFormatConfig when intended) so the tool interprets on-disk data using the persisted knobs.

Copilot uses AI. Check for mistakes.
Comment on lines +441 to +445
backendOpts := treedbdb.Options{
Dir: mainDir,
ReadOnly: true,
DisableBackgroundPrune: true,
}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command opens existing on-disk DBs (templatedb, and the main backend) without first applying the persisted format config (contrast with treemap’s applyPersistedFormatConfig). For safety/consistency, load and apply the format config for each opened DB directory (or explicitly set IgnoreFormatConfig when intended) so the tool interprets on-disk data using the persisted knobs.

Copilot uses AI. Check for mistakes.
return 0, errors.New("value-log reader unavailable")
}

seen := make(map[outerLeafKey]struct{}, 1<<20)
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preallocating seen to 1<<20 entries can cause a large up-front memory spike even for small runs (e.g., low -outer-leaf-limit). Consider sizing the initial capacity based on the effective limit (e.g., min(limit, 1<<20)), or starting smaller and letting the map grow to reduce peak memory footprint for typical invocations.

Suggested change
seen := make(map[outerLeafKey]struct{}, 1<<20)
initialCap := 1 << 10
if limit > 0 {
if limit < (1 << 20) {
initialCap = limit
} else {
initialCap = 1 << 20
}
}
seen := make(map[outerLeafKey]struct{}, initialCap)

Copilot uses AI. Check for mistakes.

func newGzipRecordSizer() *gzipRecordSizer {
cw := &countingWriter{}
gw, _ := gzip.NewWriterLevel(cw, gzip.BestSpeed)
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gzip.NewWriterLevel returns an error (even if currently unlikely with gzip.BestSpeed). Ignoring it makes failures harder to diagnose if the level ever becomes configurable or changed. Handle the error (e.g., fall back to gzip.NewWriter or return an error from newGzipRecordSizer) so the harness fails explicitly when gzip initialization fails.

Suggested change
gw, _ := gzip.NewWriterLevel(cw, gzip.BestSpeed)
gw, err := gzip.NewWriterLevel(cw, gzip.BestSpeed)
if err != nil {
log.Printf("failed to create gzip writer with level %d: %v; falling back to default level", gzip.BestSpeed, err)
gw = gzip.NewWriter(cw)
}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants