Skip to content

fix(storage): inflated block size estimation caused by shared string buffers#19657

Merged
zhyass merged 5 commits intodatabendlabs:mainfrom
zhyass:feat_stream
Apr 2, 2026
Merged

fix(storage): inflated block size estimation caused by shared string buffers#19657
zhyass merged 5 commits intodatabendlabs:mainfrom
zhyass:feat_stream

Conversation

@zhyass
Copy link
Copy Markdown
Member

@zhyass zhyass commented Apr 2, 2026

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This PR fixes block size estimation for variable-length data in two places:

  1. DataBlock::estimate_block_size() now uses the GC-aware memory size path, so block statistics reflect the effective column footprint after compaction instead of inflated shared-buffer memory.
  2. estimated_scalar_repeat_size() is corrected for const variable-length values, so repeated String / Array / Map scalars are no longer underestimated compared with materialized columns.

It addresses two issues:

  1. estimated_scalar_repeat_size() could significantly underestimate const variable-length values.
  2. estimate_block_size() used a non-GC memory size path, which could overestimate block size for view-backed columns.

Together, these changes make block_size and bytes_uncompressed more stable across insert, recluster, and readback.

After fix

root@localhost:8000/default/default> create or replace table t (
  id int,
  k int,
  s string
) cluster by(k);

root@localhost:8000/default/default> insert into t
select
  number as id,
  number % 10 as k,
  repeat('x', 500000) as s
from numbers(2000);
╭─────────────────────────╮
│ number of rows inserted │
│          UInt64         │
├─────────────────────────┤
│                    2000 │
╰─────────────────────────╯
2000 rows written in 2.576 sec. Processed 2 thousand rows, 953.72 MiB (776.4 rows/s, 370.23 MiB/s)

root@localhost:8000/default/default> insert into t
select
  2000 + number as id,
  number % 10 as k,
  concat('small_', number::string) as s
from numbers(200);
╭─────────────────────────╮
│ number of rows inserted │
│          UInt64         │
├─────────────────────────┤
│                     200 │
╰─────────────────────────╯
200 rows written in 0.154 sec. Processed 200 rows, 6.41 KiB (1.3 thousand rows/s, 41.63 KiB/s)

root@localhost:8000/default/default> alter table t recluster final;
2200 rows written in 4.720 sec. Processed 2.2 thousand rows, 953.73 MiB (466.1 rows/s, 202.06 MiB/s)

root@localhost:8000/default/default> select snapshot_id, block_count, row_count,bytes_uncompressed,bytes_compressed from fuse_snapshot('default','t') limit 2;

╭────────────────────────────────────────────────────────────────────────────────────────────────────╮
│            snapshot_id           │ block_count │ row_count │ bytes_uncompressed │ bytes_compressed │
│              String              │    UInt64   │   UInt64  │       UInt64       │      UInt64      │
├──────────────────────────────────┼─────────────┼───────────┼────────────────────┼──────────────────┤
│ 019d4d39b6a57f9aa6feb8f1d9b9519b │           92200100005533619265 │
│ 019d4d399a887e52b7cb4b1a3358a21a │          102200100005533013501 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────╯
2 rows read in 0.055 sec. Processed 2 rows, 539 B (36.36 rows/s, 9.57 KiB/s)

Fixes #19658

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-bugfix this PR patches a bug in codebase label Apr 2, 2026
@zhyass zhyass requested review from dantengsky and sundy-li April 2, 2026 04:32
@zhyass
Copy link
Copy Markdown
Member Author

zhyass commented Apr 2, 2026

@codex review

@zhyass zhyass requested a review from youngsofun April 2, 2026 05:03
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5921fdbfc6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@zhyass zhyass marked this pull request as draft April 2, 2026 05:37
@zhyass zhyass marked this pull request as ready for review April 2, 2026 05:37
@zhyass zhyass marked this pull request as draft April 2, 2026 09:58
@zhyass zhyass marked this pull request as ready for review April 2, 2026 10:10
@zhyass zhyass added the ci-benchmark Benchmark: run all test label Apr 2, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Docker Image for PR

  • tag: pr-19657-06b8de6-1775130843

note: this image tag is only available for internal use.

@zhyass zhyass merged commit 761f4f9 into databendlabs:main Apr 2, 2026
197 of 200 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-benchmark Benchmark: run all test pr-bugfix this PR patches a bug in codebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: inflated block size estimation by string

3 participants