Core: Add JMH benchmarks for Variants by steveloughran · Pull Request #15629 · apache/iceberg

steveloughran · 2026-03-13T21:25:53Z

Fixes #15628

core:VariantSerializationBenchmark

Separate benchmarks for

serializing a prebuilt object
deserializing

Variables are:

depth: [shallow, nested, deep-nested]
percentage of fields shed [0, 33, 67, 100]

The benchmarks don't show any surprises, which is good

spark-4.1:IcebergSourceVariantReadBenchmark

Generate Avro, unshedded Parquet and shedded Parquet tables with the same variant data and then compare performance for basic filter and project operations against the normal columns and the variant fields.

Key findings:

although it has the smallest file size, parquet files with shredded variants have significantly worse performance when working with the variant structs than unshredded.
Avro is best for the variant data, though all operations will have to read the entire file, operations on other columns are (as expected) slower.

Fixes apache#15628 Separate benchmarks for * building * serializing a prebuilt object * deserializing Variables are: - fields: [1000, 10000] - depth: [shallow, nested] - percentage of fields shed [0, 33, 67, 100] Note: the current benchmark does NOT for the JVM, as it allows for fast iterative development. A final merge should switch to fork(1). The benchmarks don't show any surprises, which is good

Vectorized parquet read disabled. Results (1M rows, 10 files, vectorization disabled) ┌──────────────────────┬───────────────────┬─────────────────┬─────────────┐ │ Benchmark │ Unshredded (s/op) │ Shredded (s/op) │ Ratio │ ├──────────────────────┼───────────────────┼─────────────────┼─────────────┤ │ Full read │ 0.969 ±0.152 │ 1.819 ±0.737 │ 1.9x slower │ ├──────────────────────┼───────────────────┼─────────────────┼─────────────┤ │ Projection (id only) │ 0.223 ±0.038 │ 0.273 ±0.121 │ 1.2x slower │ ├──────────────────────┼───────────────────┼─────────────────┼─────────────┤ │ Filter (category=0) │ 0.864 ±0.413 │ 1.574 ±0.164 │ 1.8x slower │ ├──────────────────────┼───────────────────┼─────────────────┼─────────────┤ │ variant_get($.value) │ 1.351 ±0.070 │ 2.415 ±0.192 │ 1.8x slower │ └──────────────────────┴───────────────────┴─────────────────┴─────────────┘ Analysis 1. Projection now works - selecting just id (skipping variant) is ~4x faster than full read, confirming the variant column is the bottleneck. 2. Shredded is consistently ~1.8-1.9x slower for all operations reading variant data. The shredded reader must reconstruct the variant object from multiple Parquet columns (metadata + value + typed_value per field), which currently costs more than reading a single binary blob. 3. Projection gap is small (0.223 vs 0.273s) — when skipping the variant column entirely, the shredded table is only slightly slower due to marginally more metadata/schema overhead. 4. variant_get doesn't exploit shredding — extracting a single field from shredded data (2.415s) is slower than from unshredded (1.351s), meaning the reader isn't short-circuiting to read just the typed Parquet column. 5. Filter provides no file-skipping — category values 0-9 are uniformly distributed across all files, so every file must be read regardless. Filter times are close to full read times minus some row-level filtering benefit. The key takeaway: the current read path doesn't yet take advantage of shredding optimizations (column pruning within variants, predicate pushdown to typed columns). These benchmarks provide a baseline to measure improvements as those optimizations are implemented.

@VisibleForTesting

* Made ParquetVariantUtil public @VisibleForTesting * includes SQL query for filtering on a variant field Dev setup imply shedding is slower all round, at least with test data

* deep nesting generates deeply nested structures * benchmark cost of construction alone Shows that there's a penalty for deep objects, inevitably due to the hashtable. But anything else is really complex, and deep is niche.

Shows least compression, best performance.l Now tuning specs for all queries to only return row ID, avoiding reconstruction costs (should boost parquet perf all round;

Selecting only column ID restores general parquet performance

Queries and data to emphasise value of columnar data formats * variant to add a string based on category (so 20 values) * always filter on ID before count * benchmark names tuned to look better in reports

github-actions bot added the core label Mar 13, 2026

steveloughran changed the title ~~Add JMH benchmarks for Variants~~ Core: Add JMH benchmarks for Variants Mar 13, 2026

steveloughran force-pushed the pr/benchmark-variant branch from 7c4f806 to 2be00b9 Compare March 13, 2026 21:51

steveloughran marked this pull request as draft March 16, 2026 16:52

steveloughran mentioned this pull request Mar 16, 2026

Improve benchmark docs page coverage and formatting #15623

Open

steveloughran added 2 commits March 19, 2026 17:04

Benchmarks based on shred/unshred

55cd2b0

* Made ParquetVariantUtil public @VisibleForTesting * includes SQL query for filtering on a variant field Dev setup imply shedding is slower all round, at least with test data

github-actions bot added spark parquet labels Mar 20, 2026

steveloughran added 4 commits March 20, 2026 20:26

Spotless and a little bit of code cleaning

5403b93

variant serialization benchmark improvements

4baf8d4

* deep nesting generates deeply nested structures * benchmark cost of construction alone Shows that there's a penalty for deep objects, inevitably due to the hashtable. But anything else is really complex, and deep is niche.

Avro table added alongside parquet

22d5969

Shows least compression, best performance.l Now tuning specs for all queries to only return row ID, avoiding reconstruction costs (should boost parquet perf all round;

Projection to remove avro advantage

2fc7633

Selecting only column ID restores general parquet performance

steveloughran closed this Mar 23, 2026

steveloughran added 2 commits March 24, 2026 14:56

Benchmark tuning

9ada9cb

Queries and data to emphasise value of columnar data formats * variant to add a string based on category (so 20 values) * always filter on ID before count * benchmark names tuned to look better in reports

Benchmark doc improvements

e0689a5

steveloughran reopened this Mar 24, 2026

spotless

0126479

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Add JMH benchmarks for Variants#15629

Core: Add JMH benchmarks for Variants#15629
steveloughran wants to merge 10 commits intoapache:mainfrom
steveloughran:pr/benchmark-variant

steveloughran commented Mar 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

steveloughran commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

core:VariantSerializationBenchmark

spark-4.1:IcebergSourceVariantReadBenchmark

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

steveloughran commented Mar 13, 2026 •

edited

Loading