Core: Add JMH benchmarks for Variants#15629
Draft
steveloughran wants to merge 10 commits intoapache:mainfrom
Draft
Core: Add JMH benchmarks for Variants#15629steveloughran wants to merge 10 commits intoapache:mainfrom
steveloughran wants to merge 10 commits intoapache:mainfrom
Conversation
Fixes apache#15628 Separate benchmarks for * building * serializing a prebuilt object * deserializing Variables are: - fields: [1000, 10000] - depth: [shallow, nested] - percentage of fields shed [0, 33, 67, 100] Note: the current benchmark does NOT for the JVM, as it allows for fast iterative development. A final merge should switch to fork(1). The benchmarks don't show any surprises, which is good
7c4f806 to
2be00b9
Compare
Vectorized parquet read disabled. Results (1M rows, 10 files, vectorization disabled) ┌──────────────────────┬───────────────────┬─────────────────┬─────────────┐ │ Benchmark │ Unshredded (s/op) │ Shredded (s/op) │ Ratio │ ├──────────────────────┼───────────────────┼─────────────────┼─────────────┤ │ Full read │ 0.969 ±0.152 │ 1.819 ±0.737 │ 1.9x slower │ ├──────────────────────┼───────────────────┼─────────────────┼─────────────┤ │ Projection (id only) │ 0.223 ±0.038 │ 0.273 ±0.121 │ 1.2x slower │ ├──────────────────────┼───────────────────┼─────────────────┼─────────────┤ │ Filter (category=0) │ 0.864 ±0.413 │ 1.574 ±0.164 │ 1.8x slower │ ├──────────────────────┼───────────────────┼─────────────────┼─────────────┤ │ variant_get($.value) │ 1.351 ±0.070 │ 2.415 ±0.192 │ 1.8x slower │ └──────────────────────┴───────────────────┴─────────────────┴─────────────┘ Analysis 1. Projection now works - selecting just id (skipping variant) is ~4x faster than full read, confirming the variant column is the bottleneck. 2. Shredded is consistently ~1.8-1.9x slower for all operations reading variant data. The shredded reader must reconstruct the variant object from multiple Parquet columns (metadata + value + typed_value per field), which currently costs more than reading a single binary blob. 3. Projection gap is small (0.223 vs 0.273s) — when skipping the variant column entirely, the shredded table is only slightly slower due to marginally more metadata/schema overhead. 4. variant_get doesn't exploit shredding — extracting a single field from shredded data (2.415s) is slower than from unshredded (1.351s), meaning the reader isn't short-circuiting to read just the typed Parquet column. 5. Filter provides no file-skipping — category values 0-9 are uniformly distributed across all files, so every file must be read regardless. Filter times are close to full read times minus some row-level filtering benefit. The key takeaway: the current read path doesn't yet take advantage of shredding optimizations (column pruning within variants, predicate pushdown to typed columns). These benchmarks provide a baseline to measure improvements as those optimizations are implemented.
* Made ParquetVariantUtil public @VisibleForTesting * includes SQL query for filtering on a variant field Dev setup imply shedding is slower all round, at least with test data
* deep nesting generates deeply nested structures * benchmark cost of construction alone Shows that there's a penalty for deep objects, inevitably due to the hashtable. But anything else is really complex, and deep is niche.
Shows least compression, best performance.l Now tuning specs for all queries to only return row ID, avoiding reconstruction costs (should boost parquet perf all round;
Selecting only column ID restores general parquet performance
Queries and data to emphasise value of columnar data formats * variant to add a string based on category (so 20 values) * always filter on ID before count * benchmark names tuned to look better in reports
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #15628
core:VariantSerializationBenchmark
Separate benchmarks for
Variables are:
The benchmarks don't show any surprises, which is good
spark-4.1:IcebergSourceVariantReadBenchmark
Generate Avro, unshedded Parquet and shedded Parquet tables with the same variant data and then compare performance for basic filter and project operations against the normal columns and the variant fields.
Key findings: