fix: clustered by virtual columns that depended on virtual columns now correctly preserve these dependencies by clintropolis · Pull Request #19262 · apache/druid

clintropolis · 2026-04-03T09:28:08Z

Description

This PR fixes an issue when clustering MSQ insert/replace by a virtual column that depends on another virtual column e.g. something like LOWER(JSON_VALUE(obj, '$.a')) which plans to an ExpressionVirtualColumn referencing a NestedFieldVirtualColumn, the resulting `DimensionRangeShardSpec was missing the dependent virtual columns. This broke segment pruning for those queries since the shard spec had incomplete virtual column context, and compaction for the same reason.

Also, it fixes an issue with virtual column equivalence for the same case, where a virtual column depends on another virtual column, by allowing virtual columns with equivalent virtual column dependencies to be considered equivalent by rewriting the virtual column to use the equivalent inputs before testing.

changes:

adds addRequiredVirtualColumns method to SegmentGenerationStageSpec which resolves transitive virtual column dependencies for virtual columns used by clustering, fixing a bug where these dependent virtual columns would be lost in the shard spec and compaction state
adds supportsRequiredRewrite and rewriteRequiredColumns to VirtualColumn allowing a virtual column to rewrite its input references to equivalent names
adds Expr.rewriteBindings to rewrite identifier bindings in an Expr tree
VirtualColumns.findEquivalent is enhanced to transitively resolve dependent virtual columns across naming contexts before checking equivalence, enabling detection that e.g. lower("v1") ≡ lower("v0") when v0 and v1 are equivalent virtual columns
FilterSegmentPruner updated to use transitive equivalence when matching shard virtual columns to query virtual columns (with Optional-based caching to correctly handle nulls)
Projections.matchQueryVirtualColumn updated similarly

…w correctly preserve these dependencies changes: * adds `addRequiredVirtualColumns` method to `SegmentGenerationStageSpec` which resolves transitive virtual column dependencies for virtual columns used by clustering, fixing a bug where these dependent virtual columns would be lost in the shard spec and compaction state * adds `supportsRequiredRewrite` and `rewriteRequiredColumns` to `VirtualColumn` allowing a virtual column to rewrite its input references to equivalent names * adds `Expr.rewriteBindings` to rewrite identifier bindings in an `Expr` tree * `VirtualColumns.findEquivalent` is enhanced to transitively resolve dependent virtual columns across naming contexts before checking equivalence, enabling detection that e.g. `lower("v1")` ≡ `lower("v0")` when v0 and v1 are equivalent virtual columns * `FilterSegmentPruner` updated to use transitive equivalence when matching shard virtual columns to query virtual columns (with Optional-based caching to correctly handle nulls) * `Projections.matchQueryVirtualColumn` updated similarly

gianm · 2026-04-07T05:07:20Z

processing/src/main/java/org/apache/druid/segment/VirtualColumn.java

  List<String> requiredColumns();

+
+  default boolean supportsRequiredRewrite()


Somewhat obvious what this means, but it would still be good to have a javadoc.

gianm · 2026-04-07T05:08:20Z

processing/src/main/java/org/apache/druid/segment/VirtualColumn.java

+
+  /**
+   * Return a copy of this virtual column that is identical to this virtual column except that it operates on different
+   * columns, based on a renaming map where the key is the column to be renamed and the value is the new column.


Useful to say that not all virtual columns implement this, and callers must check supportsRequiredRewrite.

gianm · 2026-04-07T15:56:34Z

processing/src/main/java/org/apache/druid/query/filter/FilterSegmentPruner.java

+  private VirtualColumn getQueryEquivalent(VirtualColumns shardVirtualColumns, VirtualColumn shardVirtualColumn)
+  {
+    final Optional<VirtualColumn> cached = shardEquivalenceCache.computeIfAbsent(
+        shardVirtualColumn,


Is it ok that the key is just shardVirtualColumn even though the compute function uses shardVirtualColumns as well? Can this be called with two different shardVirtualColumns for the same FilterSegmentPruner object and get some incorrect cache hits?

ah, good catch, no it isn't cool. Fixed by making the cache key be an internal class to capture the tree structure so that it is the exact virtual column structure and can't be tricked by the same names.

I'm kind of wondering if it would be worth making VirtualColumns store virtual columns wrapped like this so that we can make operations like this a bit cheaper, but I don't think i want to make a change like that in this PR.

Added some segments to the test that have the same names but different json_value expression that failed before this fix.

…w correctly preserve these dependencies (apache#19262) changes: * adds `addRequiredVirtualColumns` method to `SegmentGenerationStageSpec` which resolves transitive virtual column dependencies for virtual columns used by clustering, fixing a bug where these dependent virtual columns would be lost in the shard spec and compaction state * adds `supportsRequiredRewrite` and `rewriteRequiredColumns` to `VirtualColumn` allowing a virtual column to rewrite its input references to equivalent names * adds `Expr.rewriteBindings` to rewrite identifier bindings in an `Expr` tree * `VirtualColumns.findEquivalent` is enhanced to transitively resolve dependent virtual columns across naming contexts before checking equivalence, enabling detection that e.g. `lower("v1")` ≡ `lower("v0")` when v0 and v1 are equivalent virtual columns * `FilterSegmentPruner` updated to use transitive equivalence when matching shard virtual columns to query virtual columns (with Optional-based caching to correctly handle nulls) * `Projections.matchQueryVirtualColumn` updated similarly * intern range shardspec dimension strings and virtual columns

clintropolis added the Bug label Apr 3, 2026

github-actions bot added Area - Batch Ingestion Area - Segment Format and Ser/De Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Apr 3, 2026

fix test

9cfb313

gianm reviewed Apr 7, 2026

View reviewed changes

clintropolis added 2 commits April 7, 2026 12:56

fix

b0e469f

intern range shardspec dimension strings and virtual columns

af042ed

clintropolis added this to the 37.0.0 milestone Apr 8, 2026

gianm approved these changes Apr 8, 2026

View reviewed changes

clintropolis merged commit a0ec76e into apache:master Apr 8, 2026
63 of 64 checks passed

clintropolis deleted the fix-clustered-by-virtual-column branch April 8, 2026 22:41

clintropolis mentioned this pull request Apr 8, 2026

[Backport] fix: clustered by virtual columns that depended on virtual columns now correctly preserve these dependencies #19279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: clustered by virtual columns that depended on virtual columns now correctly preserve these dependencies#19262

fix: clustered by virtual columns that depended on virtual columns now correctly preserve these dependencies#19262
clintropolis merged 4 commits intoapache:masterfrom
clintropolis:fix-clustered-by-virtual-column

clintropolis commented Apr 3, 2026

Uh oh!

gianm Apr 7, 2026

Uh oh!

gianm Apr 7, 2026

Uh oh!

gianm Apr 7, 2026

Uh oh!

clintropolis Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		List<String> requiredColumns();


		default boolean supportsRequiredRewrite()

Conversation

clintropolis commented Apr 3, 2026

Description

Uh oh!

gianm Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gianm Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gianm Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

clintropolis Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants