Skip to content

fix: clustered by virtual columns that depended on virtual columns now correctly preserve these dependencies#19262

Merged
clintropolis merged 4 commits intoapache:masterfrom
clintropolis:fix-clustered-by-virtual-column
Apr 8, 2026
Merged

fix: clustered by virtual columns that depended on virtual columns now correctly preserve these dependencies#19262
clintropolis merged 4 commits intoapache:masterfrom
clintropolis:fix-clustered-by-virtual-column

Conversation

@clintropolis
Copy link
Copy Markdown
Member

Description

This PR fixes an issue when clustering MSQ insert/replace by a virtual column that depends on another virtual column e.g. something like LOWER(JSON_VALUE(obj, '$.a')) which plans to an ExpressionVirtualColumn referencing a NestedFieldVirtualColumn, the resulting `DimensionRangeShardSpec was missing the dependent virtual columns. This broke segment pruning for those queries since the shard spec had incomplete virtual column context, and compaction for the same reason.

Also, it fixes an issue with virtual column equivalence for the same case, where a virtual column depends on another virtual column, by allowing virtual columns with equivalent virtual column dependencies to be considered equivalent by rewriting the virtual column to use the equivalent inputs before testing.

changes:

  • adds addRequiredVirtualColumns method to SegmentGenerationStageSpec which resolves transitive virtual column dependencies for virtual columns used by clustering, fixing a bug where these dependent virtual columns would be lost in the shard spec and compaction state
  • adds supportsRequiredRewrite and rewriteRequiredColumns to VirtualColumn allowing a virtual column to rewrite its input references to equivalent names
  • adds Expr.rewriteBindings to rewrite identifier bindings in an Expr tree
  • VirtualColumns.findEquivalent is enhanced to transitively resolve dependent virtual columns across naming contexts before checking equivalence, enabling detection that e.g. lower("v1")lower("v0") when v0 and v1 are equivalent virtual columns
  • FilterSegmentPruner updated to use transitive equivalence when matching shard virtual columns to query virtual columns (with Optional-based caching to correctly handle nulls)
  • Projections.matchQueryVirtualColumn updated similarly

…w correctly preserve these dependencies

changes:
* adds `addRequiredVirtualColumns` method to `SegmentGenerationStageSpec` which resolves transitive virtual column dependencies for virtual columns used by clustering, fixing a bug where these dependent virtual columns would be lost in the shard spec and compaction state
* adds `supportsRequiredRewrite` and `rewriteRequiredColumns` to `VirtualColumn` allowing a virtual column to rewrite its input references to equivalent names
* adds `Expr.rewriteBindings` to rewrite identifier bindings in an `Expr` tree
* `VirtualColumns.findEquivalent` is enhanced to transitively resolve dependent virtual columns across naming contexts before checking equivalence, enabling detection that e.g. `lower("v1")` ≡ `lower("v0")` when v0 and v1 are equivalent virtual columns
* `FilterSegmentPruner` updated to use transitive equivalence when matching shard virtual columns to query virtual columns (with Optional-based caching to correctly handle nulls)
* `Projections.matchQueryVirtualColumn` updated similarly
@github-actions github-actions bot added Area - Batch Ingestion Area - Segment Format and Ser/De Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Apr 3, 2026
List<String> requiredColumns();


default boolean supportsRequiredRewrite()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat obvious what this means, but it would still be good to have a javadoc.


/**
* Return a copy of this virtual column that is identical to this virtual column except that it operates on different
* columns, based on a renaming map where the key is the column to be renamed and the value is the new column.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful to say that not all virtual columns implement this, and callers must check supportsRequiredRewrite.

private VirtualColumn getQueryEquivalent(VirtualColumns shardVirtualColumns, VirtualColumn shardVirtualColumn)
{
final Optional<VirtualColumn> cached = shardEquivalenceCache.computeIfAbsent(
shardVirtualColumn,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok that the key is just shardVirtualColumn even though the compute function uses shardVirtualColumns as well? Can this be called with two different shardVirtualColumns for the same FilterSegmentPruner object and get some incorrect cache hits?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good catch, no it isn't cool. Fixed by making the cache key be an internal class to capture the tree structure so that it is the exact virtual column structure and can't be tricked by the same names.

I'm kind of wondering if it would be worth making VirtualColumns store virtual columns wrapped like this so that we can make operations like this a bit cheaper, but I don't think i want to make a change like that in this PR.

Added some segments to the test that have the same names but different json_value expression that failed before this fix.

@clintropolis clintropolis added this to the 37.0.0 milestone Apr 8, 2026
@clintropolis clintropolis merged commit a0ec76e into apache:master Apr 8, 2026
63 of 64 checks passed
@clintropolis clintropolis deleted the fix-clustered-by-virtual-column branch April 8, 2026 22:41
clintropolis added a commit to clintropolis/druid that referenced this pull request Apr 8, 2026
…w correctly preserve these dependencies (apache#19262)

changes:
* adds `addRequiredVirtualColumns` method to `SegmentGenerationStageSpec` which resolves transitive virtual column dependencies for virtual columns used by clustering, fixing a bug where these dependent virtual columns would be lost in the shard spec and compaction state
* adds `supportsRequiredRewrite` and `rewriteRequiredColumns` to `VirtualColumn` allowing a virtual column to rewrite its input references to equivalent names
* adds `Expr.rewriteBindings` to rewrite identifier bindings in an `Expr` tree
* `VirtualColumns.findEquivalent` is enhanced to transitively resolve dependent virtual columns across naming contexts before checking equivalence, enabling detection that e.g. `lower("v1")` ≡ `lower("v0")` when v0 and v1 are equivalent virtual columns
* `FilterSegmentPruner` updated to use transitive equivalence when matching shard virtual columns to query virtual columns (with Optional-based caching to correctly handle nulls)
* `Projections.matchQueryVirtualColumn` updated similarly
* intern range shardspec dimension strings and virtual columns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Segment Format and Ser/De Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants