fix: clustered by virtual columns that depended on virtual columns now correctly preserve these dependencies#19262
Conversation
…w correctly preserve these dependencies
changes:
* adds `addRequiredVirtualColumns` method to `SegmentGenerationStageSpec` which resolves transitive virtual column dependencies for virtual columns used by clustering, fixing a bug where these dependent virtual columns would be lost in the shard spec and compaction state
* adds `supportsRequiredRewrite` and `rewriteRequiredColumns` to `VirtualColumn` allowing a virtual column to rewrite its input references to equivalent names
* adds `Expr.rewriteBindings` to rewrite identifier bindings in an `Expr` tree
* `VirtualColumns.findEquivalent` is enhanced to transitively resolve dependent virtual columns across naming contexts before checking equivalence, enabling detection that e.g. `lower("v1")` ≡ `lower("v0")` when v0 and v1 are equivalent virtual columns
* `FilterSegmentPruner` updated to use transitive equivalence when matching shard virtual columns to query virtual columns (with Optional-based caching to correctly handle nulls)
* `Projections.matchQueryVirtualColumn` updated similarly
| List<String> requiredColumns(); | ||
|
|
||
|
|
||
| default boolean supportsRequiredRewrite() |
There was a problem hiding this comment.
Somewhat obvious what this means, but it would still be good to have a javadoc.
|
|
||
| /** | ||
| * Return a copy of this virtual column that is identical to this virtual column except that it operates on different | ||
| * columns, based on a renaming map where the key is the column to be renamed and the value is the new column. |
There was a problem hiding this comment.
Useful to say that not all virtual columns implement this, and callers must check supportsRequiredRewrite.
| private VirtualColumn getQueryEquivalent(VirtualColumns shardVirtualColumns, VirtualColumn shardVirtualColumn) | ||
| { | ||
| final Optional<VirtualColumn> cached = shardEquivalenceCache.computeIfAbsent( | ||
| shardVirtualColumn, |
There was a problem hiding this comment.
Is it ok that the key is just shardVirtualColumn even though the compute function uses shardVirtualColumns as well? Can this be called with two different shardVirtualColumns for the same FilterSegmentPruner object and get some incorrect cache hits?
There was a problem hiding this comment.
ah, good catch, no it isn't cool. Fixed by making the cache key be an internal class to capture the tree structure so that it is the exact virtual column structure and can't be tricked by the same names.
I'm kind of wondering if it would be worth making VirtualColumns store virtual columns wrapped like this so that we can make operations like this a bit cheaper, but I don't think i want to make a change like that in this PR.
Added some segments to the test that have the same names but different json_value expression that failed before this fix.
…w correctly preserve these dependencies (apache#19262) changes: * adds `addRequiredVirtualColumns` method to `SegmentGenerationStageSpec` which resolves transitive virtual column dependencies for virtual columns used by clustering, fixing a bug where these dependent virtual columns would be lost in the shard spec and compaction state * adds `supportsRequiredRewrite` and `rewriteRequiredColumns` to `VirtualColumn` allowing a virtual column to rewrite its input references to equivalent names * adds `Expr.rewriteBindings` to rewrite identifier bindings in an `Expr` tree * `VirtualColumns.findEquivalent` is enhanced to transitively resolve dependent virtual columns across naming contexts before checking equivalence, enabling detection that e.g. `lower("v1")` ≡ `lower("v0")` when v0 and v1 are equivalent virtual columns * `FilterSegmentPruner` updated to use transitive equivalence when matching shard virtual columns to query virtual columns (with Optional-based caching to correctly handle nulls) * `Projections.matchQueryVirtualColumn` updated similarly * intern range shardspec dimension strings and virtual columns
Description
This PR fixes an issue when clustering MSQ insert/replace by a virtual column that depends on another virtual column e.g. something like
LOWER(JSON_VALUE(obj, '$.a'))which plans to anExpressionVirtualColumnreferencing aNestedFieldVirtualColumn, the resulting `DimensionRangeShardSpec was missing the dependent virtual columns. This broke segment pruning for those queries since the shard spec had incomplete virtual column context, and compaction for the same reason.Also, it fixes an issue with virtual column equivalence for the same case, where a virtual column depends on another virtual column, by allowing virtual columns with equivalent virtual column dependencies to be considered equivalent by rewriting the virtual column to use the equivalent inputs before testing.
changes:
addRequiredVirtualColumnsmethod toSegmentGenerationStageSpecwhich resolves transitive virtual column dependencies for virtual columns used by clustering, fixing a bug where these dependent virtual columns would be lost in the shard spec and compaction statesupportsRequiredRewriteandrewriteRequiredColumnstoVirtualColumnallowing a virtual column to rewrite its input references to equivalent namesExpr.rewriteBindingsto rewrite identifier bindings in anExprtreeVirtualColumns.findEquivalentis enhanced to transitively resolve dependent virtual columns across naming contexts before checking equivalence, enabling detection that e.g.lower("v1")≡lower("v0")when v0 and v1 are equivalent virtual columnsFilterSegmentPrunerupdated to use transitive equivalence when matching shard virtual columns to query virtual columns (with Optional-based caching to correctly handle nulls)Projections.matchQueryVirtualColumnupdated similarly