feat(distributed): wire coordinator-side distributed execution for scan queries#4
Open
vamsimanohar wants to merge 11 commits intomainfrom
Open
feat(distributed): wire coordinator-side distributed execution for scan queries#4vamsimanohar wants to merge 11 commits intomainfrom
vamsimanohar wants to merge 11 commits intomainfrom
Conversation
…eline Implement a distributed MPP query engine for PPL that executes queries across multiple OpenSearch nodes in parallel using direct Lucene access. Key components: - DistributedExecutionEngine: routes queries between legacy and distributed paths - DistributedQueryPlanner: converts Calcite RelNode trees to multi-stage plans - DistributedTaskScheduler: coordinates operator pipeline across cluster nodes - TransportExecuteDistributedTaskAction: executes pipelines on data nodes - LuceneScanOperator/LimitOperator: direct Lucene _source reads per shard - Coordinator-side Calcite execution for complex operations (stats, eval, joins) - Hash join support with parallel distributed table scans - Filter pushdown, sort, rename, and limit in operator pipeline - Phase 5A core operator framework (Page, Pipeline, ComputeStage, StagedPlan) - Explain API showing distributed plan stages via _plugins/_ppl/_explain - Architecture documentation with class hierarchy and execution plan details - Comprehensive test coverage including integration tests Architecture: two execution paths controlled by plugins.ppl.distributed.enabled - Legacy (off): existing Calcite-based OpenSearchExecutionEngine - Distributed (on): operator pipeline with no fallback
- Rename Split → DataUnit (abstract class), SplitSource → DataUnitSource, SplitAssignment → DataUnitAssignment - Add Block interface (columnar, Arrow-aligned) - Add PlanFragmenter, FragmentationContext, SubPlan for automatic stage creation - Add OutputBuffer for exchange back-pressure - Add execution lifecycle: QueryExecution, StageExecution, TaskExecution - Add planFragment field to ComputeStage for query pushdown - Extend Page with getBlock() and getRetainedSizeBytes() defaults - Create OpenSearchDataUnit (index + shard, not remotely accessible) - Delete H1 types: DistributedPhysicalPlan, ExecutionStage, WorkUnit, DataPartition, DistributedQueryPlanner, DistributedPlanAnalyzer, RelNodeAnalysis, PartitionDiscovery - Delete execution code: DistributedTaskScheduler, HashJoinExecutor, InMemoryScannableTable, QueryResponseBuilder, TemporalValueNormalizer, RelNodeAnalyzer, FieldMapping, JoinInfo, SortKey, OpenSearchPartitionDiscovery - Gut DistributedExecutionEngine to routing shell (throws when enabled) - Simplify OpenSearchPluginModule constructor - Default PPL_DISTRIBUTED_ENABLED to false - Remove assumeFalse(isDistributedEnabled()) from integ tests - Update architecture documentation
…vent infinite loop The processOnce() loop only passed output between adjacent operator pairs (i to i+1), never calling getOutput() on the last operator. Operators that buffer pages (e.g., PassThroughOperator) would never have their buffer drained, causing isFinished() to never return true and an infinite loop in run().
…used operator factories - Rename split/ package to dataunit/ in both core and opensearch modules - Delete SourceOperatorFactory, OperatorFactory, and Pipeline (unused) - Simplify ComputeStage constructor by removing factory fields - Update all imports across 10+ files
…y-aware assignment - RelNodeAnalyzer: walks Calcite RelNode tree to extract index name, field names, query limit, and filter conditions - OpenSearchDataUnitSource: discovers shards from ClusterState routing table, creates OpenSearchDataUnit per shard with preferred nodes - LocalityAwareDataUnitAssignment: assigns data units to nodes by matching preferred nodes to available nodes (groupBy locality)
…scan queries - SimplePlanFragmenter: creates 2-stage plan (leaf scan + root merge) from a Calcite RelNode tree for single-table scan queries - OpenSearchFragmentationContext: provides cluster topology (data node IDs, shard discovery) from ClusterState to the fragmenter
…engine - DistributedQueryCoordinator: orchestrates distributed execution by assigning shards to nodes, sending transport requests, collecting responses async, merging rows, and applying coordinator-side limit - DistributedExecutionEngine: when distributed enabled, fragments RelNode into staged plan and delegates to coordinator instead of throwing UnsupportedOperationException - OpenSearchPluginModule: pass ClusterService and TransportService to DistributedExecutionEngine constructor - Explain path: formats staged plan with stage details when distributed is enabled
Aggregation, sort, and window queries were silently producing wrong results because RelNodeAnalyzer walked past unrecognized single-input nodes. Now throws UnsupportedOperationException with clear messages for LogicalAggregate, LogicalSort with collation, and Window nodes.
…l planning Replace the ad-hoc RelNodeAnalyzer pattern matching system with proper MPP architecture using H2 interfaces. This eliminates hardcoded query analysis and enables intelligent multi-stage planning. **Major Changes:** • **CalciteDistributedPhysicalPlanner** - Replaces RelNodeAnalyzer - Proper Calcite visitor pattern for RelNode traversal - Implements PhysicalPlanner interface with plan(RelNode) method - Converts logical operators to typed physical operators • **Physical Operator Hierarchy** - Type-safe intermediate representation - PhysicalOperatorTree, ScanPhysicalOperator, FilterPhysicalOperator - ProjectionPhysicalOperator, LimitPhysicalOperator - Bridge between Calcite RelNodes and runtime operators • **ProjectionOperator** - New runtime operator for field selection - Handles field projection and nested field access - Page-based columnar data processing - Standard operator lifecycle (needsInput/addInput/getOutput) • **IntelligentPlanFragmenter** - Replaces SimplePlanFragmenter - Smart stage boundary decisions based on operator types - Cost-driven fragmentation using real estimates - Eliminates hardcoded 2-stage assumptions • **DynamicPipelineBuilder** - Dynamic operator construction - Builds pipelines from ComputeStage physical operators - Replaces hardcoded LuceneScan→Limit→Collect pattern - Supports filter pushdown and operator chaining • **OpenSearchCostEstimator** - Real cost estimation - Uses Lucene index statistics and cluster metadata - Replaces stub cost estimator with actual data - Enables cost-based optimization decisions • **Simplified Architecture** - Removed feature flag complexity - Single execution path using new physical planner - Eliminated legacy SimplePlanFragmenter - Streamlined DistributedExecutionEngine integration **Enhanced Explain Output:** - Shows physical operators in each stage - Displays cost estimates and data size projections - Operator-level execution details This change establishes proper MPP foundations for complex query support while maintaining full backward compatibility for supported query patterns.
The SimplePlanFragmenterTest was referencing the deleted SimplePlanFragmenter class, causing compilation failures after merging origin/main. Removed the obsolete test as SimplePlanFragmenter has been replaced with IntelligentPlanFragmenter.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
splitpackage todataunitand removes unusedSourceOperatorFactory/OperatorFactoryfromComputeStageRelNodeAnalyzerto extract query metadata (index, fields, limit, filters) from Calcite RelNode treesOpenSearchDataUnitSourcefor shard discovery from cluster routing tableLocalityAwareDataUnitAssignmentto group shards by preferred nodeSimplePlanFragmenterandOpenSearchFragmentationContextfor 2-stage plan creation (leaf scan + root merge)DistributedQueryCoordinatorthat assigns shards to nodes, sends transport requests, and merges resultsDistributedExecutionEngineto use the coordinator whenplugins.ppl.distributed.enabled=trueSupported query patterns
search source=<index>search source=<index> | head Nsearch source=<index> | where <condition>search source=<index> | where <condition> | head Nsearch source=<index> | fields f1, f2, ...Test plan
RelNodeAnalyzerTest,OpenSearchDataUnitSourceTest,LocalityAwareDataUnitAssignmentTest,SimplePlanFragmenterTest,DistributedQueryCoordinatorTest,DistributedExecutionEngineTest)setup-distributed-ppl.shDistributedExecutionEngine→RelNodeAnalyzer→SimplePlanFragmenter→DistributedQueryCoordinator→TransportExecuteDistributedTaskAction→ rows mergedplugins.ppl.distributed.enabled=false