-
Notifications
You must be signed in to change notification settings - Fork 188
Description
Is your feature request related to a problem?
The SQL/PPL plugin currently only supports querying Lucene-backed OpenSearch indices. A new Analytics engine introduces Parquet as an alternate storage format with an embedded query engine (DataFusion) for log analytics workloads. Users expect to query Parquet-backed indices using the same _plugins/_sql and _plugins/_ppl REST endpoints with the same SQL/PPL language support, but the current plugin has no integration path to route queries to the Analytics engine for execution.
Technical Requirements
- UX consistency: Same REST endpoints, response format, error messages, and explain output for both Lucene and non-Lucene queries
- Engineering complexity: Minimal changes to existing SQL/PPL plugin code; reuse existing parsing, formatting, and metrics infrastructure
- Plugin coupling: Clean API boundary between SQL/PPL plugin and Analytics engine; each can evolve independently
- Performance: No unnecessary performance overhead or data copying in the query handoff path
- Resource management: Query size limits, circuit breakers, and timeouts enforced on the unified path
- Monitoring/Observability: Request/failure metrics, query profiling, and logging for the unified path consistent with existing paths
- Security: Index-level authorization during routing; security context propagated to execution thread
- Testability: End-to-end integration tests with the real Analytics engine plugin loaded in the test cluster
- Long-term alignment: Architecture supports future ANSI SQL, cross-engine JOINs, and additional storage engines
What solution would you like?
High-level Design
The SQL/PPL plugin integrates with the Analytics engine through a new Unified Query Pipeline: the plugin handles query routing, parsing, and generating engine-agnostic Calcite RelNode logical plans, then hands them off to the Analytics engine for physical optimization, distributed planning, and execution.
Here is the high-level workflow:
RestSqlAction/RestPPLQueryActionroutes queries targeting non-Lucene indices to a newRestUnifiedQueryAction; Lucene queries fall through to the existing V2/V3 path unchanged.RestUnifiedQueryActionbuilds a Calcite schema from cluster state via the Analytics engine'sOpenSearchSchemaBuilder(standard SQL types, no UDTs).UnifiedQueryPlanner.plan()parses the query (Calcite native SQL parser or PPL V3 Calcite path) and generates a logicalRelNode.AnalyticsExecutionEnginepasses theRelNodedirectly to the Analytics engine'sQueryPlanExecutor.execute()— no serialization needed (shared classloader via extendedPlugins).- Results (
Iterable<Object[]>) are converted toExprValueand formatted as JDBC JSON via the existingJdbcResponseFormatter.
Component Responsibilities
| SQL/PPL Plugin | Analytics Engine | |
|---|---|---|
| Query routing | Detect non-Lucene indices by routing rules from Analytics engine and route to unified pipeline | Provide index settings-based routing rules |
| Parsing | Parse SQL (Calcite native parser) and PPL (ANTLR → CalciteRelNodeVisitor) into RelNode | — |
| Logical optimization | Apply language-specific HepPlanner rewrites (PPL dedup/stats simplification) | — |
| Physical optimization | — | Pushdown optimization by rules registered in TableScan.register() |
| Distributed query planning | — | Fragment the plan across data nodes, manage distributed execution and result aggregation |
| Schema | Pass schema to UnifiedQueryContext for planning |
Provide OpenSearchSchemaBuilder to build Calcite SchemaPlus from cluster state |
| Execution | Hand off RelNode to QueryPlanExecutor |
Execute plan via back-end engines, return Iterable<Object[]> |
| Exception handling | Catch and return error responses | Fail-fast and throw exceptions for unsupported operations |
| Response formatting | Convert results to default format via existing formatter | — |
| Observability | Metrics, logging, profiling for unified path | — |
API Contract
- SQL/PPL → Analytics engine:
RelNodelogical plan with standard Calcite types and PPL/SQL UDF nodes (e.g., match, match_phrase) - Analytics engine → SQL/PPL:
SchemaBuilderfor schema provisioning,QueryPlanExecutor.execute(RelNode, context)returning Iterable<Object[]>, routing rules for index detection
What alternatives have you considered?
- Option A (Query Delegation): Forward raw SQL/PPL query text to the Analytics engine via transport action. Simplest coupling, but bypasses the SQL/PPL plugin entirely and does not cleanly separate the query language frontend from the execution backend.
- Option C (Calcite Schema Adapter): Register a Parquet Calcite adapter in the existing PPL V3 pipeline. Tightest coupling with the current codebase — inherits the most infrastructure, but carries compatibility risk as the Analytics engine. The SQL execution path would still need a separate integration, since the current pipeline is PPL-only.
Option B (Unified Query Pipeline) was selected for its clean separation of query language frontend (SQL/PPL plugin) and execution backend (Analytics engine), reuse of existing plugin infrastructure, and a single integration path that serves both SQL and PPL.
Do you have any additional context?
- Analytics engine: https://github.com/opensearch-project/OpenSearch/tree/main/sandbox/plugins/analytics-engine
- PoC branch: [PoC] SQL/PPL support for analytics engine dai-chen/sql-1#10
- Pending alignment:
- API contract finalization
- Plugin integration mechanism (extendedPlugins vs transport action)
- SQL search function support scope
- Integration-ready Analytics engine build
Metadata
Metadata
Assignees
Labels
Type
Projects
Status