feat: data-parity skill — TypeScript orchestrator, ClickHouse driver, partition support#493
feat: data-parity skill — TypeScript orchestrator, ClickHouse driver, partition support#493suryaiyer95 wants to merge 21 commits intomainfrom
Conversation
📝 WalkthroughWalkthroughAdds an end-to-end “data parity” feature: new skill docs and prompts, a DataDiffTool, a dispatcher handler that runs a Rust DataParity state machine via a new orchestration module, driver updates to support no-limit execution, types for data-diff contracts, and test mock adjustments. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User/Client
participant Tool as DataDiffTool
participant Dispatcher as Dispatcher
participant Orch as DataDiff Orchestration
participant Engine as Rust DataParity Engine
participant DB as Warehouse Connections
User->>Tool: invoke data_diff(params)
Tool->>Dispatcher: call("data.diff", params)
Dispatcher->>Orch: runDataDiff(params)
Orch->>Orch: resolveTableSources() / build CTEs
Orch->>DB: fetch schema & partition values
DB-->>Orch: column metadata / partition list
Orch->>Engine: create DataParitySession
loop Engine emits actions
Engine-->>Orch: ExecuteSql(sql)
Orch->>Orch: injectCte(sql)
Orch->>DB: execute SQL (noLimit=true)
DB-->>Orch: rows (normalized)
Orch->>Engine: push results
end
alt partitioned
Orch->>Orch: iterate partitions -> runDataDiff(partitioned_params)
Orch->>Orch: mergeOutcomes(per_partition_results)
end
Engine-->>Orch: Done / Outcome
Orch-->>Dispatcher: DataDiffResult
Dispatcher-->>Tool: result
Tool-->>User: formatted summary
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Closing — .opencode/ skill config and model defaults should not live in the open source repo. |
0f8c7ac to
7909e55
Compare
Companion fix: column name collision in
|
suryaiyer95
left a comment
There was a problem hiding this comment.
Multi-Model Code Review — altimate-code#493
Reviewed by 4-model panel: Claude Sonnet 4.6, Gemini 3.1 Pro, Grok 4, Kimi K2.5. Unanimous verdict: REQUEST CHANGES.
Note: The PR title significantly undersells scope. This contains a 799-line TypeScript orchestrator, new ClickHouse driver, `ExecuteOptions` refactored across 10+ drivers, column auto-discovery, partition support, and a 341-line SKILL.md.
Verdict: REQUEST CHANGES (fix MAJOR-1 before merge)
Major Issues: 1 | Minor Issues: 4 | NITs: 3
Major Issues
MAJOR-1 — SQL injection via table name interpolation in `buildColumnDiscoverySQL` (Security)
- Location: `packages/opencode/src/altimate/native/connections/data-diff.ts` — `buildColumnDiscoverySQL`
- Table name parts are interpolated directly into SQL:
`tableName` flows from `DataDiffParams.source` which comes from the LLM/user. A crafted table name like `orders' OR 1=1 --` produces injected SQL.
schemaFilter = `table_schema = '${parts[1]}'` tableFilter = `table_name = '${parts[2]}'`
- Fix:
const escape = (s: string) => s.replace(/'/g, "''") schemaFilter = `table_schema = '${escape(parts[1])}'` tableFilter = `table_name = '${escape(parts[2])}'`
Minor Issues
MINOR-1 — `discoverExtraColumns` silently falls back to key-only comparison (Logic Error)
- When column discovery fails, the diff runs in key-only mode — reporting all rows as identical even when non-key values differ. No warning is surfaced.
- Fix: Include a `warning` field in `DataDiffResult` when key-only fallback occurs.
MINOR-2 — `dateTruncExpr` missing Oracle dialect (Bug)
- Oracle falls through to the default `DATE_TRUNC` case. Oracle uses `TRUNC(col, 'MONTH')` — date-partitioned diffs on Oracle tables will fail.
- Fix: Add `case "oracle": return `TRUNC(${column}, '${g.toUpperCase()}')``
MINOR-3 — Unqualified table names match across multiple schemas (Bug)
- `buildColumnDiscoverySQL` with no schema filter returns columns from all schemas with that table name — producing incorrect `extra_columns`.
- Fix: Default `schemaFilter` to the connection's current schema, or document that fully-qualified names are required.
MINOR-4 — `MAX_STEPS = 200` hardcoded and undocumented (Design)
- Not configurable, not documented in `DataDiffParams` or tool description.
- Fix: Expose as optional `max_steps?: number` in `DataDiffParams`, defaulting to 200.
NITs
- N1: `mergeOutcomes` caps `diff_rows` at 100 with no documentation — callers don't know the result is partial
- N2: `partitionColumn` identifier interpolated directly in `buildPartitionDiscoverySQL` — inconsistent with MAJOR-1 fix
- N3: `effectiveLimit` pattern repeated across 10+ drivers
Positive Observations
- The cooperative state machine bridge (TypeScript orchestrator ↔ Rust engine) is the right architecture.
- Column auto-discovery with two-layer audit exclusion is thorough.
- `ExecuteOptions.noLimit` correctly applied across all 10+ drivers.
- SKILL.md `CRITICAL: joindiff only sees one connection for cross-warehouse comparisons` is exactly the right guardrail.
- ClickHouse driver using `JSONCompactEachRowWithNamesAndTypes` with pure HTTP is clean.
- Dynamic import gracefully handles NAPI unavailability — unblocks CI.
8c7ef31 to
1468eba
Compare
suryaiyer95
left a comment
There was a problem hiding this comment.
Code Review: Data Parity TypeScript Orchestrator + ClickHouse Driver
Verdict: APPROVE with minor issues
Reviewed by: Claude (sole reviewer — external models unavailable)
Major Issue (1)
1. SQL Injection via buildColumnDiscoverySQL — ClickHouse/Snowflake paths (Security)
Location: packages/opencode/src/altimate/native/connections/data-diff.ts
The ClickHouse path uses DESCRIBE TABLE ${tableName} and Snowflake uses SHOW COLUMNS IN TABLE ${tableName} with raw string interpolation. While the information_schema paths properly escape with esc(), ClickHouse/Snowflake bypass escaping entirely.
A table name like orders; DROP TABLE users -- would be injected verbatim. The tableName comes from LLM tool params, so prompt injection or user error could trigger this.
Fix: Apply identifier quoting (backtick for ClickHouse, double-quote for Snowflake) or validate table names against [a-zA-Z0-9_.].
Minor Issues (4)
2. Profile outcome formatting uses wrong field names — tools/data-diff.ts
The TS formatter checks col.source_stats/col.target_stats but Rust serializes as col.table1/col.table2. Stats fields are non_null_count/distinct_count in Rust but the TS uses count/null_count. Profile output would show undefined values.
3. CTE injection doesn't handle ClickHouse SETTINGS clause — data-diff.ts:injectCte()
JoinDiff for ClickHouse emits ... SETTINGS join_use_nulls = 1. CTE-wrapped queries with SETTINGS may not propagate correctly in all ClickHouse versions.
4. Partition discovery fails on query sources — data-diff.ts:runPartitionedDiff()
When source is a SQL query, table1Name resolves to __diff_source but the partition discovery SQL doesn't include the CTE prefix. This would fail with "table __diff_source not found".
5. noLimit creates unbounded memory risk for JoinDiff — all 11 driver files
When noLimit is true, no LIMIT clause is added. For JoinDiff, the result set could be millions of rows loaded into Node.js memory. Consider a safety limit (e.g., 10M rows) even with noLimit.
Nit (1)
simulation-suite.test.tschanges (removingdata:wrapper, flattening mock response fields) are unrelated bug fixes that should be in a separate commit
Positive Observations
- Auto-discovery of audit columns with two-layer exclusion (name patterns + schema default detection) is production-quality. The SKILL.md workflow requiring user confirmation before excluding is excellent UX
- CTE injection for query-vs-query comparison is clever — wrapping arbitrary SQL in CTEs so the Rust engine treats them as tables
- Partition support (date/numeric/categorical) with per-partition breakdown reporting makes this usable on billion-row tables
- SKILL.md is thorough — the 9-step plan template, algorithm selection guide,
extra_columnsbehavior docs, and common mistakes section are exactly what an LLM agent needs - Consistent
noLimitacross all 11 drivers — clean, mechanical refactor with correcteffectiveLimit > 0guard for truncation detection - Defensive null guards added to
lineage-check.ts,schema-inspect.ts,sql-analyze.ts— good hardening of existing tools
Missing Tests
- No test for
injectCtemerging engine CTEs with source/target CTEs - No test for partition discovery failure (error propagation from
runPartitionedDiff) - No integration test exercising
DataParitySessionNAPI round-trip from TypeScript
Reviewed by 1 model: Claude. External models (Gemini 3.1 Pro, Kimi K2.5, Grok 4) unavailable due to insufficient OpenRouter credits.
🤖 Generated with Claude Code
Supplemental Review FindingsTwo additional issues caught during deeper analysis: MAJOR: Partition discovery misses target-only partitions (
|
e1195ba to
161051d
Compare
- Add DataParity engine integration via native Rust bindings - Add data-diff tool for LLM agent (profile, joindiff, hashdiff, cascade, auto) - Add ClickHouse driver support - Add data-parity skill: profile-first workflow, algorithm selection guide, CRITICAL warning that joindiff cannot run cross-database (always returns 0 diffs), output style rules (facts only, no editorializing) - Gitignore .altimate-code/ (credentials) and *.node (platform binaries)
Split large tables by a date or numeric column before diffing. Each partition is diffed independently then results are aggregated. New params: - partition_column: column to split on (date or numeric) - partition_granularity: day | week | month | year (for dates) - partition_bucket_size: bucket width for numeric columns New output field: - partition_results: per-partition breakdown (identical / differ / error) Dialect-aware SQL: Postgres, Snowflake, BigQuery, ClickHouse, MySQL. Skill updated with partition guidance and examples.
When partition_column is set without partition_granularity or partition_bucket_size, groups by raw DISTINCT values. Works for any non-date, non-numeric column: status, region, country, etc. WHERE clause uses equality: col = 'value' with proper escaping.
Rust serializes ReladiffOutcome with serde tag 'mode', producing:
{mode: 'diff', diff_rows: [...], stats: {rows_table1, rows_table2, exclusive_table1, exclusive_table2, updated, unchanged}}
Previous code checked for {Match: {...}} / {Diff: {...}} shapes that
never matched, causing partitioned diff to report all partitions as
'identical' with 0 rows.
- extractStats(): check outcome.mode === 'diff', read from stats fields
- mergeOutcomes(): aggregate mode-based outcomes correctly
- summarize()/formatOutcome(): display mode-based shape with correct labels
Key changes based on feedback: - Always generate TODO plan before any tool is called - Enforce data_diff tool usage (never manual EXCEPT/JOIN SQL) - Add PK discovery + explicit user confirmation step - Profile pass is now mandatory before row-level diff - Ask user before expensive row-level diff on large tables: - <100K rows: proceed automatically - 100K-10M rows: ask with where_clause option - >10M rows: offer window/partition/full choices - Document partition modes (date/numeric/categorical) with examples - Add warehouse_list as first step to confirm connections
…from data diff The Rust engine only compares columns explicitly listed in extra_columns. When omitted, it was silently reporting all key-matched rows as 'identical' even when non-key values differed — a false positive bug. Changes: - Auto-discover columns from information_schema when extra_columns is omitted and source is a plain table name (not a SQL query) - Exclude audit/timestamp columns (updated_at, created_at, inserted_at, modified_at, _fivetran_*, _airbyte_*, publisher_last_updated_*, etc.) from comparison by default since they typically differ due to ETL timing - Report excluded columns in tool output so users know what was skipped - Fix misleading tool description that said 'Omit to compare all columns' - Update SKILL.md with critical guidance on extra_columns behavior
…ult truncation
All drivers default to `LIMIT 1001` on SELECT queries and post-truncate to
1000 rows. This silently drops rows when the data-diff engine needs complete
result sets — a FULL OUTER JOIN returning >1000 diff rows would be truncated,
causing the engine to undercount differences.
- Add `ExecuteOptions { noLimit?: boolean }` to the `Connector` interface
- When `noLimit: true`, set `effectiveLimit = 0` (falsy) so the existing
LIMIT injection guard is skipped, and add `effectiveLimit > 0` to the
truncation check so rows aren't sliced to zero
- Update all 12 drivers: postgres, clickhouse, snowflake, bigquery, mysql,
redshift, databricks, duckdb, oracle, sqlserver, sqlite, mongodb
- Pass `{ noLimit: true }` from `data-diff.ts` `executeQuery()`
Interactive SQL callers are unaffected — they continue to get the default
1000-row limit. Only the data-diff pipeline opts out.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…m exclusions with user Column exclusion now has two layers: 1. Name-pattern matching (existing) — updated_at, created_at, _fivetran_synced, etc. 2. Schema-level default detection (new) — queries column_default for NOW(), CURRENT_TIMESTAMP, GETDATE(), SYSDATE, SYSTIMESTAMP, etc. Covers PostgreSQL, MySQL, Snowflake, SQL Server, Oracle, ClickHouse, DuckDB, SQLite, and Redshift in a single round-trip (no extra query). The skill prompt now instructs the agent to present detected auto-timestamp columns to the user and ask for confirmation before excluding them, since migrations should preserve timestamps while ETL replication regenerates them.
- `buildColumnDiscoverySQL`: escape single quotes in all interpolated table name parts to prevent SQL injection via crafted source/target names - `dateTruncExpr`: add Oracle case (`TRUNC(col, 'UNIT')`) — Oracle does not have `DATE_TRUNC`, date-partitioned diffs on Oracle tables previously failed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Apply esc() to Oracle and SQLite paths in buildColumnDiscoverySQL
(SQL injection via table name was unpatched in these dialects)
- Quote identifiers in resolveTableSources to prevent injection via
table names containing semicolons or special characters
- Surface SQL execution errors before feeding empty rows to the engine
(silent false "match" when warehouse is unreachable is now an error)
- Fix Oracle TRUNC() format model map: 'WEEK' → 'IW' (ISO week)
('WEEK' throws ORA-01800 on all Oracle versions)
- Quote partition column identifier in buildPartitionWhereClause
…r propagation, and test mock formats
- `altimate-core-column-lineage`: fix `[object Object]` in `column_dict` output when source entries are `{ source_table, source_column }` objects instead of strings
- `schema-inspect`: propagate `{ success: false, error }` dispatcher responses to `metadata.error` instead of silently returning empty schema
- `sql-analyze`: guard against null/undefined result from dispatcher to prevent "undefined" literal in output
- `lineage-check`: guard against null/undefined result from dispatcher to prevent "undefined" literal in output
- `simulation-suite.test.ts`: fix `sql-translate` mock format — data fields must be flat (not wrapped in `data: {}`), add `source_dialect`/`target_dialect` to mock so assertions pass
- `simulation-suite.test.ts`: fix `dbt-manifest` mock format — unwrap `data: {}` so `model_count` and `models` are accessible at top level
Simulation suite: 695/839 → 839/839 (100%)
… corruption
The @clickhouse/client package enables ERROR-level logging by default and writes
`[ERROR][@clickhouse/client][Connection]` lines directly to stderr on auth/query
failures. These raw writes corrupt the terminal TUI rendering.
Set `log: { level: 127 }` (ClickHouseLogLevel.OFF) when creating the client —
consistent with how Snowflake (`logLevel: 'OFF'`) and Databricks (no-op logger)
already suppress their SDK loggers for the same reason.
38daa6d to
2c58580
Compare
…ack script - Validate table names before interpolating into DESCRIBE/SHOW COLUMNS for ClickHouse and Snowflake — reject names with non-alphanumeric characters to prevent SQL injection; also quote parts with dialect-appropriate delimiters - Discover partition values from BOTH source and target tables and union the results — previously only source was queried, silently missing rows that existed only in target-side partitions - Add script/pack-local.ts: mirrors publish.ts but stops before npm publish; injects local altimate-core tarballs from /tmp/altimate-local-dist/ for local end-to-end testing
Require that every diff result summary surfaces: - Exact scope (tables + warehouses compared) - Filters and time period applied (or explicitly states none) - Key columns used and how they were confirmed - Columns compared and excluded, with reasons (auto-timestamp, user request) - Algorithm used Includes example full result summary and guidance for identical results — emphasising that bare numbers without context are meaningless to the user.
The partitioned diff returned `{ Match: { row_count: 0, algorithm: 'partitioned' } }`
when no partition values were found or all partitions failed. This format lacks
`mode: 'diff'`, so `formatOutcome` fell through to raw JSON.stringify instead
of producing clean output.
Use the standard Rust engine format:
`{ mode: 'diff', stats: {...}, diff_rows: [] }`
…y comparison modes
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
There was a problem hiding this comment.
Pull request overview
Adds a new data-parity / data_diff workflow to the Altimate toolchain by introducing a TypeScript “orchestrator bridge” for the Rust DataParitySession state machine, extending driver execution to support unbounded result sets, and documenting a new /data-parity skill for correct algorithm selection.
Changes:
- Introduces
data_difftool + native connection handler that runs the Rust diff session by executing emitted SQL tasks across warehouse connectors (including partitioned diffs and column auto-discovery). - Extends the drivers
Connector.execute()API withExecuteOptions.noLimitto bypass default LIMIT injection and truncation. - Adds
/data-parityskill documentation and updates prompts/registry + adjusts simulation mocks to match updated dispatcher response shapes.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/opencode/test/altimate/simulation-suite.test.ts | Updates mocked dispatcher response shapes for dbt manifest + SQL translation scenarios. |
| packages/opencode/src/tool/registry.ts | Registers the new DataDiffTool. |
| packages/opencode/src/altimate/tools/data-diff.ts | Adds the data_diff LLM tool wrapper and output formatting. |
| packages/opencode/src/altimate/prompts/builder.txt | Documents and wires up /data-parity skill invocation triggers. |
| packages/opencode/src/altimate/native/types.ts | Adds bridge types for data.diff params/results and registers the method. |
| packages/opencode/src/altimate/native/connections/register.ts | Registers the native data.diff dispatcher method. |
| packages/opencode/src/altimate/native/connections/data-diff.ts | Implements the TypeScript orchestrator bridging Rust session steps to warehouse SQL execution, plus partition support and column discovery. |
| packages/drivers/src/types.ts | Adds ExecuteOptions and extends Connector.execute() signature to accept options. |
| packages/drivers/src/{postgres,redshift,snowflake,bigquery,mysql,sqlite,duckdb,databricks,sqlserver,oracle,clickhouse}.ts | Threads noLimit option through each driver and fixes truncation detection when limit is disabled. |
| .opencode/skills/data-parity/SKILL.md | Adds the data-parity skill guide and algorithm-selection procedure. |
Comments suppressed due to low confidence (1)
packages/drivers/src/clickhouse.ts:20
- PR description claims the ClickHouse driver is a “pure HTTP driver” with “no external package required”, but this implementation still depends on
@clickhouse/clientand throws an install error if it’s missing.
Either update the PR description to match the current dependency, or rework the driver to avoid the external client if that’s the intended change.
import type { ConnectionConfig, Connector, ConnectorResult, ExecuteOptions, SchemaColumn } from "./types"
export async function connect(config: ConnectionConfig): Promise<Connector> {
let createClient: any
try {
const mod = await import("@clickhouse/client")
createClient = mod.createClient ?? mod.default?.createClient
if (!createClient) {
throw new Error("createClient export not found in @clickhouse/client")
}
} catch {
throw new Error("ClickHouse driver not installed. Run: npm install @clickhouse/client")
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
packages/drivers/src/redshift.ts (1)
49-64:⚠️ Potential issue | 🔴 CriticalRedshift driver breaks parameterized queries by ignoring bind parameters.
The
executemethod accepts_bindsbut never passes them toclient.query(), unlike all other drivers (Snowflake usesexecuteQuery(query, binds), DuckDB uses conditionalqueryWithParams(finalSql, binds)). This breaks SQL injection protection and query correctness for all callers passing binds.Rename the parameter to
binds(remove underscore) and pass it to the query execution:🔧 Proposed fix
- async execute(sql: string, limit?: number, _binds?: any[], options?: ExecuteOptions): Promise<ConnectorResult> { + async execute(sql: string, limit?: number, binds?: any[], options?: ExecuteOptions): Promise<ConnectorResult> { const client = await pool.connect() try { const effectiveLimit = options?.noLimit ? 0 : (limit ?? 1000) let query = sql const isSelectLike = /^\s*(SELECT|WITH|VALUES)\b/i.test(sql) if ( isSelectLike && effectiveLimit && !/\bLIMIT\b/i.test(sql) ) { query = `${sql.replace(/;\s*$/, "")} LIMIT ${effectiveLimit + 1}` } - const result = await client.query(query) + const result = + binds && binds.length > 0 + ? await client.query(query, binds) + : await client.query(query)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/drivers/src/redshift.ts` around lines 49 - 64, The execute method in packages/drivers/src/redshift.ts currently ignores the bind parameters by naming the param _binds and not passing them to client.query; rename the method parameter from _binds to binds (and update any internal references) and pass binds into the database call (i.e., client.query(query, binds)) so parameterized queries work like other drivers (see Snowflake/DuckDB patterns) and preserve SQL injection protection and correctness.packages/drivers/src/mysql.ts (1)
44-56:⚠️ Potential issue | 🟠 MajorMySQL
executeacceptsbindsparameter but never passes it topool.query().The interface contract (line 31 of types.ts) requires
binds?: any[]to be used. Line 44 accepts it as_binds(underscore prefix indicates intentional ignoring), but line 56 callspool.query(query)without them. Call sites likeschema/tags.ts:92andfinops/*.tspass parameterized binds expecting them to work. Parameterized queries with placeholders will fail silently in this driver.Snowflake shows the correct pattern: pass
bindstoexecuteQuery(query, binds).🔧 Proposed fix
- async execute(sql: string, limit?: number, _binds?: any[], options?: ExecuteOptions): Promise<ConnectorResult> { + async execute(sql: string, limit?: number, binds?: any[], options?: ExecuteOptions): Promise<ConnectorResult> { const effectiveLimit = options?.noLimit ? 0 : (limit ?? 1000) let query = sql const isSelectLike = /^\s*(SELECT|WITH|VALUES)\b/i.test(sql) @@ - const [rows, fields] = await pool.query(query) + const [rows, fields] = + binds && binds.length > 0 + ? await pool.query(query, binds) + : await pool.query(query)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/drivers/src/mysql.ts` around lines 44 - 56, The execute function currently ignores the binds parameter (_binds) and calls pool.query(query) without passing parameters, causing parameterized queries to fail; update the execute signature to accept binds (e.g., binds?: any[]) or rename _binds to binds and pass them into pool.query (call pool.query(query, binds) or pool.query(query, binds ?? []) so existing parameterized callers work), ensuring the LIMIT augmentation logic still operates on the query string and does not drop or alter binds; key symbols: execute(sql, limit?, _binds?, options?), pool.query(query).packages/drivers/src/sqlserver.ts (1)
49-61:⚠️ Potential issue | 🟡 MinorInclude CTEs in the default row-cap detection.
Line 49 only matches
SELECT, soWITH ... SELECTqueries never getTOPinjected and bypass the default 1,000-row safety cap. All other drivers in this codebase (postgres, oracle, redshift, mysql, duckdb, databricks, snowflake, bigquery) includeWITHin theirisSelectLikepattern.💡 Minimal fix
- const isSelectLike = /^\s*SELECT\b/i.test(sql) + const isSelectLike = /^\s*(SELECT|WITH)\b/i.test(sql)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/drivers/src/sqlserver.ts` around lines 49 - 61, The current isSelectLike regex only matches queries starting with SELECT so CTEs (WITH ... SELECT) bypass the TOP injection; update the isSelectLike detection in sqlserver.ts to also match queries that begin with WITH ... SELECT (i.e., treat queries with leading CTEs as "select-like"), and adjust the TOP-insertion logic so it still injects TOP for the main SELECT following the CTE (keep using effectiveLimit and the existing checks for existing TOP/LIMIT). Target symbols: isSelectLike, effectiveLimit, and the query replacement that inserts TOP after SELECT.
🧹 Nitpick comments (6)
packages/drivers/src/oracle.ts (1)
40-75: Extract the limit/truncation policy into a shared helper.
effectiveLimit,truncated, and the final row slice are now repeated across the driverexecuteimplementations touched in this PR. Centralizing that policy will make futurenoLimitfixes much less likely to drift by dialect.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/drivers/src/oracle.ts` around lines 40 - 75, The limit/truncation logic repeated in execute (variables effectiveLimit, truncated, and slicing to produce limitedRows) should be moved into a shared helper (e.g., a util like computeLimitAndRows or applyLimitPolicy) and invoked from Oracle's async execute method; refactor so execute only calls the helper with (sqlLimit, options?.noLimit, rows) and uses the returned { effectiveLimit, truncated, rows: limitedRows, row_count } to build the ConnectorResult, keeping existing behavior for isSelectLike/ROWNUM wrapping inside execute and preserving columns extraction via result.metaData.packages/opencode/src/altimate/native/types.ts (1)
967-983: Narrowalgorithmto the documented literal union.Since
data.diffis a public bridge contract, keepingalgorithmas plainstringdefers validation to runtime. The implementation only handles the five documented values; a literal union would catch invalid names at the boundary.Type refinement
+export type DataDiffAlgorithm = + | "auto" + | "joindiff" + | "hashdiff" + | "profile" + | "cascade" + export interface DataDiffParams { /** Source table name (e.g. "orders", "db.schema.orders") or full SQL query */ source: string @@ /** Extra columns to compare beyond the key */ extra_columns?: string[] /** Algorithm: "auto" | "joindiff" | "hashdiff" | "profile" | "cascade" */ - algorithm?: string + algorithm?: DataDiffAlgorithm🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/opencode/src/altimate/native/types.ts` around lines 967 - 983, The DataDiffParams interface currently types the algorithm field as a plain string; narrow it to the documented literal union so invalid values are caught at compile time. Update the DataDiffParams type (the algorithm property) to be the union "auto" | "joindiff" | "hashdiff" | "profile" | "cascade" (optionally keep it optional if intended), and adjust any call sites or tests that pass non-literal strings to use one of those five literals; ensure any type imports or references to DataDiffParams reflect the new union.packages/opencode/src/altimate/native/connections/data-diff.ts (1)
390-393: Silent fallback to key-only comparison when discovery fails.When
discoverExtraColumnsfails (e.g., schema query error), it returnsundefinedand the orchestrator silently falls back to key-only comparison. This could mask configuration issues or produce misleading "identical" results.Consider logging a warning when discovery fails so users are aware.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/opencode/src/altimate/native/connections/data-diff.ts` around lines 390 - 393, The catch block in discoverExtraColumns swallows errors and returns undefined, causing a silent fallback to key-only comparison; modify the catch to capture the error (catch (err)) and emit a warning log before returning undefined so users are informed (e.g., use the module's logger or propagate to the orchestrator) and include the error message/context in the log; ensure the change is made inside the discoverExtraColumns function where the current `catch { return undefined }` resides..opencode/skills/data-parity/SKILL.md (2)
12-23: Add language specifier to fenced code block.The TODO list code block lacks a language specifier. While this is a minor linting issue, adding a language (e.g.,
markdownortext) improves rendering consistency.📝 Suggested fix
-``` +```text Here's my plan: 1. [ ] List available warehouse connections🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.opencode/skills/data-parity/SKILL.md around lines 12 - 23, The fenced TODO list code block lacks a language tag; update the triple-backtick fence surrounding the plan (the multiline TODO block starting with "Here's my plan:") to include a language specifier such as "text" or "markdown" (e.g., ```text) so the block renders consistently; ensure you change both opening and closing fences only, leaving the inner checklist lines unchanged.
165-174: Add language specifier for tool invocation examples.Multiple code blocks showing
data_diffinvocations lack language specifiers (lines 165, 217, 233). Consider usingpythonorjavascriptsyntax for better readability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.opencode/skills/data-parity/SKILL.md around lines 165 - 174, The code examples invoking the data_diff helper are missing fenced-code language specifiers; update each code block that shows data_diff (the three examples invoking data_diff with source/target/key_columns/source_warehouse/target_warehouse/algorithm) to use a language tag like ```python (or ```javascript) so syntax highlighting and readability improve for those code fences.packages/opencode/src/altimate/tools/data-diff.ts (1)
114-115: Prefer typed outcome overas anycast.The
outcomeis cast toanyand passed to helper functions that also treat it asany. Consider defining a discriminated union type matching the Rust serialization ({ mode: "diff" | "profile" | "cascade", ... }) for better type safety and maintainability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/opencode/src/altimate/tools/data-diff.ts` around lines 114 - 115, The code uses an untyped cast `const outcome = result.outcome as any` and then passes it into `formatOutcome(outcome, args.source, args.target)`, losing type safety; define a discriminated union (e.g., type Outcome = { mode: "diff"; ... } | { mode: "profile"; ... } | { mode: "cascade"; ... }) that matches the Rust serialization, replace the `as any` cast with `const outcome: Outcome = result.outcome`, and update `formatOutcome` (and any helper functions it calls) to accept `Outcome` so callers can narrow on `outcome.mode` instead of relying on `any`.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/drivers/src/clickhouse.ts`:
- Around line 60-63: Replace the magic number 127 with the exported enum
constant by importing ClickHouseLogLevel from '@clickhouse/client' and setting
clientConfig.log = { level: ClickHouseLogLevel.OFF }; update the import list
that currently includes createClient and ExecuteOptions to also import
ClickHouseLogLevel so the createClient(clientConfig) call uses the named
constant instead of 127.
In `@packages/opencode/src/altimate/native/connections/data-diff.ts`:
- Around line 458-480: The SELECT in buildPartitionDiscoverySQL interpolates
partitionColumn unquoted causing syntax/injection risk; update
buildPartitionDiscoverySQL to quote the partition column the same way
buildPartitionWhereClause does (use the project’s identifier-quoting helper or
the same quoting logic used by buildPartitionWhereClause) when constructing expr
for all modes (numeric, date, categorical) and when passing partitionColumn into
dateTruncExpr/other helpers so the generated SQL uses a properly quoted
identifier (_p AS derived from the quoted column) and avoids injection/syntax
errors while preserving existing partitionMode/dateTruncExpr behavior.
- Around line 512-523: The ClickHouse branch is using ANSI double-quote
identifier quoting (see the quotedCol usage and the switch case for "clickhouse"
in data-diff.ts) which breaks on reserved words; update the dialect-aware
quoting so that when dialect === "clickhouse" identifiers are wrapped with
backticks (`) rather than double-quotes (") and ensure the partition WHERE
expression generation (the "clickhouse" case returning `${expr} =
toDate('${partitionValue}')`) uses that backtick-quoted identifier; ideally
centralize this in the existing quotedCol helper or add a small helper that
returns backtick-quoted names for ClickHouse and preserves existing quoting for
other dialects.
- Around line 78-92: injectCte currently strips a possible RECURSIVE keyword
because the regex /^WITH\s+/i only matches "WITH " so when rebuilding you end up
with "WITH <ourDefs>,\nRECURSIVE ..." which is invalid; update the detection to
capture an optional RECURSIVE (e.g. use a regex like /^WITH(\s+RECURSIVE)?\s+/i)
so you can preserve that token, compute the remainder (afterWith) the same way,
and rebuild the prefix using the captured recursive group (e.g.
`WITH${recursive} ${ourDefs},\n${afterWith}`) while still stripping only the
leading "WITH…" from ctePrefix via ourDefs = ctePrefix.replace(/^WITH\s+/i, "").
- Around line 660-667: The partition WHERE clauses are being generated only for
sourceDialect which breaks target parsing when sourceDialect !== targetDialect;
update the partition loop that calls buildPartitionWhereClause (inside the
partitionValues iteration) to produce dialect-specific where clauses for both
sourceDialect and targetDialect (similar to how partition discovery does),
storing them separately and passing both appropriately into runDataDiff / the
DataParitySession spec so the source uses the source WHERE and the target uses
the target WHERE; ensure you reference the same params (partition_column,
partition_granularity, partition_bucket_size) when calling
buildPartitionWhereClause for each dialect.
In `@packages/opencode/src/altimate/tools/data-diff.ts`:
- Around line 82-87: The current ctx.ask call passes args.source and args.target
(table names/queries) to the "sql_execute_read" permission check, which doesn't
validate warehouse-level access; update the permission validation to explicitly
include warehouse identifiers so both source and target warehouses are
authorized — e.g., change the ctx.ask invocation in data-diff.ts (the call to
ctx.ask for "sql_execute_read") to include args.source_warehouse and
args.target_warehouse in the patterns or perform separate ctx.ask calls for each
warehouse, ensuring both warehouses are checked before executing queries.
---
Outside diff comments:
In `@packages/drivers/src/mysql.ts`:
- Around line 44-56: The execute function currently ignores the binds parameter
(_binds) and calls pool.query(query) without passing parameters, causing
parameterized queries to fail; update the execute signature to accept binds
(e.g., binds?: any[]) or rename _binds to binds and pass them into pool.query
(call pool.query(query, binds) or pool.query(query, binds ?? []) so existing
parameterized callers work), ensuring the LIMIT augmentation logic still
operates on the query string and does not drop or alter binds; key symbols:
execute(sql, limit?, _binds?, options?), pool.query(query).
In `@packages/drivers/src/redshift.ts`:
- Around line 49-64: The execute method in packages/drivers/src/redshift.ts
currently ignores the bind parameters by naming the param _binds and not passing
them to client.query; rename the method parameter from _binds to binds (and
update any internal references) and pass binds into the database call (i.e.,
client.query(query, binds)) so parameterized queries work like other drivers
(see Snowflake/DuckDB patterns) and preserve SQL injection protection and
correctness.
In `@packages/drivers/src/sqlserver.ts`:
- Around line 49-61: The current isSelectLike regex only matches queries
starting with SELECT so CTEs (WITH ... SELECT) bypass the TOP injection; update
the isSelectLike detection in sqlserver.ts to also match queries that begin with
WITH ... SELECT (i.e., treat queries with leading CTEs as "select-like"), and
adjust the TOP-insertion logic so it still injects TOP for the main SELECT
following the CTE (keep using effectiveLimit and the existing checks for
existing TOP/LIMIT). Target symbols: isSelectLike, effectiveLimit, and the query
replacement that inserts TOP after SELECT.
---
Nitpick comments:
In @.opencode/skills/data-parity/SKILL.md:
- Around line 12-23: The fenced TODO list code block lacks a language tag;
update the triple-backtick fence surrounding the plan (the multiline TODO block
starting with "Here's my plan:") to include a language specifier such as "text"
or "markdown" (e.g., ```text) so the block renders consistently; ensure you
change both opening and closing fences only, leaving the inner checklist lines
unchanged.
- Around line 165-174: The code examples invoking the data_diff helper are
missing fenced-code language specifiers; update each code block that shows
data_diff (the three examples invoking data_diff with
source/target/key_columns/source_warehouse/target_warehouse/algorithm) to use a
language tag like ```python (or ```javascript) so syntax highlighting and
readability improve for those code fences.
In `@packages/drivers/src/oracle.ts`:
- Around line 40-75: The limit/truncation logic repeated in execute (variables
effectiveLimit, truncated, and slicing to produce limitedRows) should be moved
into a shared helper (e.g., a util like computeLimitAndRows or applyLimitPolicy)
and invoked from Oracle's async execute method; refactor so execute only calls
the helper with (sqlLimit, options?.noLimit, rows) and uses the returned {
effectiveLimit, truncated, rows: limitedRows, row_count } to build the
ConnectorResult, keeping existing behavior for isSelectLike/ROWNUM wrapping
inside execute and preserving columns extraction via result.metaData.
In `@packages/opencode/src/altimate/native/connections/data-diff.ts`:
- Around line 390-393: The catch block in discoverExtraColumns swallows errors
and returns undefined, causing a silent fallback to key-only comparison; modify
the catch to capture the error (catch (err)) and emit a warning log before
returning undefined so users are informed (e.g., use the module's logger or
propagate to the orchestrator) and include the error message/context in the log;
ensure the change is made inside the discoverExtraColumns function where the
current `catch { return undefined }` resides.
In `@packages/opencode/src/altimate/native/types.ts`:
- Around line 967-983: The DataDiffParams interface currently types the
algorithm field as a plain string; narrow it to the documented literal union so
invalid values are caught at compile time. Update the DataDiffParams type (the
algorithm property) to be the union "auto" | "joindiff" | "hashdiff" | "profile"
| "cascade" (optionally keep it optional if intended), and adjust any call sites
or tests that pass non-literal strings to use one of those five literals; ensure
any type imports or references to DataDiffParams reflect the new union.
In `@packages/opencode/src/altimate/tools/data-diff.ts`:
- Around line 114-115: The code uses an untyped cast `const outcome =
result.outcome as any` and then passes it into `formatOutcome(outcome,
args.source, args.target)`, losing type safety; define a discriminated union
(e.g., type Outcome = { mode: "diff"; ... } | { mode: "profile"; ... } | { mode:
"cascade"; ... }) that matches the Rust serialization, replace the `as any` cast
with `const outcome: Outcome = result.outcome`, and update `formatOutcome` (and
any helper functions it calls) to accept `Outcome` so callers can narrow on
`outcome.mode` instead of relying on `any`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 4ae147d4-a6e5-4a2e-8d5c-cfa62bc9db0a
📒 Files selected for processing (20)
.opencode/skills/data-parity/SKILL.mdpackages/drivers/src/bigquery.tspackages/drivers/src/clickhouse.tspackages/drivers/src/databricks.tspackages/drivers/src/duckdb.tspackages/drivers/src/mysql.tspackages/drivers/src/oracle.tspackages/drivers/src/postgres.tspackages/drivers/src/redshift.tspackages/drivers/src/snowflake.tspackages/drivers/src/sqlite.tspackages/drivers/src/sqlserver.tspackages/drivers/src/types.tspackages/opencode/src/altimate/native/connections/data-diff.tspackages/opencode/src/altimate/native/connections/register.tspackages/opencode/src/altimate/native/types.tspackages/opencode/src/altimate/prompts/builder.txtpackages/opencode/src/altimate/tools/data-diff.tspackages/opencode/src/tool/registry.tspackages/opencode/test/altimate/simulation-suite.test.ts
suryaiyer95
left a comment
There was a problem hiding this comment.
Multi-Model Code Review — PR #493 + PR #114
Verdict: REQUEST CHANGES
Critical Issues: 1 | Major Issues: 6 | Minor Issues: 4 | Nits: 3
Reviewed by 3 AI models: Claude, Gemini 3.1 Pro, Grok 4. 1 convergence round. Full review posted on companion PR altimate-core-internal #114.
This PR's TypeScript orchestrator is well-crafted — the noLimit driver flag, CTE injection, and column auto-discovery are clean implementations. The issues below are specific to this PR.
Issues in this PR
isQuery detection could misclassify table names (Minor — Logic Error)
- File:
packages/opencode/src/altimate/native/connections/data-diff.ts:23-26 - Regex
/^\s*(SELECT|WITH|VALUES)\b/imisclassifies table names likewith_metadataorselect_results. - Fix: Document the limitation or require explicit signaling (e.g., a separate
sourceIsQueryparameter). - Flagged by: Claude
Implicit truthy check on effectiveLimit (Nit — Code Quality)
- File:
packages/drivers/src/*.ts(all 12 driver files) - The limit-appending condition checks
effectiveLimit && !/\bLIMIT\b/i.test(sql). SincenoLimit: truesetseffectiveLimitto0(falsy), this works but relies on implicit coercion. - Fix: Change to
effectiveLimit > 0for explicit intent. - Flagged by: Gemini
Positive Observations
noLimitflag — Clean, consistent addition across all 10+ drivers with backwards-compatible optional parameter- Auto-timestamp exclusion — Two-layer detection (name patterns + schema-level DEFAULT analysis) is smart
- ClickHouse stderr silencing —
clientConfig.log = { level: 127 }prevents raw ERROR output from corrupting terminal TUI - CTE injection —
injectCtecorrectly handles merging CTE blocks when engine emits its ownWITHclauses - Well-designed SKILL.md — Critical sections on algorithm selection, cost-aware interaction, joindiff cross-DB trap
Missing Tests
- No test for
isQueryedge cases (table names likewith_metadata) - No test for CTE injection edge cases (nested/RECURSIVE CTEs)
- No partitioned diff simulation tests end-to-end
- No test for
discoverExtraColumnswith auto-timestamp default detection
See full review with all 14 findings and attribution table on altimate-core-internal PR #114.
Follow-up: Review Issue Status (post-fixes)All major issues from the review have been addressed. Here's the verified status: FIXED
STILL OPEN (minor)
|
…ng, query+partition guard - Oracle day granularity: 'DDD' (day-of-year) → 'DD' (day-of-month) - Add `quoteIdentForDialect()` helper: MySQL/ClickHouse use backticks, TSQL/Fabric use brackets, others use ANSI double-quotes - `buildPartitionDiscoverySQL` and `buildPartitionWhereClause` now use dialect-aware quoting instead of hardcoded double-quotes - `runPartitionedDiff` rejects SQL queries as source/target with a clear error — partitioning requires table names to discover column values
There was a problem hiding this comment.
🧹 Nitpick comments (1)
packages/opencode/src/altimate/native/connections/data-diff.ts (1)
766-773: Consider extractingresolveDialectas a module-level helper.This function is duplicated at lines 623-630. Extract it once and reuse in both
runPartitionedDiffandrunDataDiff.♻️ Proposed extraction
Add at module level (e.g., around line 400):
function resolveDialect(warehouse: string | undefined): string { if (warehouse) { const cfg = Registry.getConfig(warehouse) return cfg?.type ?? "generic" } const warehouses = Registry.list().warehouses return warehouses[0]?.type ?? "generic" }Then remove the local definitions in both functions and use the shared helper.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/opencode/src/altimate/native/connections/data-diff.ts` around lines 766 - 773, Extract the duplicated resolveDialect logic into a single module-level helper named resolveDialect and replace the local copies inside runPartitionedDiff and runDataDiff with calls to that helper; specifically, remove the duplicate function definitions in both functions, add one shared resolveDialect(warehouse: string | undefined): string at module scope (copying the existing logic that uses Registry.getConfig and Registry.list().warehouses) and update runPartitionedDiff and runDataDiff to call resolveDialect(warehouse) instead of their local implementations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/opencode/src/altimate/native/connections/data-diff.ts`:
- Around line 766-773: Extract the duplicated resolveDialect logic into a single
module-level helper named resolveDialect and replace the local copies inside
runPartitionedDiff and runDataDiff with calls to that helper; specifically,
remove the duplicate function definitions in both functions, add one shared
resolveDialect(warehouse: string | undefined): string at module scope (copying
the existing logic that uses Registry.getConfig and Registry.list().warehouses)
and update runPartitionedDiff and runDataDiff to call resolveDialect(warehouse)
instead of their local implementations.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: d342a309-c72e-4721-8c31-8272ec693ffa
📒 Files selected for processing (1)
packages/opencode/src/altimate/native/connections/data-diff.ts
Summary
Adds the full data-parity diffing pipeline for the `data_diff` LLM tool: a TypeScript orchestrator that bridges the Rust `DataParitySession` state machine with all database drivers, plus a comprehensive SKILL.md that guides the model through algorithm selection.
Core: TypeScript orchestrator (`data-diff.ts`, +799 lines)
Cooperative state machine bridge — TypeScript executes SQL tasks generated by the Rust engine and feeds results back. Neither layer touches the other's domain.
New: ClickHouse driver (`clickhouse.ts`, +135 lines)
Pure HTTP driver using `JSONCompactEachRowWithNamesAndTypes` — no external package required. Handles DDL (empty response) vs SELECT (typed rows) branches correctly.
`ExecuteOptions` — all 10+ drivers updated
New `noLimit?: boolean` flag bypasses the default 1,000-row cap for data-diff pipelines. Applied consistently across Postgres, Snowflake, BigQuery, Redshift, MySQL, SQLite, DuckDB, Trino, Databricks, ClickHouse.
SKILL.md (+341 lines)
Comprehensive algorithm-selection guide:
Fixes applied (from code review)
Test plan
Summary by CodeRabbit
New Features
Documentation
Other