feat: data-parity skill — TypeScript orchestrator, ClickHouse driver, partition support by suryaiyer95 · Pull Request #493 · AltimateAI/altimate-code

suryaiyer95 · 2026-03-27T00:29:44Z

Summary

Adds the full data-parity diffing pipeline for the `data_diff` LLM tool: a TypeScript orchestrator that bridges the Rust `DataParitySession` state machine with all database drivers, plus a comprehensive SKILL.md that guides the model through algorithm selection.

Core: TypeScript orchestrator (`data-diff.ts`, +799 lines)

Cooperative state machine bridge — TypeScript executes SQL tasks generated by the Rust engine and feeds results back. Neither layer touches the other's domain.

`runDataDiff` — top-level entry point; dispatches to single-table or partitioned diff
`runPartitionedDiff` — runs independent diffs per partition with `mergeOutcomes`
`resolveTableSources` + `injectCte` — compares arbitrary SQL queries by wrapping them as CTEs
`discoverExtraColumns` — auto-discovers non-key columns with two-layer audit exclusion (name patterns + schema-level DEFAULT detection across 9 dialects)
`buildColumnDiscoverySQL` — per-dialect INFORMATION_SCHEMA query (Postgres, Snowflake, BigQuery, MySQL, Redshift, SQLite, DuckDB, Trino/Presto, fallback)
`dateTruncExpr` — dialect-aware date truncation for date-partition modes (BigQuery, MySQL/MariaDB, Oracle, Spark/Databricks, default `DATE_TRUNC`)
`buildPartitionDiscoverySQL` / `buildPartitionWhereClause` — date, numeric, and categorical partition support

New: ClickHouse driver (`clickhouse.ts`, +135 lines)

Pure HTTP driver using `JSONCompactEachRowWithNamesAndTypes` — no external package required. Handles DDL (empty response) vs SELECT (typed rows) branches correctly.

`ExecuteOptions` — all 10+ drivers updated

New `noLimit?: boolean` flag bypasses the default 1,000-row cap for data-diff pipelines. Applied consistently across Postgres, Snowflake, BigQuery, Redshift, MySQL, SQLite, DuckDB, Trino, Databricks, ClickHouse.

SKILL.md (+341 lines)

Comprehensive algorithm-selection guide:

CRITICAL: `joindiff` cannot see a second table across warehouses — always use `hashdiff` or `auto` for cross-DB comparisons
Decision tree: same-warehouse vs cross-warehouse, table vs query comparison
Output format: facts only (counts, values, row samples) — no editorializing
Full parameter reference for `DataDiffParams`

Fixes applied (from code review)

Security — `buildColumnDiscoverySQL` now escapes single quotes in all interpolated table name parts (`esc()`) — prevents SQL injection via crafted source/target table names
Bug — `dateTruncExpr` Oracle case added: `TRUNC(col, 'MONTH')` — Oracle does not have `DATE_TRUNC`

Test plan

Cross-DB comparison (pg_source vs pg_target) — agent uses `hashdiff` automatically
TPC-H migration validation — output is clean fact-reporting
SQL query comparison (same-warehouse) — `joindiff` used correctly
TypeScript typecheck — passing

Summary by CodeRabbit

New Features
- New Data Parity tool for table vs table and SQL vs SQL comparisons with profiling, column/row diffs, and human-readable results.
- Partitioned diff execution to handle large datasets with per-partition summaries.
- Automatic detection/exclusion of audit/timestamp columns during comparisons.
Documentation
- Added a comprehensive data parity skill guide and updated prompts to invoke the feature.
Other
- Query execution now supports a “no-limit” mode to run untruncated comparisons.

coderabbitai · 2026-03-27T00:29:52Z

📝 Walkthrough

Walkthrough

Adds an end-to-end “data parity” feature: new skill docs and prompts, a DataDiffTool, a dispatcher handler that runs a Rust DataParity state machine via a new orchestration module, driver updates to support no-limit execution, types for data-diff contracts, and test mock adjustments.

Changes

Cohort / File(s)	Summary
Drivers (execute options) `packages/drivers/src/bigquery.ts`, `packages/drivers/src/clickhouse.ts`, `packages/drivers/src/databricks.ts`, `packages/drivers/src/duckdb.ts`, `packages/drivers/src/mysql.ts`, `packages/drivers/src/oracle.ts`, `packages/drivers/src/postgres.ts`, `packages/drivers/src/redshift.ts`, `packages/drivers/src/snowflake.ts`, `packages/drivers/src/sqlite.ts`, `packages/drivers/src/sqlserver.ts`	Added optional `options?: ExecuteOptions` to `Connector.execute`; introduced `noLimit` handling (effectiveLimit = 0 when set) and adjusted LIMIT injection/truncation logic to only apply when effectiveLimit > 0.
Driver types `packages/drivers/src/types.ts`	Added `ExecuteOptions` interface (`noLimit?: boolean`) and updated `Connector.execute` signature to accept `options?: ExecuteOptions`.
Data-diff orchestration `packages/opencode/src/altimate/native/connections/data-diff.ts`	New orchestration module that runs the DataParity Rust engine: resolves query-vs-table sources (CTE rewriting), injects CTEs into emitted SQL, executes SQL with no-limit, normalizes results, auto-discovers schema/auto-audit columns (dialect-aware), supports partitioned diffs (partition discovery, per-partition recursive diffing, merge outcomes), and returns structured results. Exposes `resolveTableSources`, `injectCte`, and `runDataDiff`.
Dispatcher registration `packages/opencode/src/altimate/native/connections/register.ts`	Registered dispatcher handler for `"data.diff"` delegating to `runDataDiff`.
Bridge types / contracts `packages/opencode/src/altimate/native/types.ts`	Added `DataDiffParams`, `PartitionDiffResult`, `DataDiffResult`, and extended `BridgeMethods` with `"data.diff"`.
Tooling & UI `packages/opencode/src/altimate/tools/data-diff.ts`, `packages/opencode/src/tool/registry.ts`	Added `DataDiffTool` that calls dispatcher, formats human-readable summaries (including partition breakdowns and auto-excluded audit columns); registered tool in `ToolRegistry`.
Skill docs & prompts `.opencode/skills/data-parity/SKILL.md`, `packages/opencode/src/altimate/prompts/builder.txt`	Added comprehensive data-parity skill doc (workflow, profiling-before-diff, partition guidance, algorithm rules, reporting requirements) and prompt builder entries to invoke `/data-parity`.
Tests `packages/opencode/test/altimate/simulation-suite.test.ts`	Adjusted test mocks by flattening previously nested `data` wrappers and adding `source_dialect`/`target_dialect` fields in sql-translate mock.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/Client
    participant Tool as DataDiffTool
    participant Dispatcher as Dispatcher
    participant Orch as DataDiff Orchestration
    participant Engine as Rust DataParity Engine
    participant DB as Warehouse Connections

    User->>Tool: invoke data_diff(params)
    Tool->>Dispatcher: call("data.diff", params)
    Dispatcher->>Orch: runDataDiff(params)
    Orch->>Orch: resolveTableSources() / build CTEs
    Orch->>DB: fetch schema & partition values
    DB-->>Orch: column metadata / partition list
    Orch->>Engine: create DataParitySession
    loop Engine emits actions
      Engine-->>Orch: ExecuteSql(sql)
      Orch->>Orch: injectCte(sql)
      Orch->>DB: execute SQL (noLimit=true)
      DB-->>Orch: rows (normalized)
      Orch->>Engine: push results
    end
    alt partitioned
      Orch->>Orch: iterate partitions -> runDataDiff(partitioned_params)
      Orch->>Orch: mergeOutcomes(per_partition_results)
    end
    Engine-->>Orch: Done / Outcome
    Orch-->>Dispatcher: DataDiffResult
    Dispatcher-->>Tool: result
    Tool-->>User: formatted summary

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

fix: address 7 P1 findings from v0.5.16 release evaluation #591: Overlapping edits to ClickHouse driver execute signature and LIMIT/truncation handling.
feat: add ClickHouse warehouse driver #574: Related ClickHouse initialization/execute logic changes impacting the same driver file.
fix: defensive null guards in tool formatters and DuckDB concurrent access retry #571: Related changes touching the DuckDB driver execute behavior.

Suggested reviewers

mdesmet

Poem

🐰 I hopped through CTEs, partitions in tow,
Found matching rows where differences grow.
No limits to fetch, no truncation to fear,
Keys confirmed, profiles run—results clear.
A rabbit's diff danced—data parity, cheer!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 52.94% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main changes: adding a data-parity skill with a TypeScript orchestrator, ClickHouse driver support, and partition support.
Description check	✅ Passed	The description includes all required sections (Summary, Test Plan, Checklist) with comprehensive detail. Summary explains core changes and fixes; Test Plan lists validation approaches; Checklist items are marked as completed.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/data-parity-skill-improvements

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

suryaiyer95 · 2026-03-27T00:32:58Z

Closing — .opencode/ skill config and model defaults should not live in the open source repo.

aidtya · 2026-03-28T08:18:07Z

Companion fix: column name collision in `join_diff_sql`

The joindiff algorithm produces incorrect results for compound keys due to a column name collision between the Rust SQL generator and the node-postgres driver.

Root cause: join_diff_sql emits unaliased COALESCE and CAST expressions. PostgreSQL auto-assigns duplicate column names (e.g., two columns both named coalesce). node-postgres converts rows to JS objects keyed by column name — duplicates silently overwrite earlier values, so the engine receives corrupted key data.

Example: Comparing rakuten.warehouse_metadata vs rakutenvthree.warehouse_metadata with key_columns: ["rk", "instance_id"], the rk column was lost (overwritten by instance_id), collapsing 460 distinct keys into 2 groups. Tool reported only_in_source = 1 instead of the correct 307.

Fix: https://github.com/AltimateAI/altimate-core-internal/pull/114 — every SELECT expression in join_diff_sql now gets a unique alias (_k0, _k1, _v0_l, _v0_r, …). No changes needed in this PR's orchestrator code — it already uses positional indexing.

To pick up the fix: bump @altimateai/altimate-core in packages/opencode/package.json after the altimate-core-internal binary is rebuilt.

suryaiyer95

Multi-Model Code Review — altimate-code#493

Reviewed by 4-model panel: Claude Sonnet 4.6, Gemini 3.1 Pro, Grok 4, Kimi K2.5. Unanimous verdict: REQUEST CHANGES.

Note: The PR title significantly undersells scope. This contains a 799-line TypeScript orchestrator, new ClickHouse driver, `ExecuteOptions` refactored across 10+ drivers, column auto-discovery, partition support, and a 341-line SKILL.md.

Verdict: REQUEST CHANGES (fix MAJOR-1 before merge)
Major Issues: 1 | Minor Issues: 4 | NITs: 3

Major Issues

MAJOR-1 — SQL injection via table name interpolation in `buildColumnDiscoverySQL` (Security)

Location: `packages/opencode/src/altimate/native/connections/data-diff.ts` — `buildColumnDiscoverySQL`
Table name parts are interpolated directly into SQL:
```
schemaFilter = `table_schema = '${parts[1]}'`
tableFilter  = `table_name  = '${parts[2]}'`
```
`tableName` flows from `DataDiffParams.source` which comes from the LLM/user. A crafted table name like `orders' OR 1=1 --` produces injected SQL.

Fix:

const escape = (s: string) => s.replace(/'/g, "''")
schemaFilter = `table_schema = '${escape(parts[1])}'`
tableFilter  = `table_name  = '${escape(parts[2])}'`

Minor Issues

MINOR-1 — `discoverExtraColumns` silently falls back to key-only comparison (Logic Error)

When column discovery fails, the diff runs in key-only mode — reporting all rows as identical even when non-key values differ. No warning is surfaced.
Fix: Include a `warning` field in `DataDiffResult` when key-only fallback occurs.

MINOR-2 — `dateTruncExpr` missing Oracle dialect (Bug)

Oracle falls through to the default `DATE_TRUNC` case. Oracle uses `TRUNC(col, 'MONTH')` — date-partitioned diffs on Oracle tables will fail.
Fix: Add `case "oracle": return `TRUNC(${column}, '${g.toUpperCase()}')``

MINOR-3 — Unqualified table names match across multiple schemas (Bug)

`buildColumnDiscoverySQL` with no schema filter returns columns from all schemas with that table name — producing incorrect `extra_columns`.
Fix: Default `schemaFilter` to the connection's current schema, or document that fully-qualified names are required.

MINOR-4 — `MAX_STEPS = 200` hardcoded and undocumented (Design)

Not configurable, not documented in `DataDiffParams` or tool description.
Fix: Expose as optional `max_steps?: number` in `DataDiffParams`, defaulting to 200.

NITs

N1: `mergeOutcomes` caps `diff_rows` at 100 with no documentation — callers don't know the result is partial
N2: `partitionColumn` identifier interpolated directly in `buildPartitionDiscoverySQL` — inconsistent with MAJOR-1 fix
N3: `effectiveLimit` pattern repeated across 10+ drivers

Positive Observations

The cooperative state machine bridge (TypeScript orchestrator ↔ Rust engine) is the right architecture.
Column auto-discovery with two-layer audit exclusion is thorough.
`ExecuteOptions.noLimit` correctly applied across all 10+ drivers.
SKILL.md `CRITICAL: joindiff only sees one connection for cross-warehouse comparisons` is exactly the right guardrail.
ClickHouse driver using `JSONCompactEachRowWithNamesAndTypes` with pure HTTP is clean.
Dynamic import gracefully handles NAPI unavailability — unblocks CI.

suryaiyer95

Code Review: Data Parity TypeScript Orchestrator + ClickHouse Driver

Verdict: APPROVE with minor issues
Reviewed by: Claude (sole reviewer — external models unavailable)

Major Issue (1)

1. SQL Injection via `buildColumnDiscoverySQL` — ClickHouse/Snowflake paths (Security)

Location: packages/opencode/src/altimate/native/connections/data-diff.ts

The ClickHouse path uses DESCRIBE TABLE ${tableName} and Snowflake uses SHOW COLUMNS IN TABLE ${tableName} with raw string interpolation. While the information_schema paths properly escape with esc(), ClickHouse/Snowflake bypass escaping entirely.

A table name like orders; DROP TABLE users -- would be injected verbatim. The tableName comes from LLM tool params, so prompt injection or user error could trigger this.

Fix: Apply identifier quoting (backtick for ClickHouse, double-quote for Snowflake) or validate table names against [a-zA-Z0-9_.].

Minor Issues (4)

2. Profile outcome formatting uses wrong field names — `tools/data-diff.ts`

The TS formatter checks col.source_stats/col.target_stats but Rust serializes as col.table1/col.table2. Stats fields are non_null_count/distinct_count in Rust but the TS uses count/null_count. Profile output would show undefined values.

3. CTE injection doesn't handle ClickHouse `SETTINGS` clause — `data-diff.ts:injectCte()`

JoinDiff for ClickHouse emits ... SETTINGS join_use_nulls = 1. CTE-wrapped queries with SETTINGS may not propagate correctly in all ClickHouse versions.

4. Partition discovery fails on query sources — `data-diff.ts:runPartitionedDiff()`

When source is a SQL query, table1Name resolves to __diff_source but the partition discovery SQL doesn't include the CTE prefix. This would fail with "table __diff_source not found".

5. `noLimit` creates unbounded memory risk for JoinDiff — all 11 driver files

When noLimit is true, no LIMIT clause is added. For JoinDiff, the result set could be millions of rows loaded into Node.js memory. Consider a safety limit (e.g., 10M rows) even with noLimit.

Nit (1)

simulation-suite.test.ts changes (removing data: wrapper, flattening mock response fields) are unrelated bug fixes that should be in a separate commit

Positive Observations

Auto-discovery of audit columns with two-layer exclusion (name patterns + schema default detection) is production-quality. The SKILL.md workflow requiring user confirmation before excluding is excellent UX
CTE injection for query-vs-query comparison is clever — wrapping arbitrary SQL in CTEs so the Rust engine treats them as tables
Partition support (date/numeric/categorical) with per-partition breakdown reporting makes this usable on billion-row tables
SKILL.md is thorough — the 9-step plan template, algorithm selection guide, extra_columns behavior docs, and common mistakes section are exactly what an LLM agent needs
Consistent noLimit across all 11 drivers — clean, mechanical refactor with correct effectiveLimit > 0 guard for truncation detection
Defensive null guards added to lineage-check.ts, schema-inspect.ts, sql-analyze.ts — good hardening of existing tools

Missing Tests

No test for injectCte merging engine CTEs with source/target CTEs
No test for partition discovery failure (error propagation from runPartitionedDiff)
No integration test exercising DataParitySession NAPI round-trip from TypeScript

Reviewed by 1 model: Claude. External models (Gemini 3.1 Pro, Kimi K2.5, Grok 4) unavailable due to insufficient OpenRouter credits.

🤖 Generated with Claude Code

suryaiyer95 · 2026-04-01T20:55:29Z

Supplemental Review Findings

Two additional issues caught during deeper analysis:

MAJOR: Partition discovery misses target-only partitions (`data-diff.ts`)

runPartitionedDiff discovers partition values only from the source table. If the target has partitions not present in source (e.g., target has December data, source doesn't), those rows are silently missed — they never get diffed.

Fix: UNION partition values from both source and target tables before iterating.

MAJOR: `buildPartitionDiscoverySQL` — unquoted column injection

partitionColumn from LLM tool args is injected unquoted into SELECT DISTINCT ${expr} AS _p FROM ${table}. The buildPartitionWhereClause function correctly quotes with "${partitionColumn}" but the discovery query doesn't.

MINOR: `injectCte` breaks on `WITH RECURSIVE`

The regex trimmed.match(/^WITH\s+/i) matches WITH RECURSIVE and strips WITH , leaving RECURSIVE cte_name AS (...) — which is invalid SQL. Should detect and preserve RECURSIVE.

MINOR: ClickHouse partition WHERE uses ANSI double-quotes instead of backticks

buildPartitionWhereClause uses "${partitionColumn}" (ANSI quoting) but ClickHouse requires backtick quoting. This would fail with a syntax error for reserved-word column names on ClickHouse.

🤖 Generated with Claude Code

- Add DataParity engine integration via native Rust bindings - Add data-diff tool for LLM agent (profile, joindiff, hashdiff, cascade, auto) - Add ClickHouse driver support - Add data-parity skill: profile-first workflow, algorithm selection guide, CRITICAL warning that joindiff cannot run cross-database (always returns 0 diffs), output style rules (facts only, no editorializing) - Gitignore .altimate-code/ (credentials) and *.node (platform binaries)

Split large tables by a date or numeric column before diffing. Each partition is diffed independently then results are aggregated. New params: - partition_column: column to split on (date or numeric) - partition_granularity: day | week | month | year (for dates) - partition_bucket_size: bucket width for numeric columns New output field: - partition_results: per-partition breakdown (identical / differ / error) Dialect-aware SQL: Postgres, Snowflake, BigQuery, ClickHouse, MySQL. Skill updated with partition guidance and examples.

When partition_column is set without partition_granularity or partition_bucket_size, groups by raw DISTINCT values. Works for any non-date, non-numeric column: status, region, country, etc. WHERE clause uses equality: col = 'value' with proper escaping.

Rust serializes ReladiffOutcome with serde tag 'mode', producing: {mode: 'diff', diff_rows: [...], stats: {rows_table1, rows_table2, exclusive_table1, exclusive_table2, updated, unchanged}} Previous code checked for {Match: {...}} / {Diff: {...}} shapes that never matched, causing partitioned diff to report all partitions as 'identical' with 0 rows. - extractStats(): check outcome.mode === 'diff', read from stats fields - mergeOutcomes(): aggregate mode-based outcomes correctly - summarize()/formatOutcome(): display mode-based shape with correct labels

Key changes based on feedback: - Always generate TODO plan before any tool is called - Enforce data_diff tool usage (never manual EXCEPT/JOIN SQL) - Add PK discovery + explicit user confirmation step - Profile pass is now mandatory before row-level diff - Ask user before expensive row-level diff on large tables: - <100K rows: proceed automatically - 100K-10M rows: ask with where_clause option - >10M rows: offer window/partition/full choices - Document partition modes (date/numeric/categorical) with examples - Add warehouse_list as first step to confirm connections

…from data diff The Rust engine only compares columns explicitly listed in extra_columns. When omitted, it was silently reporting all key-matched rows as 'identical' even when non-key values differed — a false positive bug. Changes: - Auto-discover columns from information_schema when extra_columns is omitted and source is a plain table name (not a SQL query) - Exclude audit/timestamp columns (updated_at, created_at, inserted_at, modified_at, _fivetran_*, _airbyte_*, publisher_last_updated_*, etc.) from comparison by default since they typically differ due to ETL timing - Report excluded columns in tool output so users know what was skipped - Fix misleading tool description that said 'Omit to compare all columns' - Update SKILL.md with critical guidance on extra_columns behavior

…ult truncation All drivers default to `LIMIT 1001` on SELECT queries and post-truncate to 1000 rows. This silently drops rows when the data-diff engine needs complete result sets — a FULL OUTER JOIN returning >1000 diff rows would be truncated, causing the engine to undercount differences. - Add `ExecuteOptions { noLimit?: boolean }` to the `Connector` interface - When `noLimit: true`, set `effectiveLimit = 0` (falsy) so the existing LIMIT injection guard is skipped, and add `effectiveLimit > 0` to the truncation check so rows aren't sliced to zero - Update all 12 drivers: postgres, clickhouse, snowflake, bigquery, mysql, redshift, databricks, duckdb, oracle, sqlserver, sqlite, mongodb - Pass `{ noLimit: true }` from `data-diff.ts` `executeQuery()` Interactive SQL callers are unaffected — they continue to get the default 1000-row limit. Only the data-diff pipeline opts out. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…m exclusions with user Column exclusion now has two layers: 1. Name-pattern matching (existing) — updated_at, created_at, _fivetran_synced, etc. 2. Schema-level default detection (new) — queries column_default for NOW(), CURRENT_TIMESTAMP, GETDATE(), SYSDATE, SYSTIMESTAMP, etc. Covers PostgreSQL, MySQL, Snowflake, SQL Server, Oracle, ClickHouse, DuckDB, SQLite, and Redshift in a single round-trip (no extra query). The skill prompt now instructs the agent to present detected auto-timestamp columns to the user and ask for confirmation before excluding them, since migrations should preserve timestamps while ETL replication regenerates them.

- `buildColumnDiscoverySQL`: escape single quotes in all interpolated table name parts to prevent SQL injection via crafted source/target names - `dateTruncExpr`: add Oracle case (`TRUNC(col, 'UNIT')`) — Oracle does not have `DATE_TRUNC`, date-partitioned diffs on Oracle tables previously failed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Apply esc() to Oracle and SQLite paths in buildColumnDiscoverySQL (SQL injection via table name was unpatched in these dialects) - Quote identifiers in resolveTableSources to prevent injection via table names containing semicolons or special characters - Surface SQL execution errors before feeding empty rows to the engine (silent false "match" when warehouse is unreachable is now an error) - Fix Oracle TRUNC() format model map: 'WEEK' → 'IW' (ISO week) ('WEEK' throws ORA-01800 on all Oracle versions) - Quote partition column identifier in buildPartitionWhereClause

…r propagation, and test mock formats - `altimate-core-column-lineage`: fix `[object Object]` in `column_dict` output when source entries are `{ source_table, source_column }` objects instead of strings - `schema-inspect`: propagate `{ success: false, error }` dispatcher responses to `metadata.error` instead of silently returning empty schema - `sql-analyze`: guard against null/undefined result from dispatcher to prevent "undefined" literal in output - `lineage-check`: guard against null/undefined result from dispatcher to prevent "undefined" literal in output - `simulation-suite.test.ts`: fix `sql-translate` mock format — data fields must be flat (not wrapped in `data: {}`), add `source_dialect`/`target_dialect` to mock so assertions pass - `simulation-suite.test.ts`: fix `dbt-manifest` mock format — unwrap `data: {}` so `model_count` and `models` are accessible at top level Simulation suite: 695/839 → 839/839 (100%)

… corruption The @clickhouse/client package enables ERROR-level logging by default and writes `[ERROR][@clickhouse/client][Connection]` lines directly to stderr on auth/query failures. These raw writes corrupt the terminal TUI rendering. Set `log: { level: 127 }` (ClickHouseLogLevel.OFF) when creating the client — consistent with how Snowflake (`logLevel: 'OFF'`) and Databricks (no-op logger) already suppress their SDK loggers for the same reason.

…ack script - Validate table names before interpolating into DESCRIBE/SHOW COLUMNS for ClickHouse and Snowflake — reject names with non-alphanumeric characters to prevent SQL injection; also quote parts with dialect-appropriate delimiters - Discover partition values from BOTH source and target tables and union the results — previously only source was queried, silently missing rows that existed only in target-side partitions - Add script/pack-local.ts: mirrors publish.ts but stops before npm publish; injects local altimate-core tarballs from /tmp/altimate-local-dist/ for local end-to-end testing

Require that every diff result summary surfaces: - Exact scope (tables + warehouses compared) - Filters and time period applied (or explicitly states none) - Key columns used and how they were confirmed - Columns compared and excluded, with reasons (auto-timestamp, user request) - Algorithm used Includes example full result summary and guidance for identical results — emphasising that bare numbers without context are meaningless to the user.

The partitioned diff returned `{ Match: { row_count: 0, algorithm: 'partitioned' } }` when no partition values were found or all partitions failed. This format lacks `mode: 'diff'`, so `formatOutcome` fell through to raw JSON.stringify instead of producing clean output. Use the standard Rust engine format: `{ mode: 'diff', stats: {...}, diff_rows: [] }`

…y comparison modes

claude

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

_{Tip: disable this comment in your organization's Code Review settings.}

Copilot

Pull request overview

Adds a new data-parity / data_diff workflow to the Altimate toolchain by introducing a TypeScript “orchestrator bridge” for the Rust DataParitySession state machine, extending driver execution to support unbounded result sets, and documenting a new /data-parity skill for correct algorithm selection.

Changes:

Introduces data_diff tool + native connection handler that runs the Rust diff session by executing emitted SQL tasks across warehouse connectors (including partitioned diffs and column auto-discovery).
Extends the drivers Connector.execute() API with ExecuteOptions.noLimit to bypass default LIMIT injection and truncation.
Adds /data-parity skill documentation and updates prompts/registry + adjusts simulation mocks to match updated dispatcher response shapes.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
packages/opencode/test/altimate/simulation-suite.test.ts	Updates mocked dispatcher response shapes for dbt manifest + SQL translation scenarios.
packages/opencode/src/tool/registry.ts	Registers the new `DataDiffTool`.
packages/opencode/src/altimate/tools/data-diff.ts	Adds the `data_diff` LLM tool wrapper and output formatting.
packages/opencode/src/altimate/prompts/builder.txt	Documents and wires up `/data-parity` skill invocation triggers.
packages/opencode/src/altimate/native/types.ts	Adds bridge types for `data.diff` params/results and registers the method.
packages/opencode/src/altimate/native/connections/register.ts	Registers the native `data.diff` dispatcher method.
packages/opencode/src/altimate/native/connections/data-diff.ts	Implements the TypeScript orchestrator bridging Rust session steps to warehouse SQL execution, plus partition support and column discovery.
packages/drivers/src/types.ts	Adds `ExecuteOptions` and extends `Connector.execute()` signature to accept options.
packages/drivers/src/{postgres,redshift,snowflake,bigquery,mysql,sqlite,duckdb,databricks,sqlserver,oracle,clickhouse}.ts	Threads `noLimit` option through each driver and fixes truncation detection when limit is disabled.
.opencode/skills/data-parity/SKILL.md	Adds the data-parity skill guide and algorithm-selection procedure.

Comments suppressed due to low confidence (1)

packages/drivers/src/clickhouse.ts:20

PR description claims the ClickHouse driver is a “pure HTTP driver” with “no external package required”, but this implementation still depends on @clickhouse/client and throws an install error if it’s missing.

Either update the PR description to match the current dependency, or rework the driver to avoid the external client if that’s the intended change.

import type { ConnectionConfig, Connector, ConnectorResult, ExecuteOptions, SchemaColumn } from "./types"

export async function connect(config: ConnectionConfig): Promise<Connector> {
  let createClient: any
  try {
    const mod = await import("@clickhouse/client")
    createClient = mod.createClient ?? mod.default?.createClient
    if (!createClient) {
      throw new Error("createClient export not found in @clickhouse/client")
    }
  } catch {
    throw new Error("ClickHouse driver not installed. Run: npm install @clickhouse/client")
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

packages/drivers/src/types.ts

packages/opencode/src/altimate/native/connections/data-diff.ts

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

packages/drivers/src/redshift.ts (1)

49-64: ⚠️ Potential issue | 🔴 Critical

Redshift driver breaks parameterized queries by ignoring bind parameters.

The execute method accepts _binds but never passes them to client.query(), unlike all other drivers (Snowflake uses executeQuery(query, binds), DuckDB uses conditional queryWithParams(finalSql, binds)). This breaks SQL injection protection and query correctness for all callers passing binds.

Rename the parameter to binds (remove underscore) and pass it to the query execution:

🔧 Proposed fix

-    async execute(sql: string, limit?: number, _binds?: any[], options?: ExecuteOptions): Promise<ConnectorResult> {
+    async execute(sql: string, limit?: number, binds?: any[], options?: ExecuteOptions): Promise<ConnectorResult> {
       const client = await pool.connect()
       try {
         const effectiveLimit = options?.noLimit ? 0 : (limit ?? 1000)
         let query = sql
         const isSelectLike = /^\s*(SELECT|WITH|VALUES)\b/i.test(sql)
         if (
           isSelectLike &&
           effectiveLimit &&
           !/\bLIMIT\b/i.test(sql)
         ) {
           query = `${sql.replace(/;\s*$/, "")} LIMIT ${effectiveLimit + 1}`
         }
 
-        const result = await client.query(query)
+        const result =
+          binds && binds.length > 0
+            ? await client.query(query, binds)
+            : await client.query(query)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/drivers/src/redshift.ts` around lines 49 - 64, The execute method in
packages/drivers/src/redshift.ts currently ignores the bind parameters by naming
the param _binds and not passing them to client.query; rename the method
parameter from _binds to binds (and update any internal references) and pass
binds into the database call (i.e., client.query(query, binds)) so parameterized
queries work like other drivers (see Snowflake/DuckDB patterns) and preserve SQL
injection protection and correctness.

packages/drivers/src/mysql.ts (1)

44-56: ⚠️ Potential issue | 🟠 Major

MySQL execute accepts binds parameter but never passes it to pool.query().

The interface contract (line 31 of types.ts) requires binds?: any[] to be used. Line 44 accepts it as _binds (underscore prefix indicates intentional ignoring), but line 56 calls pool.query(query) without them. Call sites like schema/tags.ts:92 and finops/*.ts pass parameterized binds expecting them to work. Parameterized queries with placeholders will fail silently in this driver.

Snowflake shows the correct pattern: pass binds to executeQuery(query, binds).

🔧 Proposed fix

-    async execute(sql: string, limit?: number, _binds?: any[], options?: ExecuteOptions): Promise<ConnectorResult> {
+    async execute(sql: string, limit?: number, binds?: any[], options?: ExecuteOptions): Promise<ConnectorResult> {
       const effectiveLimit = options?.noLimit ? 0 : (limit ?? 1000)
       let query = sql
       const isSelectLike = /^\s*(SELECT|WITH|VALUES)\b/i.test(sql)
@@
-      const [rows, fields] = await pool.query(query)
+      const [rows, fields] =
+        binds && binds.length > 0
+          ? await pool.query(query, binds)
+          : await pool.query(query)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/drivers/src/mysql.ts` around lines 44 - 56, The execute function
currently ignores the binds parameter (_binds) and calls pool.query(query)
without passing parameters, causing parameterized queries to fail; update the
execute signature to accept binds (e.g., binds?: any[]) or rename _binds to
binds and pass them into pool.query (call pool.query(query, binds) or
pool.query(query, binds ?? []) so existing parameterized callers work), ensuring
the LIMIT augmentation logic still operates on the query string and does not
drop or alter binds; key symbols: execute(sql, limit?, _binds?, options?),
pool.query(query).

packages/drivers/src/sqlserver.ts (1)

49-61: ⚠️ Potential issue | 🟡 Minor

Include CTEs in the default row-cap detection.

Line 49 only matches SELECT, so WITH ... SELECT queries never get TOP injected and bypass the default 1,000-row safety cap. All other drivers in this codebase (postgres, oracle, redshift, mysql, duckdb, databricks, snowflake, bigquery) include WITH in their isSelectLike pattern.
💡 Minimal fix
-      const isSelectLike = /^\s*SELECT\b/i.test(sql)
+      const isSelectLike = /^\s*(SELECT|WITH)\b/i.test(sql)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/drivers/src/sqlserver.ts` around lines 49 - 61, The current
isSelectLike regex only matches queries starting with SELECT so CTEs (WITH ...
SELECT) bypass the TOP injection; update the isSelectLike detection in
sqlserver.ts to also match queries that begin with WITH ... SELECT (i.e., treat
queries with leading CTEs as "select-like"), and adjust the TOP-insertion logic
so it still injects TOP for the main SELECT following the CTE (keep using
effectiveLimit and the existing checks for existing TOP/LIMIT). Target symbols:
isSelectLike, effectiveLimit, and the query replacement that inserts TOP after
SELECT.

🧹 Nitpick comments (6)

packages/drivers/src/oracle.ts (1)
40-75: Extract the limit/truncation policy into a shared helper.

effectiveLimit, truncated, and the final row slice are now repeated across the driver execute implementations touched in this PR. Centralizing that policy will make future noLimit fixes much less likely to drift by dialect.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/drivers/src/oracle.ts` around lines 40 - 75, The limit/truncation
logic repeated in execute (variables effectiveLimit, truncated, and slicing to
produce limitedRows) should be moved into a shared helper (e.g., a util like
computeLimitAndRows or applyLimitPolicy) and invoked from Oracle's async execute
method; refactor so execute only calls the helper with (sqlLimit,
options?.noLimit, rows) and uses the returned { effectiveLimit, truncated, rows:
limitedRows, row_count } to build the ConnectorResult, keeping existing behavior
for isSelectLike/ROWNUM wrapping inside execute and preserving columns
extraction via result.metaData.
packages/opencode/src/altimate/native/types.ts (1)
967-983: Narrow algorithm to the documented literal union.

Since data.diff is a public bridge contract, keeping algorithm as plain string defers validation to runtime. The implementation only handles the five documented values; a literal union would catch invalid names at the boundary.
Type refinement
+export type DataDiffAlgorithm =
+  | "auto"
+  | "joindiff"
+  | "hashdiff"
+  | "profile"
+  | "cascade"
+
 export interface DataDiffParams {
   /** Source table name (e.g. "orders", "db.schema.orders") or full SQL query */
   source: string
@@
   /** Extra columns to compare beyond the key */
   extra_columns?: string[]
   /** Algorithm: "auto" | "joindiff" | "hashdiff" | "profile" | "cascade" */
-  algorithm?: string
+  algorithm?: DataDiffAlgorithm
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/opencode/src/altimate/native/types.ts` around lines 967 - 983, The
DataDiffParams interface currently types the algorithm field as a plain string;
narrow it to the documented literal union so invalid values are caught at
compile time. Update the DataDiffParams type (the algorithm property) to be the
union "auto" | "joindiff" | "hashdiff" | "profile" | "cascade" (optionally keep
it optional if intended), and adjust any call sites or tests that pass
non-literal strings to use one of those five literals; ensure any type imports
or references to DataDiffParams reflect the new union.
packages/opencode/src/altimate/native/connections/data-diff.ts (1)
390-393: Silent fallback to key-only comparison when discovery fails.

When discoverExtraColumns fails (e.g., schema query error), it returns undefined and the orchestrator silently falls back to key-only comparison. This could mask configuration issues or produce misleading "identical" results.

Consider logging a warning when discovery fails so users are aware.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/opencode/src/altimate/native/connections/data-diff.ts` around lines
390 - 393, The catch block in discoverExtraColumns swallows errors and returns
undefined, causing a silent fallback to key-only comparison; modify the catch to
capture the error (catch (err)) and emit a warning log before returning
undefined so users are informed (e.g., use the module's logger or propagate to
the orchestrator) and include the error message/context in the log; ensure the
change is made inside the discoverExtraColumns function where the current `catch
{ return undefined }` resides.
.opencode/skills/data-parity/SKILL.md (2)
12-23: Add language specifier to fenced code block.

The TODO list code block lacks a language specifier. While this is a minor linting issue, adding a language (e.g., markdown or text) improves rendering consistency.
📝 Suggested fix
-```
+```text
 Here's my plan:
 1. [ ] List available warehouse connections
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.opencode/skills/data-parity/SKILL.md around lines 12 - 23, The fenced TODO
list code block lacks a language tag; update the triple-backtick fence
surrounding the plan (the multiline TODO block starting with "Here's my plan:")
to include a language specifier such as "text" or "markdown" (e.g., ```text) so
the block renders consistently; ensure you change both opening and closing
fences only, leaving the inner checklist lines unchanged.
165-174: Add language specifier for tool invocation examples.

Multiple code blocks showing data_diff invocations lack language specifiers (lines 165, 217, 233). Consider using python or javascript syntax for better readability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.opencode/skills/data-parity/SKILL.md around lines 165 - 174, The code
examples invoking the data_diff helper are missing fenced-code language
specifiers; update each code block that shows data_diff (the three examples
invoking data_diff with
source/target/key_columns/source_warehouse/target_warehouse/algorithm) to use a
language tag like ```python (or ```javascript) so syntax highlighting and
readability improve for those code fences.
packages/opencode/src/altimate/tools/data-diff.ts (1)
114-115: Prefer typed outcome over as any cast.

The outcome is cast to any and passed to helper functions that also treat it as any. Consider defining a discriminated union type matching the Rust serialization ({ mode: "diff" | "profile" | "cascade", ... }) for better type safety and maintainability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/opencode/src/altimate/tools/data-diff.ts` around lines 114 - 115,
The code uses an untyped cast `const outcome = result.outcome as any` and then
passes it into `formatOutcome(outcome, args.source, args.target)`, losing type
safety; define a discriminated union (e.g., type Outcome = { mode: "diff"; ... }
| { mode: "profile"; ... } | { mode: "cascade"; ... }) that matches the Rust
serialization, replace the `as any` cast with `const outcome: Outcome =
result.outcome`, and update `formatOutcome` (and any helper functions it calls)
to accept `Outcome` so callers can narrow on `outcome.mode` instead of relying
on `any`.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/drivers/src/clickhouse.ts`:
- Around line 60-63: Replace the magic number 127 with the exported enum
constant by importing ClickHouseLogLevel from '@clickhouse/client' and setting
clientConfig.log = { level: ClickHouseLogLevel.OFF }; update the import list
that currently includes createClient and ExecuteOptions to also import
ClickHouseLogLevel so the createClient(clientConfig) call uses the named
constant instead of 127.

In `@packages/opencode/src/altimate/native/connections/data-diff.ts`:
- Around line 458-480: The SELECT in buildPartitionDiscoverySQL interpolates
partitionColumn unquoted causing syntax/injection risk; update
buildPartitionDiscoverySQL to quote the partition column the same way
buildPartitionWhereClause does (use the project’s identifier-quoting helper or
the same quoting logic used by buildPartitionWhereClause) when constructing expr
for all modes (numeric, date, categorical) and when passing partitionColumn into
dateTruncExpr/other helpers so the generated SQL uses a properly quoted
identifier (_p AS derived from the quoted column) and avoids injection/syntax
errors while preserving existing partitionMode/dateTruncExpr behavior.
- Around line 512-523: The ClickHouse branch is using ANSI double-quote
identifier quoting (see the quotedCol usage and the switch case for "clickhouse"
in data-diff.ts) which breaks on reserved words; update the dialect-aware
quoting so that when dialect === "clickhouse" identifiers are wrapped with
backticks (`) rather than double-quotes (") and ensure the partition WHERE
expression generation (the "clickhouse" case returning `${expr} =
toDate('${partitionValue}')`) uses that backtick-quoted identifier; ideally
centralize this in the existing quotedCol helper or add a small helper that
returns backtick-quoted names for ClickHouse and preserves existing quoting for
other dialects.
- Around line 78-92: injectCte currently strips a possible RECURSIVE keyword
because the regex /^WITH\s+/i only matches "WITH " so when rebuilding you end up
with "WITH <ourDefs>,\nRECURSIVE ..." which is invalid; update the detection to
capture an optional RECURSIVE (e.g. use a regex like /^WITH(\s+RECURSIVE)?\s+/i)
so you can preserve that token, compute the remainder (afterWith) the same way,
and rebuild the prefix using the captured recursive group (e.g.
`WITH${recursive} ${ourDefs},\n${afterWith}`) while still stripping only the
leading "WITH…" from ctePrefix via ourDefs = ctePrefix.replace(/^WITH\s+/i, "").
- Around line 660-667: The partition WHERE clauses are being generated only for
sourceDialect which breaks target parsing when sourceDialect !== targetDialect;
update the partition loop that calls buildPartitionWhereClause (inside the
partitionValues iteration) to produce dialect-specific where clauses for both
sourceDialect and targetDialect (similar to how partition discovery does),
storing them separately and passing both appropriately into runDataDiff / the
DataParitySession spec so the source uses the source WHERE and the target uses
the target WHERE; ensure you reference the same params (partition_column,
partition_granularity, partition_bucket_size) when calling
buildPartitionWhereClause for each dialect.

In `@packages/opencode/src/altimate/tools/data-diff.ts`:
- Around line 82-87: The current ctx.ask call passes args.source and args.target
(table names/queries) to the "sql_execute_read" permission check, which doesn't
validate warehouse-level access; update the permission validation to explicitly
include warehouse identifiers so both source and target warehouses are
authorized — e.g., change the ctx.ask invocation in data-diff.ts (the call to
ctx.ask for "sql_execute_read") to include args.source_warehouse and
args.target_warehouse in the patterns or perform separate ctx.ask calls for each
warehouse, ensuring both warehouses are checked before executing queries.

---

Outside diff comments:
In `@packages/drivers/src/mysql.ts`:
- Around line 44-56: The execute function currently ignores the binds parameter
(_binds) and calls pool.query(query) without passing parameters, causing
parameterized queries to fail; update the execute signature to accept binds
(e.g., binds?: any[]) or rename _binds to binds and pass them into pool.query
(call pool.query(query, binds) or pool.query(query, binds ?? []) so existing
parameterized callers work), ensuring the LIMIT augmentation logic still
operates on the query string and does not drop or alter binds; key symbols:
execute(sql, limit?, _binds?, options?), pool.query(query).

In `@packages/drivers/src/redshift.ts`:
- Around line 49-64: The execute method in packages/drivers/src/redshift.ts
currently ignores the bind parameters by naming the param _binds and not passing
them to client.query; rename the method parameter from _binds to binds (and
update any internal references) and pass binds into the database call (i.e.,
client.query(query, binds)) so parameterized queries work like other drivers
(see Snowflake/DuckDB patterns) and preserve SQL injection protection and
correctness.

In `@packages/drivers/src/sqlserver.ts`:
- Around line 49-61: The current isSelectLike regex only matches queries
starting with SELECT so CTEs (WITH ... SELECT) bypass the TOP injection; update
the isSelectLike detection in sqlserver.ts to also match queries that begin with
WITH ... SELECT (i.e., treat queries with leading CTEs as "select-like"), and
adjust the TOP-insertion logic so it still injects TOP for the main SELECT
following the CTE (keep using effectiveLimit and the existing checks for
existing TOP/LIMIT). Target symbols: isSelectLike, effectiveLimit, and the query
replacement that inserts TOP after SELECT.

---

Nitpick comments:
In @.opencode/skills/data-parity/SKILL.md:
- Around line 12-23: The fenced TODO list code block lacks a language tag;
update the triple-backtick fence surrounding the plan (the multiline TODO block
starting with "Here's my plan:") to include a language specifier such as "text"
or "markdown" (e.g., ```text) so the block renders consistently; ensure you
change both opening and closing fences only, leaving the inner checklist lines
unchanged.
- Around line 165-174: The code examples invoking the data_diff helper are
missing fenced-code language specifiers; update each code block that shows
data_diff (the three examples invoking data_diff with
source/target/key_columns/source_warehouse/target_warehouse/algorithm) to use a
language tag like ```python (or ```javascript) so syntax highlighting and
readability improve for those code fences.

In `@packages/drivers/src/oracle.ts`:
- Around line 40-75: The limit/truncation logic repeated in execute (variables
effectiveLimit, truncated, and slicing to produce limitedRows) should be moved
into a shared helper (e.g., a util like computeLimitAndRows or applyLimitPolicy)
and invoked from Oracle's async execute method; refactor so execute only calls
the helper with (sqlLimit, options?.noLimit, rows) and uses the returned {
effectiveLimit, truncated, rows: limitedRows, row_count } to build the
ConnectorResult, keeping existing behavior for isSelectLike/ROWNUM wrapping
inside execute and preserving columns extraction via result.metaData.

In `@packages/opencode/src/altimate/native/connections/data-diff.ts`:
- Around line 390-393: The catch block in discoverExtraColumns swallows errors
and returns undefined, causing a silent fallback to key-only comparison; modify
the catch to capture the error (catch (err)) and emit a warning log before
returning undefined so users are informed (e.g., use the module's logger or
propagate to the orchestrator) and include the error message/context in the log;
ensure the change is made inside the discoverExtraColumns function where the
current `catch { return undefined }` resides.

In `@packages/opencode/src/altimate/native/types.ts`:
- Around line 967-983: The DataDiffParams interface currently types the
algorithm field as a plain string; narrow it to the documented literal union so
invalid values are caught at compile time. Update the DataDiffParams type (the
algorithm property) to be the union "auto" | "joindiff" | "hashdiff" | "profile"
| "cascade" (optionally keep it optional if intended), and adjust any call sites
or tests that pass non-literal strings to use one of those five literals; ensure
any type imports or references to DataDiffParams reflect the new union.

In `@packages/opencode/src/altimate/tools/data-diff.ts`:
- Around line 114-115: The code uses an untyped cast `const outcome =
result.outcome as any` and then passes it into `formatOutcome(outcome,
args.source, args.target)`, losing type safety; define a discriminated union
(e.g., type Outcome = { mode: "diff"; ... } | { mode: "profile"; ... } | { mode:
"cascade"; ... }) that matches the Rust serialization, replace the `as any` cast
with `const outcome: Outcome = result.outcome`, and update `formatOutcome` (and
any helper functions it calls) to accept `Outcome` so callers can narrow on
`outcome.mode` instead of relying on `any`.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 4ae147d4-a6e5-4a2e-8d5c-cfa62bc9db0a

📥 Commits

Reviewing files that changed from the base of the PR and between 0d34855 and e41e5a0.

📒 Files selected for processing (20)

.opencode/skills/data-parity/SKILL.md
packages/drivers/src/bigquery.ts
packages/drivers/src/clickhouse.ts
packages/drivers/src/databricks.ts
packages/drivers/src/duckdb.ts
packages/drivers/src/mysql.ts
packages/drivers/src/oracle.ts
packages/drivers/src/postgres.ts
packages/drivers/src/redshift.ts
packages/drivers/src/snowflake.ts
packages/drivers/src/sqlite.ts
packages/drivers/src/sqlserver.ts
packages/drivers/src/types.ts
packages/opencode/src/altimate/native/connections/data-diff.ts
packages/opencode/src/altimate/native/connections/register.ts
packages/opencode/src/altimate/native/types.ts
packages/opencode/src/altimate/prompts/builder.txt
packages/opencode/src/altimate/tools/data-diff.ts
packages/opencode/src/tool/registry.ts
packages/opencode/test/altimate/simulation-suite.test.ts

packages/drivers/src/clickhouse.ts

packages/opencode/src/altimate/native/connections/data-diff.ts

packages/opencode/src/altimate/tools/data-diff.ts

suryaiyer95

Multi-Model Code Review — PR #493 + PR #114

Verdict: REQUEST CHANGES
Critical Issues: 1 | Major Issues: 6 | Minor Issues: 4 | Nits: 3

Reviewed by 3 AI models: Claude, Gemini 3.1 Pro, Grok 4. 1 convergence round. Full review posted on companion PR altimate-core-internal #114.

This PR's TypeScript orchestrator is well-crafted — the noLimit driver flag, CTE injection, and column auto-discovery are clean implementations. The issues below are specific to this PR.

Issues in this PR

isQuery detection could misclassify table names (Minor — Logic Error)

File: packages/opencode/src/altimate/native/connections/data-diff.ts:23-26
Regex /^\s*(SELECT|WITH|VALUES)\b/i misclassifies table names like with_metadata or select_results.
Fix: Document the limitation or require explicit signaling (e.g., a separate sourceIsQuery parameter).
Flagged by: Claude

Implicit truthy check on effectiveLimit (Nit — Code Quality)

File: packages/drivers/src/*.ts (all 12 driver files)
The limit-appending condition checks effectiveLimit && !/\bLIMIT\b/i.test(sql). Since noLimit: true sets effectiveLimit to 0 (falsy), this works but relies on implicit coercion.
Fix: Change to effectiveLimit > 0 for explicit intent.
Flagged by: Gemini

Positive Observations

noLimit flag — Clean, consistent addition across all 10+ drivers with backwards-compatible optional parameter
Auto-timestamp exclusion — Two-layer detection (name patterns + schema-level DEFAULT analysis) is smart
ClickHouse stderr silencing — clientConfig.log = { level: 127 } prevents raw ERROR output from corrupting terminal TUI
CTE injection — injectCte correctly handles merging CTE blocks when engine emits its own WITH clauses
Well-designed SKILL.md — Critical sections on algorithm selection, cost-aware interaction, joindiff cross-DB trap

Missing Tests

No test for isQuery edge cases (table names like with_metadata)
No test for CTE injection edge cases (nested/RECURSIVE CTEs)
No partitioned diff simulation tests end-to-end
No test for discoverExtraColumns with auto-timestamp default detection

See full review with all 14 findings and attribution table on altimate-core-internal PR #114.

suryaiyer95 · 2026-04-03T21:35:13Z

Follow-up: Review Issue Status (post-fixes)

All major issues from the review have been addressed. Here's the verified status:

FIXED

Issue	Evidence
Stale partition fallback outcome shape	Both fallback sites now use `{ mode: "diff", stats: { rows_table1: 0, ... }, diff_rows: [] }` — matches Rust schema
`effectiveLimit` truthy check	`truncated` check now uses `effectiveLimit > 0` across all 10+ drivers

STILL OPEN (minor)

isQuery regex misclassification

File: packages/opencode/src/altimate/native/connections/data-diff.ts:17-25
Still a pure regex test — table names like with_metadata or select_results would be misclassified as queries.
Low risk in practice (such table names are rare), but worth a comment or a fallback heuristic (e.g., reject if input contains spaces or parentheses).

…ng, query+partition guard - Oracle day granularity: 'DDD' (day-of-year) → 'DD' (day-of-month) - Add `quoteIdentForDialect()` helper: MySQL/ClickHouse use backticks, TSQL/Fabric use brackets, others use ANSI double-quotes - `buildPartitionDiscoverySQL` and `buildPartitionWhereClause` now use dialect-aware quoting instead of hardcoded double-quotes - `runPartitionedDiff` rejects SQL queries as source/target with a clear error — partitioning requires table names to discover column values

coderabbitai

🧹 Nitpick comments (1)

packages/opencode/src/altimate/native/connections/data-diff.ts (1)

766-773: Consider extracting resolveDialect as a module-level helper.

This function is duplicated at lines 623-630. Extract it once and reuse in both runPartitionedDiff and runDataDiff.

♻️ Proposed extraction

Add at module level (e.g., around line 400):

function resolveDialect(warehouse: string | undefined): string {
  if (warehouse) {
    const cfg = Registry.getConfig(warehouse)
    return cfg?.type ?? "generic"
  }
  const warehouses = Registry.list().warehouses
  return warehouses[0]?.type ?? "generic"
}

Then remove the local definitions in both functions and use the shared helper.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/opencode/src/altimate/native/connections/data-diff.ts` around lines
766 - 773, Extract the duplicated resolveDialect logic into a single
module-level helper named resolveDialect and replace the local copies inside
runPartitionedDiff and runDataDiff with calls to that helper; specifically,
remove the duplicate function definitions in both functions, add one shared
resolveDialect(warehouse: string | undefined): string at module scope (copying
the existing logic that uses Registry.getConfig and Registry.list().warehouses)
and update runPartitionedDiff and runDataDiff to call resolveDialect(warehouse)
instead of their local implementations.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/opencode/src/altimate/native/connections/data-diff.ts`:
- Around line 766-773: Extract the duplicated resolveDialect logic into a single
module-level helper named resolveDialect and replace the local copies inside
runPartitionedDiff and runDataDiff with calls to that helper; specifically,
remove the duplicate function definitions in both functions, add one shared
resolveDialect(warehouse: string | undefined): string at module scope (copying
the existing logic that uses Registry.getConfig and Registry.list().warehouses)
and update runPartitionedDiff and runDataDiff to call resolveDialect(warehouse)
instead of their local implementations.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d342a309-c72e-4721-8c31-8272ec693ffa

📥 Commits

Reviewing files that changed from the base of the PR and between e41e5a0 and b8147c9.

📒 Files selected for processing (1)

packages/opencode/src/altimate/native/connections/data-diff.ts

github-actions bot added the contributor label Mar 27, 2026

suryaiyer95 closed this Mar 27, 2026

suryaiyer95 reopened this Mar 27, 2026

suryaiyer95 force-pushed the feat/data-parity-skill-improvements branch 2 times, most recently from 0f8c7ac to 7909e55 Compare March 27, 2026 00:41

suryaiyer95 commented Mar 30, 2026

View reviewed changes

suryaiyer95 changed the title ~~feat: data-parity skill — algorithm guardrails and output style~~ feat: data-parity skill — TypeScript orchestrator, ClickHouse driver, partition support Mar 30, 2026

suryaiyer95 force-pushed the feat/data-parity-skill-improvements branch from 8c7ef31 to 1468eba Compare March 31, 2026 02:03

suryaiyer95 commented Apr 1, 2026

View reviewed changes

suryaiyer95 force-pushed the feat/data-parity-skill-improvements branch from e1195ba to 161051d Compare April 1, 2026 23:31

suryaiyer95 and others added 14 commits April 2, 2026 11:47

refactor: remove existing-tool improvements — scope to data-diff only

05b6a02

refactor: revert .gitignore changes — scope to data-diff only

6c60be1

suryaiyer95 force-pushed the feat/data-parity-skill-improvements branch from 38daa6d to 2c58580 Compare April 2, 2026 23:19

suryaiyer95 added 4 commits April 2, 2026 17:06

chore: remove pack-local.ts — dev-only utility, not part of the feature

1bc67ef

feat: add data-parity skill to builder prompt with table and SQL quer…

e41e5a0

…y comparison modes

suryaiyer95 marked this pull request as ready for review April 3, 2026 21:04

Copilot AI review requested due to automatic review settings April 3, 2026 21:04

claude bot reviewed Apr 3, 2026

View reviewed changes

Copilot started reviewing on behalf of suryaiyer95 April 3, 2026 21:05 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

suryaiyer95 commented Apr 3, 2026

View reviewed changes

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

Merge branch 'main' into feat/data-parity-skill-improvements

997fc17

Conversation

suryaiyer95 commented Mar 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Core: TypeScript orchestrator (`data-diff.ts`, +799 lines)

New: ClickHouse driver (`clickhouse.ts`, +135 lines)

`ExecuteOptions` — all 10+ drivers updated

SKILL.md (+341 lines)

Fixes applied (from code review)

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

suryaiyer95 commented Mar 27, 2026

Uh oh!

aidtya commented Mar 28, 2026

Companion fix: column name collision in join_diff_sql

Uh oh!

suryaiyer95 left a comment

Choose a reason for hiding this comment

Multi-Model Code Review — altimate-code#493

Major Issues

Minor Issues

NITs

Positive Observations

Uh oh!

suryaiyer95 left a comment

Choose a reason for hiding this comment

Code Review: Data Parity TypeScript Orchestrator + ClickHouse Driver

Major Issue (1)

1. SQL Injection via buildColumnDiscoverySQL — ClickHouse/Snowflake paths (Security)

Minor Issues (4)

2. Profile outcome formatting uses wrong field names — tools/data-diff.ts

3. CTE injection doesn't handle ClickHouse SETTINGS clause — data-diff.ts:injectCte()

4. Partition discovery fails on query sources — data-diff.ts:runPartitionedDiff()

5. noLimit creates unbounded memory risk for JoinDiff — all 11 driver files

Nit (1)

Positive Observations

Missing Tests

Uh oh!

suryaiyer95 commented Apr 1, 2026

Supplemental Review Findings

MAJOR: Partition discovery misses target-only partitions (data-diff.ts)

MAJOR: buildPartitionDiscoverySQL — unquoted column injection

MINOR: injectCte breaks on WITH RECURSIVE

MINOR: ClickHouse partition WHERE uses ANSI double-quotes instead of backticks

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

suryaiyer95 commented Mar 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 27, 2026 •

edited

Loading

Companion fix: column name collision in `join_diff_sql`

1. SQL Injection via `buildColumnDiscoverySQL` — ClickHouse/Snowflake paths (Security)

2. Profile outcome formatting uses wrong field names — `tools/data-diff.ts`

3. CTE injection doesn't handle ClickHouse `SETTINGS` clause — `data-diff.ts:injectCte()`

4. Partition discovery fails on query sources — `data-diff.ts:runPartitionedDiff()`

5. `noLimit` creates unbounded memory risk for JoinDiff — all 11 driver files

MAJOR: Partition discovery misses target-only partitions (`data-diff.ts`)

MAJOR: `buildPartitionDiscoverySQL` — unquoted column injection

MINOR: `injectCte` breaks on `WITH RECURSIVE`