Skip to content

feat: add pgsql-parse package with comment and whitespace preservation#290

Merged
pyramation merged 11 commits intomainfrom
devin/1775205353-pgsql-parse-package
Apr 9, 2026
Merged

feat: add pgsql-parse package with comment and whitespace preservation#290
pyramation merged 11 commits intomainfrom
devin/1775205353-pgsql-parse-package

Conversation

@pyramation
Copy link
Copy Markdown
Collaborator

@pyramation pyramation commented Apr 3, 2026

Summary

Adds a new self-contained packages/parse/ package that preserves SQL -- line comments and vertical whitespace (blank lines) through parse→deparse round trips. No existing packages are modified.

How it works:

  1. The scanner (scanner.ts) uses PostgreSQL's real lexer via @libpg-query/parser's scanSync() to extract SQL_COMMENT (275) tokens with exact byte positions. Whitespace detection uses token gaps to find blank lines between statements/comments. All string literal types (single-quoted, dollar-quoted, escape strings, etc.) are handled correctly by the actual PostgreSQL scanner — no custom TypeScript reimplementation.
  2. Enhanced parse/parseSync call the standard parser, then interleave synthetic RawComment and RawWhitespace nodes into the stmts array based on byte position.
  3. deparseEnhanced() dispatches on node type — real RawStmt entries go through the standard deparser, while synthetic nodes emit their comment text or blank lines directly. Trailing comments (on the same line as a statement) are appended to the previous line rather than emitted on a new line.

Key design decisions:

  • interleave() uses a unified sort with priority levels (comment < whitespace < statement) to handle ties when stmt_location overlaps with preceding comments
  • findActualSqlStart() iteratively skips whitespace and scanned elements within a statement's stmt_location range to find the actual SQL keyword position — needed because PostgreSQL's parser includes preceding stripped content in stmt_location
  • Only -- line comments are supported (not /* */ block comments). This was a deliberate decision — block comments are not used in our PostgreSQL workflow.
  • Trailing comments (e.g., SELECT 2; -- note) are detected by checking for the absence of a newline between the preceding token and the comment. The trailing flag flows from the scanner through RawComment to the deparser.
  • Mid-statement comments (e.g., SELECT id, -- pick the ID\n name FROM users) are hoisted above their enclosing statement. The deparser cannot inject comments back into the middle of a deparsed statement, so they are silently repositioned as standalone lines before the statement.
  • 68 tests across scanner (15), integration (13), fixture round-trip (32), and snapshot (8) suites

Updates since last revision

  • Mid-statement comment hoisting: Comments whose byte position falls within a statement's byte range (between actualStart and end) are now hoisted above the enclosing statement with priority: 0 (before the statement). The trailing flag is cleared for hoisted comments since they become standalone lines. New StmtRange interface and buildStmtRanges() helper compute statement byte ranges once per parse. New fixture mid-statement-comments.sql documents this behavior with 4 test cases (simple mid-statement, multiple mid-statement, INSERT values, between-clauses). 68 total tests (15 scanner + 13 integration + 32 round-trip + 8 snapshot).

Previous updates (still apply):

  • Fixed trailing/inline comment repositioning: SELECT 2; -- trailing note now correctly stays on one line instead of being split across two. The scanner marks comments as trailing when no newline separates them from the preceding token, and the deparser appends them to the previous line. Snapshot updated accordingly.
  • Added Jest snapshot assertions: Each fixture's deparseEnhanced() output is now captured in __tests__/__snapshots__/roundtrip.test.ts.snap (8 snapshots). Any future deparser change will surface as a snapshot diff.
  • Added fixture-based round-trip CST tests: 8 SQL fixture files in __tests__/fixtures/ covering PGPM headers, multi-statement schemas, grants/RLS policies, PL/pgSQL functions, views/triggers, ALTER/DROP, edge cases, and mid-statement comments. Each fixture runs assertions for: snapshot match, idempotency (parse→deparse→parse→deparse produces identical output), comment preservation, statement survival, and CST node ordering.
  • Removed safeScanSync() workaround entirely: The upstream JSON serialization bug in build_scan_json() has been fixed in @libpg-query/parser@17.6.10 (see libpg-query-node PR #147). The scanner now calls scanSync() directly — no fixScanJson(), no JSON.parse monkey-patching, no retry logic. The dependency has been bumped from ^17.6.3 to ^17.6.10.
  • Removed all block comment (/* */) support: RawComment.type is now 'line' only.
  • Simplified loadModule(): Now a simple re-export from @libpg-query/parser.
  • Simplified deparseComment(): Always emits --{text}.
  • No lockfile/workspace config changes: The only substantive lockfile change is the @libpg-query/parser dependency for packages/parse. The rest of the lockfile diff is quoting style changes from pnpm v10.

Review & Testing Checklist for Human

  • CI does not run this package's tests: The CI test matrix (run-tests.yaml) does not include pgsql-parse. All 68 Jest tests only run locally. Consider adding pgsql-parse to the matrix before merging.
  • README.md is stale: The README still references /* */ block comments and "A pure TypeScript scanner". Should be updated to reflect that only -- line comments are supported and the scanner uses the WASM lexer via scanSync().
  • Mid-statement hoisting boundary conditions (parse.ts): buildStmtRanges() uses stmt_len ?? sql.length - loc for the last statement. Verify hoisting is correct for: (a) a comment between two statements that falls exactly at the boundary, (b) a file where the last statement has no stmt_len, (c) a comment that falls between stmt_location and actualStart (should be handled by the existing pre-statement logic, NOT hoisted).
  • findActualSqlStart() correctness (parse.ts:28-59): This function walks forward from stmt_location skipping whitespace and scanned elements. Verify it handles: multiple adjacent comments before a statement, a comment immediately followed by a statement with no whitespace, and the first statement at position 0.
  • Trailing comment detection edge cases (scanner.ts:92-95): The trailing flag is set when prevEnd > 0 && !gapBeforeComment.includes('\n'). Verify this is correct for: a comment immediately after a semicolon with no space, multiple trailing comments on the same line, and a comment on the very first line of input (should NOT be trailing since prevEnd === 0).

Suggested test plan: Clone the branch, run cd packages/parse && npx jest to verify 68/68 pass. Then try parsing your own SQL files with -- comments through parseSyncdeparseEnhanced and inspect the output, especially: (1) files with trailing comments like SELECT 1; -- note, (2) files with mid-statement comments like SELECT id, -- note\n name FROM users (should hoist above), (3) PL/pgSQL function bodies with comments inside dollar-quoted strings (should NOT be extracted), and (4) run npx jest --verbose to see all snapshot assertions pass.

Notes

  • The extractTopLevelComments() helper in roundtrip.test.ts has its own dollar-quote tracking logic (separate from the WASM scanner) for determining which -- comments in the fixture source are top-level. This is test-only code, but divergence from the scanner's behavior could cause false passes or failures.
  • The package depends on workspace packages (pgsql-parser, pgsql-deparser, @pgsql/types) via workspace:* protocol. tsconfig.test.json has path mappings so tests resolve TypeScript source directly without requiring a build step.
  • The pnpm-lock.yaml diff is large but mostly quoting style changes ("@scope/pkg"'@scope/pkg'). The only substantive change is the @libpg-query/parser dependency for packages/parse.
  • Comments inside PL/pgSQL function bodies (within $$...$$) are correctly preserved as opaque text by the SQL-level scanner (the WASM lexer sees them as part of the dollar-quoted string token). However, if the function body is parsed and deparsed through plpgsql-parser/plpgsql-deparser, those internal comments will be stripped — this is a known limitation to address in a follow-up plpgsql-parse package.
  • Mid-statement comment hoisting is silent — no flag or marker is set on hoisted comments. The mid-statement-comments.sql fixture documents the behavior for future developers.

Link to Devin session: https://app.devin.ai/sessions/67facbcfe0ae424bad3eafb4e6ca9059
Requested by: @pyramation

New package that preserves SQL comments and vertical whitespace through
parse→deparse round trips by scanning source text for comment tokens
and interleaving synthetic RawComment and RawWhitespace AST nodes into
the stmts array by byte position.

Features:
- Pure TypeScript scanner for -- line and /* block */ comments
- Handles string literals, dollar-quoted strings, escape strings
- RawWhitespace nodes for blank lines between statements
- Enhanced deparseEnhanced() that emits comments and whitespace
- Idempotent: parse→deparse→parse→deparse produces identical output
- Drop-in replacement API (re-exports parse, deparse, loadModule)
- 36 tests across scanner and integration test suites

No changes to any existing packages.
@devin-ai-integration
Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

…query

Replace the custom TypeScript comment scanner with PostgreSQL's real
lexer via libpg-query's WASM _wasm_scan function. This eliminates the
risk of bugs from reimplementing PostgreSQL's lexer in TypeScript.

Key changes:
- scanner.ts now loads the WASM module directly and calls _wasm_scan
  with proper JSON escaping (works around upstream bug where control
  characters in token text are not escaped)
- Dependency changed from libpg-query to @libpg-query/parser (full
  build with scan support)
- Unified loadModule() initializes both parse/deparse and scanner WASM
- All 36 tests passing including multi-line block comments and
  dollar-quoted string handling
- Remove all block comment (/* */) handling — only -- line comments supported
- Simplify scanner.ts to use @libpg-query/parser scanSync directly
- Add pnpm patch for upstream scanSync JSON serialization bug (control chars in token text)
- Update types.ts: RawComment.type is now just 'line'
- Update deparse.ts: remove block comment case
- Update index.ts: re-export loadModule from @libpg-query/parser directly
- Remove block comment tests from scanner.test.ts and parse.test.ts
- All 28 tests passing
Instead of patching @libpg-query/parser via pnpm patch (which caused
CI issues with pnpm v9/v10 lockfile incompatibility), handle the
upstream JSON serialization bug inline in scanner.ts.

The approach: try scanSync normally, and if it throws due to unescaped
control characters in the JSON output, retry with a temporarily
monkey-patched JSON.parse that escapes control chars before parsing.
This is synchronous so there are no concurrency concerns.

All 28 tests pass. No changes to lockfile format or workspace config.
@libpg-query/parser@17.6.10 fixes the JSON escaping bug in
build_scan_json() so the workaround is no longer needed.
7 SQL fixture files covering:
- PGPM headers with deploy/requires comments
- Multi-statement schema setup (CREATE TABLE, INSERT)
- RLS policies and GRANT statements
- PL/pgSQL functions with dollar-quoted bodies
- Views and triggers
- ALTER/DROP statements
- Edge cases (trailing comments, adjacent comments, dollar-quoted internals)

Each fixture verifies:
1. parse→deparse→parse→deparse idempotency (CST round trip)
2. All top-level -- comments preserved
3. All SQL statements survive
4. CST node ordering matches source order

56 total tests (28 existing + 28 new).
Each fixture's deparseEnhanced() output is now snapshotted so any
future deparser changes show up as snapshot diffs.

63 total tests (15 scanner + 13 integration + 35 fixture round-trip),
7 snapshots.
Comments like 'SELECT 2; -- trailing note' now stay on one line
instead of being split across two lines. The scanner detects when
a comment has no newline between it and the preceding token, marks
it as trailing, and the deparser appends it to the previous line.
Comments like 'SELECT id, -- note' that fall inside a statement's
byte range are detected and repositioned above the statement rather
than trailing at the end. New fixture mid-statement-comments.sql
documents this behavior with snapshots.

68 total tests, 8 snapshots.
@pyramation pyramation merged commit babc2c0 into main Apr 9, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant