feat: add pgsql-parse package with comment and whitespace preservation#290
Merged
pyramation merged 11 commits intomainfrom Apr 9, 2026
Merged
feat: add pgsql-parse package with comment and whitespace preservation#290pyramation merged 11 commits intomainfrom
pyramation merged 11 commits intomainfrom
Conversation
New package that preserves SQL comments and vertical whitespace through parse→deparse round trips by scanning source text for comment tokens and interleaving synthetic RawComment and RawWhitespace AST nodes into the stmts array by byte position. Features: - Pure TypeScript scanner for -- line and /* block */ comments - Handles string literals, dollar-quoted strings, escape strings - RawWhitespace nodes for blank lines between statements - Enhanced deparseEnhanced() that emits comments and whitespace - Idempotent: parse→deparse→parse→deparse produces identical output - Drop-in replacement API (re-exports parse, deparse, loadModule) - 36 tests across scanner and integration test suites No changes to any existing packages.
Contributor
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
…query Replace the custom TypeScript comment scanner with PostgreSQL's real lexer via libpg-query's WASM _wasm_scan function. This eliminates the risk of bugs from reimplementing PostgreSQL's lexer in TypeScript. Key changes: - scanner.ts now loads the WASM module directly and calls _wasm_scan with proper JSON escaping (works around upstream bug where control characters in token text are not escaped) - Dependency changed from libpg-query to @libpg-query/parser (full build with scan support) - Unified loadModule() initializes both parse/deparse and scanner WASM - All 36 tests passing including multi-line block comments and dollar-quoted string handling
- Remove all block comment (/* */) handling — only -- line comments supported - Simplify scanner.ts to use @libpg-query/parser scanSync directly - Add pnpm patch for upstream scanSync JSON serialization bug (control chars in token text) - Update types.ts: RawComment.type is now just 'line' - Update deparse.ts: remove block comment case - Update index.ts: re-export loadModule from @libpg-query/parser directly - Remove block comment tests from scanner.test.ts and parse.test.ts - All 28 tests passing
Instead of patching @libpg-query/parser via pnpm patch (which caused CI issues with pnpm v9/v10 lockfile incompatibility), handle the upstream JSON serialization bug inline in scanner.ts. The approach: try scanSync normally, and if it throws due to unescaped control characters in the JSON output, retry with a temporarily monkey-patched JSON.parse that escapes control chars before parsing. This is synchronous so there are no concurrency concerns. All 28 tests pass. No changes to lockfile format or workspace config.
@libpg-query/parser@17.6.10 fixes the JSON escaping bug in build_scan_json() so the workaround is no longer needed.
7 SQL fixture files covering: - PGPM headers with deploy/requires comments - Multi-statement schema setup (CREATE TABLE, INSERT) - RLS policies and GRANT statements - PL/pgSQL functions with dollar-quoted bodies - Views and triggers - ALTER/DROP statements - Edge cases (trailing comments, adjacent comments, dollar-quoted internals) Each fixture verifies: 1. parse→deparse→parse→deparse idempotency (CST round trip) 2. All top-level -- comments preserved 3. All SQL statements survive 4. CST node ordering matches source order 56 total tests (28 existing + 28 new).
Each fixture's deparseEnhanced() output is now snapshotted so any future deparser changes show up as snapshot diffs. 63 total tests (15 scanner + 13 integration + 35 fixture round-trip), 7 snapshots.
Comments like 'SELECT 2; -- trailing note' now stay on one line instead of being split across two lines. The scanner detects when a comment has no newline between it and the preceding token, marks it as trailing, and the deparser appends it to the previous line.
Comments like 'SELECT id, -- note' that fall inside a statement's byte range are detected and repositioned above the statement rather than trailing at the end. New fixture mid-statement-comments.sql documents this behavior with snapshots. 68 total tests, 8 snapshots.
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new self-contained
packages/parse/package that preserves SQL--line comments and vertical whitespace (blank lines) through parse→deparse round trips. No existing packages are modified.How it works:
scanner.ts) uses PostgreSQL's real lexer via@libpg-query/parser'sscanSync()to extractSQL_COMMENT(275) tokens with exact byte positions. Whitespace detection uses token gaps to find blank lines between statements/comments. All string literal types (single-quoted, dollar-quoted, escape strings, etc.) are handled correctly by the actual PostgreSQL scanner — no custom TypeScript reimplementation.parse/parseSynccall the standard parser, then interleave syntheticRawCommentandRawWhitespacenodes into thestmtsarray based on byte position.deparseEnhanced()dispatches on node type — realRawStmtentries go through the standard deparser, while synthetic nodes emit their comment text or blank lines directly. Trailing comments (on the same line as a statement) are appended to the previous line rather than emitted on a new line.Key design decisions:
interleave()uses a unified sort with priority levels (comment < whitespace < statement) to handle ties whenstmt_locationoverlaps with preceding commentsfindActualSqlStart()iteratively skips whitespace and scanned elements within a statement'sstmt_locationrange to find the actual SQL keyword position — needed because PostgreSQL's parser includes preceding stripped content instmt_location--line comments are supported (not/* */block comments). This was a deliberate decision — block comments are not used in our PostgreSQL workflow.SELECT 2; -- note) are detected by checking for the absence of a newline between the preceding token and the comment. Thetrailingflag flows from the scanner throughRawCommentto the deparser.SELECT id, -- pick the ID\n name FROM users) are hoisted above their enclosing statement. The deparser cannot inject comments back into the middle of a deparsed statement, so they are silently repositioned as standalone lines before the statement.Updates since last revision
actualStartandend) are now hoisted above the enclosing statement withpriority: 0(before the statement). Thetrailingflag is cleared for hoisted comments since they become standalone lines. NewStmtRangeinterface andbuildStmtRanges()helper compute statement byte ranges once per parse. New fixturemid-statement-comments.sqldocuments this behavior with 4 test cases (simple mid-statement, multiple mid-statement, INSERT values, between-clauses). 68 total tests (15 scanner + 13 integration + 32 round-trip + 8 snapshot).Previous updates (still apply):
SELECT 2; -- trailing notenow correctly stays on one line instead of being split across two. The scanner marks comments astrailingwhen no newline separates them from the preceding token, and the deparser appends them to the previous line. Snapshot updated accordingly.deparseEnhanced()output is now captured in__tests__/__snapshots__/roundtrip.test.ts.snap(8 snapshots). Any future deparser change will surface as a snapshot diff.__tests__/fixtures/covering PGPM headers, multi-statement schemas, grants/RLS policies, PL/pgSQL functions, views/triggers, ALTER/DROP, edge cases, and mid-statement comments. Each fixture runs assertions for: snapshot match, idempotency (parse→deparse→parse→deparseproduces identical output), comment preservation, statement survival, and CST node ordering.safeScanSync()workaround entirely: The upstream JSON serialization bug inbuild_scan_json()has been fixed in@libpg-query/parser@17.6.10(see libpg-query-node PR #147). The scanner now callsscanSync()directly — nofixScanJson(), noJSON.parsemonkey-patching, no retry logic. The dependency has been bumped from^17.6.3to^17.6.10./* */) support:RawComment.typeis now'line'only.loadModule(): Now a simple re-export from@libpg-query/parser.deparseComment(): Always emits--{text}.@libpg-query/parserdependency forpackages/parse. The rest of the lockfile diff is quoting style changes from pnpm v10.Review & Testing Checklist for Human
run-tests.yaml) does not includepgsql-parse. All 68 Jest tests only run locally. Consider addingpgsql-parseto the matrix before merging./* */block comments and "A pure TypeScript scanner". Should be updated to reflect that only--line comments are supported and the scanner uses the WASM lexer viascanSync().parse.ts):buildStmtRanges()usesstmt_len ?? sql.length - locfor the last statement. Verify hoisting is correct for: (a) a comment between two statements that falls exactly at the boundary, (b) a file where the last statement has nostmt_len, (c) a comment that falls betweenstmt_locationandactualStart(should be handled by the existing pre-statement logic, NOT hoisted).findActualSqlStart()correctness (parse.ts:28-59): This function walks forward fromstmt_locationskipping whitespace and scanned elements. Verify it handles: multiple adjacent comments before a statement, a comment immediately followed by a statement with no whitespace, and the first statement at position 0.scanner.ts:92-95): Thetrailingflag is set whenprevEnd > 0 && !gapBeforeComment.includes('\n'). Verify this is correct for: a comment immediately after a semicolon with no space, multiple trailing comments on the same line, and a comment on the very first line of input (should NOT be trailing sinceprevEnd === 0).Suggested test plan: Clone the branch, run
cd packages/parse && npx jestto verify 68/68 pass. Then try parsing your own SQL files with--comments throughparseSync→deparseEnhancedand inspect the output, especially: (1) files with trailing comments likeSELECT 1; -- note, (2) files with mid-statement comments likeSELECT id, -- note\n name FROM users(should hoist above), (3) PL/pgSQL function bodies with comments inside dollar-quoted strings (should NOT be extracted), and (4) runnpx jest --verboseto see all snapshot assertions pass.Notes
extractTopLevelComments()helper inroundtrip.test.tshas its own dollar-quote tracking logic (separate from the WASM scanner) for determining which--comments in the fixture source are top-level. This is test-only code, but divergence from the scanner's behavior could cause false passes or failures.pgsql-parser,pgsql-deparser,@pgsql/types) viaworkspace:*protocol.tsconfig.test.jsonhas path mappings so tests resolve TypeScript source directly without requiring a build step.pnpm-lock.yamldiff is large but mostly quoting style changes ("@scope/pkg"→'@scope/pkg'). The only substantive change is the@libpg-query/parserdependency forpackages/parse.$$...$$) are correctly preserved as opaque text by the SQL-level scanner (the WASM lexer sees them as part of the dollar-quoted string token). However, if the function body is parsed and deparsed throughplpgsql-parser/plpgsql-deparser, those internal comments will be stripped — this is a known limitation to address in a follow-upplpgsql-parsepackage.mid-statement-comments.sqlfixture documents the behavior for future developers.Link to Devin session: https://app.devin.ai/sessions/67facbcfe0ae424bad3eafb4e6ca9059
Requested by: @pyramation