Skip to content

fix: escape control characters in build_scan_json() for multi-line tokens#147

Merged
pyramation merged 2 commits intomainfrom
devin/1775609815-fix-scan-json-escaping
Apr 8, 2026
Merged

fix: escape control characters in build_scan_json() for multi-line tokens#147
pyramation merged 2 commits intomainfrom
devin/1775609815-fix-scan-json-escaping

Conversation

@pyramation
Copy link
Copy Markdown
Collaborator

@pyramation pyramation commented Apr 8, 2026

Summary

Fixes a bug in build_scan_json() where literal control characters (\n, \r, \t) in token text were written directly into the JSON string without escaping. This caused JSON.parse to throw "Bad control character in string literal" when scanning SQL containing multi-line tokens (e.g., dollar-quoted function bodies like $$\nBEGIN\n...$$ or multi-line /* */ comments).

The existing escape loop only handled " and \. This adds the three missing control character cases to produce valid JSON escape sequences (\n\\n, etc.).

Three new tests added covering dollar-quoted strings with newlines, dollar-quoted tokens with tabs, and multi-line block comments. Tests use template literals with actual newlines/tabs so the multi-line SQL is visually obvious in the source code.

Review & Testing Checklist for Human

  • Verify the if/else restructuring is correct (wasm_wrapper.c:379-396): The original code used a flat if + unconditional escaped_text[escaped_pos++] = c which worked for " and \ but silently passed through control chars. The new code uses an if/else if/else chain where each branch writes exactly the right pair of characters. Confirm no character is double-written or skipped.
  • Buffer size adequacy: escaped_text is allocated as token_length * 2 + 1. Worst case is every char needing a 2-byte escape, so this is sufficient. But worth a glance.
  • Other control characters (0x00-0x1F): Only \n, \r, \t are handled. Other control chars are technically invalid in JSON strings but extremely unlikely in PostgreSQL token text. Decide if a general c < 0x20 guard is worth adding.

Test plan: The new tests will only pass after a WASM rebuild (they exercise the C code path). CI should rebuild and run them. To verify manually: rebuild WASM with pnpm wasm:build in /full, then run node --test test/scan.test.js and confirm all tests pass, especially the three new multi-line token tests.

See also PR #148 which contains only these same tests without the C fix — CI fails there with the exact "Bad control character in string literal" errors, proving the fix is necessary.

Notes

  • The scan and scanSync functions in index.cjs call JSON.parse on the raw string returned by _wasm_scan. This fix ensures that string is always valid JSON.
  • Downstream consumer pgsql-parse (PR #290 in pgsql-parser) has a safeScanSync() workaround that monkey-patches JSON.parse to fix this at the JS level. Once this fix is published, that workaround can be removed.

Link to Devin session: https://app.devin.ai/sessions/67facbcfe0ae424bad3eafb4e6ca9059
Requested by: @pyramation

…kens

The JSON escape loop in build_scan_json() only handled '"' and '\',
but not '\n', '\r', '\t'. When token text contains literal newlines
(e.g., dollar-quoted function bodies like $$\nBEGIN\n...$$),
the raw JSON string had unescaped control characters, causing
JSON.parse to throw 'Bad control character in string literal'.

This adds proper escape sequences for \n, \r, and \t in the
token text escaping loop, matching standard JSON string escaping rules.

New tests added for multi-line dollar-quoted strings, tabs, and
multi-line C-style comments.
@devin-ai-integration
Copy link
Copy Markdown

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

devin-ai-integration bot pushed a commit that referenced this pull request Apr 8, 2026
These tests demonstrate that scanSync throws 'Bad control character in
string literal' when scanning SQL with multi-line tokens (dollar-quoted
function bodies, tabs, multi-line C-style comments).

The root cause is that build_scan_json() in wasm_wrapper.c only escapes
'"' and '\\' in token text, but not '\n', '\r', '\t'. When token text
contains literal newlines, the JSON output has unescaped control chars
that break JSON.parse.

These tests are expected to FAIL on this branch (no fix applied).
See PR #147 for the fix.
@pyramation pyramation merged commit 99dee5f into main Apr 8, 2026
58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant