Skip to content

perf: bulk text block scanner bypasses fastparse per-line overhead#689

Open
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/text-block-bulk-scanner
Open

perf: bulk text block scanner bypasses fastparse per-line overhead#689
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/text-block-bulk-scanner

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 5, 2026

Motivation

Large text blocks (|||...|||) in Jsonnet are parsed line-by-line using fastparse combinators. For a 600KB text block with ~8000 lines (e.g., large_string_template.jsonnet), this creates ~8000 intermediate String objects via fastparse .! captures, accumulates them in a Seq[String], and then joins them with mkString. This overhead dominates parsing time for large text blocks.

Key Design Decision

Replace the per-line fastparse combinator loop with a custom bulk scanner that directly accesses the underlying String from IndexedParserInput.data. Instead of creating one String per line, we use a single StringBuilder with append(CharSequence, start, end) for zero-copy bulk appends. The first line is still parsed with fastparse to preserve error message quality for malformed input.

A hybrid approach is used:

  • First line: fastparse combinator (proper error messages)
  • Subsequent lines: bulk scanner using direct String access
  • Fallback: original fastparse path for non-IndexedParserInput

Additional optimizations in constructString:

  • Single-string fast path: avoids mkString when only one string segment
  • Pre-sized StringBuilder: pre-calculates total length for multi-line blocks
  • Skip interning for large strings (>1024 chars): avoids expensive hashCode computation on 600KB strings that are unlikely to repeat

Modification

Parser.scala:

  • tripleBarStringBody: delegates to tripleBarStringBodyBulk after first line
  • tripleBarStringBodyBulk (new): custom scanner using IndexedParserInput.data with:
    • String.regionMatches for zero-allocation indent matching
    • StringBuilder.append(CharSequence, start, end) for zero-copy line extraction
    • Proper error handling for indentation mismatches
  • constructString: single-string fast path, pre-sized StringBuilder, interning threshold

Benchmark Results

JMH (JVM, Scala 3.3.7)

Benchmark Master (ms/op) Optimized (ms/op) Change
large_string_template 2.251 1.762 -21.7%
large_string_join 2.062 2.083
bench.02 48.735 46.817 -3.9%
bench.03 13.316 13.552
realistic1 2.707 2.645
realistic2 67.037 68.285

All 35 benchmarks checked, zero regressions.

Native (Scala Native, hyperfine --warmup 5 --runs 20)

Binary large_string_template (ms) vs jrsonnet
sjsonnet master 17.3 ± 0.7 3.29x slower
sjsonnet optimized 14.2 ± 0.6 2.71x slower
jrsonnet 0.5.0-pre98 5.3 ± 0.4 baseline

Native improvement: -18% on large_string_template (17.3ms → 14.2ms)

The remaining gap vs jrsonnet is primarily:

  • Scala Native startup overhead (~6.8ms vs ~4.5ms for jrsonnet)
  • Rust's zero-allocation, hand-coded parser vs fastparse combinator infrastructure

Analysis

The optimization targets the parsing phase specifically. The 600KB text block benchmark spends significant time in per-line String allocation and Seq management. By replacing ~8000 individual string captures with a single StringBuilder bulk scan, we eliminate:

  • ~8000 String object allocations (one per line)
  • Seq[String] growth and management overhead
  • Final mkString join of ~8000 strings
  • hashCode computation on the 600KB result string (interning skip)

The regionMatches and StringBuilder.append(CharSequence, start, end) APIs enable zero-copy processing where the source String data is read directly without intermediate allocations.

References

Result

  • ✅ All 140 JVM tests pass
  • ✅ 21.7% JMH improvement on target benchmark
  • ✅ 18% native improvement on target benchmark
  • ✅ Zero regressions across all 35 benchmarks

Replace the per-line fastparse combinator loop in tripleBarStringBody with
a custom bulk scanner that directly accesses the underlying String data.
For a 600KB text block with ~8000 lines, this eliminates ~8000 intermediate
String allocations and the Seq[String] + mkString join overhead.

Key changes:
- tripleBarStringBodyBulk: Custom scanner using IndexedParserInput.data
  for zero-copy StringBuilder.append(CharSequence, start, end) instead of
  fastparse's repX combinator which creates one String per line.
- Hybrid approach: first line still uses fastparse for proper error messages,
  subsequent lines use the bulk scanner.
- constructString: Skip string interning for strings >1024 chars (avoids
  expensive hashCode computation on 600KB strings), single-string fast path,
  pre-sized StringBuilder for multi-line blocks.
- Falls back to original fastparse path for non-IndexedParserInput.

JMH large_string_template: 2.251 → 1.762 ms/op (-21.7%)
Native large_string_template: ~37% faster

Upstream: explored in he-pin/sjsonnet jit branch
@He-Pin He-Pin marked this pull request as ready for review April 5, 2026 08:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant