Skip to content

perf: comprehension fuse scope+eval and inline BinaryOp(ValidId,ValidId) fast path#686

Open
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/comprehension-binop-inline
Open

perf: comprehension fuse scope+eval and inline BinaryOp(ValidId,ValidId) fast path#686
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/comprehension-binop-inline

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 5, 2026

Motivation

Array comprehensions with simple binary operations (e.g., [a + b for a in xs for b in ys]) are extremely common in Jsonnet workloads. The current implementation creates a new ValScope per iteration and goes through the full visitExpr dispatch for every element — both of which are unnecessary overhead for the common case where the body is BinaryOp(ValidId, ValidId).

This PR fuses scope creation with body evaluation and inlines the BinaryOp evaluation for ValidId operands directly in the comprehension inner loop, avoiding:

  1. Scope allocation per iteration (reuses a mutable scope)
  2. Full visitExpr dispatch (direct bindings(idx).value lookup)
  3. Lazy thunk creation (eager Val results instead of LazyExpr wrappers)

Key Design Decisions

  1. Mutable scope reuse: ValScope.extendMutable() creates a scope with one uninitialized slot. The caller mutates bindings(slot) per iteration instead of allocating a new scope. This is safe because BinaryOp(ValidId, ValidId) evaluates both operands eagerly — no lazy closures capture the scope array.

  2. Complete operator coverage: The inline fast path handles ALL BinaryOp operations:

    • evalBinaryOpNumNum: @switch-dispatched Num×Num path with overflow checks, div-by-zero, and bitwise ops
    • visitBinaryOpValues: Polymorphic fallback for Str concat, Str format, Obj merge, Arr concat, equality, comparisons, and in
  3. Eager evaluation: The fast path stores concrete Val results instead of LazyExpr thunks. This changes error timing for edge cases like std.length([1/0 for x in [1]]) (errors at construction vs access), but matches go-jsonnet/jrsonnet behavior and provides massive performance gains.

Modification

Evaluator.scala

  • visitCompInline: New method that fuses scope creation with body evaluation. For BinaryOp(ValidId, ValidId) bodies, directly looks up operands via scope indices and dispatches to specialized evaluators.
  • evalBinaryOpNumNum: @inline @switch-dispatched fast path for Num×Num operations (comparison, arithmetic with overflow checks, bitwise with safe integer range checks).
  • visitBinaryOpValues: Complete polymorphic fallback handling all non-Num operations (Str+Str, Str+any, Obj+Obj, Arr+Arr, Str%any, comparisons).

ValScope.scala

  • extendMutable(): New method that creates a mutable scope extension for tight-loop reuse without per-iteration allocation.

Test

  • comprehension_binop_types.jsonnet: Comprehensive regression test covering ALL BinaryOp types in comprehensions (string concat, numeric arithmetic, comparisons, bitwise ops, string formatting, array concat, in operator).

Benchmark Results

JMH (JVM, Scala 3.3.7)

Benchmark Before (ms/op) After (ms/op) Change
comparison2 72.498 39.086 -46.1%
comparison 23.093 22.120 -4.2%
bench.02 48.296 48.670 ~0% (noise)
bench.04 33.261 32.260 -3.0%
realistic2 68.960 69.977 ~0% (noise)
reverse 10.685 10.540 ~0% (noise)

No regressions across all 35 benchmarks.

Hyperfine (Scala Native, macOS ARM64)

Benchmark master This PR jrsonnet 0.4.2 vs master vs jrsonnet
comparison2 166.5 ms 74.0 ms 232.4 ms -55.6% 3.14× faster
bench.02 68.8 ms 67.5 ms ~0%
bench.04 530 ms 528 ms 532 ms ~0% ~0%
reverse 49.6 ms 49.1 ms ~0%

Analysis

The -46% JMH / -56% native improvement on comparison2 is expected: this benchmark is dominated by [a < b for a in large_array for b in large_array] comprehensions, which are exactly the pattern optimized by the BinaryOp inline fast path.

Other benchmarks show no regression because:

  • The fast path only activates for BinaryOp(ValidId, ValidId) bodies — other comprehension patterns fall through to the normal path
  • The @switch annotation ensures tableswitch bytecode for zero-overhead dispatch

sjsonnet is now 3.14× faster than jrsonnet on comparison2, up from being roughly equal.

References

Result

  • ✅ All 141 tests pass
  • ✅ Zero benchmark regressions across 35 JMH benchmarks
  • ✅ comparison2: -46% JMH, -56% native, 3.14× faster than jrsonnet
  • ✅ Comprehensive regression test for all BinaryOp types in comprehensions

…Id) fast path

Fuse comprehension scope building with body evaluation, eliminating
redundant scope allocation in the innermost loop. When the body is
BinaryOp(ValidId,ValidId), inline the scope lookups and binary-op
dispatch entirely, avoiding 3× visitExpr overhead per iteration.

Key changes:
- ValScope.extendMutable(): creates a scope with one extra mutable slot
  for reuse across iterations (safe because results are eagerly
  evaluated, not captured in lazy thunks)
- visitCompInline: split by rest (Nil vs non-Nil), with BinaryOp fast
  path for innermost loops
- evalBinaryOpNumNum: @switch-dispatched Num×Num fast path covering all
  comparison, arithmetic, modulo, bitwise, and shift operators with full
  safety checks (overflow, division-by-zero, safe integer range)
- visitBinaryOpValues: polymorphic fallback for non-Num operands covering
  string concat/format, object merge, array concat, equality, and 'in'

Benchmark: comparison2 -53.1% (74.1 → 34.8 ms/op), zero regressions
across 35 benchmarks.

Upstream: jit branch commits 3466461 (fuse) + 71545ba (inline)
@He-Pin He-Pin marked this pull request as ready for review April 5, 2026 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant