gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only) by eendebakpt · Pull Request #146307 · python/cpython

eendebakpt · 2026-03-22T19:56:54Z

We can add the following tier 2 micro-ops that mutate the uniquely-referenced operand:

_BINARY_OP_ADD_FLOAT_INPLACE / _INPLACE_RIGHT — unique LHS / RHS
_BINARY_OP_SUBTRACT_FLOAT_INPLACE / _INPLACE_RIGHT — unique LHS / RHS
_BINARY_OP_MULTIPLY_FLOAT_INPLACE / _INPLACE_RIGHT — unique LHS / RHS
_UNARY_NEGATIVE_FLOAT_INPLACE — unique operand

The _RIGHT variants handle commutative ops (add, multiply) plus subtract when only the RHS is unique. The optimizer emits these in optimizer_bytecodes.c when PyJitRef_IsUnique(left) or PyJitRef_IsUnique(right) is true and the operand is a known float. The mutated operand is marked as borrowed so the following _POP_TOP becomes _POP_TOP_NOP.

Micro-benchmarks:

Expression	main	optimized	Speedup
`total += a*b + c`	24.0 ns/iter	11.5 ns/iter	2.1x
`total += a + b`	16.9 ns/iter	11.0 ns/iter	1.5x
`total += ab + cd`	28.5 ns/iter	18.3 ns/iter	1.6x

pyperformance nbody (20k iterations):

	main	optimized	Speedup
nbody	60.6 ms	49.0 ms	1.19x (19% faster)

Followup

Some operations that could be added in followup PR's: division of floats (same as this PR, but needs to handle division by zero), operations on a float and an (compact) int with a uniquely referenced float, integer operations (but this is more involved because of small ints and number of digits), operations with complex numbers (same as this PR, but maybe the use case for complex is less).

Script

"""Demo script for the inplace float mutation optimization in the tier 2 optimizer.

./configure --enable-experimental-jit=interpreter --with-pydebug
./configure --enable-experimental-jit=yes --with-pydebug

Usage:
    ./python jit_float_demo.py           # filtered trace (float-related ops)
    ./python jit_float_demo.py --all     # full trace (all ops)
"""

import sys
import timeit

SHOW_ALL = "--all" in sys.argv

# --- System info ---
print("=" * 60)
print("CPython JIT / Tier 2 Demo")
print("=" * 60)
print(f"Python version:  {sys.version}")
print(f"Debug build:     {hasattr(sys, 'gettotalrefcount')}")
print(f"Free-threaded:   {not sys._is_gil_enabled()}")
jit_mod = getattr(sys, "_jit", None)
if jit_mod is not None:
    print(f"JIT available:   {jit_mod.is_available()}")
    print(f"JIT enabled:     {jit_mod.is_enabled()}")
else:
    print("JIT:             not compiled in")

tier2 = False
try:
    from _testinternalcapi import TIER2_THRESHOLD
    tier2 = True
    print(f"Tier 2:          enabled (threshold={TIER2_THRESHOLD})")
except (ImportError, AttributeError):
    print("Tier 2:          disabled (build without _Py_TIER2)")

print()

# --- Example functions ---

def f_adds(n, a, b, c):
    """a*b + c per iteration — the multiply result is unique and reused."""
    total = 0.0
    for i in range(n):
        total = a + b + c
    return total

def f_chain(n, a, b, c):
    """a*b + c per iteration — the multiply result is unique and reused."""
    total = 0.0
    for i in range(n):
        total += a * b + c
    return total


def f_simple_add(n, a, b):
    """a + b per iteration — no unique intermediate."""
    total = 0.0
    for i in range(n):
        total += a + b
    return total


def f_long_chain(n, a, b, c, d):
    """a*b + c*d per iteration — two unique intermediates."""
    total = 0.0
    for i in range(n):
        total += a * b + c * d
    return total

def f_negate(n, a, b):
    """a*b + c*d per iteration — two unique intermediates."""
    total = 0.0
    for i in range(n):
        total = - (a + b)
    return total


# --- Warm up to trigger tier 2 ---
LOOP = 10_000
f_adds(LOOP, 2.0, 3.0, 4.0)
f_chain(LOOP, 2.0, 3.0, 4.0)
f_simple_add(LOOP, 2.0, 3.0)
f_long_chain(LOOP, 2.0, 3.0, 4.0, 5.0)


# --- Op annotation ---

def annotate_op(name, oparg, func):
    """Return a human-readable annotation for a uop."""
    varnames = func.__code__.co_varnames
    consts = func.__code__.co_consts

    # _LOAD_FAST_BORROW_3 → local index 3
    for prefix in ("_LOAD_FAST_BORROW_", "_LOAD_FAST_", "_SWAP_FAST_"):
        if name.startswith(prefix):
            idx = int(name[len(prefix):])
            local = varnames[idx] if idx < len(varnames) else f"local{idx}"
            return local

    # _LOAD_CONST_INLINE_BORROW etc — operand is a pointer, not useful
    # but if oparg is a small index into consts, show it
    if "LOAD_CONST" in name and oparg < len(consts):
        return repr(consts[oparg])

    # Binary ops
    if "MULTIPLY" in name:
        return "*"
    if "SUBTRACT" in name:
        return "-"
    if "ADD" in name and "UNICODE" not in name:
        return "+"

    # Guards
    if name == "_GUARD_TOS_FLOAT":
        return "top is float?"
    if name == "_GUARD_NOS_FLOAT":
        return "2nd is float?"
    if name == "_GUARD_TOS_INT":
        return "top is int?"
    if name == "_GUARD_NOS_INT":
        return "2nd is int?"
    if "NOT_EXHAUSTED" in name:
        return "iter not done?"

    # Pop / cleanup
    if name == "_POP_TOP_NOP":
        return "skip (borrowed/null)"
    if name == "_POP_TOP_FLOAT":
        return "decref float"
    if name == "_POP_TOP_INT":
        return "decref int"

    # Control flow
    if name == "_JUMP_TO_TOP":
        return "loop"
    if name == "_EXIT_TRACE":
        return "exit"
    if name == "_DEOPT":
        return "deoptimize"
    if name == "_ERROR_POP_N":
        return "error handler"
    if name == "_START_EXECUTOR":
        return "trace entry"
    if name == "_MAKE_WARM":
        return "warmup counter"

    return ""


# --- Show traces ---
FILTER_KEYWORDS = (
    "FLOAT", "INPLACE", "BINARY_OP", "NOP",
    "LOAD_FAST", "LOAD_CONST", "GUARD",
)

has_get_executor = False
if tier2:
    try:
        from _opcode import get_executor
        has_get_executor = True
    except ImportError:
        pass

if has_get_executor:
    mode = "all ops" if SHOW_ALL else "float-related ops only"
    print("-" * 60)
    print(f"Tier 2 traces ({mode})")
    print("-" * 60)

    for label, func in [
        ("f_adds: total = a + b + c", f_adds),
        ("f_chain: total += a * b + c", f_chain),
        ("f_simple_add: total += a + b", f_simple_add),
        ("f_long_chain: total += a * b + c * d", f_long_chain),
    ]:
        code = func.__code__
        found = False
        for i in range(len(code.co_code) // 2):
            try:
                ex = get_executor(code, i * 2)
            except (ValueError, TypeError, RuntimeError):
                continue
            if ex is None:
                continue

            print(f"\n  {label}")
            for j, op in enumerate(ex):
                name, oparg = op[0], op[1]

                if not SHOW_ALL:
                    if not any(k in name for k in FILTER_KEYWORDS):
                        continue

                annotation = annotate_op(name, oparg, func)
                marker = " <<<" if "INPLACE" in name else ""

                if annotation:
                    print(f"    {j:3d}: {name:45s} # {annotation}{marker}")
                else:
                    print(f"    {j:3d}: {name}{marker}")
            found = True
            break

        if not found:
            print(f"\n  {label}: (no executor found)")

    print()
else:
    print("-" * 60)
    print("Tier 2 traces: skipped (tier 2 not available)")
    print("-" * 60)
    print()

# --- Benchmark ---
print("-" * 60)
print("Benchmark")
print("-" * 60)

N = 2_000_000
INNER = 1000

benchmarks = [
    ("total = a + b + c", lambda: f_adds(INNER, 2.0, 3.0, 4.0)),
    ("total += a*b + c  ", lambda: f_chain(INNER, 2.0, 3.0, 4.0)),
    ("total += a + b    ", lambda: f_simple_add(INNER, 2.0, 3.0)),
    ("total += a*b + c*d", lambda: f_long_chain(INNER, 2.0, 3.0, 4.0, 5.0)),
    ("total = - (a + b)    ", lambda: f_negate(INNER, 2.0, 3.0)),
]

for label, fn in benchmarks:
    iters = N // INNER
    t = timeit.timeit(fn, number=iters)
    ns_per = t / N * 1e9
    print(f"  {label}:  {t:.3f}s  ({ns_per:.0f} ns/iter)")

print()
print("The 'a*b + c' case benefits from _BINARY_OP_ADD_FLOAT_INPLACE:")
print("the result of a*b is uniquely referenced, so the addition")
print("mutates it in place instead of allocating a new float.")

# --- N-body benchmark from pyperformance ---
print()
print("-" * 60)
print("N-body benchmark (from pyperformance)")
print("-" * 60)

PI = 3.14159265358979323
SOLAR_MASS = 4 * PI * PI
DAYS_PER_YEAR = 365.24

def _nbody_make_system():
    bodies = [
        # sun
        ([0.0, 0.0, 0.0], [0.0, 0.0, 0.0], SOLAR_MASS),
        # jupiter
        ([4.84143144246472090e+00, -1.16032004402742839e+00, -1.03622044471123109e-01],
         [1.66007664274403694e-03*DAYS_PER_YEAR, 7.69901118419740425e-03*DAYS_PER_YEAR, -6.90460016972063023e-05*DAYS_PER_YEAR],
         9.54791938424326609e-04*SOLAR_MASS),
        # saturn
        ([8.34336671824457987e+00, 4.12479856412430479e+00, -4.03523417114321381e-01],
         [-2.76742510726862411e-03*DAYS_PER_YEAR, 4.99852801234917238e-03*DAYS_PER_YEAR, 2.30417297573763929e-05*DAYS_PER_YEAR],
         2.85885980666130812e-04*SOLAR_MASS),
        # uranus
        ([1.28943695621391310e+01, -1.51111514016986312e+01, -2.23307578892655734e-01],
         [2.96460137564761618e-03*DAYS_PER_YEAR, 2.37847173959480950e-03*DAYS_PER_YEAR, -2.96589568540237556e-05*DAYS_PER_YEAR],
         4.36624404335156298e-05*SOLAR_MASS),
        # neptune
        ([1.53796971148509165e+01, -2.59193146099879641e+01, 1.79258772950371181e-01],
         [2.68067772490389322e-03*DAYS_PER_YEAR, 1.62824170038242295e-03*DAYS_PER_YEAR, -9.51592254519715870e-05*DAYS_PER_YEAR],
         5.15138902046611451e-05*SOLAR_MASS),
    ]
    pairs = []
    for x in range(len(bodies) - 1):
        for y in bodies[x + 1:]:
            pairs.append((bodies[x], y))
    return bodies, pairs

def _nbody_advance(dt, n, bodies, pairs):
    for i in range(n):
        for (([x1, y1, z1], v1, m1), ([x2, y2, z2], v2, m2)) in pairs:
            dx = x1 - x2
            dy = y1 - y2
            dz = z1 - z2
            mag = dt * ((dx * dx + dy * dy + dz * dz) ** (-1.5))
            b1m = m1 * mag
            b2m = m2 * mag
            v1[0] -= dx * b2m
            v1[1] -= dy * b2m
            v1[2] -= dz * b2m
            v2[0] += dx * b1m
            v2[1] += dy * b1m
            v2[2] += dz * b1m
        for (r, [vx, vy, vz], m) in bodies:
            r[0] += dt * vx
            r[1] += dt * vy
            r[2] += dt * vz

def _nbody_report_energy(bodies, pairs, e=0.0):
    for (((x1, y1, z1), v1, m1), ((x2, y2, z2), v2, m2)) in pairs:
        dx = x1 - x2
        dy = y1 - y2
        dz = z1 - z2
        e -= (m1 * m2) / ((dx * dx + dy * dy + dz * dz) ** 0.5)
    for (r, [vx, vy, vz], m) in bodies:
        e += m * (vx * vx + vy * vy + vz * vz) / 2.
    return e

def _nbody_offset_momentum(ref, bodies, px=0.0, py=0.0, pz=0.0):
    for (r, [vx, vy, vz], m) in bodies:
        px -= vx * m
        py -= vy * m
        pz -= vz * m
    (r, v, m) = ref
    v[0] = px / m
    v[1] = py / m
    v[2] = pz / m

def bench_nbody(iterations=20000):
    bodies, pairs = _nbody_make_system()
    _nbody_offset_momentum(bodies[0], bodies)
    _nbody_report_energy(bodies, pairs)
    _nbody_advance(0.01, iterations, bodies, pairs)
    _nbody_report_energy(bodies, pairs)

# Warmup
bench_nbody(1000)

NBODY_RUNS = 5
t = timeit.timeit(lambda: bench_nbody(20000), number=NBODY_RUNS)
print(f"  nbody (20k iterations, {NBODY_RUNS} runs): {t/NBODY_RUNS*1000:.1f} ms/run")

Issue: Optimize float operations by mutating uniquely-referenced operands in place (JIT only) #146306

…y-referenced operands in place When the tier 2 optimizer can prove that an operand to a float operation is uniquely referenced (refcount 1), mutate it in place instead of allocating a new PyFloatObject. New tier 2 micro-ops: - _BINARY_OP_{ADD,SUBTRACT,MULTIPLY}_FLOAT_INPLACE (unique LHS) - _BINARY_OP_{ADD,SUBTRACT,MULTIPLY}_FLOAT_INPLACE_RIGHT (unique RHS) - _UNARY_NEGATIVE_FLOAT_INPLACE (unique operand) Speeds up the pyperformance nbody benchmark by ~19%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Avoid compound assignment (+=, -=, *=) directly on ob_fval in inplace float ops. On 32-bit Windows, this generates JIT stencils with _xmm register references that MSVC cannot parse. Instead, read into a local double, compute, and write back. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add -mno-sse to clang args for i686-pc-windows-msvc target. The COFF32 stencil converter cannot handle _xmm register references that clang emits for inline float arithmetic. Using x87 FPU instructions avoids this. SSE is optional on 32-bit x86; x87 is the baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eendebakpt · 2026-03-22T21:41:05Z

Selected pyperformance benchmarks:

### chaos ###
Mean +- std dev: 60.7 ms +- 1.5 ms -> 59.5 ms +- 1.5 ms: 1.02x faster
Not significant

### fannkuch ###
Mean +- std dev: 376 ms +- 4 ms -> 375 ms +- 3 ms: 1.00x faster
Not significant

### float ###
Mean +- std dev: 60.9 ms +- 3.0 ms -> 58.8 ms +- 2.7 ms: 1.04x faster
Significant (t=4.00)

### nbody ###
Mean +- std dev: 87.2 ms +- 1.1 ms -> 72.3 ms +- 0.4 ms: 1.21x faster
Significant (t=98.34)

### raytrace ###
Mean +- std dev: 304 ms +- 5 ms -> 299 ms +- 5 ms: 1.02x faster
Not significant

### scimark_fft ###
Mean +- std dev: 305 ms +- 1 ms -> 297 ms +- 3 ms: 1.03x faster
Significant (t=19.01)

### scimark_lu ###
Mean +- std dev: 86.5 ms +- 0.4 ms -> 84.8 ms +- 0.4 ms: 1.02x faster
Not significant

### scimark_monte_carlo ###
Mean +- std dev: 60.7 ms +- 0.7 ms -> 59.8 ms +- 0.3 ms: 1.02x faster
Not significant

### scimark_sor ###
Mean +- std dev: 101 ms +- 1 ms -> 99 ms +- 3 ms: 1.03x faster
Significant (t=7.38)

### scimark_sparse_mat_mult ###
Mean +- std dev: 5.53 ms +- 0.02 ms -> 5.14 ms +- 0.02 ms: 1.08x faster
Significant (t=130.74)

### spectral_norm ###
Mean +- std dev: 84.0 ms +- 0.6 ms -> 79.0 ms +- 0.4 ms: 1.06x faster
Significant (t=53.73)

Fidget-Spinner · 2026-03-23T05:51:44Z

Python/optimizer_bytecodes.c

-        r = right;
+        if (PyJitRef_IsUnique(left)) {
+            ADD_OP(_BINARY_OP_SUBTRACT_FLOAT_INPLACE, 0, 0);
+            l = PyJitRef_Borrow(left);


Isn't it more correct to say l = sym_new_null(ctx);? Same for below?

To make this work I had to change the definition _POP_TOP_FLOAT. I think the change is ok, but please double check.

It should be PyJitRef_Borrow(sym_new_null(ctx)) sorry for not being clear

That way theres no need to change the case in POP_TOP_FLOAT

Thanks for claryfying. Now _POP_TOP_FLOAT is unchanged.

Python/bytecodes.c

markshannon

This looks like a nice performance improvement for floating point code
I've a few comments

markshannon · 2026-03-23T14:58:49Z

Python/bytecodes.c

+        // Note: read into a local double and write back to avoid compound
+        // assignment (+=) on ob_fval, which generates problematic JIT
+        // stencils on i686-pc-windows-msvc.
+        tier2 op(_BINARY_OP_ADD_FLOAT_INPLACE, (left, right -- res, l, r)) {


This op and its variants all share a lot of common code.
Could you factor out the code into a macro to perform the inplace operation?
Like:

tier2 op(_BINARY_OP_ADD_FLOAT_INPLACE, (left, right -- res, l, r)) { res = FLOAT_INPLACE_OP(left, +, right); l = PyStackRef_NULL; r = right; INPUTS_DEAD(); } tier2 op(_BINARY_OP_MULTIPLY_FLOAT_INPLACE_RIGHT, (left, right -- res, l, r)) { res = FLOAT_INPLACE_OP(right, *, left); l = left r = PyStackRef_NULL; INPUTS_DEAD(); }

Normal C macros are not allowed in the bytecodes.c opcodes (they are not expanded). I have not yet found a way to refactor this nicely.

We could add a new macro to the DSL (like INPUTS_DEAD) for this, but it feels a bit odd to add something to the DSL for this particular case.

Normal C macros are not allowed in the bytecodes.c opcodes

They are. You just need to define them in ceval_macros.h. Would you like me to do this, or would you like to do it?

Tools/jit/_targets.py

Co-authored-by: Ken Jin <kenjin4096@gmail.com>

The inplace ops set l or r to PyStackRef_NULL at runtime, so the optimizer should model this as sym_new_null(ctx) rather than PyJitRef_Borrow(). Both produce _POP_TOP_NOP but sym_new_null correctly matches the runtime semantics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

# Conflicts: # Include/internal/pycore_uop_ids.h

Fidget-Spinner

Awesome. Thanks for doing this!

In the future, this will be superceded partially by the partial evaluation pass, as any Py_REFCNT(f) == 1 op is also safe to unbox. However, for 3.15 this is a good win.

eendebakpt · 2026-03-24T13:12:23Z

In the future, this will be superceded partially by the partial evaluation pass, as any Py_REFCNT(f) == 1 op is also safe to unbox. However, for 3.15 this is a good win.

Good to know! Given that, which of the operations listed in the followup section of the first post do you think would be candidates to add in new PRs?

Fidget-Spinner · 2026-03-24T13:14:00Z

Good to know! Given that, which of the operations listed in the followup section of the first post do you think would be candidates to add in new PRs?

I think compact_int OP compact_int is likely the next most common/worthwhile. The remaining mixed float-int stuff is less common in benchmarks.

They might be worthwhile, hard to tell.

eendebakpt requested review from Fidget-Spinner, markshannon, savannahostrowski and tomasr8 as code owners March 22, 2026 19:56

bedevere-app bot mentioned this pull request Mar 22, 2026

Optimize float operations by mutating uniquely-referenced operands in place (JIT only) #146306

Closed

bedevere-app bot added the awaiting review label Mar 22, 2026

eendebakpt mentioned this pull request Mar 22, 2026

JIT: Implement unique reference tracking in Tier 2 for reference count optimizations #143414

Closed

eendebakpt requested review from brandtbucher and diegorusso as code owners March 22, 2026 20:48

eendebakpt changed the title ~~Draft: gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only)~~ gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only) Mar 22, 2026

Fidget-Spinner reviewed Mar 23, 2026

View reviewed changes

markshannon reviewed Mar 23, 2026

View reviewed changes

eendebakpt mentioned this pull request Mar 24, 2026

JIT stencil generation for i686-pc-windows-msvc has issues with SSE #146365

Open

eendebakpt and others added 9 commits March 24, 2026 10:51

Update Python/bytecodes.c

7bfdedb

Co-authored-by: Ken Jin <kenjin4096@gmail.com>

Revert unrelated bytecodes.c comment/whitespace changes

fbedeb3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' into jit_float_inplace_rebased

1efd764

# Conflicts: # Include/internal/pycore_uop_ids.h

regenerate

78c11e1

handle case there operand is set to null

8af7b51

reduce duplication in new opcodes

7d63316

review comments

c61b9af

cleanup unary opcode

4ff36c1

Fidget-Spinner approved these changes Mar 24, 2026

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting review labels Mar 24, 2026

Fidget-Spinner merged commit 951675c into python:main Mar 24, 2026
89 checks passed

bedevere-app bot removed the awaiting merge label Mar 24, 2026

Uh oh!

Conversation

eendebakpt commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eendebakpt commented Mar 22, 2026

Uh oh!

Fidget-Spinner Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

eendebakpt Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Fidget-Spinner Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Fidget-Spinner Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

eendebakpt Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

markshannon left a comment

Choose a reason for hiding this comment

Uh oh!

markshannon Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eendebakpt Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Fidget-Spinner Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Fidget-Spinner left a comment

Choose a reason for hiding this comment

Uh oh!

eendebakpt commented Mar 24, 2026

Uh oh!

Fidget-Spinner commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eendebakpt commented Mar 22, 2026 •

edited

Loading

markshannon Mar 23, 2026 •

edited

Loading

Fidget-Spinner commented Mar 24, 2026 •

edited

Loading