Skip to content

gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only)#146307

Merged
Fidget-Spinner merged 12 commits intopython:mainfrom
eendebakpt:jit_float_inplace_rebased
Mar 24, 2026
Merged

gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only)#146307
Fidget-Spinner merged 12 commits intopython:mainfrom
eendebakpt:jit_float_inplace_rebased

Conversation

@eendebakpt
Copy link
Contributor

@eendebakpt eendebakpt commented Mar 22, 2026

We can add the following tier 2 micro-ops that mutate the uniquely-referenced operand:

  • _BINARY_OP_ADD_FLOAT_INPLACE / _INPLACE_RIGHT — unique LHS / RHS
  • _BINARY_OP_SUBTRACT_FLOAT_INPLACE / _INPLACE_RIGHT — unique LHS / RHS
  • _BINARY_OP_MULTIPLY_FLOAT_INPLACE / _INPLACE_RIGHT — unique LHS / RHS
  • _UNARY_NEGATIVE_FLOAT_INPLACE — unique operand

The _RIGHT variants handle commutative ops (add, multiply) plus subtract when only the RHS is unique. The optimizer emits these in optimizer_bytecodes.c when PyJitRef_IsUnique(left) or PyJitRef_IsUnique(right) is true and the operand is a known float. The mutated operand is marked as borrowed so the following _POP_TOP becomes _POP_TOP_NOP.

Micro-benchmarks:

Expression main optimized Speedup
total += a*b + c 24.0 ns/iter 11.5 ns/iter 2.1x
total += a + b 16.9 ns/iter 11.0 ns/iter 1.5x
total += a*b + c*d 28.5 ns/iter 18.3 ns/iter 1.6x

pyperformance nbody (20k iterations):

main optimized Speedup
nbody 60.6 ms 49.0 ms 1.19x (19% faster)

Followup

Some operations that could be added in followup PR's: division of floats (same as this PR, but needs to handle division by zero), operations on a float and an (compact) int with a uniquely referenced float, integer operations (but this is more involved because of small ints and number of digits), operations with complex numbers (same as this PR, but maybe the use case for complex is less).

Script
"""Demo script for the inplace float mutation optimization in the tier 2 optimizer.

./configure --enable-experimental-jit=interpreter --with-pydebug
./configure --enable-experimental-jit=yes --with-pydebug

Usage:
    ./python jit_float_demo.py           # filtered trace (float-related ops)
    ./python jit_float_demo.py --all     # full trace (all ops)
"""

import sys
import timeit

SHOW_ALL = "--all" in sys.argv

# --- System info ---
print("=" * 60)
print("CPython JIT / Tier 2 Demo")
print("=" * 60)
print(f"Python version:  {sys.version}")
print(f"Debug build:     {hasattr(sys, 'gettotalrefcount')}")
print(f"Free-threaded:   {not sys._is_gil_enabled()}")
jit_mod = getattr(sys, "_jit", None)
if jit_mod is not None:
    print(f"JIT available:   {jit_mod.is_available()}")
    print(f"JIT enabled:     {jit_mod.is_enabled()}")
else:
    print("JIT:             not compiled in")

tier2 = False
try:
    from _testinternalcapi import TIER2_THRESHOLD
    tier2 = True
    print(f"Tier 2:          enabled (threshold={TIER2_THRESHOLD})")
except (ImportError, AttributeError):
    print("Tier 2:          disabled (build without _Py_TIER2)")

print()

# --- Example functions ---

def f_adds(n, a, b, c):
    """a*b + c per iteration — the multiply result is unique and reused."""
    total = 0.0
    for i in range(n):
        total = a + b + c
    return total

def f_chain(n, a, b, c):
    """a*b + c per iteration — the multiply result is unique and reused."""
    total = 0.0
    for i in range(n):
        total += a * b + c
    return total


def f_simple_add(n, a, b):
    """a + b per iteration — no unique intermediate."""
    total = 0.0
    for i in range(n):
        total += a + b
    return total


def f_long_chain(n, a, b, c, d):
    """a*b + c*d per iteration — two unique intermediates."""
    total = 0.0
    for i in range(n):
        total += a * b + c * d
    return total

def f_negate(n, a, b):
    """a*b + c*d per iteration — two unique intermediates."""
    total = 0.0
    for i in range(n):
        total = - (a + b)
    return total


# --- Warm up to trigger tier 2 ---
LOOP = 10_000
f_adds(LOOP, 2.0, 3.0, 4.0)
f_chain(LOOP, 2.0, 3.0, 4.0)
f_simple_add(LOOP, 2.0, 3.0)
f_long_chain(LOOP, 2.0, 3.0, 4.0, 5.0)


# --- Op annotation ---

def annotate_op(name, oparg, func):
    """Return a human-readable annotation for a uop."""
    varnames = func.__code__.co_varnames
    consts = func.__code__.co_consts

    # _LOAD_FAST_BORROW_3 → local index 3
    for prefix in ("_LOAD_FAST_BORROW_", "_LOAD_FAST_", "_SWAP_FAST_"):
        if name.startswith(prefix):
            idx = int(name[len(prefix):])
            local = varnames[idx] if idx < len(varnames) else f"local{idx}"
            return local

    # _LOAD_CONST_INLINE_BORROW etc — operand is a pointer, not useful
    # but if oparg is a small index into consts, show it
    if "LOAD_CONST" in name and oparg < len(consts):
        return repr(consts[oparg])

    # Binary ops
    if "MULTIPLY" in name:
        return "*"
    if "SUBTRACT" in name:
        return "-"
    if "ADD" in name and "UNICODE" not in name:
        return "+"

    # Guards
    if name == "_GUARD_TOS_FLOAT":
        return "top is float?"
    if name == "_GUARD_NOS_FLOAT":
        return "2nd is float?"
    if name == "_GUARD_TOS_INT":
        return "top is int?"
    if name == "_GUARD_NOS_INT":
        return "2nd is int?"
    if "NOT_EXHAUSTED" in name:
        return "iter not done?"

    # Pop / cleanup
    if name == "_POP_TOP_NOP":
        return "skip (borrowed/null)"
    if name == "_POP_TOP_FLOAT":
        return "decref float"
    if name == "_POP_TOP_INT":
        return "decref int"

    # Control flow
    if name == "_JUMP_TO_TOP":
        return "loop"
    if name == "_EXIT_TRACE":
        return "exit"
    if name == "_DEOPT":
        return "deoptimize"
    if name == "_ERROR_POP_N":
        return "error handler"
    if name == "_START_EXECUTOR":
        return "trace entry"
    if name == "_MAKE_WARM":
        return "warmup counter"

    return ""


# --- Show traces ---
FILTER_KEYWORDS = (
    "FLOAT", "INPLACE", "BINARY_OP", "NOP",
    "LOAD_FAST", "LOAD_CONST", "GUARD",
)

has_get_executor = False
if tier2:
    try:
        from _opcode import get_executor
        has_get_executor = True
    except ImportError:
        pass

if has_get_executor:
    mode = "all ops" if SHOW_ALL else "float-related ops only"
    print("-" * 60)
    print(f"Tier 2 traces ({mode})")
    print("-" * 60)

    for label, func in [
        ("f_adds: total = a + b + c", f_adds),
        ("f_chain: total += a * b + c", f_chain),
        ("f_simple_add: total += a + b", f_simple_add),
        ("f_long_chain: total += a * b + c * d", f_long_chain),
    ]:
        code = func.__code__
        found = False
        for i in range(len(code.co_code) // 2):
            try:
                ex = get_executor(code, i * 2)
            except (ValueError, TypeError, RuntimeError):
                continue
            if ex is None:
                continue

            print(f"\n  {label}")
            for j, op in enumerate(ex):
                name, oparg = op[0], op[1]

                if not SHOW_ALL:
                    if not any(k in name for k in FILTER_KEYWORDS):
                        continue

                annotation = annotate_op(name, oparg, func)
                marker = " <<<" if "INPLACE" in name else ""

                if annotation:
                    print(f"    {j:3d}: {name:45s} # {annotation}{marker}")
                else:
                    print(f"    {j:3d}: {name}{marker}")
            found = True
            break

        if not found:
            print(f"\n  {label}: (no executor found)")

    print()
else:
    print("-" * 60)
    print("Tier 2 traces: skipped (tier 2 not available)")
    print("-" * 60)
    print()

# --- Benchmark ---
print("-" * 60)
print("Benchmark")
print("-" * 60)

N = 2_000_000
INNER = 1000

benchmarks = [
    ("total = a + b + c", lambda: f_adds(INNER, 2.0, 3.0, 4.0)),
    ("total += a*b + c  ", lambda: f_chain(INNER, 2.0, 3.0, 4.0)),
    ("total += a + b    ", lambda: f_simple_add(INNER, 2.0, 3.0)),
    ("total += a*b + c*d", lambda: f_long_chain(INNER, 2.0, 3.0, 4.0, 5.0)),
    ("total = - (a + b)    ", lambda: f_negate(INNER, 2.0, 3.0)),
]

for label, fn in benchmarks:
    iters = N // INNER
    t = timeit.timeit(fn, number=iters)
    ns_per = t / N * 1e9
    print(f"  {label}:  {t:.3f}s  ({ns_per:.0f} ns/iter)")

print()
print("The 'a*b + c' case benefits from _BINARY_OP_ADD_FLOAT_INPLACE:")
print("the result of a*b is uniquely referenced, so the addition")
print("mutates it in place instead of allocating a new float.")

# --- N-body benchmark from pyperformance ---
print()
print("-" * 60)
print("N-body benchmark (from pyperformance)")
print("-" * 60)

PI = 3.14159265358979323
SOLAR_MASS = 4 * PI * PI
DAYS_PER_YEAR = 365.24

def _nbody_make_system():
    bodies = [
        # sun
        ([0.0, 0.0, 0.0], [0.0, 0.0, 0.0], SOLAR_MASS),
        # jupiter
        ([4.84143144246472090e+00, -1.16032004402742839e+00, -1.03622044471123109e-01],
         [1.66007664274403694e-03*DAYS_PER_YEAR, 7.69901118419740425e-03*DAYS_PER_YEAR, -6.90460016972063023e-05*DAYS_PER_YEAR],
         9.54791938424326609e-04*SOLAR_MASS),
        # saturn
        ([8.34336671824457987e+00, 4.12479856412430479e+00, -4.03523417114321381e-01],
         [-2.76742510726862411e-03*DAYS_PER_YEAR, 4.99852801234917238e-03*DAYS_PER_YEAR, 2.30417297573763929e-05*DAYS_PER_YEAR],
         2.85885980666130812e-04*SOLAR_MASS),
        # uranus
        ([1.28943695621391310e+01, -1.51111514016986312e+01, -2.23307578892655734e-01],
         [2.96460137564761618e-03*DAYS_PER_YEAR, 2.37847173959480950e-03*DAYS_PER_YEAR, -2.96589568540237556e-05*DAYS_PER_YEAR],
         4.36624404335156298e-05*SOLAR_MASS),
        # neptune
        ([1.53796971148509165e+01, -2.59193146099879641e+01, 1.79258772950371181e-01],
         [2.68067772490389322e-03*DAYS_PER_YEAR, 1.62824170038242295e-03*DAYS_PER_YEAR, -9.51592254519715870e-05*DAYS_PER_YEAR],
         5.15138902046611451e-05*SOLAR_MASS),
    ]
    pairs = []
    for x in range(len(bodies) - 1):
        for y in bodies[x + 1:]:
            pairs.append((bodies[x], y))
    return bodies, pairs

def _nbody_advance(dt, n, bodies, pairs):
    for i in range(n):
        for (([x1, y1, z1], v1, m1), ([x2, y2, z2], v2, m2)) in pairs:
            dx = x1 - x2
            dy = y1 - y2
            dz = z1 - z2
            mag = dt * ((dx * dx + dy * dy + dz * dz) ** (-1.5))
            b1m = m1 * mag
            b2m = m2 * mag
            v1[0] -= dx * b2m
            v1[1] -= dy * b2m
            v1[2] -= dz * b2m
            v2[0] += dx * b1m
            v2[1] += dy * b1m
            v2[2] += dz * b1m
        for (r, [vx, vy, vz], m) in bodies:
            r[0] += dt * vx
            r[1] += dt * vy
            r[2] += dt * vz

def _nbody_report_energy(bodies, pairs, e=0.0):
    for (((x1, y1, z1), v1, m1), ((x2, y2, z2), v2, m2)) in pairs:
        dx = x1 - x2
        dy = y1 - y2
        dz = z1 - z2
        e -= (m1 * m2) / ((dx * dx + dy * dy + dz * dz) ** 0.5)
    for (r, [vx, vy, vz], m) in bodies:
        e += m * (vx * vx + vy * vy + vz * vz) / 2.
    return e

def _nbody_offset_momentum(ref, bodies, px=0.0, py=0.0, pz=0.0):
    for (r, [vx, vy, vz], m) in bodies:
        px -= vx * m
        py -= vy * m
        pz -= vz * m
    (r, v, m) = ref
    v[0] = px / m
    v[1] = py / m
    v[2] = pz / m

def bench_nbody(iterations=20000):
    bodies, pairs = _nbody_make_system()
    _nbody_offset_momentum(bodies[0], bodies)
    _nbody_report_energy(bodies, pairs)
    _nbody_advance(0.01, iterations, bodies, pairs)
    _nbody_report_energy(bodies, pairs)

# Warmup
bench_nbody(1000)

NBODY_RUNS = 5
t = timeit.timeit(lambda: bench_nbody(20000), number=NBODY_RUNS)
print(f"  nbody (20k iterations, {NBODY_RUNS} runs): {t/NBODY_RUNS*1000:.1f} ms/run")

…y-referenced operands in place

When the tier 2 optimizer can prove that an operand to a float
operation is uniquely referenced (refcount 1), mutate it in place
instead of allocating a new PyFloatObject.

New tier 2 micro-ops:
- _BINARY_OP_{ADD,SUBTRACT,MULTIPLY}_FLOAT_INPLACE (unique LHS)
- _BINARY_OP_{ADD,SUBTRACT,MULTIPLY}_FLOAT_INPLACE_RIGHT (unique RHS)
- _UNARY_NEGATIVE_FLOAT_INPLACE (unique operand)

Speeds up the pyperformance nbody benchmark by ~19%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Avoid compound assignment (+=, -=, *=) directly on ob_fval in
inplace float ops. On 32-bit Windows, this generates JIT stencils
with _xmm register references that MSVC cannot parse. Instead,
read into a local double, compute, and write back.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add -mno-sse to clang args for i686-pc-windows-msvc target. The COFF32
stencil converter cannot handle _xmm register references that clang
emits for inline float arithmetic. Using x87 FPU instructions avoids
this. SSE is optional on 32-bit x86; x87 is the baseline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eendebakpt eendebakpt changed the title Draft: gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only) gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only) Mar 22, 2026
@eendebakpt
Copy link
Contributor Author

Selected pyperformance benchmarks:

### chaos ###
Mean +- std dev: 60.7 ms +- 1.5 ms -> 59.5 ms +- 1.5 ms: 1.02x faster
Not significant

### fannkuch ###
Mean +- std dev: 376 ms +- 4 ms -> 375 ms +- 3 ms: 1.00x faster
Not significant

### float ###
Mean +- std dev: 60.9 ms +- 3.0 ms -> 58.8 ms +- 2.7 ms: 1.04x faster
Significant (t=4.00)

### nbody ###
Mean +- std dev: 87.2 ms +- 1.1 ms -> 72.3 ms +- 0.4 ms: 1.21x faster
Significant (t=98.34)

### raytrace ###
Mean +- std dev: 304 ms +- 5 ms -> 299 ms +- 5 ms: 1.02x faster
Not significant

### scimark_fft ###
Mean +- std dev: 305 ms +- 1 ms -> 297 ms +- 3 ms: 1.03x faster
Significant (t=19.01)

### scimark_lu ###
Mean +- std dev: 86.5 ms +- 0.4 ms -> 84.8 ms +- 0.4 ms: 1.02x faster
Not significant

### scimark_monte_carlo ###
Mean +- std dev: 60.7 ms +- 0.7 ms -> 59.8 ms +- 0.3 ms: 1.02x faster
Not significant

### scimark_sor ###
Mean +- std dev: 101 ms +- 1 ms -> 99 ms +- 3 ms: 1.03x faster
Significant (t=7.38)

### scimark_sparse_mat_mult ###
Mean +- std dev: 5.53 ms +- 0.02 ms -> 5.14 ms +- 0.02 ms: 1.08x faster
Significant (t=130.74)

### spectral_norm ###
Mean +- std dev: 84.0 ms +- 0.6 ms -> 79.0 ms +- 0.4 ms: 1.06x faster
Significant (t=53.73)

r = right;
if (PyJitRef_IsUnique(left)) {
ADD_OP(_BINARY_OP_SUBTRACT_FLOAT_INPLACE, 0, 0);
l = PyJitRef_Borrow(left);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it more correct to say l = sym_new_null(ctx);? Same for below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this work I had to change the definition _POP_TOP_FLOAT. I think the change is ok, but please double check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be PyJitRef_Borrow(sym_new_null(ctx)) sorry for not being clear

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That way theres no need to change the case in POP_TOP_FLOAT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for claryfying. Now _POP_TOP_FLOAT is unchanged.

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a nice performance improvement for floating point code
I've a few comments

// Note: read into a local double and write back to avoid compound
// assignment (+=) on ob_fval, which generates problematic JIT
// stencils on i686-pc-windows-msvc.
tier2 op(_BINARY_OP_ADD_FLOAT_INPLACE, (left, right -- res, l, r)) {
Copy link
Member

@markshannon markshannon Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This op and its variants all share a lot of common code.
Could you factor out the code into a macro to perform the inplace operation?
Like:

tier2 op(_BINARY_OP_ADD_FLOAT_INPLACE, (left, right -- res, l, r)) {
    res = FLOAT_INPLACE_OP(left, +, right);
    l = PyStackRef_NULL;
    r = right;
    INPUTS_DEAD();
}

tier2 op(_BINARY_OP_MULTIPLY_FLOAT_INPLACE_RIGHT, (left, right -- res, l, r)) {
    res = FLOAT_INPLACE_OP(right, *, left);
    l = left
    r = PyStackRef_NULL;
    INPUTS_DEAD();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normal C macros are not allowed in the bytecodes.c opcodes (they are not expanded). I have not yet found a way to refactor this nicely.

We could add a new macro to the DSL (like INPUTS_DEAD) for this, but it feels a bit odd to add something to the DSL for this particular case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normal C macros are not allowed in the bytecodes.c opcodes

They are. You just need to define them in ceval_macros.h. Would you like me to do this, or would you like to do it?

eendebakpt and others added 9 commits March 24, 2026 10:51
Co-authored-by: Ken Jin <kenjin4096@gmail.com>
The inplace ops set l or r to PyStackRef_NULL at runtime, so the
optimizer should model this as sym_new_null(ctx) rather than
PyJitRef_Borrow(). Both produce _POP_TOP_NOP but sym_new_null
correctly matches the runtime semantics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts:
#	Include/internal/pycore_uop_ids.h
Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Thanks for doing this!

In the future, this will be superceded partially by the partial evaluation pass, as any Py_REFCNT(f) == 1 op is also safe to unbox. However, for 3.15 this is a good win.

@eendebakpt
Copy link
Contributor Author

In the future, this will be superceded partially by the partial evaluation pass, as any Py_REFCNT(f) == 1 op is also safe to unbox. However, for 3.15 this is a good win.

Good to know! Given that, which of the operations listed in the followup section of the first post do you think would be candidates to add in new PRs?

@Fidget-Spinner
Copy link
Member

Fidget-Spinner commented Mar 24, 2026

Good to know! Given that, which of the operations listed in the followup section of the first post do you think would be candidates to add in new PRs?

I think compact_int OP compact_int is likely the next most common/worthwhile. The remaining mixed float-int stuff is less common in benchmarks.

They might be worthwhile, hard to tell.

@Fidget-Spinner Fidget-Spinner merged commit 951675c into python:main Mar 24, 2026
89 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants