Enable ROCm Triton backend for AllReduce by mfrancepillois · Pull Request #684 · ROCm/xla

mfrancepillois · 2026-03-18T14:43:37Z

📝 Summary of Changes
This PR enables the ROCm Triton backend for AllReduce (collective emitter).
To this end:

Add a new passes to lower atomic operations triton_xla.atomic_write and triton_xla.atomic_spin_wait and triton_xla.get_tid. These passes rely on extern_elementwise triton operations. thereby avoiding the use of target specific inline assembly. The extern_elementwise ops are then caught later in the compilater pipeline and replaced by llvm intrinsics.
Add missing API to rocm_executor RocmExecutor::CanEnablePeerAccessTo(int other_device_ordinal) (this API is required to enable collective_emitter thunk).

🎯 Justification
Prior to this PR, the triton_xla.get_tid, triton_xla.atomic_write and triton_xla.atomic_spin_wait operations were lowered using PTX assembly. Therefore, AllReduce triton backend was only available for CUDA target.
This PR adds a new way to lower these operations only Triton operations using extern_elementwise.
Thanks to that, the triton backend for AllReduce is now available for ROCm target.

🚀 Kind of Contribution
Please remove what does not apply: ✨ New Feature

🧪 Unit Tests:
This PR includes a LIT test checking the lowering of atomic operations.

xla/backends/gpu/codegen/triton/transforms/triton_xla_lower_extern_atomics_pass.cc

xla/backends/gpu/codegen/triton/transforms/triton_xla_implement_extern_atomics_rocm_pass.cc

xla/stream_executor/rocm/rocm_executor.cc

claude · 2026-03-18T14:52:34Z

Claude Code Review Summary

This PR enables the ROCm Triton backend for AllReduce collective operations via tt.extern_elementwise ops and a ROCm-specific LLVM IR implementation pass. The overall architecture — two-stage lowering (high-level ops → extern calls → platform-specific LLVM atomics) — is clean and extensible.

Key issues found (see inline comments for details):

Masks silently ignored in both LowerAtomicWriteOp and LowerAtomicSpinWaitOp — the CUDA pass handles masks via predicated stores/waits, but the extern lowering discards them entirely. The spin wait case can cause GPU hangs.
Signed vs unsigned comparison mismatch — the ROCm spin wait uses slt (signed) while the CUDA PTX uses setp.lt.u32 (unsigned), which can produce incorrect results for large signal values.
Syncscope "one-as" for system scope may not provide cross-device visibility guarantees on AMDGPU — needs verification against the AMDGPU memory model docs.
No validation of memory semantics — the CUDA pass rejects unsupported semantics (e.g., ACQUIRE for writes); this pass silently accepts them.
default: return "unknown" in enum-to-string switches will generate unmatchable function names instead of failing early.
Minor: duplicated scope-parsing logic, fragile substring matching, misplaced #include.

Missing test coverage for TritonXLAImplementExternAtomicsROCmPass and TritonXLALowerExternGetTidPass.

claude · 2026-03-18T15:09:09Z

Re-review Summary

Re-reviewed the latest diff. All 8 previously flagged issues remain unaddressed — no new findings. Key open items:

Masks silently discarded in both LowerAtomicWriteOp and LowerAtomicSpinWaitOp (risk of GPU hang / memory corruption)
Signed vs unsigned comparison mismatch with CUDA path in spin-wait
Incorrect AMDGPU syncscopes ("one-as" alone likely insufficient for system scope)
No semantic validation and hardcoded atomic orderings ignoring function-name-encoded semantics
Fragile substring matching and duplicated scope-parsing logic
Missing test coverage for TritonXLAImplementExternAtomicsROCmPass and TritonXLALowerExternGetTidPass
Include ordering issue in rocm_executor.cc

See existing inline comments for details.

🤖 Generated with Claude Code

claude · 2026-03-18T16:09:26Z

Re-review Summary

Excellent progress — 7 of 8 previous findings have been addressed in this revision:

Mask handling: Both atomic_write and atomic_spin_wait now properly pass and handle masks
Exhaustive switches: default cases replaced with exhaustive switch + LOG(FATAL)
Semantic validation: Unsupported semantics now rejected via notifyMatchFailure
Sync scope: System scope corrected to empty string per AMDGPU convention
Unsigned comparison: Spin-wait comparator now uses ICmpPredicate::ult matching CUDA
Helper extraction: ParseSyncScope extracted; EndsWithComparator used for suffix matching

1 minor item remaining: #include <unistd.h> placement in rocm_executor.cc (between ROCm and XLA headers — should be in the system headers group).

Overall this is looking solid. The two-stage lowering design is clean and the ROCm atomics implementation is well-structured.

🤖 Generated with Claude Code

xla/backends/gpu/codegen/triton/transforms/triton_xla_lower_extern_atomics_pass.cc

xla/backends/gpu/codegen/triton/transforms/triton_xla_implement_extern_atomics_rocm_pass.cc

xla/backends/gpu/codegen/triton/transforms/triton_xla_lower_extern_atomics_pass.cc

xla/backends/gpu/codegen/triton/transforms/tests/triton_xla_lower_extern_atomics.mlir

claude · 2026-03-18T16:13:16Z

Re-review Summary

Good progress since the last review — 6 of 8 previous findings have been addressed (mask handling, validation, syncscope, signed/unsigned comparison, scope parsing). The include ordering issue in rocm_executor.cc is still present.

New findings (see inline comments):

Bug (high): SSA domination issue in masked and unmasked spin-wait — replaceAllUsesWith references a value not defined on all control-flow paths (triton_xla_implement_extern_atomics_rocm_pass.cc:213)
Bug (high): Scalar atomic write path silently discards the mask operand (triton_xla_lower_extern_atomics_pass.cc:162-169)
Nit: Memory ordering hardcoded to release/acquire even when relaxed was requested — safe but suboptimal
Nit: Substring matching in ParseSyncScope is fragile
Nit: Mask detection by operand count without documentation
Testing gap: No tests for masked operations, ROCm implementation pass, or GetTid pass

🤖 Generated with Claude Code

claude · 2026-03-18T16:20:42Z

Re-review Summary

Re-reviewed the latest diff. All previously flagged issues remain applicable — no new findings. Key outstanding items from prior review:

SSA dominance bug in spin-wait loop (triton_xla_implement_extern_atomics_rocm_pass.cc:213,248) — loaded value defined inside loop block used after exit
Hardcoded release ordering ignoring _relaxed_ semantic in function names
Missing test coverage for the ROCm-specific atomics pass and the get_tid pass

🤖 Generated with Claude Code

claude · 2026-03-18T18:01:35Z

Re-review Summary (commit `90ec42c`)

All previously flagged issues have been addressed in this revision. Key fixes verified:

Mask handling: Scalar and vectorized mask paths now handled correctly with proper operand count checks
Memory ordering: Parsed from function names instead of hardcoded
SSA domination: Fixed via block arguments
Sync scope parsing: Uses robust field extraction
Unsigned comparison: Spin-wait loop now uses unsigned icmp as required
Test coverage: Comprehensive tests added for masked/unmasked operations in both lowering and ROCm implementation passes

No new issues found. One minor prior note (include ordering of <unistd.h> in rocm_executor.cc) remains outstanding from the initial review.

🤖 Generated with Claude Code

xla/backends/gpu/codegen/triton/transforms/triton_xla_lower_extern_get_tid_pass.cc

draganmladjenovic · 2026-03-19T17:28:18Z

xla/backends/gpu/codegen/triton/collective_emitter.cc

+#if defined(TENSORFLOW_USE_ROCM)
+// ROCm: Use constant value directly as ROCDL dialect doesn't define memory
+// space enum
+static constexpr int32_t kGlobalAddressSpace = 1;


https://mlir.llvm.org/docs/Dialects/GPU/#gpu-address-spaces should be same for both.

draganmladjenovic · 2026-03-19T17:29:14Z

xla/backends/gpu/codegen/triton/collective_emitter.cc

+  bool is_supported = false;
+
+  // CUDA: Requires compute capability 9.0+ (Hopper or newer)
+  if (device_info.cuda_compute_capability().major >= 9) {


Where did this check exist before this change?

The check was here: https://github.com/openxla/xla/blob/d798924a906fc012889561eeade672b2b06c6cd8/xla/backends/gpu/codegen/triton/collective_emitter.cc#L286-L290

xla/backends/gpu/codegen/triton/compilation_pipeline.cc

xla/backends/gpu/codegen/triton/transforms/triton_xla_implement_extern_atomics_rocm_pass.cc

draganmladjenovic · 2026-03-19T17:47:59Z

xla/backends/gpu/codegen/triton/transforms/triton_xla_implement_extern_atomics_rocm_pass.cc

+
+          // Atomic block: perform atomic exchange
+          builder.setInsertionPointToStart(atomic_block);
+          auto atomic_xchg = LLVM::AtomicRMWOp::create(


Use atomic store https://mlir.llvm.org/docs/Dialects/LLVM/#llvmstore-llvmstoreop. This is not optimal. Use https://mlir.llvm.org/docs/Dialects/LLVM/#llvmmlirpoison-llvmpoisonop for result.

We expect result not to be used. If it is it is a bug.

draganmladjenovic · 2026-03-19T17:54:02Z

xla/backends/gpu/codegen/triton/transforms/triton_xla_implement_extern_atomics_rocm_pass.cc

+          // Loop block: spin wait
+          builder.setInsertionPointToStart(loop_block);
+          auto loaded = LLVM::LoadOp::create(
+              builder, loc, i32_type, addr, 4, false, false, false, false,


Is it atomic load? It is hard to follow. Maybe inline comment (/* arg_name */ false) each of the bool args.

NVM. I see it from the test, but still comment.

draganmladjenovic · 2026-03-19T17:56:16Z

xla/backends/gpu/codegen/triton/transforms/triton_xla_implement_extern_atomics_rocm_pass.cc

+              mlir::ValueRange{exit_block->getArgument(0)});
+          call_op.erase();
+        } else {
+          // Unmasked spin wait: direct loop


Can you unify these two paths. Maybe via lambda that you can call at both places.

draganmladjenovic · 2026-03-19T17:57:33Z

xla/backends/gpu/codegen/triton/transforms/triton_xla_implement_extern_atomics_rocm_pass.cc

+
+    // Clean up unused extern function declarations
+    llvm::SmallVector<LLVM::LLVMFuncOp> to_erase;
+    module.walk([&](LLVM::LLVMFuncOp func) {


Nice touch!

draganmladjenovic · 2026-03-19T18:00:30Z

xla/backends/gpu/codegen/triton/transforms/triton_xla_implement_extern_atomics_rocm_pass.cc

+// Function names follow pattern: xla_atomic_*_<semantic>_<scope>[_<comparator>]
+std::string ParseSyncScope(const std::string& func_name) {
+  // Per AMDGPU memory model (Table 31):
+  // - "" (empty) = system scope (cross-device visibility)


Same for nvptx backend.

draganmladjenovic · 2026-03-19T18:06:28Z

xla/backends/gpu/codegen/triton/transforms/tests/triton_xla_implement_extern_atomics_rocm.mlir

+    // CHECK-NOT: llvm.call @xla_get_thread_id
+    // CHECK: [[TID:%.*]] = llvm.call_intrinsic "llvm.amdgcn.workitem.id.x"() : () -> i32
+    // CHECK: llvm.return [[TID]]
+    %tid = llvm.call @xla_get_thread_id() : () -> i32


maybe prefix them with __triton_xla

draganmladjenovic · 2026-03-19T18:17:01Z

xla/backends/gpu/codegen/triton/transforms/triton_xla_implement_extern_atomics_rocm_pass.cc

+        auto value = operands[1];
+        mlir::Value mask = operands.size() > 2 ? operands[2] : mlir::Value{};
+
+        std::string syncscope = ParseSyncScope(callee_name);


draganmladjenovic · 2026-03-19T18:17:53Z

xla/backends/gpu/codegen/triton/transforms/triton_xla_implement_extern_atomics_rocm_pass.cc

+          // Exit block: phi node to select result
+          exit_block->addArgument(i32_type, loc);
+          call_op.replaceAllUsesWith(
+              mlir::ValueRange{exit_block->getArgument(0)});


Is the result expeced to be used. If not poison it.

xla/backends/gpu/codegen/triton/transforms/triton_xla_lower_extern_atomics_pass.cc

xla/backends/gpu/codegen/triton/transforms/tests/triton_xla_implement_extern_atomics_rocm.mlir

xla/backends/gpu/codegen/triton/transforms/triton_xla_implement_extern_atomics_rocm_pass.cc

claude · 2026-03-24T10:05:58Z

Re-review Summary (Claude)

All 15 previously flagged issues have been addressed — nice work on the iteration.

New findings this round:

[Critical] ROCm atomic write implementation uses LLVM::StoreOp but tests expect llvm.atomicrmw xchg — these are different instructions and the masked path also lacks block arguments for PHI merging. The LIT tests will fail as written. (inline comment)
[Suggestion] Stale comment in triton_xla_lower_extern_atomics_pass.cc still says "ROCm-specific" but the lowering is now target-agnostic. (inline comment)
[Suggestion] Missing LIT test for the CUDA implementation pass (triton_xla_implement_extern_atomics_cuda_pass.cc). (inline comment)

The critical item (#1) should be resolved before merge — either fix the implementation to use AtomicRMWOp or update the tests.

draganmladjenovic · 2026-03-24T10:13:42Z

xla/backends/gpu/codegen/triton/transforms/triton_xla_implement_extern_atomics_cuda_pass.cc

@@ -0,0 +1,276 @@
+/* Copyright 2025 The OpenXLA Authors.


Ok. This is reasonable conservative approach. Have you check does the cuda actually need it. Does generated ptx differs that much if going trough llvm ir?

No I don't. I struggled a bit to test on CUDA target. But is it really up to us to modify the CUDA specific compilation passes? I assume they had probably a reason for preferring to use inline PTX rather than LLVM intrinsics when they implemented this pass in the first place.

Yes. Because it is easy for them to use PTX. And every time one does a cuda/rocm split one side bitrots. Ours. Try asking Chao for nv machine to check on.

In my view, it's simpler to maintain one pass dedicated exclusively to ROCm and another dedicated to CUDA rather than a single pass (especially since the Triton pipeline changes from one target to another anyway). But anyway, I updated the support and compared the generated PTX, which are similar:

1. Thread ID Retrieval (Line ~67 in both files)

Old (PTX Assembly):

// begin inline asm mov.u32 %r8, %tid.x; // end inline asm

New (Intrinsics):

mov.u32 %r2, %tid.x;

2. Atomic Store (Line ~75-80)

Old (PTX Assembly):

// begin inline asm st.global.sys.release.u32 [%rd25], %r12; // end inline asm

New (Intrinsics):

st.release.sys.global.b32 [%rd30], %r4;

3. Atomic Spin-Wait Loop (Line ~85-95)

Old (PTX Assembly):

// begin inline asm { .reg .pred %p<1>; .reg .b32 %r<1>; wait: ld.global.sys.acquire.u32 %r0, [%rd28]; setp.lt.u32 %p0, %r0, %r12; @%p0 bra wait; } // end inline asm

New (Intrinsics):

$L__BB0_2: ld.acquire.sys.global.b32 %r14, [%rd7]; setp.lt.u32 %p2, %r14, %r4; @%p2 bra $L__BB0_2;

draganmladjenovic · 2026-03-26T11:10:48Z

xla/backends/gpu/codegen/triton/transforms/triton_xla_implement_extern_atomics_pass.cc

+    if (func_name.contains("_system")) {
+      return "";  // System scope for cross-GPU visibility
+    } else if (func_name.contains("_gpu")) {
+      return "gpu";


You sure these ones are correct. I would expect "device" and "block". But you are right. This is cumbersome. Sorry for driving you around. Maybe sick to per arch pass, but have this be a common logic that both can call pasing in scope names and and threadIDx.x intrinsic name?

Scope names have been changes (gpu and cta are indeed aliases to device and block).
But for the second point, I'm not sure to follow what you would like to have? 2 different passes using intrinsics?

@draganmladjenovic Maybe we could open a PR upstream with this support (common pass) and see what google/nvidia guys say about it?

All Reduce Triton backend relying on extern_elementwise ops

1daa0d9

mfrancepillois force-pushed the ci_maxime_allreduce_triton_rocm_elementwise_rocm branch from 08ee4a7 to 1daa0d9 Compare March 18, 2026 14:51