Add hipCUB support for scan (prefix sum) op by magaonka-amd · Pull Request #668 · ROCm/xla

magaonka-amd · 2026-03-16T22:10:05Z

Add hipCUB DeviceScan::InclusiveSum kernel implementation mirroring the CUDA cub_scan_kernel from commit 75f8001
Register ROCm FFI handler for xla.gpu.ext.cub_scan with runtime type dispatch for all 12 supported types
Add parameterized kernel test covering all types, row/column sizes, and scan configurations

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

claude · 2026-03-16T22:17:22Z

xla/stream_executor/rocm/cub_scan_kernel_rocm_impl.cu.cc

+  for (int64_t col = 0; col < column_length; ++col) {
+    TF_RETURN_IF_ERROR(
+        stream_executor::gpu::ToStatus(hipcub::DeviceScan::InclusiveSum(
+            d_temp_storage, temp_bytes, d_in + col * row_length,
+            d_out + col * row_length, row_length, stream)));
+  }


Performance: per-column host loop issues N separate kernel launches

This loop launches one hipcub::DeviceScan::InclusiveSum per column, which means column_length separate GPU kernel launches. For large column_length values, the launch overhead will dominate.

The CUDA counterpart avoids this by using a custom BlockScanKernel with gridDim = column_length — all columns are processed in a single kernel launch, one block per column (see cub_scan_kernel_cuda_impl.cu.cc:122).

Consider implementing a similar custom hipCUB BlockScan-based kernel for the ROCm path to achieve comparable performance, or at minimum documenting this known performance gap with a TODO.

@magaonka-amd have you checked hipcub that we need BlockLoad?

Yes, I agree, I see they have recently changed from host-side InclusiveSum to device-side function called from a custom BlockScanKernel: openxla#39238

Thanks for the feedback, I'll explore this today and get back with the update.

claude · 2026-03-16T22:17:27Z

xla/stream_executor/rocm/cub_scan_kernel_rocm_impl.cu.cc

+  if (d_in == nullptr) {
+    return stream_executor::gpu::ToStatus(hipcub::DeviceScan::InclusiveSum(
+        d_temp_storage, temp_bytes, d_in, d_out, row_length, stream));
+  }


Nit: query-mode sentinel checks d_in instead of d_temp_storage

The hipCUB convention for "query scratch size" is d_temp_storage == nullptr. This code uses d_in == nullptr as the sentinel instead. It works today because CubScanGetScratchSize (line 123) passes both as nullptr, but checking d_temp_storage == nullptr would be more idiomatic and robust against future callers that might pass a valid d_temp_storage with null d_in.

Suggested change

if (d_in == nullptr) {

return stream_executor::gpu::ToStatus(hipcub::DeviceScan::InclusiveSum(

d_temp_storage, temp_bytes, d_in, d_out, row_length, stream));

}

if (d_temp_storage == nullptr) {

return stream_executor::gpu::ToStatus(hipcub::DeviceScan::InclusiveSum(

d_temp_storage, temp_bytes, d_in, d_out, row_length, stream));

}

claude · 2026-03-16T22:17:33Z

xla/stream_executor/rocm/cub_scan_kernel_rocm_test.cc

+      GTEST_SKIP() << "BF16 for row length > 128 has precision issues.",
+          absl::OkStatus();
+    }


Nit: GTEST_SKIP() comma-expression in non-void function

GTEST_SKIP() expands to a return from the current scope in Google Test. The comma-expression GTEST_SKIP() << "...", absl::OkStatus() works because GTEST_SKIP() returns before the comma-operand is evaluated, but this relies on an implementation detail of the macro. A more explicit pattern would be:

if (type == xla::PrimitiveType::BF16 && row_length > 128) { GTEST_SKIP() << "BF16 for row length > 128 has precision issues."; return absl::OkStatus(); // unreachable, satisfies return type }

This is inherited from the CUDA test, so feel free to leave as-is for consistency if preferred.

claude · 2026-03-16T22:17:42Z

Review Summary

Good implementation of hipCUB-based InclusiveSum for the ROCm scan op, with clean FFI registration and thorough parameterized testing across 12 data types.

Main concern: The per-column host loop in the impl issues one hipcub::DeviceScan::InclusiveSum launch per column, whereas the CUDA counterpart uses a custom BlockScanKernel that processes all columns in a single launch (gridDim = column_length). This may cause a significant performance gap for multi-column scans.

Two minor nits posted inline (sentinel check, test skip pattern).

🤖 Generated with Claude Code

- Add custom BlockScanKernel using rocPRIM block-level primitives (block_scan, block_load, block_store) for efficient batched scan - Single kernel launch with gridDim = column_length, one block per row - Register ROCm FFI handler for xla.gpu.ext.cub_scan with runtime type dispatch for all 12 supported types - Tuning via rocPRIM default_scan_config_base (architecture-aware) - Add parameterized kernel test and performance benchmarks

magaonka-amd marked this pull request as ready for review March 16, 2026 22:10

claude bot reviewed Mar 16, 2026

View reviewed changes

magaonka-amd force-pushed the feature/hipcub-scan-support-downstream branch from 316f257 to 015864a Compare March 18, 2026 16:09

magaonka-amd force-pushed the feature/hipcub-scan-support-downstream branch from 015864a to 85b82bb Compare March 18, 2026 23:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hipCUB support for scan (prefix sum) op#668

Add hipCUB support for scan (prefix sum) op#668
magaonka-amd wants to merge 1 commit intoROCm:mainfrom
magaonka-amd:feature/hipcub-scan-support-downstream

magaonka-amd commented Mar 16, 2026

Uh oh!

claude bot Mar 16, 2026

Uh oh!

i-chaochen Mar 17, 2026

Uh oh!

pemeliya Mar 17, 2026

Uh oh!

magaonka-amd Mar 17, 2026

Uh oh!

claude bot Mar 16, 2026

Uh oh!

claude bot Mar 16, 2026

Uh oh!

claude bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

magaonka-amd commented Mar 16, 2026

Submission Checklist

Uh oh!

claude bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

i-chaochen Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

pemeliya Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

magaonka-amd Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Mar 16, 2026

Review Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants