Skip to content

[ROCm] Use BFCAllocator for scratch allocations needed for MIOpen aut…#756

Open
draganmladjenovic wants to merge 1 commit intorocm-jaxlib-v0.9.1from
draganm/rocm-jaxlib-v0.9.1-conv-scratch
Open

[ROCm] Use BFCAllocator for scratch allocations needed for MIOpen aut…#756
draganmladjenovic wants to merge 1 commit intorocm-jaxlib-v0.9.1from
draganm/rocm-jaxlib-v0.9.1-conv-scratch

Conversation

@draganmladjenovic
Copy link

@draganmladjenovic draganmladjenovic commented Mar 25, 2026

…otuning

Motivation

Technical Details

openxla#39622

@draganmladjenovic draganmladjenovic force-pushed the draganm/rocm-jaxlib-v0.9.1-conv-scratch branch from 0341010 to 9ffc786 Compare March 25, 2026 12:52
@i-chaochen i-chaochen self-requested a review March 25, 2026 13:16
@i-chaochen i-chaochen added cherry-pick-candidate Mark a PR to be cherry-picked into the next ROCm JAX. Remove IIF the latest upstream contain the PR. rocm-jaxlib-v0.9.1 labels Mar 25, 2026
@draganmladjenovic draganmladjenovic added sanitizers Execute sanitizer pipelines for that PR claude-review Request a Claude AI code review for this PR labels Mar 26, 2026
private:
bool IsSupported(const HloInstruction& instr) override;
bool do_not_autotune_;
stream_executor::DeviceAddressAllocator* allocator_;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The allocator_ member is a raw non-owning DeviceAddressAllocator* that can potentially be null. In GpuCompiler::AddConvAndGemmAutotuningPass (gpu_compiler.cc:3293), options.device_allocator is passed through, and CompileOptions::device_allocator defaults to nullptr (compiler.h:129). The old code was self-contained (created its own StreamExecutorMemoryAllocator locally), so this path was never null-unsafe before.

Consider adding a CHECK(allocator != nullptr) in the MIOpenBackend constructor, or at minimum documenting the non-null precondition.

struct GetCodegenBackends {
using Type = std::function<std::vector<std::unique_ptr<CodegenBackend>>(
stream_executor::StreamExecutor*, const DebugOptions*, Compiler*,
stream_executor::StreamExecutor*,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (IWYU): factory.h uses DeviceAddressAllocator* in this signature but doesn't directly include xla/stream_executor/device_address_allocator.h. It relies on a transitive include through compiler.h. If compiler.h ever stops including that header, this file would break. Consider adding a direct include.

size_t result_buffers_count = instr->shape().tuple_shapes().size();
result_buffers.reserve(result_buffers_count);
absl::InlinedVector<se::DeviceAddressBase, 1> result_buffers;
size_t result_buffers_count = instr->shape().tuple_shapes().size() - 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The fix from .size() to .size() - 1 (to exclude the workspace u8[0] element) is correct, but it's a logically independent bug fix from the allocator refactoring. Consider separating this into its own commit for cleaner bisect history.


std::vector<std::unique_ptr<CodegenBackend>> GetCodegenBackendsForCuda(
stream_executor::StreamExecutor* stream_executor,
stream_executor::DeviceAddressAllocator* device_allocator,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: device_allocator is unused in the CUDA path. Consider adding [[maybe_unused]] or (void)device_allocator; to suppress potential compiler warnings and make the intent explicit.

GetDNNDataTypeFromPrimitiveType(gpu_conv_config.output_type));
se::dnn::DnnSupport* dnn = stream_executor->AsDnn();
se::StreamExecutorMemoryAllocator allocator(stream_executor);
std::unique_ptr<se::Stream> owned_stream;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: The stream creation path changed from allocator.GetStream() (which reused a cached stream) to stream_executor->CreateStream() (which creates a fresh stream per call). This is functionally correct and the lifetime is properly managed, but it's a behavioral change worth being aware of — previously streams could be reused across calls.

@claude
Copy link

claude bot commented Mar 26, 2026

Claude Code Review Summary

Overall: Clean refactoring that threads DeviceAddressAllocator through the MIOpen autotuning path to enable BFC allocation for scratch memory. The change is well-structured and the allocator lifetime management is correct.

Key items flagged inline:

  • Potential null-pointer risk on the new allocator_ member in MIOpenBackend (previously self-contained, now depends on caller-provided allocator which defaults to nullptr)
  • Missing direct include for DeviceAddressAllocator in factory.h (IWYU)
  • Independent bug fix (result buffer count) mixed into the refactoring commit
  • Unused parameter in CUDA factory path
  • Behavioral change in stream management (creation vs. reuse)

See inline comments for details.

🤖 Generated with Claude Code

@github-actions github-actions bot removed the claude-review Request a Claude AI code review for this PR label Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-candidate Mark a PR to be cherry-picked into the next ROCm JAX. Remove IIF the latest upstream contain the PR. rocm-jaxlib-v0.9.1 sanitizers Execute sanitizer pipelines for that PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants