[ROCm] Use BFCAllocator for scratch allocations needed for MIOpen aut…#756
[ROCm] Use BFCAllocator for scratch allocations needed for MIOpen aut…#756draganmladjenovic wants to merge 1 commit intorocm-jaxlib-v0.9.1from
Conversation
0341010 to
9ffc786
Compare
| private: | ||
| bool IsSupported(const HloInstruction& instr) override; | ||
| bool do_not_autotune_; | ||
| stream_executor::DeviceAddressAllocator* allocator_; |
There was a problem hiding this comment.
nit: The allocator_ member is a raw non-owning DeviceAddressAllocator* that can potentially be null. In GpuCompiler::AddConvAndGemmAutotuningPass (gpu_compiler.cc:3293), options.device_allocator is passed through, and CompileOptions::device_allocator defaults to nullptr (compiler.h:129). The old code was self-contained (created its own StreamExecutorMemoryAllocator locally), so this path was never null-unsafe before.
Consider adding a CHECK(allocator != nullptr) in the MIOpenBackend constructor, or at minimum documenting the non-null precondition.
| struct GetCodegenBackends { | ||
| using Type = std::function<std::vector<std::unique_ptr<CodegenBackend>>( | ||
| stream_executor::StreamExecutor*, const DebugOptions*, Compiler*, | ||
| stream_executor::StreamExecutor*, |
There was a problem hiding this comment.
nit (IWYU): factory.h uses DeviceAddressAllocator* in this signature but doesn't directly include xla/stream_executor/device_address_allocator.h. It relies on a transitive include through compiler.h. If compiler.h ever stops including that header, this file would break. Consider adding a direct include.
| size_t result_buffers_count = instr->shape().tuple_shapes().size(); | ||
| result_buffers.reserve(result_buffers_count); | ||
| absl::InlinedVector<se::DeviceAddressBase, 1> result_buffers; | ||
| size_t result_buffers_count = instr->shape().tuple_shapes().size() - 1; |
There was a problem hiding this comment.
nit: The fix from .size() to .size() - 1 (to exclude the workspace u8[0] element) is correct, but it's a logically independent bug fix from the allocator refactoring. Consider separating this into its own commit for cleaner bisect history.
|
|
||
| std::vector<std::unique_ptr<CodegenBackend>> GetCodegenBackendsForCuda( | ||
| stream_executor::StreamExecutor* stream_executor, | ||
| stream_executor::DeviceAddressAllocator* device_allocator, |
There was a problem hiding this comment.
nit: device_allocator is unused in the CUDA path. Consider adding [[maybe_unused]] or (void)device_allocator; to suppress potential compiler warnings and make the intent explicit.
| GetDNNDataTypeFromPrimitiveType(gpu_conv_config.output_type)); | ||
| se::dnn::DnnSupport* dnn = stream_executor->AsDnn(); | ||
| se::StreamExecutorMemoryAllocator allocator(stream_executor); | ||
| std::unique_ptr<se::Stream> owned_stream; |
There was a problem hiding this comment.
Note: The stream creation path changed from allocator.GetStream() (which reused a cached stream) to stream_executor->CreateStream() (which creates a fresh stream per call). This is functionally correct and the lifetime is properly managed, but it's a behavioral change worth being aware of — previously streams could be reused across calls.
Claude Code Review SummaryOverall: Clean refactoring that threads Key items flagged inline:
See inline comments for details. 🤖 Generated with Claude Code |
…otuning
Motivation
Technical Details
openxla#39622