Skip to content

Add scope_range_id support to ROCm profiler#670

Draft
magaonka-amd wants to merge 1 commit intoROCm:mainfrom
magaonka-amd:feature/scope-range-id
Draft

Add scope_range_id support to ROCm profiler#670
magaonka-amd wants to merge 1 commit intoROCm:mainfrom
magaonka-amd:feature/scope-range-id

Conversation

@magaonka-amd
Copy link

  • Capture scope_range_id from AnnotationStack during HIP API correlation callbacks and propagate it through AnnotationMap to GPU kernel events in the profiler trace
  • Export the scope_range_id tree (child-to-parent mapping) as a separate XPlane so xprof can reconstruct the scope hierarchy
  • This enables xprof to separate per-invocation events on the derived timeline (XLA Modules, XLA Ops rows), matching the existing CUPTI-side implementation for NVIDIA GPUs

Submission Checklist

Comment on lines 553 to +561

void RocmTraceCollectorImpl::ExportScopeRangeIdTree(XSpace* space) {
XPlaneBuilder plane(FindOrAddMutablePlaneWithName(
space, tsl::profiler::kScopeRangeIdTreePlaneName));
tensorflow::profiler::XStatMetadata metadata;
for (const auto& [child_id, parent_id] : scope_range_id_tree_) {
metadata.set_id(child_id);
plane.AddStatValue(metadata, parent_id);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: scope_range_id_tree_ is never populated — exported XPlane will always be empty.

This method iterates over scope_range_id_tree_, but nothing in this PR ever inserts entries into it. Compare with the CUPTI path:

  • In cupti_tracer.cc:671-681, AddScopeRangeIdSequence walks the full range ID sequence from AnnotationStack::GetScopeRangeIds() and builds the child→parent mapping by iterating adjacent pairs (tree.emplace(*curr, *(curr - 1))).
  • In cupti_collector.cc:726-734, per-thread trees are merged into the collector's scope_range_id_tree_.

In this PR, only the leaf scope range ID (the last element of range_ids) is stored per correlation ID in the annotation map, but the parent-child tree relationships are never constructed. The tree-building logic equivalent to AddScopeRangeIdSequence is missing, so xprof will not be able to reconstruct the scope hierarchy.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved — addressed in this revision. The tree-building logic has been added in AnnotationMap::Add (walking adjacent pairs to build child-to-parent mappings), and TakeScopeRangeIdTree() transfers it to the collector before export.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved: scope_range_id_tree_ is now populated via the tree-building logic in AnnotationMap::Add and transferred through TakeScopeRangeIdTree(). The exported XPlane will now correctly contain the scope hierarchy.

Comment on lines +287 to +288
trace_event->scope_range_id =
annotation_map()->LookUpScopeRangeId(trace_event->correlation_id);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: scope_range_id is set only for kernel dispatch events, not for HipApi or Memcpy events.

HipApiEvent (around line 144-195) and MemcpyEvent (around line 197-263) both look up annotations via annotation_map()->LookUp() but do not call LookUpScopeRangeId. Since the scope range ID is stored in the annotation map keyed by correlation ID, and both event types have correlation IDs, these events will have scope_range_id = 0 even when a valid scope range ID exists.

The CUPTI path sets scope_range_id uniformly on all events at callback time (see cupti_tracer.cc:996).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved — addressed in this revision. LookUpScopeRangeId is now called in HipApiEvent, MemcpyEvent, and KernelEvent uniformly.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved: LookUpScopeRangeId is now called uniformly for HipApi, Memcpy, and Kernel events, so all event types correctly receive their scope_range_id.

#include "rocm/include/rocprofiler-sdk/internal_threading.h"
#include "rocm/include/rocprofiler-sdk/registration.h"
#include "rocm/include/rocprofiler-sdk/rocprofiler.h"
#include <unistd.h>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: #include <unistd.h> moved to incorrect position.

This C system header is now placed between rocprofiler-sdk/rocprofiler.h (third-party) and xla/backends/profiler/gpu/rocm_collector.h (project). Per Google C++ style and the project's include order convention, C system headers should be grouped together before third-party headers. This will likely be flagged by include-order linters.

void AddEvent(RocmTracerEvent&& event, bool is_auxiliary) override;
void Flush() override;
void Export(tsl::profiler::XSpace* space) override;
void ExportScopeRangeIdTree(tsl::profiler::XSpace* space);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: ExportScopeRangeIdTree should be private.

This method is an implementation detail called only by Export(). The CUPTI-side equivalent in cupti_collector.cc is a private member function. Making it public exposes an internal detail of the export pipeline.

Suggested change
void ExportScopeRangeIdTree(tsl::profiler::XSpace* space);
void ExportScopeRangeIdTree(tsl::profiler::XSpace* space);

(Move this declaration to the private: section below.)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved — addressed in this revision. ExportScopeRangeIdTree is now declared in the private: section of RocmTraceCollectorImpl.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved: ExportScopeRangeIdTree is now in the private: section.

@@ -213,6 +216,7 @@ class RocmTraceCollectorImpl : public RocmTraceCollector {
ABSL_EXCLUSIVE_LOCKS_REQUIRED(event_maps_mutex_);

absl::node_hash_map<uint32_t, PerDeviceCollector> per_device_collector_;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: scope_range_id_tree_ lacks ABSL_GUARDED_BY annotation.

Other mutable fields in this class are annotated with ABSL_EXCLUSIVE_LOCKS_REQUIRED(event_maps_mutex_) or similar. While Export is likely called only after tracing is stopped (no concurrent access), adding a guard annotation would maintain consistency with the rest of the class and prevent future misuse.

Comment on lines +578 to +579
// Build scope range ID tree from events in this device
// (events already have scope_range_id set from annotation lookup)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Misleading comment.

This comment says "Build scope range ID tree from events in this device" and "events already have scope_range_id set from annotation lookup", but no tree-building from events actually happens here. The per_device_collector_[id].Export(...) call does not contribute to scope_range_id_tree_. This comment should either be removed or updated to reflect what actually happens (once the tree-building logic is implemented).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved — addressed in this revision. The misleading comments have been removed.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved: The misleading comments about tree building have been removed.

@claude
Copy link

claude bot commented Mar 17, 2026

Review Summary

This PR adds scope_range_id support to the ROCm profiler, mirroring the CUPTI-side implementation. The annotation map, event propagation, and XPlane export plumbing are in place, but there are two functional gaps that will result in incomplete profiling data:

  1. scope_range_id_tree_ is never populated — the tree-building logic equivalent to CUPTI's AddScopeRangeIdSequence is missing, so the exported scope range ID tree XPlane will always be empty.
  2. scope_range_id is only set on kernel dispatch events — HipApi and Memcpy events are missing the LookUpScopeRangeId call, unlike the CUPTI path which sets it uniformly on all events.

Additional minor findings (include order, visibility, thread-safety annotation, misleading comment) are noted inline.

No tests are included for the new functionality.


Automated review by Claude

@magaonka-amd magaonka-amd force-pushed the feature/scope-range-id branch from dfb32b5 to b33722d Compare March 17, 2026 02:46
Comment on lines +159 to +162
auto* collector =
static_cast<RocmTraceCollectorImpl*>(rocm_trace_collector_.get());
collector->SetScopeRangeIdTree(
rocm_tracer_->annotation_map()->TakeScopeRangeIdTree());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design: static_cast to RocmTraceCollectorImpl* breaks the RocmTraceCollector abstraction.

rocm_trace_collector_ is declared as std::unique_ptr<RocmTraceCollector> (the base interface), but this code downcasts it to the concrete RocmTraceCollectorImpl* to call SetScopeRangeIdTree. This couples GpuTracer to the concrete implementation, bypassing the polymorphic interface.

Consider either:

  1. Adding a SetScopeRangeIdTree (or a combined Export overload that accepts the tree) to the RocmTraceCollector base class, or
  2. Passing the tree as an argument to Export(space, tree).

The CUPTI path avoids this by merging the tree inside the collector itself (CuptiTraceCollectorImpl::Flush calls MergeScopeRangeIdTrees).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved — addressed in this revision. The static_cast to RocmTraceCollectorImpl* has been removed. SetScopeRangeIdTree is now a virtual method on the RocmTraceCollector base class, so the call goes through the polymorphic interface.

Comment on lines +186 to +187
absl::flat_hash_map<uint32_t, int64_t> scope_range_id_map;
ScopeRangeIdTree scope_range_id_tree;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: New fields lack ABSL_GUARDED_BY(mutex) annotations.

scope_range_id_map and scope_range_id_tree are accessed under map_.mutex in Add, LookUpScopeRangeId, TakeScopeRangeIdTree, and Clear, but unlike the pattern established by the existing fields, they lack thread-safety annotations. Adding ABSL_GUARDED_BY(mutex) would let Clang's thread-safety analysis verify correctness at compile time.

(The existing annotations and correlation_map fields also lack these annotations, but since this PR is adding new fields, it's a good opportunity to set the right precedent.)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved — addressed in this revision. All fields in AnnotationMapImpl (annotations, correlation_map, scope_range_id_map, scope_range_id_tree) now carry ABSL_GUARDED_BY(mutex) annotations.

namespace xla {
namespace profiler {

using ScopeRangeIdTree = absl::flat_hash_map<int64_t, int64_t>;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Potential ScopeRangeIdTree type alias collision.

This ScopeRangeIdTree alias is defined in xla::profiler, the same namespace where the CUPTI path defines an identical alias in cupti_buffer_events.h. If both headers are ever included in the same translation unit (e.g., during future refactoring or in shared test code), this would cause a redefinition error. Consider using a shared definition in a common header instead.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved — addressed in this revision. A comment has been added clarifying that ROCm and CUPTI are never compiled together, so the duplicate alias is safe.

Comment on lines 553 to +557

void RocmTraceCollectorImpl::ExportScopeRangeIdTree(XSpace* space) {
XPlaneBuilder plane(FindOrAddMutablePlaneWithName(
space, tsl::profiler::kScopeRangeIdTreePlaneName));
tensorflow::profiler::XStatMetadata metadata;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Missing explanatory comment for empty metadata name.

The XStatMetadata object has only id set but no name. This is intentional (matching the CUPTI implementation), but the CUPTI code has a comment explaining the rationale: "No metadata is used for this plane, we just use the XStat to transfer the map without breaking any existing proto." Adding a similar comment here would help future maintainers understand why the metadata name is deliberately left empty.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved — addressed in this revision. An explanatory comment has been added: "No metadata is used for this plane, we just use the XStat to transfer the map without breaking any existing proto."

@claude
Copy link

claude bot commented Mar 17, 2026

Re-review Summary

The two critical bugs from the initial review (empty scope_range_id_tree_ and missing scope_range_id on HipApi/Memcpy events) have been fixed. ExportScopeRangeIdTree is now private, and the misleading comments have been removed. Good progress.

Remaining from initial review (2 items):

  • #include <unistd.h> is still in the wrong position in rocm_tracer.cc
  • scope_range_id_tree_ in rocm_collector.h still lacks ABSL_GUARDED_BY annotation

New findings (4 items):

  • static_cast downcast to RocmTraceCollectorImpl* in device_tracer_rocm.cc breaks the polymorphic abstraction
  • New fields in AnnotationMapImpl (rocm_tracer_utils.h) lack ABSL_GUARDED_BY(mutex) annotations
  • ScopeRangeIdTree type alias could collide with the identical CUPTI-side alias in the same namespace
  • ExportScopeRangeIdTree missing the explanatory comment from the CUPTI implementation about why metadata names are empty

No unit tests are included for the new scope_range_id functionality — consider adding tests modeled after the CUPTI-side coverage in cupti_buffer_events_test.cc and cupti_collector_test.cc.

@magaonka-amd magaonka-amd force-pushed the feature/scope-range-id branch from b33722d to b2bc42c Compare March 17, 2026 03:13
- Capture scope_range_id from AnnotationStack during HIP API
  correlation callbacks and propagate it through AnnotationMap
  to GPU kernel events in the profiler trace
- Build the scope_range_id tree (child-to-parent mapping) from
  the full range ID sequence in AnnotationMap::Add, matching the
  CUPTI AddScopeRangeIdSequence pattern
- Export the tree as a separate XPlane so xprof can reconstruct
  the scope hierarchy for derived timeline grouping
- Set scope_range_id on all event types (kernel, HIP API, memcpy)
  for uniform coverage, matching the CUPTI path
@magaonka-amd magaonka-amd force-pushed the feature/scope-range-id branch from b2bc42c to 084618d Compare March 17, 2026 03:16
#include "rocm/include/rocprofiler-sdk/internal_threading.h"
#include "rocm/include/rocprofiler-sdk/registration.h"
#include "rocm/include/rocprofiler-sdk/rocprofiler.h"
#include <unistd.h>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#include <unistd.h> is a C system header and belongs with the other system headers (lines 22-28), not after the third-party rocprofiler-sdk headers. Google C++ style (and XLA's .clang-format) requires system headers to be grouped together after the corresponding header.

Suggested change
#include <unistd.h>
#include "xla/backends/profiler/gpu/rocm_collector.h"

And move #include <unistd.h> up into the C/C++ system headers block (after <vector>).

void ExportScopeRangeIdTree(tsl::profiler::XSpace* space);

absl::node_hash_map<uint32_t, PerDeviceCollector> per_device_collector_;
ScopeRangeIdTree scope_range_id_tree_;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scope_range_id_tree_ is accessed from SetScopeRangeIdTree (called from CollectData on the profiler thread) and from ExportScopeRangeIdTree (called from Export, also on the profiler thread). In the current usage pattern these are sequential so there is no actual data race, but the CUPTI side stores this field in the base class without a guard as well. For consistency and defensive coding, consider documenting the single-threaded access assumption with a comment, or protecting it with event_maps_mutex_ and adding ABSL_GUARDED_BY(event_maps_mutex_) annotation, similar to the other maps in this class.

uint32_t num_events) = 0;
virtual void Flush() = 0;
virtual void Export(tsl::profiler::XSpace* space) = 0;
virtual void SetScopeRangeIdTree(ScopeRangeIdTree tree) {}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base class SetScopeRangeIdTree has a default no-op body {}. This means if someone creates a custom subclass of RocmTraceCollector, the scope range ID tree will silently be dropped. Consider making this pure virtual alongside the other pure virtual methods (AddEvent, Flush, Export), or at minimum adding a comment explaining the intentional no-op default.

Comment on lines +102 to +111
if (!scope_range_ids.empty()) {
map_.scope_range_id_map.emplace(correlation_id, scope_range_ids.back());
if (scope_range_ids.size() > 1) {
const int64_t* head = scope_range_ids.data();
const int64_t* curr = &scope_range_ids.back();
for (; curr > head && !map_.scope_range_id_tree.contains(*curr);
--curr) {
map_.scope_range_id_tree.emplace(*curr, *(curr - 1));
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tree-building logic iterates backward through scope_range_ids and stops as soon as it finds a node already in the tree (map_.scope_range_id_tree.contains(*curr)). This relies on the assumption that if a node is already in the tree, all its ancestors are too. This is correct given that AnnotationStack::GetScopeRangeIds() always returns the full root-to-leaf path, so any previously-inserted subtree would have been built from a path sharing the same prefix. Worth adding a brief comment explaining this invariant — the CUPTI side has similar logic in CuptiBufferEvents::AddScopeRangeIdSequence but it is more explicit about the assumption.

Comment on lines +554 to +563
void RocmTraceCollectorImpl::ExportScopeRangeIdTree(XSpace* space) {
XPlaneBuilder plane(FindOrAddMutablePlaneWithName(
space, tsl::profiler::kScopeRangeIdTreePlaneName));
// No metadata is used for this plane, we just use the XStat to
// transfer the map without breaking any existing proto.
tensorflow::profiler::XStatMetadata metadata;
for (const auto& [child_id, parent_id] : scope_range_id_tree_) {
metadata.set_id(child_id);
plane.AddStatValue(metadata, parent_id);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ExportScopeRangeIdTree implementation matches the CUPTI side closely, which is good. One observation: metadata is being mutated in place across loop iterations (only set_id is called, never set_name). The name field of XStatMetadata defaults to empty string, which is the intended behavior per the CUPTI implementation. This is fine, but the CUPTI side has a comment about why metadata names are empty — consider mirroring that comment here for clarity.

@claude
Copy link

claude bot commented Mar 17, 2026

Summary

This PR adds scope_range_id support to the ROCm profiler, mirroring the CUPTI-side implementation for NVIDIA GPUs. It captures the scope range ID from AnnotationStack during HIP API correlation callbacks, propagates it through AnnotationMap to all GPU events (HipApi, Memcpy, Kernel), and exports the scope range ID tree as a dedicated XPlane so xprof can reconstruct the scope hierarchy on derived timelines.

The implementation addresses the two critical gaps identified in the first review round (empty tree, missing scope_range_id on HipApi/Memcpy events) and adds ABSL_GUARDED_BY annotations to the AnnotationMapImpl fields.

Findings

  • [rocm_tracer.cc:45] #include <unistd.h> is positioned between third-party rocprofiler-sdk headers and xla project headers. Google C++ style requires C system headers to be grouped with C++ system headers (after <vector>, before absl headers). The diff removed it from its original (also incorrect) position and placed it in a new incorrect position.

  • [rocm_collector.h:130] SetScopeRangeIdTree in the base class RocmTraceCollector has a default no-op body, unlike the other interface methods (AddEvent, Flush, Export) which are pure virtual. A custom subclass would silently discard scope range ID data. Consider making it pure virtual or documenting the no-op default.

  • [rocm_collector.h:222] scope_range_id_tree_ member in RocmTraceCollectorImpl lacks thread-safety annotation. While current usage is single-threaded (set in CollectData, read in Export, called sequentially), the other map members in this class carry ABSL_GUARDED_BY(event_maps_mutex_). Adding an annotation or a comment documenting the sequential-access assumption would be consistent.

  • [rocm_tracer_utils.cc:102-111] The scope range ID tree construction logic uses a backwards walk that stops when it finds a node already in the tree. This relies on the invariant that AnnotationStack::GetScopeRangeIds() always returns the complete root-to-leaf path. The assumption is correct but undocumented — a brief comment would help future readers understand why the early termination is safe.

  • [rocm_collector.cc:554-563] ExportScopeRangeIdTree reuses a single XStatMetadata object across iterations, only setting id and leaving name empty. This matches the CUPTI implementation but lacks the explanatory comment that the CUPTI side has about why metadata names are intentionally empty.

  • No unit tests are included for the new scope_range_id functionality. The CUPTI side has test coverage in cupti_buffer_events_test.cc and cupti_collector_test.cc — similar tests for the ROCm path would catch regressions in tree construction and event propagation.


Automated review by Claude

@claude
Copy link

claude bot commented Mar 17, 2026

Re-review Summary

Most issues from the previous review have been addressed — nice work! Resolved 4 previously-flagged issues (static_cast abstraction break, missing ABSL_GUARDED_BY on new fields, ScopeRangeIdTree alias collision risk, missing explanatory comment for empty metadata name).

Two minor nits remain from the prior review (include ordering of <unistd.h>, and scope_range_id_tree_ lacking ABSL_GUARDED_BY in RocmTraceCollectorImpl) — see existing inline comments.

No new issues found in this revision.

@claude
Copy link

claude bot commented Mar 17, 2026

Re-review Summary (Round 3)

The two critical bugs from the initial review (empty tree export and missing scope_range_id on HipApi/Memcpy events) have been fixed — nice work.

The remaining minor items from previous rounds (include order, ABSL_GUARDED_BY annotation, base-class default no-op, tree-building invariant documentation) are still open but non-blocking. No new issues found in this round.

🤖 Generated with Claude Code

@magaonka-amd magaonka-amd marked this pull request as draft March 17, 2026 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant