Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/rocm-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,7 @@ jobs:
export HSA_FORCE_FINE_GRAIN_PCIE=1
export HSA_ENABLE_SDMA=0
torchrun --nproc_per_node 8 apex/contrib/peer_memory/peer_halo_exchange_module_tests.py 2>&1 | tee halo_results.log
! grep -q 'FAILURE :' halo_results.log
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leo-amd @amd-sriram Whatever error detection logic we use should be applied to all the test runs. But that also begs the question why we need to do the above, and why doesn't it exit with nonzero exit code for halo tests? Is torchrun to blame?

Copy link
Copy Markdown
Collaborator

@amd-sriram amd-sriram Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jithunnair-amd @leo-amd We should try to use an assert statement similar to https://github.com/NVIDIA/apex/blob/master/apex/contrib/test/peer_memory/test_peer_halo_exchange_module.py#L134.

torch.testing.assert_close(list_y, list_y2, msg=memory_format_str)
I was trying to run the halo tests but I could only run it only once. So, couldn't check if the assert statement would help.

Copy link
Copy Markdown
Collaborator

@amd-sriram amd-sriram Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jithunnair-amd Made a PR with the assert statement and also addresses the timeout error - #323

"

- name: Run Distributed Synced BatchNorm tests
Expand Down
Loading