Skip to content

Failing tests in --peer_memory #92

@hubertlu-tw

Description

@hubertlu-tw

Please find the comment in the PR we enabled --peer_memory and --nccl_p2p extensions: #87 (comment)

Some tests failed sporadically on ROCm by running the following test script:

cd apex/contrib/peer_memory
torchrun --nproc_per_node 2 peer_halo_exchange_module_tests.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions