Add NCCL backend by msimberg · Pull Request #55 · ghex-org/oomph

msimberg · 2025-12-22T14:21:51Z

This adds a NCCL backend, with some strong constraints compared to the MPI, libfabric, and UCX backends:

Cancellation isn't supported
Tags aren't supported (they are ignored)
Send/recv submission requirements are stronger (communication should mostly be launched within groups)
Multithreading is not allowed with NCCL (because of ordering requirements and lack of tags)

If one sticks to these requirements one should be able to use any backend. If one needs any of the above features, NCCL can't be used.

Adds a few extra features to communicators:

start_group/end_group: These map to ncclGroupStart/ncclGroupEnd for NCCL, and no-ops for other backends.
is_stream_aware: The NCCL backend is the only one that returns true for this. If a backend is_stream_aware it will take into account the optional stream argument that can be passed to send/recv.

cmake/FindNCCL.cmake

msimberg · 2025-12-22T14:26:52Z

include/oomph/communicator.hpp


    template<typename T>
-    recv_request recv(message_buffer<T>& msg, rank_type src, tag_type tag)
+    recv_request recv(message_buffer<T>& msg, rank_type src, tag_type tag, void* stream = nullptr)


Is this a good API?

This means that for NCCL the default stream is used if nothing is specified (a stream is always required for NCCL). For other backends the stream is ignored.

msimberg · 2025-12-22T14:28:17Z

include/oomph/communicator.hpp


    template<typename T, typename CallBack>
-    recv_request recv(message_buffer<T>&& msg, rank_type src, tag_type tag, CallBack&& callback)
+    recv_request recv(message_buffer<T>&& msg, rank_type src, tag_type tag, CallBack&& callback, void* stream = nullptr)


These signatures can lead to ambiguous calls: leaving out the callback but supplying a stream can match this overload as well with the stream taking the place of CallBack. Is this ok?

I would add some SFINE tests such as std::enable_if_t<std::is_invocable_v<CallBack>> but I am not sure if this is a good idea.

Yeah, it's a bit unfortunate. There's OOMPH_CHECK_CALLBACK* that's used essentially for that in the body of the functions, but that's not SFINAE. Also unsure what's best here.

src/nccl/communicator.hpp

test/test_locality.cpp

msimberg · 2025-12-22T14:55:56Z

test/test_send_recv.cpp

+        // TODO: The sreq.wait was previously called immediately. With NCCL
+        // groups can't call wait so early (communication hasn't started yet).


Note the semantic change here: If one attempts to call env.comm.send(...).wait() within the NCCL group it will hang. wait will block forever since the group never starts. Should that just throw an exception instead (we can easily query whether the group has already been ended)?

I would say it should throw an exception.

Sounds good, I'll (try to) add that.

test/test_send_recv.cpp

msimberg · 2026-01-08T12:55:54Z

This now seems to work in icon fortran. While I still have some open TODOs I'd be grateful for feedback on this already. The general implementation is pretty much what I want it to be, though I still have some profiling to do with NCCL to check if I'm missing some additional low hanging fruit.

Besides any comments you may have on the implementation itself (in particular I'm grateful if you have comments on me misunderstanding oomph requirements for backends) I guess we may need to discuss some sort of CI for the NCCL backend...

I can't request reviews so pinging @boeschf @biddisco @philip-paul-mueller.

philip-paul-mueller

I have some comments/suggestions, but I am not sure what they are worth; probably not much.

philip-paul-mueller · 2026-01-13T07:41:03Z

include/oomph/communicator.hpp


    template<typename T, typename CallBack>
-    recv_request recv(message_buffer<T>&& msg, rank_type src, tag_type tag, CallBack&& callback)
+    recv_request recv(message_buffer<T>&& msg, rank_type src, tag_type tag, CallBack&& callback, void* stream = nullptr)


I would add some SFINE tests such as std::enable_if_t<std::is_invocable_v<CallBack>> but I am not sure if this is a good idea.

src/libfabric/communicator.hpp

src/nccl/communicator.hpp

philip-paul-mueller · 2026-01-14T09:43:14Z

src/nccl/cuda_event_pool.cpp

+    static cuda_event_pool pool{128};
+    return pool;


Suggested change

static cuda_event_pool pool{128};

return pool;

static cuda_event_pool* pool new cuda_event_pool(128);

return *pool;

See: hhttps://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2

I have to say I disagree with that motivation, or at least the solution. IMO if the events outlive the pool, then the events should be returned earlier, not the pool leaked. But I can be convinced otherwise...

philip-paul-mueller · 2026-01-14T09:49:15Z

src/nccl/nccl_communicator.hpp

+        ncclResult_t result;
+        do {
+            OOMPH_CHECK_NCCL_RESULT(ncclCommGetAsyncError(m_comm, &result));
+        } while (result == ncclInProgress);


This is more of a question for myself, but this can technically go on indefinitely.
So would it be a good idea to include a timeout?

I think NCCL internally has enough timeouts that this should not be a problem, but not completely sure... If there's a timeout, the question is what value it should be and how it's configured.

src/nccl/cached_cuda_event.hpp

src/nccl/request_state.hpp

philip-paul-mueller · 2026-01-14T10:17:01Z

test/test_send_recv.cpp

+        // TODO: The sreq.wait was previously called immediately. With NCCL
+        // groups can't call wait so early (communication hasn't started yet).


I would say it should throw an exception.

Mostly just copy MPI implementation to a new directory, not functional.

…nality to nccl

…f debugging

msimberg · 2026-03-23T13:35:38Z

.clang-format

 # ConstructorInitializerAllOnOneLineOrOnePerLine: false
 BreakConstructorInitializers: BeforeComma
 ConstructorInitializerIndentWidth: 0
-BreakInheritanceList: BeforeComma


I opened #58 for this.

msimberg · 2026-03-23T14:12:45Z

src/nccl/request_state.hpp

+    : base{ctxt, comm, scheduled, rank, tag, std::move(cb)}
+    , m_req{std::move(m)}
+    {
+        // std::cerr << "creating nccl shared_request_state\n";


To do:

Suggested change

// std::cerr << "creating nccl shared_request_state\n";

msimberg · 2026-03-23T14:12:55Z

src/nccl/request_state.hpp

+    : base{ctxt, comm, scheduled, rank, tag, std::move(cb)}
+    , m_req{std::move(m)}
+    {
+        // std::cerr << "creating nccl request_state\n";


Suggested change

// std::cerr << "creating nccl request_state\n";

msimberg · 2026-03-23T14:13:14Z

src/nccl/request_queue.hpp

+        {
+            if (e->m_req.is_ready())
+            {
+                // std::cerr << "found ready request in shared queue\n";


Suggested change

// std::cerr << "found ready request in shared queue\n";

NCCL can work with host memory on unified memory systems.