Three implementations of a library used for routing linear algebra sub-problems in multi-GPU systems (with communication - computation overlap).
-
CUDA version
-
POSIX threads version
-
HIP version (AMD)
Task queues and an event-based synchronization system are used. The latter implementations extend the applicability of the library and achieve similar or superior performance.
© Poutas Sokratis