So I have been in the process of implementing "packed" operations in order to hide the overhead of launching a bunch of small kernels.
I just learned there is actually a solution to address this, cuda graphs!
I believe packed operations will still be faster if each operation is so small that multiple could be run in parallel on the gpu (I have no looked into graphs too much, it may allow nodes at the same level to launch in parallel which would be awesome). But this would only be a constant factor of the number of possible warps that could be launched (probably not more than 30x). However, I was noticing much worse slowdowns with increasing number of kernel launches because there was more and more accumulated overhead.
In other words, scaling for example the number of layers in a CNN was not resulting in constant time slowdowns.
I am going to continue along the path of using packed operations, since I believe it will be a valuable experience. But afterwards I do want to take a look at graphs.
https://developer.nvidia.com/blog/cuda-graphs/
So I have been in the process of implementing "packed" operations in order to hide the overhead of launching a bunch of small kernels.
I just learned there is actually a solution to address this, cuda graphs!
I believe packed operations will still be faster if each operation is so small that multiple could be run in parallel on the gpu (I have no looked into graphs too much, it may allow nodes at the same level to launch in parallel which would be awesome). But this would only be a constant factor of the number of possible warps that could be launched (probably not more than 30x). However, I was noticing much worse slowdowns with increasing number of kernel launches because there was more and more accumulated overhead.
In other words, scaling for example the number of layers in a CNN was not resulting in constant time slowdowns.
I am going to continue along the path of using packed operations, since I believe it will be a valuable experience. But afterwards I do want to take a look at graphs.
https://developer.nvidia.com/blog/cuda-graphs/