STIR and optimising GPU support (i.e. parallelproj and NiftyPET) #1239
Replies: 3 comments 5 replies
-
|
@casperdcl, just replacing the |
Beta Was this translation helpful? Give feedback.
-
|
Note that being able to expose the GPU arrays directly to Python would presumably speed-up |
Beta Was this translation helpful? Give feedback.
-
|
@gschramm working on v2.0.0 release of https://github.com/KUL-recon-lab/libparallelproj:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
STIR currently supports CUDA-enabled projectors from NiftyPET (for mMR only due to hard-wiring) and Parallelproj, but performance is suboptimal. We're thinking on how to speed this up. As I'm more familiar with parallelproj, we can start there. That STIR code was based on @rijobro's work on NiftyPET anyway.
The current STIR strategy is to call the GPU code for all projection data (and then sort things out on subsets afterwards. This is sub-optimal as well, but let's concentrate on how to speed the projection up first.)
Example is in the forward projection. Current strategy:
std::vector(to get continuous memory)STIR/src/recon_buildblock/Parallelproj_projector/ForwardProjectorByBinParallelproj.cxx
Lines 148 to 149 in 4408419
std::vectordata to all GPUsSTIR/src/recon_buildblock/Parallelproj_projector/ForwardProjectorByBinParallelproj.cxx
Lines 169 to 170 in 4408419
ProjDataInMemory, i.e. CPU). Note thatparallelprojwill do the copy of projection data from GPU to hostSTIR/src/recon_buildblock/Parallelproj_projector/ForwardProjectorByBinParallelproj.cxx
Lines 181 to 189 in 4408419
ProjDataInMemorytoProjDataobject that is actually given by the user (which could be in memory, but could also sit on disk, asset_viewgrametc are overloaded accordingly)Some reasons for this are:
stir::Arrayis currently not contiguously stored. This is being addressed in update Array hierarchy and allocate nD arrays in a contiguous block by default #1236, which means that step 1 could be avoided.parallelprojto do this in chunks.Similar steps (in reverse) happen in the back-projection
Some points from a discussion held with @casperdcl @markus-jehl @evgueni-ovtchinnikov and others on 31Aug2023:
cudaMallocManagedin a few places might avoid the explicit transfer (of images only?) between CPU and GPU (CUDA will take care of it). One option could be to usecudaMallocManagedfor allArrays. This could be extended to numerical operations by either using libraries, or by using#ifdefs. Example of this is in https://github.com/AMYPAD/NumCu/blob/main/numcu/src/elemwise.cuSome anticipated difficulties:
cudaMallocManagedwill probably fail when asking for more memory than available on the GPU. I don't know if this is a problem when allocating multiple blocks (i.e. do they all need to fit in GPU together?)ProjDataInMemory, but this needs a refactor of theForwardProjectorByBinclass.Comments/suggestions/PRs welcome!
Beta Was this translation helpful? Give feedback.
All reactions