STIR and optimising GPU support (i.e. parallelproj and NiftyPET) #1239

KrisThielemans · 2023-08-31T11:14:20Z

KrisThielemans
Aug 31, 2023
Maintainer

STIR currently supports CUDA-enabled projectors from NiftyPET (for mMR only due to hard-wiring) and Parallelproj, but performance is suboptimal. We're thinking on how to speed this up. As I'm more familiar with parallelproj, we can start there. That STIR code was based on @rijobro's work on NiftyPET anyway.

The current STIR strategy is to call the GPU code for all projection data (and then sort things out on subsets afterwards. This is sub-optimal as well, but let's concentrate on how to speed the projection up first.)

Example is in the forward projection. Current strategy:

copy STIR array to std::vector (to get continuous memory)

STIR/src/recon_buildblock/Parallelproj_projector/ForwardProjectorByBinParallelproj.cxx

Lines 148 to 149 in 4408419

    
           std::vector<float> image_vec(density.size_all()); 
        
           std::copy(_density_sptr->begin_all(), _density_sptr->end_all(), image_vec.begin());

copy std::vector data to all GPUs

STIR/src/recon_buildblock/Parallelproj_projector/ForwardProjectorByBinParallelproj.cxx

Lines 169 to 170 in 4408419

    
           float** image_on_cuda_devices; 
        
           image_on_cuda_devices = copy_float_array_to_all_devices(image_vec.data(), num_image_voxel);

call parallelproj with image in the GPU memory, giving a pointer to data (stored contiguously by ProjDataInMemory, i.e. CPU). Note that parallelproj will do the copy of projection data from GPU to host

STIR/src/recon_buildblock/Parallelproj_projector/ForwardProjectorByBinParallelproj.cxx

Lines 181 to 189 in 4408419

    
           joseph3d_fwd_cuda(_helper->xstart.data() + 3*offset, 
        
                             _helper->xend.data() + 3*offset, 
        
                             image_on_cuda_devices, 
        
                             _helper->origin.data(), 
        
                             _helper->voxsize.data(), 
        
                             _projected_data_sptr->get_data_ptr() + offset, 
        
                             num_lors_per_chunk, 
        
                             _helper->imgdim.data(), 
        
                             /*threadsperblock*/ 64);

.

copy projection data from ProjDataInMemory to ProjData object that is actually given by the user (which could be in memory, but could also sit on disk, as set_viewgram etc are overloaded accordingly)

Some reasons for this are:

step 1&2: stir::Array is currently not contiguously stored. This is being addressed in update Array hierarchy and allocate nD arrays in a contiguous block by default #1236, which means that step 1 could be avoided.
step 3: GPU memory might be too small to store complete projection data, so you can ask parallelproj to do this in chunks.
step 4: generally STIR is written to be able to handle projection data of any size, including LAFOV PET, it doesn't even need to fit in CPU memory. However, this is currently broken by the above strategy (and in the sensitivity calculation as well).

Similar steps (in reverse) happen in the back-projection

Some points from a discussion held with @casperdcl @markus-jehl @evgueni-ovtchinnikov and others on 31Aug2023:

@casperdcl suggests that using cudaMallocManaged in a few places might avoid the explicit transfer (of images only?) between CPU and GPU (CUDA will take care of it). One option could be to use cudaMallocManaged for all Arrays. This could be extended to numerical operations by either using libraries, or by using #ifdefs. Example of this is in https://github.com/AMYPAD/NumCu/blob/main/numcu/src/elemwise.cu
@casperdcl wrote CuVec to be able to auto-magically transport vectors between CUDA/C/Python (via SWIG), such that other libraries in Python can be used to do the operations on the arrays. This could be useful for STIR as well.

Some anticipated difficulties:

cudaMallocManaged will probably fail when asking for more memory than available on the GPU. I don't know if this is a problem when allocating multiple blocks (i.e. do they all need to fit in GPU together?)
projection data can be very big (Siemens Vision sinograms are 2.8GB, LAFOV data will be a lot bigger). Avoiding step 4 means taking care of the "chunking" directly without going via ProjDataInMemory, but this needs a refactor of the ForwardProjectorByBin class.
without doing timings on these various steps, we don't know which ones we should address first. For example,
- according to some timings in add utility to perform timings and some performance improvements #1237 copying an image (step 1) seems to take only ~2 ms.
- update Array hierarchy and allocate nD arrays in a contiguous block by default #1236 seems to make no difference in timings of current STIR (seee discussion there)

Comments/suggestions/PRs welcome!

KrisThielemans · 2023-08-31T11:18:15Z

KrisThielemans
Aug 31, 2023
Maintainer Author

@casperdcl, just replacing the std::vector in step 1 with a cudaMalloced memory block (or CuVec vector) is presumably not going to make any difference at all, as it will need to upload the data to GPU anyway. Correct? (By the way, when using multiple GPUs, does a "shared" array get copied to automatically to whatever GPU needs the data as well?)

1 reply

casperdcl Aug 31, 2023
Maintainer

I haven't stress tested multi-GPU behaviour, but happy to do so. TL;DR calling cudaSetDevice(int) or cuvec.dev_set(int) should be enough :)

KrisThielemans · 2023-08-31T11:19:31Z

KrisThielemans
Aug 31, 2023
Maintainer Author

Note that being able to expose the GPU arrays directly to Python would presumably speed-up pytorch and tensorflow a lot, although possibly that needs a lot more work, which might need a different Discussion in that case.

3 replies

casperdcl Aug 31, 2023
Maintainer

CuVec python objects can be directly understood by numpy, tensorflow, pytorch, cupy, etc (i.e. all these common libs will work without memcopies).

KrisThielemans Sep 1, 2023
Maintainer Author

That's great! We'd need to get our head round how to create a "rich" object (with all the meta-info AND the data) that can be used transparently, or maybe that can't be done.

KrisThielemans Mar 17, 2025
Maintainer Author

Reading up a bit, this is done by implementing the __cuda_array_interface__ property.

OpenRTK has a PR to RTKConsortium/ITKCudaCommon#43 to convert an itk.CudaImage to, e.g., pytorch tensor or cupy array, without GPU to CPU transfers. The magic seems to sit in the SWIG file https://github.com/RTKConsortium/ITKCudaCommon/pull/43/files#diff-b08959cf69284340900976ab6e14570a65e78745850a4f8de0942d908ba5a02d

%extend itkCudaImage@CudaImageTypes@{
    %pythoncode %{
    @property
    def __cuda_array_interface__(self):
   ...

casperdcl · 2026-02-26T15:21:20Z

casperdcl
Feb 26, 2026
Maintainer

@gschramm working on v2.0.0 release of https://github.com/KUL-recon-lab/libparallelproj:

all pointers can be host, CUDA device, or CUDA managed
docs for C-API: https://libparallelproj.readthedocs.io/en/latest/c_api.html
non-TOF & TOF sinogram projector interfaces have not changed
listmode projectors have an additional num_tofbins argument (breaking change)

1 reply

gschramm Feb 26, 2026

just a small heads up. v2.0.0 is ready to be released. I am just optimizing the build of the cuda/non-cuda versions + optional minimal python interface for conda-forge using the new "rattler-build" tool + the new recipe.yaml format (both awesome :)

STIR and optimising GPU support (i.e. parallelproj and NiftyPET) #1239

Uh oh!

Uh oh!

KrisThielemans Aug 31, 2023 Maintainer

Replies: 3 comments · 5 replies

Uh oh!

KrisThielemans Aug 31, 2023 Maintainer Author

Uh oh!

Uh oh!

casperdcl Aug 31, 2023 Maintainer

Uh oh!

KrisThielemans Aug 31, 2023 Maintainer Author

Uh oh!

Uh oh!

casperdcl Aug 31, 2023 Maintainer

Uh oh!

KrisThielemans Sep 1, 2023 Maintainer Author

Uh oh!

KrisThielemans Mar 17, 2025 Maintainer Author

Uh oh!

casperdcl Feb 26, 2026 Maintainer

Uh oh!

gschramm Feb 26, 2026

KrisThielemans
Aug 31, 2023
Maintainer

Replies: 3 comments 5 replies

KrisThielemans
Aug 31, 2023
Maintainer Author

casperdcl Aug 31, 2023
Maintainer

KrisThielemans
Aug 31, 2023
Maintainer Author

casperdcl Aug 31, 2023
Maintainer

KrisThielemans Sep 1, 2023
Maintainer Author

KrisThielemans Mar 17, 2025
Maintainer Author

casperdcl
Feb 26, 2026
Maintainer