Skip to content

Upgrade to parallelproj 2.0 and use cuvec#1689

Merged
KrisThielemans merged 30 commits intoUCL:masterfrom
KrisThielemans:parallelproj2.0
Mar 19, 2026
Merged

Upgrade to parallelproj 2.0 and use cuvec#1689
KrisThielemans merged 30 commits intoUCL:masterfrom
KrisThielemans:parallelproj2.0

Conversation

@KrisThielemans
Copy link
Copy Markdown
Collaborator

@KrisThielemans KrisThielemans commented Mar 7, 2026

See https://github.com/KUL-recon-lab/libparallelproj

Currently this PR is on top of #1676, while at least initially there is no good reason for this. Look therefore only at the last commit(s) and ignore the test_Array failure. Sorry

WARNING: Commits here will be rebased/squashed etc. The PR will probably also be split in 2 or 3 other PRs.

@gschramm @markus-jehl feel free to comment :-)

@KrisThielemans KrisThielemans self-assigned this Mar 7, 2026
@gschramm
Copy link
Copy Markdown
Contributor

gschramm commented Mar 7, 2026

@KrisThielemans : in case you are wondering that the tof_sino_fwd / back projections are slightly different compared to libparallelproj v1.x - that is expected. In the new version I make sure that the "sum over TOF bins" of a TOF fwd projection is the same as the non-TOF fwd projection (if num_TOF bins is big enough) - even with truncated TOF kernels.

@KrisThielemans
Copy link
Copy Markdown
Collaborator Author

Currently just getting zero in both fwd and backprojection...

@KrisThielemans
Copy link
Copy Markdown
Collaborator Author

The code is currently confusing as I tried to make minimal changes, but taking into account pre-processor symbol parallelproj_built_with_CUDA is NOT defined, we're currently just falling back to what we did for CPU version before (aside from the name change in the tof projectors). I don't know therefore why it doesn't work :-(

@gschramm
Copy link
Copy Markdown
Contributor

gschramm commented Mar 7, 2026

The code is currently confusing as I tried to make minimal changes, but taking into account pre-processor symbol parallelproj_built_with_CUDA is NOT defined, we're currently just falling back to what we did for CPU version before (aside from the name change in the tof projectors). I don't know therefore why it doesn't work :-(

At runtime, you can check whether libparallelproj was built with cuda using:
https://libparallelproj.readthedocs.io/en/v2.0.3/c_api.html#_CPPv425parallelproj_cuda_enabledv

and at cmake config time PARALLELPROJ_CUDA can be used
https://github.com/KUL-recon-lab/libparallelproj?tab=readme-ov-file#checking-whether-the-installed-library-was-built-with-cuda

@KrisThielemans
Copy link
Copy Markdown
Collaborator Author

KrisThielemans commented Mar 7, 2026

Sure, I meant that the old CUDA code is still present in the file, but it's intentionally never used as the preprocessor symbol isn't set.

@KrisThielemans
Copy link
Copy Markdown
Collaborator Author

MacOS failure is due to unrelated #1691

@KrisThielemans KrisThielemans changed the title Upgrade to parallelproj 2.0 Upgrade to parallelproj 2.0 and use cuvec Mar 11, 2026
@KrisThielemans
Copy link
Copy Markdown
Collaborator Author

Current status:

  • after removing the commits from add Array template index and use long long for ProjDataInMemory #1676 (which is WIP), the code worked ok (both with/without CUDA)
  • we changed xstart and xend to use CuVec, which gives us CUDA Managed pointers. Therefore, these 2 should remain on the device for several calls, avoiding data copies. Initial testing gives some run-time problems.
  • we will use this PR also for using CuVeC for the CudaGibbsPenalty branch in the "internal" variables. Here there will currently be little/no reduction in memory overhead, but we no longer need explicit CUDAMalloc and Free, so code should become smaller. To avoid too many code changes, I've overloaded array_to_device and array_to_host for CuVeC objects (where we can just fall back to std::copy)

@KrisThielemans
Copy link
Copy Markdown
Collaborator Author

could it be related to device syncing?

@markus-jehl
Copy link
Copy Markdown
Contributor

Rebuilding the Docker image from scratch without cache resolved the issue, so it must have been an out-of-date dependency!

@KrisThielemans
Copy link
Copy Markdown
Collaborator Author

From what I understood about the CUDA managed memory, the transparency is about data location (automatic CPU↔GPU), not about execution ordering. The sync is needed because GPU kernels are asynchronous — the host continues executing immediately after a kernel launch, so without a sync you risk reading managed memory before the kernel has finished writing to it.

ok, that is clear.

The old cudaMemcpy(DeviceToHost) didn't need an explicit sync because it's implicitly synchronizing — it blocks until all previous GPU operations complete. std::copy on managed memory has no such guarantee.

ok

What about the following then: change array_t_host to

void array_to_host(Array<num_dimensions, elemT>& stir_array, const CuVec<elemT>& dev_data, bool sync=true)

and drop the explicit syncs that @denproc inserted. This way array_to_host for a CuVec acts the same as for a CUDA (ordinary) ptr, and we have less code again.

Good idea?

@KrisThielemans KrisThielemans linked an issue Mar 16, 2026 that may be closed by this pull request
@Dimitra-Kyriakopoulou
Copy link
Copy Markdown
Contributor

Dear Professor @KrisThielemans,

  • I tested your array_to_host(..., bool sync = true) suggestion and pushed it here:
    Dimitra-Kyriakopoulou/STIR:project08_pr1689_sync_cleanup
    commit 354f20950

    It passed the focused STIR #1689 tests (test_blocks_on_cylindrical_projectors, test_OSMAPOSL_parallelproj, test_priors), rebuilt through reduced SIRF, and still passed the Project 8 Stage 1/2/3 checks on the GPU VM:

    • Stage 1 RMSE: 0.0
    • Stage 2 RMSE: 0.0
    • Stage 3 RMSE: 5.241103662179114e-10
  • I also checked the other two follow-up points:

    • d_scalar does not currently look like a good standalone cleanup, because it is already a size-1 CuVec<double> reduction buffer, so it is not a raw device pointer that still needs converting.
    • I ran a separate TOF/chunking validation on the exact #1689 stack. Forward projection matched exactly between chunked and unchunked runs; backprojection differences were very small and appear consistent with floating-point accumulation-order effects rather than a functional issue. For a reduced TOF geometry with 8 subset views:
      • chunk count 7: backward RMSE 2.0786981487323249e-05, max abs 0.00030517578125
      • chunk count 3: backward RMSE 2.0888311642741219e-05, max abs 0.000335693359375

@KrisThielemans
Copy link
Copy Markdown
Collaborator Author

I've put the parallelproj 1.* compatibility in and tested it on my Linux system without CUDA. It'd be great if someone could test it with CUDA.

@markus-jehl
Copy link
Copy Markdown
Contributor

I've put the parallelproj 1.* compatibility in and tested it on my Linux system without CUDA. It'd be great if someone could test it with CUDA.

Just tested it - all fine.

@Dimitra-Kyriakopoulou
Copy link
Copy Markdown
Contributor

Dear Professor @KrisThielemans ,
I prepared four small follow-up commits on my fork:

  • project08_pr1689_array_to_host_sync_helper
    120807e47
    Moves the CuVec host synchronization into array_to_host(...).

  • project08_pr1689_pair_chunks
    57b9d975a
    Exposes num_gpu_chunks at the
    ProjectorByBinPairUsingParallelproj parameter-file level.

  • project08_pr1689_projdata_external_buffer
    de12a0032
    Adds a minimal ProjDataInMemory constructor so externally
    allocated contiguous Array storage can be adopted.

  • project08_pr1689_array_cuvec_bridge
    cabf5d3df
    Adds a small optional helper to let Array share ownership of
    CuVec-backed storage.

I am not entirely sure whether the preferred route here is to open separate follow-up PRs against parallelproj2.0. In case not, I am drawing your attention to these commits in case any of them are useful and you would like to pick them.

THANK YOU!!
Dimitra

@KrisThielemans
Copy link
Copy Markdown
Collaborator Author

Dear Professor @KrisThielemans , I prepared four small follow-up commits on my fork:

  • project08_pr1689_array_to_host_sync_helper
    120807e47
    Moves the CuVec host synchronization into array_to_host(...).

I'll do this myself, as we need some documentation etc

  • project08_pr1689_pair_chunks
    57b9d975a
    Exposes num_gpu_chunks at the
    ProjectorByBinPairUsingParallelproj parameter-file level.

I've created KrisThielemans#11

  • project08_pr1689_projdata_external_buffer
    de12a0032
    Adds a minimal ProjDataInMemory constructor so externally
    allocated contiguous Array storage can be adopted.

I think this is already merged on master? #1694

  • project08_pr1689_array_cuvec_bridge
    cabf5d3df
    Adds a small optional helper to let Array share ownership of
    CuVec-backed storage.

Are you referring to Dimitra-Kyriakopoulou@3b1ea8f? This is an interesting approach. I'd like to get @casperdcl's opinion on that. However, let's not do that here, but in #1679. See below.

I am not entirely sure whether the preferred route here is to open separate follow-up PRs against parallelproj2.0. In case not, I am drawing your attention to these commits in case any of them are useful and you would like to pick them.

Thanks! I generally prefer PRs with a smal aim. Some of the above have nothing to do with this PR, really, but only with the overall project, so I'd keep those for separate PRs. It also keeps the discussion focused.

@KrisThielemans
Copy link
Copy Markdown
Collaborator Author

anyone can check, please.

@Dimitra-Kyriakopoulou
Copy link
Copy Markdown
Contributor

Dimitra-Kyriakopoulou commented Mar 18, 2026

Dear Professor @KrisThielemans,
I checked cbc8e4f on the GPU VM, and the focused CUDA/STIR tests passed:

  • test_blocks_on_cylindrical_projectors
  • test_OSMAPOSL_parallelproj
  • test_priors

However for safety, I think it would still be good if someone else could also check, because I already made errors today ...
In particular,

I think this is already merged on master?

Indeed! I am really sorry about that ...

Are you referring to
Dimitra-Kyriakopoulou@3b1ea8f?

Yes. cabf5d3df was the prior version; 3b1ea8f is the current one.

I am really sorry for all the carelessness, and thank you so much for your reply!!
Dimitra

@KrisThielemans KrisThielemans marked this pull request as ready for review March 18, 2026 23:27
@KrisThielemans
Copy link
Copy Markdown
Collaborator Author

@casperdcl I'm afraid I don't have the time to fix the CMake for using find_package(cuvec). It's a bit tricky as we then need to add it to our STIRConfig.cmake (but only when using an external package). I think it's fine as-is.

@KrisThielemans
Copy link
Copy Markdown
Collaborator Author

This should be fine now, aside from the release notes. I'd appreciate a full check and review from a few of you :-)

@KrisThielemans KrisThielemans added this to the v6.4 milestone Mar 18, 2026
@markus-jehl
Copy link
Copy Markdown
Contributor

This should be fine now, aside from the release notes. I'd appreciate a full check and review from a few of you :-)

Tested it and reviewed the code. All good from my side.

@KrisThielemans
Copy link
Copy Markdown
Collaborator Author

I'm not going to change history on this and merge.

Thanks a lot all for your contributions!

@KrisThielemans KrisThielemans merged commit 2c0192b into UCL:master Mar 19, 2026
1 check passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in 2026-03 AIRBI Hackathon Mar 19, 2026
@KrisThielemans KrisThielemans deleted the parallelproj2.0 branch March 19, 2026 17:32
@casperdcl casperdcl mentioned this pull request Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

enable CUDA managed pointers

7 participants