Upgrade to parallelproj 2.0 and use cuvec by KrisThielemans · Pull Request #1689 · UCL/STIR

KrisThielemans · 2026-03-07T15:15:24Z

See https://github.com/KUL-recon-lab/libparallelproj

~~Currently this PR is on top of #1676, while at least initially there is no good reason for this. Look therefore only at the last commit(s) and ignore the test_Array failure. Sorry~~

WARNING: Commits here will be rebased/squashed etc. The PR will probably also be split in 2 or 3 other PRs.

@gschramm @markus-jehl feel free to comment :-)

gschramm · 2026-03-07T15:33:18Z

@KrisThielemans : in case you are wondering that the tof_sino_fwd / back projections are slightly different compared to libparallelproj v1.x - that is expected. In the new version I make sure that the "sum over TOF bins" of a TOF fwd projection is the same as the non-TOF fwd projection (if num_TOF bins is big enough) - even with truncated TOF kernels.

KrisThielemans · 2026-03-07T16:42:45Z

Currently just getting zero in both fwd and backprojection...

KrisThielemans · 2026-03-07T16:47:06Z

The code is currently confusing as I tried to make minimal changes, but taking into account pre-processor symbol parallelproj_built_with_CUDA is NOT defined, we're currently just falling back to what we did for CPU version before (aside from the name change in the tof projectors). I don't know therefore why it doesn't work :-(

gschramm · 2026-03-07T18:44:41Z

The code is currently confusing as I tried to make minimal changes, but taking into account pre-processor symbol parallelproj_built_with_CUDA is NOT defined, we're currently just falling back to what we did for CPU version before (aside from the name change in the tof projectors). I don't know therefore why it doesn't work :-(

At runtime, you can check whether libparallelproj was built with cuda using:
https://libparallelproj.readthedocs.io/en/v2.0.3/c_api.html#_CPPv425parallelproj_cuda_enabledv

and at cmake config time PARALLELPROJ_CUDA can be used
https://github.com/KUL-recon-lab/libparallelproj?tab=readme-ov-file#checking-whether-the-installed-library-was-built-with-cuda

KrisThielemans · 2026-03-07T20:41:24Z

Sure, I meant that the old CUDA code is still present in the file, but it's intentionally never used as the preprocessor symbol isn't set.

KrisThielemans · 2026-03-11T14:12:52Z

MacOS failure is due to unrelated #1691

KrisThielemans · 2026-03-11T23:25:08Z

Current status:

after removing the commits from add Array template index and use long long for ProjDataInMemory #1676 (which is WIP), the code worked ok (both with/without CUDA)
we changed xstart and xend to use CuVec, which gives us CUDA Managed pointers. Therefore, these 2 should remain on the device for several calls, avoiding data copies. Initial testing gives some run-time problems.
we will use this PR also for using CuVeC for the CudaGibbsPenalty branch in the "internal" variables. Here there will currently be little/no reduction in memory overhead, but we no longer need explicit CUDAMalloc and Free, so code should become smaller. To avoid too many code changes, I've overloaded array_to_device and array_to_host for CuVeC objects (where we can just fall back to std::copy)

still WIP

KrisThielemans · 2026-03-16T14:34:18Z

could it be related to device syncing?

markus-jehl · 2026-03-16T15:07:53Z

Rebuilding the Docker image from scratch without cache resolved the issue, so it must have been an out-of-date dependency!

KrisThielemans · 2026-03-16T22:44:37Z

From what I understood about the CUDA managed memory, the transparency is about data location (automatic CPU↔GPU), not about execution ordering. The sync is needed because GPU kernels are asynchronous — the host continues executing immediately after a kernel launch, so without a sync you risk reading managed memory before the kernel has finished writing to it.

ok, that is clear.

The old cudaMemcpy(DeviceToHost) didn't need an explicit sync because it's implicitly synchronizing — it blocks until all previous GPU operations complete. std::copy on managed memory has no such guarantee.

ok

What about the following then: change array_t_host to

void array_to_host(Array<num_dimensions, elemT>& stir_array, const CuVec<elemT>& dev_data, bool sync=true)

and drop the explicit syncs that @denproc inserted. This way array_to_host for a CuVec acts the same as for a CUDA (ordinary) ptr, and we have less code again.

Good idea?

Dimitra-Kyriakopoulou · 2026-03-17T01:25:48Z

Dear Professor @KrisThielemans,

I tested your array_to_host(..., bool sync = true) suggestion and pushed it here:
Dimitra-Kyriakopoulou/STIR:project08_pr1689_sync_cleanup
commit 354f20950

It passed the focused STIR #1689 tests (test_blocks_on_cylindrical_projectors, test_OSMAPOSL_parallelproj, test_priors), rebuilt through reduced SIRF, and still passed the Project 8 Stage 1/2/3 checks on the GPU VM:
- Stage 1 RMSE: 0.0
- Stage 2 RMSE: 0.0
- Stage 3 RMSE: 5.241103662179114e-10
I also checked the other two follow-up points:
- d_scalar does not currently look like a good standalone cleanup, because it is already a size-1 CuVec<double> reduction buffer, so it is not a raw device pointer that still needs converting.
- I ran a separate TOF/chunking validation on the exact #1689 stack. Forward projection matched exactly between chunked and unchunked runs; backprojection differences were very small and appear consistent with floating-point accumulation-order effects rather than a functional issue. For a reduced TOF geometry with 8 subset views:
  - chunk count 7: backward RMSE 2.0786981487323249e-05, max abs 0.00030517578125
  - chunk count 3: backward RMSE 2.0888311642741219e-05, max abs 0.000335693359375

… in ParallelprojHelper

KrisThielemans · 2026-03-18T08:57:06Z

I've put the parallelproj 1.* compatibility in and tested it on my Linux system without CUDA. It'd be great if someone could test it with CUDA.

markus-jehl · 2026-03-18T09:40:22Z

I've put the parallelproj 1.* compatibility in and tested it on my Linux system without CUDA. It'd be great if someone could test it with CUDA.

Just tested it - all fine.

Dimitra-Kyriakopoulou · 2026-03-18T17:23:53Z

Dear Professor @KrisThielemans ,
I prepared four small follow-up commits on my fork:

project08_pr1689_array_to_host_sync_helper
120807e47
Moves the CuVec host synchronization into array_to_host(...).
project08_pr1689_pair_chunks
57b9d975a
Exposes num_gpu_chunks at the
ProjectorByBinPairUsingParallelproj parameter-file level.
project08_pr1689_projdata_external_buffer
de12a0032
Adds a minimal ProjDataInMemory constructor so externally
allocated contiguous Array storage can be adopted.
project08_pr1689_array_cuvec_bridge
cabf5d3df
Adds a small optional helper to let Array share ownership of
CuVec-backed storage.

I am not entirely sure whether the preferred route here is to open separate follow-up PRs against parallelproj2.0. In case not, I am drawing your attention to these commits in case any of them are useful and you would like to pick them.

THANK YOU!!
Dimitra

KrisThielemans · 2026-03-18T20:55:39Z

Dear Professor @KrisThielemans , I prepared four small follow-up commits on my fork:

project08_pr1689_array_to_host_sync_helper
120807e47
Moves the CuVec host synchronization into array_to_host(...).

I'll do this myself, as we need some documentation etc

project08_pr1689_pair_chunks
57b9d975a
Exposes num_gpu_chunks at the
ProjectorByBinPairUsingParallelproj parameter-file level.

I've created KrisThielemans#11

project08_pr1689_projdata_external_buffer
de12a0032
Adds a minimal ProjDataInMemory constructor so externally
allocated contiguous Array storage can be adopted.

I think this is already merged on master? #1694

project08_pr1689_array_cuvec_bridge
cabf5d3df
Adds a small optional helper to let Array share ownership of
CuVec-backed storage.

Are you referring to Dimitra-Kyriakopoulou@3b1ea8f? This is an interesting approach. I'd like to get @casperdcl's opinion on that. However, let's not do that here, but in #1679. See below.

I am not entirely sure whether the preferred route here is to open separate follow-up PRs against parallelproj2.0. In case not, I am drawing your attention to these commits in case any of them are useful and you would like to pick them.

Thanks! I generally prefer PRs with a smal aim. Some of the above have nothing to do with this PR, really, but only with the overall project, so I'd keep those for separate PRs. It also keeps the discussion focused.

KrisThielemans · 2026-03-18T21:53:36Z

anyone can check, please.

Dimitra-Kyriakopoulou · 2026-03-18T22:55:55Z

Dear Professor @KrisThielemans,
I checked cbc8e4f on the GPU VM, and the focused CUDA/STIR tests passed:

test_blocks_on_cylindrical_projectors
test_OSMAPOSL_parallelproj
test_priors

However for safety, I think it would still be good if someone else could also check, because I already made errors today ...
In particular,

I think this is already merged on master?

Indeed! I am really sorry about that ...

Are you referring to
Dimitra-Kyriakopoulou@3b1ea8f?

Yes. cabf5d3df was the prior version; 3b1ea8f is the current one.

I am really sorry for all the carelessness, and thank you so much for your reply!!
Dimitra

KrisThielemans · 2026-03-18T23:29:15Z

@casperdcl I'm afraid I don't have the time to fix the CMake for using find_package(cuvec). It's a bit tricky as we then need to add it to our STIRConfig.cmake (but only when using an external package). I think it's fine as-is.

KrisThielemans · 2026-03-18T23:34:45Z

This should be fine now, aside from the release notes. I'd appreciate a full check and review from a few of you :-)

markus-jehl · 2026-03-19T10:13:32Z

This should be fine now, aside from the release notes. I'd appreciate a full check and review from a few of you :-)

Tested it and reviewed the code. All good from my side.

[ci skip]

KrisThielemans · 2026-03-19T17:31:33Z

I'm not going to change history on this and merge.

Thanks a lot all for your contributions!

KrisThielemans self-assigned this Mar 7, 2026

KrisThielemans force-pushed the parallelproj2.0 branch from de95e7f to 791904e Compare March 7, 2026 17:27

KrisThielemans added 2 commits March 11, 2026 12:45

convert to parallelproj 2.0 (WIP)

b3532a9

[GHA] use parallelproj 2.0

b357df6

KrisThielemans force-pushed the parallelproj2.0 branch from 791904e to b357df6 Compare March 11, 2026 12:47

KrisThielemans added 3 commits March 11, 2026 14:50

[GHA] MacOS clang21: disable OpenMP and enable parallelproj

87b207e

added cuvec submodule

87fa48c

add cuvec include (TEMP COMMIT)

858ddbf

KrisThielemans force-pushed the parallelproj2.0 branch from ba7b8cb to 858ddbf Compare March 11, 2026 16:09

fix cuvec include

62c3e03

KrisThielemans changed the title ~~Upgrade to parallelproj 2.0~~ Upgrade to parallelproj 2.0 and use cuvec Mar 11, 2026

KrisThielemans added the enhancement label Mar 11, 2026

KrisThielemans mentioned this pull request Mar 11, 2026

GHA MacOS clang21 job fails due to openMP not found #1692

Open

Use cuvec in CudaGibbsPenalty

a4005a5

KrisThielemans force-pushed the parallelproj2.0 branch from 6ada6f5 to a4005a5 Compare March 11, 2026 23:51

KrisThielemans and others added 8 commits March 12, 2026 00:26

[CMake] add cuvec path to registries

572c8a3

[CMake] target_include_directories for cuvec PUBLIC

e6cf262

[CMake] install cuvec.cuh and fix CMake includes

5ca02e3

still WIP

make CuVeC member mutable for writing

3eb37ac

fix target_include_directories for cuvec

cf7b7af

Make CUDA dependency public.

8bb7793

extend CuVec use in CudaGibbsPenalty

2dd762e

restoring back the consistency check

a0e3f14

KrisThielemans linked an issue Mar 16, 2026 that may be closed by this pull request

enable CUDA managed pointers #1679

Closed

Markus Jehl and others added 3 commits March 17, 2026 14:12

Implementation of the GPU chunks and using CuVec start and end points…

fde9800

… in ParallelprojHelper

Remove explicit sync statements.

b837cef

enabled parallelproj 1.* compatibility

a321793

KrisThielemans mentioned this pull request Mar 18, 2026

enable CUDA managed pointers #1679

Closed

Dimitra Kyriakopoulou and others added 2 commits March 18, 2026 20:59

Expose num_gpu_chunks on ProjectorByBinPairUsingParallelproj

05ad9de

add sync argument to array_to_host

cbc8e4f

moved cuvec submodule to external_helpers/CuVec

59dc91f

KrisThielemans marked this pull request as ready for review March 18, 2026 23:27

KrisThielemans added this to the v6.4 milestone Mar 18, 2026

ColomboMatte0 approved these changes Mar 19, 2026

View reviewed changes

updated release notes

ba91372

[ci skip]

KrisThielemans merged commit 2c0192b into UCL:master Mar 19, 2026
1 check passed

github-project-automation bot moved this from In Progress to Done in 2026-03 AIRBI Hackathon Mar 19, 2026

KrisThielemans deleted the parallelproj2.0 branch March 19, 2026 17:32

casperdcl mentioned this pull request Mar 19, 2026

CuVec: check size #1695

Merged

Conversation

KrisThielemans commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gschramm commented Mar 7, 2026

Uh oh!

KrisThielemans commented Mar 7, 2026

Uh oh!

KrisThielemans commented Mar 7, 2026

Uh oh!

gschramm commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KrisThielemans commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KrisThielemans commented Mar 11, 2026

Uh oh!

KrisThielemans commented Mar 11, 2026

Uh oh!

KrisThielemans commented Mar 16, 2026

Uh oh!

markus-jehl commented Mar 16, 2026

Uh oh!

KrisThielemans commented Mar 16, 2026

Uh oh!

Dimitra-Kyriakopoulou commented Mar 17, 2026

Uh oh!

KrisThielemans commented Mar 18, 2026

Uh oh!

markus-jehl commented Mar 18, 2026

Uh oh!

Dimitra-Kyriakopoulou commented Mar 18, 2026

Uh oh!

KrisThielemans commented Mar 18, 2026

Uh oh!

KrisThielemans commented Mar 18, 2026

Uh oh!

Dimitra-Kyriakopoulou commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KrisThielemans commented Mar 18, 2026

Uh oh!

KrisThielemans commented Mar 18, 2026

Uh oh!

markus-jehl commented Mar 19, 2026

Uh oh!

KrisThielemans commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

KrisThielemans commented Mar 7, 2026 •

edited

Loading

gschramm commented Mar 7, 2026 •

edited

Loading

KrisThielemans commented Mar 7, 2026 •

edited

Loading

Dimitra-Kyriakopoulou commented Mar 18, 2026 •

edited

Loading