Skip to content

Add CuPy GPU simulation with double-buffered streaming pipeline and nsys profiling#5

Draft
Copilot wants to merge 8 commits intomainfrom
copilot/add-cupy-simulation
Draft

Add CuPy GPU simulation with double-buffered streaming pipeline and nsys profiling#5
Copilot wants to merge 8 commits intomainfrom
copilot/add-cupy-simulation

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 6, 2026

Ports the simulation to a CuPy CUDA backend using a three-stream double-buffered pipeline that overlaps GPU computation with CPU disk I/O.

Architecture (game_of_life_cupy.py)

  • next_generation_cupy(grid) — drop-in CuPy equivalent of the NumPy generation step
  • simulate_cupy(...) — streaming pipeline:
    • One sim_stream owns all simulation kernels
    • Two alternating out_streams[0/1] handle non-blocking D2H transfers into pinned (page-locked) host memory after each chunk, using separate per-slot chunk_gpus[0/1] device buffers to eliminate read-after-write hazards and allow true stream overlap
    • out_streams[i%2].wait_event(sim_stream.record()) gates each transfer on its chunk's simulation completing
    • After enqueuing chunk N's work, the main thread synchronises chunk N-1's output stream and writes to disk while the GPU is already running chunk N
    • CLI interface identical to game_of_life.py
GPU sim_stream:   [sim chunk 0]             [sim chunk 1]             [sim chunk 2]
GPU out_streams:  [D2H chunk 0 → pinned]    [D2H chunk 1 → pinned]    [D2H chunk 2 → pinned]
CPU main thread:                            sync + save(chunk 0)       sync + save(chunk 1)

Per-chunk output files

Each chunk is saved to its own constant-size .npy file — no file ever grows over time:

simulation_000000.npy          — initial frame (generation 0), shape (1, H, W)
simulation_000001-000010.npy   — generations 1–10,  shape (chunk_size, H, W)
simulation_000011-000020.npy   — generations 11–20, shape (chunk_size, H, W)
…

The filename base and extension are derived from the --output argument. Every write is O(chunk_size) bytes regardless of how many chunks have run before.

GPU Profiling (nsys profile)

Each chunk's simulation-kernel batch and D2H transfer are wrapped with cupyx.profiler.time_range NVTX range annotations. The ranges appear as labelled bands in the NVIDIA Nsight Systems timeline when the process is launched under nsys profile, making it easy to distinguish kernel time from transfer time:

nsys profile python game_of_life_cupy.py --steps 100

Falls back to contextlib.nullcontext() when cupyx is not installed, so there is no overhead in environments without a GPU.

Bug Fixes

  • Fixed np.frombuffer crash when wrapping a PinnedMemoryPointer: added explicit count=n argument, which is required because PinnedMemoryPointer does not always expose its size through the Python buffer protocol.
  • Fixed stream serialisation: replaced the single shared chunk_gpu buffer with per-slot chunk_gpus[0/1] buffers so the CUDA driver sees no read-after-write dependency between the D2H transfer stream and the simulation stream, enabling true concurrent execution.

Tests (test_game_of_life_cupy.py)

21 tests backed by a numpy _CpArray mock (runs without GPU/CuPy) covering:

  • All Conway-rule unit tests, verified byte-identical against the NumPy implementation
  • Full integration suite mirroring test_game_of_life.py; helpers updated to collect and concatenate per-chunk files using TemporaryDirectory + glob
  • test_streams_alternated — asserts both output streams are used and synchronised across chunks
  • test_cupy_unavailable_raises — clean RuntimeError when CuPy is absent
  • test_nvtx_ranges_entered — asserts one "kernel *" range and one "D2H *" range are entered per chunk

Docs / deps

  • requirements.txt — optional CuPy install instructions (cupy-cuda12x / cupy-cuda11x)
  • README.md — documents both backends and the streaming pipeline design

Copilot AI linked an issue Apr 6, 2026 that may be closed by this pull request
Copilot AI changed the title [WIP] Add cupy simulation with non-blocking transfers Add CuPy GPU simulation with double-buffered streaming pipeline Apr 6, 2026
Copilot AI requested a review from tackandr April 6, 2026 18:00
@tackandr
Copy link
Copy Markdown
Owner

tackandr commented Apr 6, 2026

@copilot
File "/home/users/xxx/python/GameOfLife/game_of_life_cupy.py", line 132, in
np.frombuffer(pinned_mems[i], dtype=np.uint8).reshape(
ValueError: cannot reshape array of size 131072 into shape (10,100,100)

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 6, 2026

@copilot
File "/home/users/xxx/python/GameOfLife/game_of_life_cupy.py", line 132, in
np.frombuffer(pinned_mems[i], dtype=np.uint8).reshape(
...

The traceback at line 132 was caused by np.frombuffer being called on a PinnedMemoryPointer without an explicit count. CuPy's PinnedMemoryPointer does not reliably expose its size through the Python buffer protocol, so NumPy couldn't determine how many elements to read.

Fixed in 45b368a by adding count=n (matching the pattern in CuPy's own docs):

n = chunk_size * height * width
pinned_arrs = [
    np.frombuffer(pinned_mems[i], dtype=np.uint8, count=n).reshape(chunk_size, height, width)
    for i in range(2)
]

@tackandr
Copy link
Copy Markdown
Owner

tackandr commented Apr 6, 2026

@copilot add timing flags for gpu profiling. For memory transfers and kernels.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 6, 2026

@copilot add timing flags for gpu profiling. For memory transfers and kernels.

Done in b638a93. Added --profile / profile=True flag that uses cp.cuda.Event pairs to time both the simulation kernels and D2H transfers on each chunk, then prints a summary:

  [profile] chunk 1–2: kernel=0.142 ms, D2H=0.031 ms
  [profile] chunk 3–4: kernel=0.139 ms, D2H=0.029 ms

[profile] Total kernel time : 0.281 ms
[profile] Total D2H time    : 0.060 ms
[profile] Total GPU+D2H     : 0.341 ms

Events are recorded inside each buffer slot so timing resolution is per-chunk and both slots alternate with the pipeline. Off by default (zero overhead when not used).

Copilot AI changed the title Add CuPy GPU simulation with double-buffered streaming pipeline Add CuPy GPU simulation with double-buffered streaming pipeline and GPU profiling Apr 6, 2026
@tackandr
Copy link
Copy Markdown
Owner

tackandr commented Apr 6, 2026

@copilot Do it so that I can use nsys profile to analyze.

Stuff like this

import cupyx
from cupyx.profiler import time_range
from cupyx.profiler import benchmark

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 6, 2026

@copilot Do it so that I can use nsys profile to analyze.

Stuff like this
...

Done in 778d4a3. Each chunk's kernel batch and D2H transfer are now wrapped with cupyx.profiler.time_range NVTX annotations, so they show up as labelled ranges in the Nsight Systems timeline:

from cupyx.profiler import time_range as _nvtx_range

with _profiling_range(f"kernel {frame}-{chunk_end - 1}"):
    for i in range(chunk_frames):
        grid_gpu = next_generation_cupy(grid_gpu)
        chunk_gpu[i] = grid_gpu

with _profiling_range(f"D2H {frame}-{chunk_end - 1}"):
    chunk_gpu[:chunk_frames].get(out=pinned_arrs[buf_idx][:chunk_frames], ...)

Run with:

nsys profile python game_of_life_cupy.py --steps 100

The ranges fall back to contextlib.nullcontext() when cupyx isn't installed, so there's no runtime cost without a GPU. The existing --profile Event-based timing still works independently.

@tackandr
Copy link
Copy Markdown
Owner

tackandr commented Apr 6, 2026

@copilot I inspected the report. The IO is running parallel with the simulation but they are running sequencially.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 6, 2026

@copilot I inspected the report. The IO is running parallel with the simulation but they are running sequencially.

Fixed in ebb523a. The root cause was a shared chunk_gpu buffer across both double-buffer slots. When chunk N+1's kernels (on sim_stream) wrote to the same buffer that chunk N's D2H (on out_stream) was still reading, the CUDA driver detected the read-after-write hazard and serialised the two streams to preserve correctness — eliminating the intended overlap.

The fix: allocate a separate chunk_gpus[0] and chunk_gpus[1], one per slot, mirroring the existing pinned_arrs[0/1] pattern. Now sim_stream writing to chunk_gpus[buf_idx] and out_stream reading from chunk_gpus[prev_buf] always touch different device buffers, so the CUDA driver has no dependency to enforce and the streams run concurrently.

Before:  chunk_gpu (shared)
           sim_stream:  ──[write chunk N]──[write chunk N+1]──
           out_stream:              ──[read chunk N]──  ← same buffer → CUDA serialises

After:   chunk_gpus[0] / chunk_gpus[1] (per-slot)
           sim_stream:  ──[write to gpus[0]]──[write to gpus[1]]──
           out_stream:             ──[read from gpus[0]]──  ← different buffer → true overlap

@tackandr
Copy link
Copy Markdown
Owner

tackandr commented Apr 6, 2026

@copilot The file writing keeps getting bigger each time. Write each chunk to a separate file.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 6, 2026

@copilot The file writing keeps getting bigger each time. Write each chunk to a separate file.

Fixed in 1528db3. The history accumulation array is gone — each chunk is now written directly to its own constant-size file derived from the output path:

simulation_000000.npy          — initial frame (generation 0), shape (1, H, W)
simulation_000001-000010.npy   — generations 1–10,  shape (10, H, W)
simulation_000011-000020.npy   — generations 11–20, shape (10, H, W)
…

Every write is exactly chunk_size × H × W bytes — no file ever grows.

@tackandr
Copy link
Copy Markdown
Owner

tackandr commented Apr 6, 2026

@copilot remove the --profile related code and only keep nsys profiling

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 6, 2026

@copilot remove the --profile related code and only keep nsys profiling

Done in f1cafab. Removed the --profile CLI flag, the profile parameter from simulate_cupy, and all cp.cuda.Event timing/summary code. The cupyx.profiler.time_range NVTX annotations remain, so kernel and D2H ranges are still visible in Nsight Systems:

nsys profile python game_of_life_cupy.py --steps 100

Copilot AI changed the title Add CuPy GPU simulation with double-buffered streaming pipeline and GPU profiling Add CuPy GPU simulation with double-buffered streaming pipeline and nsys profiling Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add cupy simulation.

2 participants