Add CuPy GPU simulation with double-buffered streaming pipeline and nsys profiling#5
Add CuPy GPU simulation with double-buffered streaming pipeline and nsys profiling#5
Conversation
Agent-Logs-Url: https://github.com/tackandr/GameOfLife/sessions/73c2072f-dbe4-42b6-a809-9f80d0c2d78e Co-authored-by: tackandr <26540124+tackandr@users.noreply.github.com>
|
@copilot |
…yPointer Agent-Logs-Url: https://github.com/tackandr/GameOfLife/sessions/403464e4-7b97-4a76-94f9-0c3862dee3a4 Co-authored-by: tackandr <26540124+tackandr@users.noreply.github.com>
The traceback at line 132 was caused by Fixed in 45b368a by adding n = chunk_size * height * width
pinned_arrs = [
np.frombuffer(pinned_mems[i], dtype=np.uint8, count=n).reshape(chunk_size, height, width)
for i in range(2)
] |
|
@copilot add timing flags for gpu profiling. For memory transfers and kernels. |
Agent-Logs-Url: https://github.com/tackandr/GameOfLife/sessions/61155756-d5c8-4927-af94-e8a2b970ced3 Co-authored-by: tackandr <26540124+tackandr@users.noreply.github.com>
Done in b638a93. Added Events are recorded inside each buffer slot so timing resolution is per-chunk and both slots alternate with the pipeline. Off by default (zero overhead when not used). |
|
@copilot Do it so that I can use nsys profile to analyze. Stuff like this import cupyx |
Agent-Logs-Url: https://github.com/tackandr/GameOfLife/sessions/5b93455e-d5ae-49b6-b4e7-8a1dc1f80616 Co-authored-by: tackandr <26540124+tackandr@users.noreply.github.com>
Done in 778d4a3. Each chunk's kernel batch and D2H transfer are now wrapped with from cupyx.profiler import time_range as _nvtx_range
with _profiling_range(f"kernel {frame}-{chunk_end - 1}"):
for i in range(chunk_frames):
grid_gpu = next_generation_cupy(grid_gpu)
chunk_gpu[i] = grid_gpu
with _profiling_range(f"D2H {frame}-{chunk_end - 1}"):
chunk_gpu[:chunk_frames].get(out=pinned_arrs[buf_idx][:chunk_frames], ...)Run with: The ranges fall back to |
|
@copilot I inspected the report. The IO is running parallel with the simulation but they are running sequencially. |
…dependency Agent-Logs-Url: https://github.com/tackandr/GameOfLife/sessions/9b069bf9-8f50-4fe8-bfc5-c7bcdc028636 Co-authored-by: tackandr <26540124+tackandr@users.noreply.github.com>
Fixed in ebb523a. The root cause was a shared The fix: allocate a separate |
|
@copilot The file writing keeps getting bigger each time. Write each chunk to a separate file. |
Agent-Logs-Url: https://github.com/tackandr/GameOfLife/sessions/86e9d635-92a4-47d0-968d-8982581693e8 Co-authored-by: tackandr <26540124+tackandr@users.noreply.github.com>
Fixed in 1528db3. The Every write is exactly |
|
@copilot remove the --profile related code and only keep nsys profiling |
Agent-Logs-Url: https://github.com/tackandr/GameOfLife/sessions/fb8692a8-a774-4996-9184-a54d856c6f04 Co-authored-by: tackandr <26540124+tackandr@users.noreply.github.com>
Done in f1cafab. Removed the |
Ports the simulation to a CuPy CUDA backend using a three-stream double-buffered pipeline that overlaps GPU computation with CPU disk I/O.
Architecture (
game_of_life_cupy.py)next_generation_cupy(grid)— drop-in CuPy equivalent of the NumPy generation stepsimulate_cupy(...)— streaming pipeline:sim_streamowns all simulation kernelsout_streams[0/1]handle non-blocking D2H transfers into pinned (page-locked) host memory after each chunk, using separate per-slotchunk_gpus[0/1]device buffers to eliminate read-after-write hazards and allow true stream overlapout_streams[i%2].wait_event(sim_stream.record())gates each transfer on its chunk's simulation completinggame_of_life.pyPer-chunk output files
Each chunk is saved to its own constant-size
.npyfile — no file ever grows over time:The filename base and extension are derived from the
--outputargument. Every write is O(chunk_size) bytes regardless of how many chunks have run before.GPU Profiling (
nsys profile)Each chunk's simulation-kernel batch and D2H transfer are wrapped with
cupyx.profiler.time_rangeNVTX range annotations. The ranges appear as labelled bands in the NVIDIA Nsight Systems timeline when the process is launched undernsys profile, making it easy to distinguish kernel time from transfer time:Falls back to
contextlib.nullcontext()whencupyxis not installed, so there is no overhead in environments without a GPU.Bug Fixes
np.frombuffercrash when wrapping aPinnedMemoryPointer: added explicitcount=nargument, which is required becausePinnedMemoryPointerdoes not always expose its size through the Python buffer protocol.chunk_gpubuffer with per-slotchunk_gpus[0/1]buffers so the CUDA driver sees no read-after-write dependency between the D2H transfer stream and the simulation stream, enabling true concurrent execution.Tests (
test_game_of_life_cupy.py)21 tests backed by a numpy
_CpArraymock (runs without GPU/CuPy) covering:test_game_of_life.py; helpers updated to collect and concatenate per-chunk files usingTemporaryDirectory+globtest_streams_alternated— asserts both output streams are used and synchronised across chunkstest_cupy_unavailable_raises— cleanRuntimeErrorwhen CuPy is absenttest_nvtx_ranges_entered— asserts one"kernel *"range and one"D2H *"range are entered per chunkDocs / deps
requirements.txt— optional CuPy install instructions (cupy-cuda12x/cupy-cuda11x)README.md— documents both backends and the streaming pipeline design