Skip to content
#

gpu-benchmarking

Here are 7 public repositories matching this topic...

Language: All
Filter by language
jetson-orin-matmul-analysis

CUDA matrix multiplication benchmarking on Jetson Orin Nano. Four implementations, three power modes, five matrix sizes. 99.5% mathematical validation. C++/CUDA and Python.

  • Updated Mar 23, 2026
  • Python

🔍 Analyze CUDA matrix multiplication performance and power consumption on NVIDIA Jetson Orin Nano across multiple implementations and settings.

  • Updated Mar 23, 2026
  • Python

GPT-2 (124M) fixed-work distributed training benchmark on NYU BigPurple (Slurm) scaling 1→8× V100 across 2 nodes using DeepSpeed ZeRO-1 + FP16/AMP. Built a reproducible harness that writes training_metrics.json + RUN_COMPLETE.txt + launcher metadata per run, plus NCCL topology/log artifacts and Nsight Systems traces/summaries (NVTX + NCCL ranges).

  • Updated Feb 4, 2026
  • Python

Improve this page

Add a description, image, and links to the gpu-benchmarking topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the gpu-benchmarking topic, visit your repo's landing page and select "manage topics."

Learn more