CUDA matrix multiplication benchmarking on Jetson Orin Nano. Four implementations, three power modes, five matrix sizes. 99.5% mathematical validation. C++/CUDA and Python.
-
Updated
Mar 23, 2026 - Python
CUDA matrix multiplication benchmarking on Jetson Orin Nano. Four implementations, three power modes, five matrix sizes. 99.5% mathematical validation. C++/CUDA and Python.
Standalone LLM inference benchmarking pipelines on AMD GPUs using ROCm, vLLM, MAD, and data visualization scripts.
🔍 Analyze CUDA matrix multiplication performance and power consumption on NVIDIA Jetson Orin Nano across multiple implementations and settings.
One-shot script to audit GPU, CUDA, PyTorch, CPU, and disk performance before debugging a slow or broken ML environment.
benchHUB is a Python-based project to parse, aggregate, and visualize system and performance benchmarks. It includes a Streamlit dashboard to display and compare results.
GPT-2 (124M) fixed-work distributed training benchmark on NYU BigPurple (Slurm) scaling 1→8× V100 across 2 nodes using DeepSpeed ZeRO-1 + FP16/AMP. Built a reproducible harness that writes training_metrics.json + RUN_COMPLETE.txt + launcher metadata per run, plus NCCL topology/log artifacts and Nsight Systems traces/summaries (NVTX + NCCL ranges).
Add a description, image, and links to the gpu-benchmarking topic page so that developers can more easily learn about it.
To associate your repository with the gpu-benchmarking topic, visit your repo's landing page and select "manage topics."