gpu-pcie-path-validator

Deterministic PCIe transport validation for NVIDIA GPUs.

This tool collects live transport and telemetry signals from the GPU using NVIDIA-supported interfaces (CUDA, NVML, sysfs) and computes all metrics directly from the negotiated PCIe link state.

What It Measures

CUDA memcpy throughput — H2D and D2H, bulk transfer and link latency
NVML PCIe bus-level throughput — RX, TX, combined
Link negotiation state — pre and post load, speed and width
PCIe replay counter delta
AER error counters — correctable, non-fatal, fatal deltas
GPU clock state — SM, memory, graphics clocks, P-state pre and post load
Max Payload Size and Max Read Request Size
Persistence mode, ASPM policy, IOMMU state
NUMA affinity validation
Power and thermal telemetry — pre, average, peak, end, delta

All values are sourced from live hardware via NVML and sysfs. Nothing is estimated.

Requirements

Linux
NVIDIA GPU
NVIDIA Driver (with NVML support)
CUDA Toolkit
pciutils

Install

Clone and build:

git clone https://github.com/parallelArchitect/gpu-pcie-path-validator.git
cd gpu-pcie-path-validator
make

Optional system-wide install (if supported in Makefile):

sudo make install

Quick Start

Note: On some systems, a sudo prompt may appear if lspci fallback is required to read PCIe configuration space (MPS/MRRS).

List all detected GPUs:

./gpu_pcie_validator --list-devices

Validate a single device:

./gpu_pcie_validator --device 0

Validate all devices:

./gpu_pcie_validator --all-devices

Output

Single device:

./logs/runs/<timestamp>_GPU<N>/
    report.txt
    report.json

All devices:

./logs/runs/<timestamp>_ALL/
    report.txt
    report.json
    gpu0.json
    gpu1.json
    ...

Exit Codes

0 — All validated GPUs HEALTHY
1 — Runtime error
2 — One or more GPUs DEGRADED or LINK_DEGRADED

Usage

See: docs/Usage.md

Interpretation

CUDA memcpy throughput measures payload transfer rate per direction for a defined buffer size.

NVML PCIe throughput (nvmlDeviceGetPcieThroughput) reports bus-level traffic counters sampled over a measurement window.

These metrics may differ due to sampling window alignment, DMA overlap, or other PCIe traffic on the bus.

NVML API reference: https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html

NVIDIA Developer: https://developer.nvidia.com/

Scope

Does not:

Benchmark kernel compute performance
Diagnose motherboard electrical faults
Replace full-system hardware diagnostics

Author

Joe McLaren (parallelArchitect) Human-directed GPU engineering with AI assistance.

Contact

gpu.validation@gmail.com

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
gpu_pcie_validator.cu		gpu_pcie_validator.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gpu-pcie-path-validator

What It Measures

Requirements

Install

Quick Start

Output

Exit Codes

Usage

Interpretation

Scope

Author

Contact

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gpu-pcie-path-validator

What It Measures

Requirements

Install

Quick Start

Output

Exit Codes

Usage

Interpretation

Scope

Author

Contact

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages