Skip to content

wjabbour/rocm-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

213 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Currently Working On

Here is everything I've worked on so far, ordered and dated.

I recently achieved 84.4% Memory Bandwidth Efficiency (~541 GB/s) on my Radeon 9070XT (RDNA 4) using wave shuffles, LDS, and a grid-stride loop. You can see the code and profiling writeup for the kernel here.

My primary current focus is on learning about vLLM and landing some commits in the project which push forward vLLM on RDNA. My secondary current focus is on further digging into ROCm and HIP, focusing on vectorization and quantization.

About Me

Hi, I’m Turner Jabbour. I’ve been a software engineer for ~7 years, primarily working in Node and React. Around September 2025, I became deeply interested in GPU programming, ROCm, and the broader world of low-level performance engineering.

This repository is my space to learn in public as I delve into GPU kernel engineering and inference systems work.

There are three important directories:

  • kernels - I explore different kernels and include a write up of what I learned and how it relates to inference.
  • papers - I summarize and discuss different papers.
  • topics - I dive deep into some specific topic.

My long-term goal is to build strong competency in HIP, Triton, and AMD’s GPU software stack, with a focus on high-performance inference.

Scheduled Learning

RCCL

AMD’s collectives library for multi-GPU communication (AllReduce, AllGather, ReduceScatter, etc.) used heavily in distributed inference.

Triton

A higher-level DSL for writing high-performance kernels, increasingly used in modern inference work (FlashAttention, fused ops, reductions).

GPU Architecture

Wavefronts, SIMDs, LDS, VGPRs, vectorized memory access, latency hiding, ROCm ISA, wavefront scheduler.

Model Serving at Scale

vLLM, PagedAttention, KV-cache management, continuous batching, speculative decoding, and multi-GPU parallelism.

Profiling & Debugging

rocprofv3, PMCs, perfetto, bottleneck analysis (latency vs. memory vs. compute).

Contact

doubleujabbour@gmail.com
LinkedIn

About

This repository showcases my growth over time in ROCm

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors