WaveBoost is my personal repository to experiment with inference-time optimizations. I implemented individual CUDA kernels for LLM inference.
Performance comparison between Multi-Head Attention (MHA) and Grouped Query Attention (GQA):
GQA demonstrates superior memory efficiency while maintaining competitive latency through optimized grouped computation without explicit KV replication.
- Flash Attention v1: Dao et al., 2022 - Fast and Memory-Efficient Exact Attention with IO-Awareness
- Flash Attention v2: Dao, 2023 - Faster Attention with Better Parallelism and Work Partitioning
- NVIDIA CUDA C++ Programming Guide
- NVIDIA Cutlass - CUDA Templates
- Programming Massively Parallel Processors
