GPT-2 inference from scratch in ~1,200 lines of C++. No PyTorch. No TensorFlow. No ONNX Runtime. Just a custom tensor library, a transformer implementation, and a BPE tokenizer.
A minimal, readable implementation of GPT-2 (124M) inference in C++.
What's implemented:
- Tensor class with matmul, softmax, layer norm, GELU, broadcasting
- Multi-head causal self-attention
- Full GPT-2 transformer (12 blocks, 768-dim, 12 heads)
- BPE tokenizer with byte-level encoding
- ONNX weight loader
- Autoregressive text generation (greedy decoding)
What's NOT implemented (on purpose):
- KV cache (each forward pass recomputes from scratch)
- SIMD / GPU acceleration
- Training / backpropagation
- Sampling strategies (top-k, top-p, temperature)
This is an educational project, not a production inference engine.
- C++20 compiler (g++ or clang++)
- Protocol Buffers (for loading ONNX weights)
On macOS with Homebrew:
brew install protobuf1. Clone the repo:
git clone https://github.com/YOUR_USERNAME/minicpp-gpt2.git
cd minicpp-gpt22. Download GPT-2 model and tokenizer files:
Download these three files from HuggingFace and place them in the project root:
gpt2.onnx— the model weights (~475MB). Export from the HuggingFace model or find a pre-exported ONNX file.vocab.jsonmerges.txt
3. Build and run:
make
./gpt2You should see something like:
Loading GPT-2 model...
Model loaded.
Loaded 50257 vocab entries
Loaded 50000 merges
Prompt: The meaning of life is
The meaning of life is to be a good person...
minicpp-gpt2/
├── include/
│ ├── tensor.h # Tensor class declaration
│ ├── transformer.h # Linear, Attention, LayerNorm, TransformerBlock, Wrapper
│ ├── tokenizer.h # BPE tokenizer
│ ├── onnx_loader.h # ONNX weight loading
│ └── onnx.pb.h # Generated protobuf for ONNX format
├── src/
│ ├── tensor.cpp # Tensor operations (matmul, softmax, GELU, etc.)
│ ├── transformer.cpp # Neural network layers and forward passes
│ ├── tokenizer.cpp # BPE encode/decode with byte-level encoding
│ ├── onnx_loader.cpp # Load GPT-2 weights from ONNX file
│ └── onnx.pb.cc # Generated protobuf implementation
├── main.cpp # Entry point: load model, tokenize, generate
├── Makefile
├── onnx.proto # ONNX protobuf schema
└── LICENSE
The forward pass for a single generation step:
tokens → [Embedding + Position Embedding]
→ [TransformerBlock x 12]
→ LayerNorm → Multi-Head Attention → Residual
→ LayerNorm → MLP (expand → GELU → shrink) → Residual
→ [Final LayerNorm]
→ [Linear projection to vocab]
→ [Softmax]
→ next token
Each transformer block uses pre-norm (GPT-2 style), causal masking to prevent attending to future tokens, and tied weights between the embedding and output projection.
MIT