minicpp-gpt2

GPT-2 inference from scratch in ~1,200 lines of C++. No PyTorch. No TensorFlow. No ONNX Runtime. Just a custom tensor library, a transformer implementation, and a BPE tokenizer.

What this is

A minimal, readable implementation of GPT-2 (124M) inference in C++.

What's implemented:

Tensor class with matmul, softmax, layer norm, GELU, broadcasting
Multi-head causal self-attention
Full GPT-2 transformer (12 blocks, 768-dim, 12 heads)
BPE tokenizer with byte-level encoding
ONNX weight loader
Autoregressive text generation (greedy decoding)

What's NOT implemented (on purpose):

KV cache (each forward pass recomputes from scratch)
SIMD / GPU acceleration
Training / backpropagation
Sampling strategies (top-k, top-p, temperature)

This is an educational project, not a production inference engine.

Prerequisites

C++20 compiler (g++ or clang++)
Protocol Buffers (for loading ONNX weights)

On macOS with Homebrew:

brew install protobuf

Setup

1. Clone the repo:

git clone https://github.com/YOUR_USERNAME/minicpp-gpt2.git
cd minicpp-gpt2

2. Download GPT-2 model and tokenizer files:

Download these three files from HuggingFace and place them in the project root:

gpt2.onnx — the model weights (~475MB). Export from the HuggingFace model or find a pre-exported ONNX file.
vocab.json
merges.txt

3. Build and run:

make
./gpt2

You should see something like:

Loading GPT-2 model...
Model loaded.
Loaded 50257 vocab entries
Loaded 50000 merges
Prompt: The meaning of life is
The meaning of life is to be a good person...

Project structure

minicpp-gpt2/
├── include/
│   ├── tensor.h          # Tensor class declaration
│   ├── transformer.h     # Linear, Attention, LayerNorm, TransformerBlock, Wrapper
│   ├── tokenizer.h       # BPE tokenizer
│   ├── onnx_loader.h     # ONNX weight loading
│   └── onnx.pb.h         # Generated protobuf for ONNX format
├── src/
│   ├── tensor.cpp         # Tensor operations (matmul, softmax, GELU, etc.)
│   ├── transformer.cpp    # Neural network layers and forward passes
│   ├── tokenizer.cpp      # BPE encode/decode with byte-level encoding
│   ├── onnx_loader.cpp    # Load GPT-2 weights from ONNX file
│   └── onnx.pb.cc         # Generated protobuf implementation
├── main.cpp               # Entry point: load model, tokenize, generate
├── Makefile
├── onnx.proto             # ONNX protobuf schema
└── LICENSE

How it works

The forward pass for a single generation step:

tokens → [Embedding + Position Embedding]
       → [TransformerBlock x 12]
           → LayerNorm → Multi-Head Attention → Residual
           → LayerNorm → MLP (expand → GELU → shrink) → Residual
       → [Final LayerNorm]
       → [Linear projection to vocab]
       → [Softmax]
       → next token

Each transformer block uses pre-norm (GPT-2 style), causal masking to prevent attending to future tokens, and tied weights between the embedding and output projection.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

minicpp-gpt2

What this is

Prerequisites

Setup

Project structure

How it works

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
include		include
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.cpp		main.cpp
onnx.proto		onnx.proto

Folders and files

Latest commit

History

Repository files navigation

minicpp-gpt2

What this is

Prerequisites

Setup

Project structure

How it works

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages