Skip to content

TonyStef/minicpp-gpt2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

minicpp-gpt2

GPT-2 inference from scratch in ~1,200 lines of C++. No PyTorch. No TensorFlow. No ONNX Runtime. Just a custom tensor library, a transformer implementation, and a BPE tokenizer.

What this is

A minimal, readable implementation of GPT-2 (124M) inference in C++.

What's implemented:

  • Tensor class with matmul, softmax, layer norm, GELU, broadcasting
  • Multi-head causal self-attention
  • Full GPT-2 transformer (12 blocks, 768-dim, 12 heads)
  • BPE tokenizer with byte-level encoding
  • ONNX weight loader
  • Autoregressive text generation (greedy decoding)

What's NOT implemented (on purpose):

  • KV cache (each forward pass recomputes from scratch)
  • SIMD / GPU acceleration
  • Training / backpropagation
  • Sampling strategies (top-k, top-p, temperature)

This is an educational project, not a production inference engine.

Prerequisites

On macOS with Homebrew:

brew install protobuf

Setup

1. Clone the repo:

git clone https://github.com/YOUR_USERNAME/minicpp-gpt2.git
cd minicpp-gpt2

2. Download GPT-2 model and tokenizer files:

Download these three files from HuggingFace and place them in the project root:

  • gpt2.onnx — the model weights (~475MB). Export from the HuggingFace model or find a pre-exported ONNX file.
  • vocab.json
  • merges.txt

3. Build and run:

make
./gpt2

You should see something like:

Loading GPT-2 model...
Model loaded.
Loaded 50257 vocab entries
Loaded 50000 merges
Prompt: The meaning of life is
The meaning of life is to be a good person...

Project structure

minicpp-gpt2/
├── include/
│   ├── tensor.h          # Tensor class declaration
│   ├── transformer.h     # Linear, Attention, LayerNorm, TransformerBlock, Wrapper
│   ├── tokenizer.h       # BPE tokenizer
│   ├── onnx_loader.h     # ONNX weight loading
│   └── onnx.pb.h         # Generated protobuf for ONNX format
├── src/
│   ├── tensor.cpp         # Tensor operations (matmul, softmax, GELU, etc.)
│   ├── transformer.cpp    # Neural network layers and forward passes
│   ├── tokenizer.cpp      # BPE encode/decode with byte-level encoding
│   ├── onnx_loader.cpp    # Load GPT-2 weights from ONNX file
│   └── onnx.pb.cc         # Generated protobuf implementation
├── main.cpp               # Entry point: load model, tokenize, generate
├── Makefile
├── onnx.proto             # ONNX protobuf schema
└── LICENSE

How it works

The forward pass for a single generation step:

tokens → [Embedding + Position Embedding]
       → [TransformerBlock x 12]
           → LayerNorm → Multi-Head Attention → Residual
           → LayerNorm → MLP (expand → GELU → shrink) → Residual
       → [Final LayerNorm]
       → [Linear projection to vocab]
       → [Softmax]
       → next token

Each transformer block uses pre-norm (GPT-2 style), causal masking to prevent attending to future tokens, and tied weights between the embedding and output projection.

License

MIT

About

GPT-2 inference from scratch in ~1,200 lines of C++. Custom tensor library, transformer, and BPE tokenizer.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors