Itchy

A 16MB language model designed to be small from birth — not a shrunken giant.

Built for the OpenAI Parameter Golf challenge.

Why "Itchy"

Like giving someone a wool blanket to keep them warm when the reason they were complaining was that they were itchy. Every other submission shrinks a big model into 16MB — solving the wrong problem. Itchy asks: what if you just built something the right size?

The Thesis

622+ submissions to Parameter Golf. All of them are transformers with BPE tokenizers. Itchy is the first byte-level submission.

Byte-level (256 vocab) — no tokenizer. The entire 16MB budget goes to the brain. With a 1024-token vocabulary, the embedding table consumes a large fraction of the parameter budget. With 256 bytes, it's negligible — those parameters go into transformer layers instead.

Patch processing (4 bytes -> 1 patch) keeps effective sequence length shorter than the token-level baseline.

Architecture

Input: raw UTF-8 bytes (0-255)
  -> BytePatchEmbed (12 bytes -> 1 patch -> model dim)
  -> 11x Block (attention + LeakyReLU(0.5)² MLP, 3x expansion)
  -> PerPositionDecode (12 independent MLP heads, one per byte position)
Output: next-byte probabilities

17.5M params | 13.1MB at int6 | 384 dim, 11 layers, 8 heads, patch=12

Validation Results (T4 GPU, 3000 steps, 1 shard)

Patch size ablation (the big finding):

Patch Size	Val BPB	Delta
2	1.1512	+0.557
3	0.7800	+0.186
4	0.5944	— (original)
8	0.3298	-0.265
12	0.2903	-0.304
16	0.3343	-0.260

Patch size is the dominant hyperparameter — going from 4 to 12 improved BPB by 0.30, which is 40x larger than all other tricks combined.

Unpatch decode ablation:

Decode method	Val BPB	Delta
Flat linear (baseline)	0.2903	—
Per-position MLP heads	0.2329	-0.057
Autoregressive decoder	0.7937	+0.503 (too complex for scale)
MoE unpatch	3.2594	broken

Trick ablation:

Config	Val BPB	Delta
Baseline (relu², 2x MLP)	0.6010	—
+LeakyReLU(0.5)²	0.5996	-0.0014
+3x MLP	0.5949	-0.0061
+Partial RoPE	0.6119	+0.0109 (hurts)
+LN Scale	0.6125	+0.0115 (hurts)
+N-gram hash	0.6010	+0.0000 (no effect)

Head-to-head vs token-level baseline:

Itchy (byte-level): 0.66 BPB with 4.3M params
Baseline (token-level): 2.56 BPB with 1.1M params (same total size budget)

What Didn't Work

LoRA adapters + TTT meta-learning: Built LoRA adapters into every block, trained with meta-learning episodes. Adapters never learned to adapt — zero improvement across all tests. The zero-initialized gate creates a gradient dead zone. Stripped entirely.
Partial RoPE: Helps token-level models, hurts byte-level (+0.011 BPB).
LN Scale per layer: Same — helps at token level, hurts at byte level (+0.012 BPB).
Hash n-gram embeddings: Added 600K params for zero BPB improvement.

Status

Model architecture (byte-level, LeakyReLU², 3x MLP)
Ablation study on T4 (7 configs tested)
Head-to-head vs token-level baseline
MLX training loop (local Mac)
CUDA training script (8xH100, ready to run)
Byte data pipeline
Full training runs on H100s (waiting for compute grant)
3-seed validation
Submission PR

Running

# Local (Mac, MLX)
python data/convert_to_bytes.py --train-shards 2
ITERATIONS=30 TRAIN_LOG_EVERY=10 .venv/bin/python train_itchy_mlx.py

# Competition (8xH100)
python data/convert_to_bytes.py --train-shards 80
torchrun --standalone --nproc_per_node=8 train_itchy_final.py

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
data		data
docs		docs
records		records
.gitignore		.gitignore
COMPETITION.md		COMPETITION.md
ITCHY_README.md		ITCHY_README.md
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
itchy_10shard_colab.ipynb		itchy_10shard_colab.ipynb
itchy_80shard_colab.ipynb		itchy_80shard_colab.ipynb
itchy_a100_colab.ipynb		itchy_a100_colab.ipynb
itchy_ablation_colab.ipynb		itchy_ablation_colab.ipynb
itchy_apples_colab.ipynb		itchy_apples_colab.ipynb
itchy_colab.ipynb		itchy_colab.ipynb
itchy_fixed_colab.ipynb		itchy_fixed_colab.ipynb
itchy_h100_colab.ipynb		itchy_h100_colab.ipynb
itchy_honest_colab.ipynb		itchy_honest_colab.ipynb
itchy_megabyte_colab.ipynb		itchy_megabyte_colab.ipynb
itchy_optimal_colab.ipynb		itchy_optimal_colab.ipynb
itchy_patch_ablation_colab.ipynb		itchy_patch_ablation_colab.ipynb
itchy_unpatch_colab.ipynb		itchy_unpatch_colab.ipynb
itchy_v2_colab.ipynb		itchy_v2_colab.ipynb
model_itchy.py		model_itchy.py
model_itchy_final.py		model_itchy_final.py
model_itchy_v2.py		model_itchy_v2.py
requirements.txt		requirements.txt
test_ttt_thesis.py		test_ttt_thesis.py
test_ttt_thesis_v2.py		test_ttt_thesis_v2.py
train_gpt.py		train_gpt.py
train_gpt_mlx.py		train_gpt_mlx.py
train_itchy.py		train_itchy.py
train_itchy_final.py		train_itchy_final.py
train_itchy_mlx.py		train_itchy_mlx.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Itchy

Why "Itchy"

The Thesis

Architecture

Validation Results (T4 GPU, 3000 steps, 1 shard)

Patch size ablation (the big finding):

Unpatch decode ablation:

Trick ablation:

Head-to-head vs token-level baseline:

What Didn't Work

Status

Running

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Itchy

Why "Itchy"

The Thesis

Architecture

Validation Results (T4 GPU, 3000 steps, 1 shard)

Patch size ablation (the big finding):

Unpatch decode ablation:

Trick ablation:

Head-to-head vs token-level baseline:

What Didn't Work

Status

Running

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages