A beginner-friendly TypeScript project for learning how modern tokenizers work by building them yourself.
GitHub repository: UtkarshTheDev/tokenizer
This repository currently contains two tokenizer families:
- Byte-Pair Encoding (BPE) in bpe/README.md
- WordPiece in wordpiece/README.md
The goal of this project is not just to run tokenizers.
The goal is to help a beginner understand:
- why tokenization is needed
- how different subword algorithms think
- how training, encoding, and decoding fit together
- how to experiment with those ideas from a CLI
If you are new to tokenization, this repo helps you move through the topic in a practical order.
You will learn:
- why large language models need tokenization
- how BPE learns merge rules
- how WordPiece learns a vocabulary and uses greedy longest matching
- how text becomes token IDs
- how token IDs become text again
- how training data affects the pieces a tokenizer learns
This project is built in a way that encourages reading the code, not just running it.
If you are a complete beginner, use this order:
Read bpe/README.md first.
Why?
- BPE is easier to visualize
- it is easier to see repeated pair counting and merging
- it builds intuition for subword tokenization
After BPE, read wordpiece/README.md.
Why second?
- WordPiece is similar enough to compare with BPE
- but different enough to teach a new way of thinking
- it introduces ideas like
##continuation pieces and greedy longest-match encoding
| Method | Main idea | Best beginner mental model |
|---|---|---|
| BPE | Learn merge rules | "Keep merging the most frequent adjacent pair." |
| WordPiece | Learn a vocabulary of valid pieces | "At each position, take the longest valid piece from the vocabulary." |
.
├── bpe/
│ ├── README.md
│ ├── preTokenizer.ts
│ └── tokenizer.ts
├── wordpiece/
│ ├── README.md
│ ├── manualPreTokenizer.ts
│ ├── preTokenizer.ts
│ ├── tokenizer.ts
│ ├── trainHelpers.ts
│ ├── tokenizer.test.ts
│ ├── types.ts
│ ├── wordpiece.excalidraw
│ └── wordpiece.png
├── data/
│ └── data.txt
├── index.ts
├── package.json
└── tsconfig.jsonbpe/
- contains the BPE implementation
- contains the detailed BPE teaching guide
wordpiece/
- contains the WordPiece implementation
- contains the WordPiece teaching guide
- includes both a practical pre-tokenizer and a manual learning version
- includes tests and visual diagram assets
data/data.txt
- sample corpus used for training from the CLI
index.ts
- the interactive CLI entry point
- the easiest place to experiment with the project
If you want to learn this repo properly, follow this order.
This file gives you the map of the whole project.
Go to:
Focus on:
- what a token is
- why BPE starts from bytes
- how merge rules are learned
- how encoding and decoding work
Start with:
Then:
Go to:
Focus on:
- what makes WordPiece different from BPE
- why
##exists - how greedy longest-match encoding works
- how WordPiece training builds a vocabulary
Recommended order:
- wordpiece/manualPreTokenizer.ts
- wordpiece/preTokenizer.ts
- wordpiece/types.ts
- wordpiece/tokenizer.ts
- wordpiece/trainHelpers.ts
- wordpiece/tokenizer.test.ts
Once the mental model is clear, run the CLI and try:
- training on your own text
- training on
data/data.txt - encoding known and unknown words
- decoding tokens back into text
- comparing BPE and WordPiece on the same input
This project uses Bun as its runtime.
bun installbun run index.tsThis opens the project’s main interactive interface.
The CLI supports both BPE and WordPiece.
When you run it, you can use either:
- the numbered menu
- or direct commands like
bpe,wordpiece,train,encode,decode,save,load
| Action | What it does |
|---|---|
select or 1 |
Choose BPE or WordPiece |
train or 2 |
Train on text you type directly |
data or 3 |
Train on data/data.txt |
encode or 4 |
Convert text into token IDs |
decode or 5 |
Convert token IDs back into text |
save or 6 |
Save the currently active tokenizer model to models/ |
load or 7 |
Load a tokenizer model from models/ into memory |
stats or 8 |
Show training stats |
clear |
Clear the terminal |
exit or quit |
Exit the CLI |
bpe |
Switch directly to BPE |
wordpiece / wp |
Switch directly to WordPiece |
1. Run CLI
2. Type: bpe
3. Type: train
4. Enter some text
5. Choose a vocabulary size
6. Type: encode
7. Try a sentence
8. Type: decode
9. Paste token ids back in1. Run CLI
2. Type: wordpiece
3. Type: data
4. Let it train on data/data.txt
5. Type: encode
6. Try words like:
- playing
- player
- tokenizer
- playful
7. Type: decode
8. Observe how ## continuation pieces become normal words againYou must train the currently selected tokenizer before encode or decode will work.
That means:
- if you switch to BPE, train BPE first
- if you switch to WordPiece, train WordPiece first
Each tokenizer keeps its own learned state in memory while the CLI session is running.
The CLI can now save and load both tokenizer families.
savewrites the currently active tokenizer to themodels/folderloadreads a saved tokenizer file back into memory- pressing Enter uses the default file name for the active tokenizer:
models/bpe.jsonmodels/wordpiece.json
One important design choice in this project:
- loading a model does not automatically switch the active tokenizer
- instead, it fills that tokenizer's in-memory slot
- this means you can keep a BPE model and a WordPiece model loaded at the same time during one CLI session
Example:
1. Active tokenizer is WordPiece
2. Type: load
3. Load a saved BPE model
4. BPE becomes available in memory
5. WordPiece stays active until you switch to BPE yourselfThis makes experimentation easier because you do not lose the model that is already loaded for the other tokenizer.
1. Train or load a tokenizer first
2. Type: save
3. Press Enter to use the default file:
- bpe -> models/bpe.json
- wordpiece -> models/wordpiece.json
4. Or type a custom name like:
- my-bpe-model
- lesson-1-wordpiece.json1. Type: load
2. Press Enter to load the default file for the active tokenizer
3. Or type a saved file name from models/
4. The CLI loads that tokenizer into memory
5. If it is different from the active tokenizer, the active tokenizer stays unchangedYou can also use the tokenizers directly in code.
import { decode, encode, train } from "./bpe/tokenizer";
const text = "hello world! hello world!";
const { mergeTable } = train(text, 300);
const encoded = encode("hello world", mergeTable);
const decoded = decode(encoded, mergeTable);import { decode, encode, train } from "./wordpiece/tokenizer";
const model = train("play playing player played", 18);
const encoded = encode("playing played", model);
const decoded = decode(encoded, model);This project is built in a learning-friendly way:
- the BPE code is heavily commented
- the WordPiece code now has beginner-friendly explanations
- both methods have dedicated README guides
- there is a CLI for hands-on exploration
- WordPiece has tests that act like small executable examples
This means you can learn in three different ways:
- read the guides
- read the code
- run the tokenizer yourself
That combination is what makes the project useful.
If you want to study actively instead of only reading, try these exercises.
Take the word:
playingAsk:
- how would BPE learn this over time?
- how would WordPiece segment this using a vocabulary?
Use text like:
play playing player playedThen compare what each tokenizer learns.
In WordPiece, test:
playfulObserve how [UNK] appears when the vocabulary cannot fully segment the word.
Open:
and treat each test as a behavior example.
Once you understand this repo, good next topics are:
- WordPiece scoring strategies in more depth
- Unigram tokenization
- SentencePiece
- Unicode-aware tokenization
- vocabulary serialization and loading
- production tokenization performance
But first, make sure you really understand:
- what a token is
- how BPE learns merges
- how WordPiece learns a vocabulary
- why encoding and decoding are different
Those ideas are the foundation.
- BPE guide: bpe/README.md
- WordPiece guide: wordpiece/README.md
- CLI entry point: index.ts
- sample corpus: data/data.txt
- contributing guide: CONTRIBUTING.md
- license: LICENSE
Best way to use this repo: read a guide, open the matching code, then try it in the CLI.
Made with love by Utkarsh. Contributions are welcome.