Tokenizer 101 - For Begginers

A beginner-friendly TypeScript project for learning how modern tokenizers work by building them yourself.

GitHub repository: UtkarshTheDev/tokenizer

This repository currently contains two tokenizer families:

Byte-Pair Encoding (BPE) in bpe/README.md
WordPiece in wordpiece/README.md

The goal of this project is not just to run tokenizers.
The goal is to help a beginner understand:

why tokenization is needed
how different subword algorithms think
how training, encoding, and decoding fit together
how to experiment with those ideas from a CLI

What This Project Teaches

If you are new to tokenization, this repo helps you move through the topic in a practical order.

You will learn:

why large language models need tokenization
how BPE learns merge rules
how WordPiece learns a vocabulary and uses greedy longest matching
how text becomes token IDs
how token IDs become text again
how training data affects the pieces a tokenizer learns

This project is built in a way that encourages reading the code, not just running it.

Which Tokenizer Should You Study First?

If you are a complete beginner, use this order:

1. Start with BPE

Read bpe/README.md first.

Why?

BPE is easier to visualize
it is easier to see repeated pair counting and merging
it builds intuition for subword tokenization

2. Then move to WordPiece

After BPE, read wordpiece/README.md.

Why second?

WordPiece is similar enough to compare with BPE
but different enough to teach a new way of thinking
it introduces ideas like ## continuation pieces and greedy longest-match encoding

Short comparison

Method	Main idea	Best beginner mental model
BPE	Learn merge rules	"Keep merging the most frequent adjacent pair."
WordPiece	Learn a vocabulary of valid pieces	"At each position, take the longest valid piece from the vocabulary."

Repository Tour

.
├── bpe/
│   ├── README.md
│   ├── preTokenizer.ts
│   └── tokenizer.ts
├── wordpiece/
│   ├── README.md
│   ├── manualPreTokenizer.ts
│   ├── preTokenizer.ts
│   ├── tokenizer.ts
│   ├── trainHelpers.ts
│   ├── tokenizer.test.ts
│   ├── types.ts
│   ├── wordpiece.excalidraw
│   └── wordpiece.png
├── data/
│   └── data.txt
├── index.ts
├── package.json
└── tsconfig.json

What each area is for

bpe/

contains the BPE implementation
contains the detailed BPE teaching guide

wordpiece/

contains the WordPiece implementation
contains the WordPiece teaching guide
includes both a practical pre-tokenizer and a manual learning version
includes tests and visual diagram assets

data/data.txt

sample corpus used for training from the CLI

index.ts

the interactive CLI entry point
the easiest place to experiment with the project

Best Study Path for a Beginner

If you want to learn this repo properly, follow this order.

Step 1: Read the root README

This file gives you the map of the whole project.

Step 2: Read the BPE guide

Go to:

bpe/README.md

Focus on:

what a token is
why BPE starts from bytes
how merge rules are learned
how encoding and decoding work

Step 3: Read the BPE code

Start with:

bpe/tokenizer.ts

Then:

bpe/preTokenizer.ts

Step 4: Read the WordPiece guide

Go to:

wordpiece/README.md

Focus on:

what makes WordPiece different from BPE
why ## exists
how greedy longest-match encoding works
how WordPiece training builds a vocabulary

Step 5: Read the WordPiece code

Recommended order:

Step 6: Use the CLI and experiment

Once the mental model is clear, run the CLI and try:

training on your own text
training on data/data.txt
encoding known and unknown words
decoding tokens back into text
comparing BPE and WordPiece on the same input

Getting Started

This project uses Bun as its runtime.

Install dependencies

bun install

Run the interactive CLI

bun run index.ts

This opens the project’s main interactive interface.

How To Use the CLI

The CLI supports both BPE and WordPiece.

When you run it, you can use either:

the numbered menu
or direct commands like bpe, wordpiece, train, encode, decode, save, load

Main actions

Action	What it does
`select` or `1`	Choose BPE or WordPiece
`train` or `2`	Train on text you type directly
`data` or `3`	Train on `data/data.txt`
`encode` or `4`	Convert text into token IDs
`decode` or `5`	Convert token IDs back into text
`save` or `6`	Save the currently active tokenizer model to `models/`
`load` or `7`	Load a tokenizer model from `models/` into memory
`stats` or `8`	Show training stats
`clear`	Clear the terminal
`exit` or `quit`	Exit the CLI
`bpe`	Switch directly to BPE
`wordpiece` / `wp`	Switch directly to WordPiece

Typical beginner workflow

Explore BPE

1. Run CLI
2. Type: bpe
3. Type: train
4. Enter some text
5. Choose a vocabulary size
6. Type: encode
7. Try a sentence
8. Type: decode
9. Paste token ids back in

Explore WordPiece

1. Run CLI
2. Type: wordpiece
3. Type: data
4. Let it train on data/data.txt
5. Type: encode
6. Try words like:
   - playing
   - player
   - tokenizer
   - playful
7. Type: decode
8. Observe how ## continuation pieces become normal words again

Important CLI note

You must train the currently selected tokenizer before encode or decode will work.

That means:

if you switch to BPE, train BPE first
if you switch to WordPiece, train WordPiece first

Each tokenizer keeps its own learned state in memory while the CLI session is running.

Save and load models

The CLI can now save and load both tokenizer families.

save writes the currently active tokenizer to the models/ folder
load reads a saved tokenizer file back into memory
pressing Enter uses the default file name for the active tokenizer:
- models/bpe.json
- models/wordpiece.json

One important design choice in this project:

loading a model does not automatically switch the active tokenizer
instead, it fills that tokenizer's in-memory slot
this means you can keep a BPE model and a WordPiece model loaded at the same time during one CLI session

Example:

1. Active tokenizer is WordPiece
2. Type: load
3. Load a saved BPE model
4. BPE becomes available in memory
5. WordPiece stays active until you switch to BPE yourself

This makes experimentation easier because you do not lose the model that is already loaded for the other tokenizer.

Save and load command examples

Save the active tokenizer

1. Train or load a tokenizer first
2. Type: save
3. Press Enter to use the default file:
   - bpe -> models/bpe.json
   - wordpiece -> models/wordpiece.json
4. Or type a custom name like:
   - my-bpe-model
   - lesson-1-wordpiece.json

Load a tokenizer model

1. Type: load
2. Press Enter to load the default file for the active tokenizer
3. Or type a saved file name from models/
4. The CLI loads that tokenizer into memory
5. If it is different from the active tokenizer, the active tokenizer stays unchanged

Programmatic Usage

You can also use the tokenizers directly in code.

BPE usage

import { decode, encode, train } from "./bpe/tokenizer";

const text = "hello world! hello world!";
const { mergeTable } = train(text, 300);

const encoded = encode("hello world", mergeTable);
const decoded = decode(encoded, mergeTable);

WordPiece usage

import { decode, encode, train } from "./wordpiece/tokenizer";

const model = train("play playing player played", 18);

const encoded = encode("playing played", model);
const decoded = decode(encoded, model);

What Makes This Project Good for Learning

This project is built in a learning-friendly way:

the BPE code is heavily commented
the WordPiece code now has beginner-friendly explanations
both methods have dedicated README guides
there is a CLI for hands-on exploration
WordPiece has tests that act like small executable examples

This means you can learn in three different ways:

read the guides
read the code
run the tokenizer yourself

That combination is what makes the project useful.

Suggested Learning Exercises

If you want to study actively instead of only reading, try these exercises.

Exercise 1: Compare BPE and WordPiece mentally

Take the word:

playing

Ask:

how would BPE learn this over time?
how would WordPiece segment this using a vocabulary?

Exercise 2: Train on a tiny corpus

Use text like:

play playing player played

Then compare what each tokenizer learns.

Exercise 3: Try unknown words

In WordPiece, test:

playful

Observe how [UNK] appears when the vocabulary cannot fully segment the word.

Exercise 4: Read the tests

Open:

wordpiece/tokenizer.test.ts

and treat each test as a behavior example.

Where To Go Next

Once you understand this repo, good next topics are:

WordPiece scoring strategies in more depth
Unigram tokenization
SentencePiece
Unicode-aware tokenization
vocabulary serialization and loading
production tokenization performance

But first, make sure you really understand:

what a token is
how BPE learns merges
how WordPiece learns a vocabulary
why encoding and decoding are different

Those ideas are the foundation.

Quick Links

BPE guide: bpe/README.md
WordPiece guide: wordpiece/README.md
CLI entry point: index.ts
sample corpus: data/data.txt
contributing guide: CONTRIBUTING.md
license: LICENSE

Best way to use this repo: read a guide, open the matching code, then try it in the CLI.

Made with love by Utkarsh. Contributions are welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.planning		.planning
bpe		bpe
cli		cli
data		data
wordpiece		wordpiece
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
index.ts		index.ts
package.json		package.json
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

Tokenizer 101 - For Begginers

What This Project Teaches

Which Tokenizer Should You Study First?

1. Start with BPE

2. Then move to WordPiece

Short comparison

Repository Tour

What each area is for

Best Study Path for a Beginner

Step 1: Read the root README

Step 2: Read the BPE guide

Step 3: Read the BPE code

Step 4: Read the WordPiece guide

Step 5: Read the WordPiece code

Step 6: Use the CLI and experiment

Getting Started

Install dependencies

Run the interactive CLI

How To Use the CLI

Main actions

Typical beginner workflow

Explore BPE

Explore WordPiece

Important CLI note

Save and load models

Save and load command examples

Save the active tokenizer

Load a tokenizer model

Programmatic Usage

BPE usage

WordPiece usage

What Makes This Project Good for Learning

Suggested Learning Exercises

Exercise 1: Compare BPE and WordPiece mentally

Exercise 2: Train on a tiny corpus

Exercise 3: Try unknown words

Exercise 4: Read the tests

Where To Go Next

Quick Links

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages