Skip to content

aditaa/llm

Repository files navigation

LLM From Scratch

CI Wiki Sync Python

Production-style, decoder-only LLM engineering project focused on reproducible data pipelines, tokenizer/sharding workflows, and GPU training from scratch.

About

  • Scope: end-to-end LLM workflow from raw corpora to checkpoints and generation.
  • Data focus: ZIM + FineWeb workflows with hot (./data) and warm (/mnt/ceph/llm/data) storage patterns.
  • Engineering focus: deterministic scripts, integrity checks, CI gating, and wiki-backed docs.

Quick Links

Project Goals

  • Build a minimal but production-style training stack incrementally.
  • Keep each subsystem testable (tokenizer, data, model, training, evaluation).
  • Favor reproducible experiments through explicit configs and scripts.

Repository Layout

  • src/llm/: core Python package
  • tests/: unit tests
  • docs/: architecture and roadmap notes
  • information/: reference material and external links for project guidance
  • requirements/: system and Python dependency lists for server setup
  • scripts/: bootstrap/install/doctor scripts
  • data/: local/intermediate corpora (gitignored except data/README.md)
  • artifacts/: local outputs (vocab, checkpoints, logs; gitignored)
  • Makefile: common developer commands

Quick Start

bash scripts/bootstrap_dev.sh

Common Commands

make setup-infer # install inference/deploy dependencies
make test        # run unit tests
make lint        # run Ruff checks
make format      # run Black formatter
make typecheck   # run MyPy
make smoke       # tiny CLI smoke check
make verify-shards # print shard integrity check usage
make train       # print baseline training command usage
make generate    # print checkpoint text-generation command usage
make eval-checkpoint # print standardized prompt-suite eval usage
make train-tokenizer-global # print shared-tokenizer command usage
make corpus-quality-report # print quality report command usage
make clean-corpus-batch # print batch cleanup command usage
make dataset-risk-report # print heuristic dataset risk audit command usage
make pull-hf-rows # print Hugging Face rows API pull helper usage
make fineweb-parquet-to-shards # print direct FineWeb parquet->token-shards usage
make stage-fineweb-from-warm # print warm->hot FineWeb chunk staging usage
make fineweb-stage-shard-loop # print rolling stage->shard->verify->sync->purge usage
make shard-corpus-batch # print shared-tokenizer batch sharding usage
make sync-warm   # sync raw/training data + artifacts to warm storage
make hydrate-warm # hydrate hot workspace from warm storage
make offload-zim # continuously move raw ZIMs hot -> warm
make hf-prepare-publish # print HF bundle/publish usage
make hf-download-model # print full HF model download usage
make serve-openai # print local OpenAI-compatible server usage
make doctor      # verify binaries and Python deps

CI/CD

GitHub Actions workflows are defined in .github/workflows/:

  • ci.yml: lint, typecheck, unit tests, smoke checks on pull requests and pushes to main
  • wiki-sync.yml: publish wiki/*.md changes to the GitHub Wiki
  • Dependabot config: .github/dependabot.yml (weekly updates for pip, requirements/, and GitHub Actions)

Recommended branch protection for main:

  • Require pull request before merging
  • Require status checks: CI Gate
  • Require branches to be up to date before merge

Server Setup (Ubuntu/Debian)

  1. Install system packages: bash scripts/install_server_system.sh
  2. Bootstrap dev environment: bash scripts/bootstrap_dev.sh
  3. Install training extras: bash scripts/bootstrap_train.sh
  4. Run health check: bash scripts/doctor.sh

Detailed guide: docs/SERVER_SETUP.md

ZIM Data Workflow (IIAB)

Keep raw .zim files on server storage (for example /data/iiab/zim/), not in Git.

For a first-pass talking-only dataset profile (English prose focus), generate include/exclude manifests:

bash scripts/first_pass_zim_profile.sh

To also move excluded local ZIMs from hot storage to warm storage:

bash scripts/first_pass_zim_profile.sh --move-excluded

This writes:

  • artifacts/reports/first_pass_include_targets.txt (target profile, includes Gutenberg)
  • artifacts/reports/first_pass_include_zims.txt (currently present and included)
  • artifacts/reports/first_pass_exclude_zims.txt (currently present and excluded)
  1. Extract text corpus from ZIM:
PYTHONPATH=src .venv/bin/python -m llm.cli extract-zim-text \
  --input-zim /data/iiab/zim/wikipedia_en_all_maxi.zim \
  --output data/extracted/wiki_corpus.txt \
  --max-articles 50000 \
  --min-chars 200

If extraction returns written_articles=0, retry with a lower --min-chars (for example 20). If extract-zim-text reports no fulltext index, generate a --paths-file from ZIM suggestions/title index and rerun extraction with that file.

  1. Analyze extracted corpora and generate boilerplate candidates:
PYTHONPATH=src .venv/bin/python -m llm.cli corpus-quality-report \
  --input-dir data/extracted \
  --output artifacts/reports/corpus_quality.json
  1. Clean corpora before tokenizer training:
PYTHONPATH=src .venv/bin/python -m llm.cli clean-corpus-batch \
  --input-dir data/extracted \
  --output-dir data/cleaned \
  --boilerplate-report artifacts/reports/corpus_quality.json \
  --en-only

By default this cleanup step also decodes HTML entities and strips common web-shell artifacts (HTML-like tags, repeated nav/menu phrases, site suffixes such as - Stack Overflow). Disable individual transforms with: --no-decode-html-entities, --no-strip-html-tags, --no-strip-site-suffixes, --no-strip-nav-phrases, --no-strip-stack-metadata, --no-collapse-repeated-prefix, --no-strip-inline-score-tokens. To enforce English-only cleanup, add --en-only (with tunable thresholds: --en-min-words, --en-min-stopword-ratio, --en-min-stopword-count, --en-min-latin-ratio). Additional quality guards are enabled by default:

  • minimum words per line (--min-words, default 6)
  • symbol-density filter (--max-symbol-ratio, default 0.20)
  • URL-heavy line filter (--max-urls-per-line, default 1)
  • repetitive-token noise filter (--repeated-token-run-threshold, default 8) For talking-only passes, keep code filtering enabled (default) or tune with: --code-symbol-ratio-threshold and --code-keyword-hits-threshold.

3a. Pull a bounded Hugging Face dataset slice (for example FineWeb sample rows):

python3 scripts/pull_hf_rows.py \
  --dataset HuggingFaceFW/fineweb \
  --config sample-10BT \
  --split train \
  --output /mnt/ceph/llm/data/extracted/fineweb_sample-10BT_rows100k.txt \
  --max-rows 100000

Use warm storage for these pulls first; full FineWeb variants are much larger than typical hot disk.

3aa. Bulk-download FineWeb parquet shards (resumable):

# create token in Hugging Face web UI: Settings -> Access Tokens (read scope)
export HF_TOKEN=hf_xxx

# sample-10BT (~30.6 GB) -> hot storage
HF_HUB_DISABLE_XET=1 .venv/bin/hf download HuggingFaceFW/fineweb \
  --repo-type dataset \
  --include "sample/10BT/*.parquet" \
  --local-dir data/fineweb/sample-10BT \
  --max-workers 2 \
  --token "$HF_TOKEN"

# sample-350BT (~1.06 TB) -> warm storage
HF_HUB_DISABLE_XET=1 .venv/bin/hf download HuggingFaceFW/fineweb \
  --repo-type dataset \
  --include "sample/350BT/*.parquet" \
  --local-dir /mnt/ceph/llm/data/fineweb/sample-350BT \
  --max-workers 2 \
  --token "$HF_TOKEN"

Notes:

  • HF_TOKEN is recommended (higher limits), not strictly required for public datasets.
  • Hugging Face SSH keys are for Git-over-SSH and are not used by hf download.

3ab. Stage FineWeb chunks from warm to hot as needed:

bash scripts/stage_fineweb_from_warm.sh --max-files 4 --max-gib 8

3ac. Run rolling warm->hot staging + sharding loop (recommended for 350BT on limited hot disk):

bash scripts/fineweb_stage_shard_loop.sh \
  --stage-max-files 10 \
  --process-max-files 10 \
  --sleep-seconds 120

This loop stages bounded parquet files to hot storage, builds verified shard batches under data/shards_global/fineweb-global-bpe-v1/, syncs those batches back to warm storage, and purges processed hot parquet files.

3ad. Build tokenizer + token shards directly from FineWeb parquet:

PYTHONPATH=src .venv/bin/python scripts/fineweb_parquet_to_shards.py \
  --input-dir data/fineweb/sample-10BT \
  --output-dir data/shards_global/fineweb-s10bt-global-bpe-v1 \
  --tokenizer-out artifacts/tokenizer/fineweb-s10bt-global-bpe-v1.json \
  --field text \
  --min-chars 80 \
  --shard-size-tokens 5000000 \
  --val-ratio 0.01

This writes manifest.json + shard .bin files directly, skipping extracted text. Use --max-files to do bounded test runs.

3b. Run heuristic dataset risk audit:

PYTHONPATH=src .venv/bin/python -m llm.cli dataset-risk-report \
  --input-dir data/cleaned \
  --output artifacts/reports/dataset_risk.json

This reports lexical cues for toxicity, stereotypes, political content, and refusal-like phrases. Use it as a screening signal, then manually review high-risk segments.

  1. Train tokenizer on cleaned corpus:
PYTHONPATH=src .venv/bin/python -m llm.cli train-tokenizer \
  --input data/cleaned/wiki_corpus.clean.txt \
  --output artifacts/tokenizer/vocab.json \
  --bpe-vocab-size 32000 \
  --bpe-min-frequency 2
  1. Shard tokenized corpus for training:
PYTHONPATH=src .venv/bin/python -m llm.cli shard-corpus \
  --input data/cleaned/wiki_corpus.clean.txt \
  --tokenizer artifacts/tokenizer/vocab.json \
  --output-dir data/shards/wiki_bpe \
  --shard-size-tokens 5000000 \
  --val-ratio 0.01

5b. Build one global tokenizer for multi-dataset training:

PYTHONPATH=src .venv/bin/python -m llm.cli train-tokenizer-global \
  --input-dir data/cleaned \
  --pattern "*.clean.txt" \
  --from-shards-path data/shards \
  --output artifacts/tokenizer/global-bpe-v1.json \
  --bpe-vocab-size 32000 \
  --bpe-min-frequency 2

5c. Re-shard many corpora with that global tokenizer:

PYTHONPATH=src .venv/bin/python -m llm.cli shard-corpus-batch \
  --input-dir data/cleaned \
  --pattern "*.clean.txt" \
  --from-shards-path data/shards \
  --tokenizer artifacts/tokenizer/global-bpe-v1.json \
  --output-root data/shards_global/global-bpe-v1
  1. Inspect corpus quickly:
PYTHONPATH=src .venv/bin/python -m llm.cli stats --input data/cleaned/wiki_corpus.clean.txt
  1. Verify shard integrity before training:
PYTHONPATH=src .venv/bin/python -m llm.cli verify-shards \
  --path data/shards \
  --raw-zim-dir data/raw_zim \
  --strict-source
  1. Run a baseline training test:
PYTHONPATH=src .venv/bin/python -m llm.cli train \
  --shards-path data/shards/medlineplus.gov_en_all_2025-01 \
  --output-dir artifacts/checkpoints/medlineplus_baseline \
  --max-steps 200 \
  --batch-size 8 \
  --context-length 256 \
  --lr-schedule cosine \
  --lr-warmup-steps 50 \
  --grad-accum-steps 1 \
  --fail-on-eval-regression \
  --precision auto

Note: train requires all selected manifests to share the exact same tokenizer mapping. Use a global tokenizer + shard-corpus-batch output root for multi-dataset runs. For higher sustained GPU utilization on CUDA, use --precision auto and keep validation less frequent (--eval-interval 500 --eval-steps 10). If utilization is still bursty on smaller models, test --compile-model. Training now supports:

  • warmup + cosine LR schedule (--lr-schedule, --lr-warmup-steps, --lr-min-ratio)
  • gradient accumulation (--grad-accum-steps)
  • fixed held-out eval batches (--no-eval-freeze-batches to disable)
  • eval regression gate (--fail-on-eval-regression --eval-regression-tolerance 0.20)
  • optional weights-only export (--export-safetensors)
  1. Generate text from a checkpoint:
PYTHONPATH=src .venv/bin/python -m llm.cli generate \
  --checkpoint artifacts/checkpoints/medlineplus_baseline/last.pt \
  --prompt "The future of medicine is" \
  --max-new-tokens 200 \
  --temperature 0.9 \
  --top-k 50
  1. Run standardized checkpoint eval (fixed prompt suite + scored report):
PYTHONPATH=src .venv/bin/python scripts/eval_checkpoint_prompts.py \
  --checkpoint artifacts/checkpoints/medlineplus_baseline/last.pt \
  --suite configs/eval/standard_prompt_suite_v1.json

Writes a JSON report under artifacts/reports/evals/ so runs can be compared over time.

FineWeb-Only First-Pass Training

Use this when you want round-1 pretraining only from FineWeb (no ZIM mix yet):

# 1) build tokenizer + shards directly from parquet
PYTHONPATH=src .venv/bin/python scripts/fineweb_parquet_to_shards.py \
  --input-dir data/fineweb/sample-10BT \
  --output-dir data/shards_global/fineweb-s10bt-global-bpe-v1 \
  --tokenizer-out artifacts/tokenizer/fineweb-s10bt-global-bpe-v1.json \
  --field text \
  --min-chars 80 \
  --shard-size-tokens 5000000 \
  --val-ratio 0.01

# 2) verify and train
PYTHONPATH=src .venv/bin/python -m llm.cli verify-shards \
  --path data/shards_global/fineweb-s10bt-global-bpe-v1

PYTHONPATH=src .venv/bin/python -m llm.cli train \
  --shards-path data/shards_global/fineweb-s10bt-global-bpe-v1 \
  --output-dir artifacts/checkpoints/fineweb-s10bt-run1 \
  --device cuda \
  --max-steps 1000 \
  --batch-size 12 \
  --context-length 256 \
  --lr-schedule cosine \
  --lr-warmup-steps 200 \
  --fail-on-eval-regression \
  --precision auto

Resume training from the latest checkpoint:

PYTHONPATH=src .venv/bin/python -m llm.cli train \
  --shards-path data/shards_global/fineweb-s10bt-global-bpe-v1 \
  --output-dir artifacts/checkpoints/fineweb-s10bt-run1 \
  --device cuda \
  --resume-from artifacts/checkpoints/fineweb-s10bt-run1/last.pt \
  --max-steps 3000

Optional text-first path still exists for inspection-heavy runs: parquet_to_corpus -> clean-corpus-batch -> train-tokenizer-global -> shard-corpus-batch.

Incremental FineWeb Adds While Training

You can start training on a subset, then add new parquet files with the same tokenizer and resume:

# phase 1 file snapshot (example: first 10 files)
find data/fineweb/sample-10BT/sample/10BT -maxdepth 1 -type f -name '*.parquet' | sort | head -n 10 | sed 's#^data/fineweb/sample-10BT/##' > artifacts/reports/fineweb_sample10bt_phase1_files.txt

# build phase 1 tokenizer + shards
PYTHONPATH=src .venv/bin/python scripts/fineweb_parquet_to_shards.py \
  --input-dir data/fineweb/sample-10BT \
  --files-list artifacts/reports/fineweb_sample10bt_phase1_files.txt \
  --output-dir data/shards_global/fineweb-s10bt-incremental/phase1 \
  --tokenizer-out artifacts/tokenizer/fineweb-s10bt-incremental-bpe-v1.json \
  --field text

# start training on phase 1
PYTHONPATH=src .venv/bin/python -m llm.cli train \
  --shards-path data/shards_global/fineweb-s10bt-incremental \
  --output-dir artifacts/checkpoints/fineweb-s10bt-incremental-run1 \
  --device cuda

# later: build phase 2 from newly arrived files using same tokenizer
find data/fineweb/sample-10BT/sample/10BT -maxdepth 1 -type f -name '*.parquet' | sort | sed 's#^data/fineweb/sample-10BT/##' > /tmp/all_parquets.txt
comm -23 /tmp/all_parquets.txt artifacts/reports/fineweb_sample10bt_phase1_files.txt > artifacts/reports/fineweb_sample10bt_phase2_files.txt
PYTHONPATH=src .venv/bin/python scripts/fineweb_parquet_to_shards.py \
  --input-dir data/fineweb/sample-10BT \
  --files-list artifacts/reports/fineweb_sample10bt_phase2_files.txt \
  --output-dir data/shards_global/fineweb-s10bt-incremental/phase2 \
  --tokenizer-in artifacts/tokenizer/fineweb-s10bt-incremental-bpe-v1.json \
  --field text

# resume; train sees both manifests under shards-path
PYTHONPATH=src .venv/bin/python -m llm.cli train \
  --shards-path data/shards_global/fineweb-s10bt-incremental \
  --output-dir artifacts/checkpoints/fineweb-s10bt-incremental-run1 \
  --device cuda \
  --resume-from artifacts/checkpoints/fineweb-s10bt-incremental-run1/last.pt

On this 20-core host, default FineWeb shard splitting should use 15 parallel streams.

RTX 5070 Tuned Profiles

  • Tuned profile docs: docs/RTX5070_TUNING.md
  • Saved JSON profiles:
    • configs/train/rtx5070/fineweb_global_bpe_v1_big.json (recommended, BPE)
  • Launch tuned big profile:
bash scripts/train_rtx5070_fineweb_bpe_v1_big.sh

Warm Storage (Ceph Mount)

Use ./data and ./artifacts as the hot working set. Use /mnt/ceph/llm/data as warm cache/backup for durability and overflow.

  • Recommended mount layout:
    • /mnt/ceph/llm/data/raw_zim/
    • /mnt/ceph/llm/data/extracted/
    • /mnt/ceph/llm/data/shards/
    • /mnt/ceph/llm/data/tokenizer/
  • Version datasets by ZIM date stamp:
    • ZIM: serverfault.com_en_all_2025-08.zim
    • Version tag: serverfault_2025-08
    • Raw ZIM: /mnt/ceph/llm/data/raw_zim/serverfault.com_en_all_2025-08.zim
    • Extracted text: /mnt/ceph/llm/data/extracted/serverfault_2025-08.txt
    • Tokenizer: /mnt/ceph/llm/data/tokenizer/serverfault_2025-08-vocab.json
    • Shards: /mnt/ceph/llm/data/shards/serverfault_2025-08/
  • Default run model:
    • Process locally in data/extracted, data/shards, and artifacts/tokenizer.
    • Periodically sync to Ceph for backup/caching.
  • Push local artifacts to warm storage:
bash scripts/sync_warm_storage.sh /mnt/ceph/llm/data

This now syncs training-critical inputs/outputs including: data/raw_zim, data/fineweb, data/cleaned, data/extracted, data/shards, data/shards_global, artifacts/tokenizer, artifacts/checkpoints, and artifacts/reports.

  • Continuous ZIM offload worker (hot -> warm):
bash scripts/zim_offload_worker.sh data/raw_zim /mnt/ceph/llm/data/raw_zim 120
  • Pull artifacts back from warm storage to local hot workspace:
bash scripts/hydrate_from_warm_storage.sh /mnt/ceph/llm/data

Current Capabilities

  • Text stats CLI for quick corpus sanity checks.
  • Batch corpus quality report generation (corpus-quality-report).
  • Batch corpus cleanup and dedupe (clean-corpus-batch).
  • Heuristic dataset risk auditing (dataset-risk-report).
  • Direct FineWeb parquet -> tokenizer -> shard pipeline (scripts/fineweb_parquet_to_shards.py).
  • BPE tokenizer workflow with train/save/load + contract fingerprinting.
  • Token-window data pipeline (TokenWindowDataset) for next-token training pairs.
  • ZIM archive text extraction (extract-zim-text) for server-hosted .zim files.
    • Automatically falls back to suggestion-index paths if fulltext search returns no matches.
  • Corpus sharding (shard-corpus) into train/val token shard binaries + manifest.
  • Batch corpus sharding (shard-corpus-batch) with one shared tokenizer.
  • Baseline GPT training (train) with checkpoint save/resume.
    • Default architecture: RoPE + RMSNorm + SwiGLU (gpt_rope_rmsnorm_swiglu_v1).
    • Includes AdamW no-decay param groups, warmup/cosine LR, and grad accumulation.
  • Checkpoint-based text generation (generate) with temperature/top-k sampling.
  • Optional safetensors export for deployment (--export-safetensors).
  • Unit tests for tokenizer round-trips and unknown token behavior.

Next Milestones

  1. Expand checkpoint eval suite and track regressions in CI.
  2. Add tokenizer-aware dataset manifests for long-running incremental FineWeb phases.
  3. Add larger-context training profiles and memory/throughput benchmarking.
  4. Add finetuning flows for classification and instruction datasets.

References

Reference Repo Sync

git submodule update --init --recursive
git submodule update --remote information/external/LLMs-from-scratch

Use the first command after clone; use the second to pull newer upstream reference commits.

Wiki Documentation

Repository wiki pages are maintained from wiki/*.md.

Publish updates to GitHub wiki:

bash scripts/publish_wiki.sh git@github.com:aditaa/llm.wiki.git

Preferred workflow:

  1. Update README.md and AGENTS.md as needed.
  2. Update matching pages in wiki/.
  3. Publish wiki with scripts/publish_wiki.sh.

Dataset inventory and intended use are tracked in:

  • wiki/Dataset-Registry.md

About

Decoder-only LLM from scratch with reproducible data pipelines, tokenizer/sharding workflows, and GPU training.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors