Production-style, decoder-only LLM engineering project focused on reproducible data pipelines, tokenizer/sharding workflows, and GPU training from scratch.
- Scope: end-to-end LLM workflow from raw corpora to checkpoints and generation.
- Data focus: ZIM + FineWeb workflows with hot (
./data) and warm (/mnt/ceph/llm/data) storage patterns. - Engineering focus: deterministic scripts, integrity checks, CI gating, and wiki-backed docs.
- Wiki:
wiki/ - Setup:
docs/SERVER_SETUP.md - RTX 5070 tuning:
docs/RTX5070_TUNING.md - HF release + deploy:
docs/HF_RELEASE_AND_DEPLOY.md - Contributor guide:
AGENTS.md
- Build a minimal but production-style training stack incrementally.
- Keep each subsystem testable (
tokenizer,data,model,training,evaluation). - Favor reproducible experiments through explicit configs and scripts.
src/llm/: core Python packagetests/: unit testsdocs/: architecture and roadmap notesinformation/: reference material and external links for project guidancerequirements/: system and Python dependency lists for server setupscripts/: bootstrap/install/doctor scriptsdata/: local/intermediate corpora (gitignored exceptdata/README.md)artifacts/: local outputs (vocab, checkpoints, logs; gitignored)Makefile: common developer commands
bash scripts/bootstrap_dev.shmake setup-infer # install inference/deploy dependencies
make test # run unit tests
make lint # run Ruff checks
make format # run Black formatter
make typecheck # run MyPy
make smoke # tiny CLI smoke check
make verify-shards # print shard integrity check usage
make train # print baseline training command usage
make generate # print checkpoint text-generation command usage
make eval-checkpoint # print standardized prompt-suite eval usage
make train-tokenizer-global # print shared-tokenizer command usage
make corpus-quality-report # print quality report command usage
make clean-corpus-batch # print batch cleanup command usage
make dataset-risk-report # print heuristic dataset risk audit command usage
make pull-hf-rows # print Hugging Face rows API pull helper usage
make fineweb-parquet-to-shards # print direct FineWeb parquet->token-shards usage
make stage-fineweb-from-warm # print warm->hot FineWeb chunk staging usage
make fineweb-stage-shard-loop # print rolling stage->shard->verify->sync->purge usage
make shard-corpus-batch # print shared-tokenizer batch sharding usage
make sync-warm # sync raw/training data + artifacts to warm storage
make hydrate-warm # hydrate hot workspace from warm storage
make offload-zim # continuously move raw ZIMs hot -> warm
make hf-prepare-publish # print HF bundle/publish usage
make hf-download-model # print full HF model download usage
make serve-openai # print local OpenAI-compatible server usage
make doctor # verify binaries and Python depsGitHub Actions workflows are defined in .github/workflows/:
ci.yml: lint, typecheck, unit tests, smoke checks on pull requests and pushes tomainwiki-sync.yml: publishwiki/*.mdchanges to the GitHub Wiki- Dependabot config:
.github/dependabot.yml(weekly updates forpip,requirements/, and GitHub Actions)
Recommended branch protection for main:
- Require pull request before merging
- Require status checks:
CI Gate - Require branches to be up to date before merge
- Install system packages:
bash scripts/install_server_system.sh - Bootstrap dev environment:
bash scripts/bootstrap_dev.sh - Install training extras:
bash scripts/bootstrap_train.sh - Run health check:
bash scripts/doctor.sh
Detailed guide: docs/SERVER_SETUP.md
Keep raw .zim files on server storage (for example /data/iiab/zim/), not in Git.
For a first-pass talking-only dataset profile (English prose focus), generate include/exclude manifests:
bash scripts/first_pass_zim_profile.shTo also move excluded local ZIMs from hot storage to warm storage:
bash scripts/first_pass_zim_profile.sh --move-excludedThis writes:
artifacts/reports/first_pass_include_targets.txt(target profile, includes Gutenberg)artifacts/reports/first_pass_include_zims.txt(currently present and included)artifacts/reports/first_pass_exclude_zims.txt(currently present and excluded)
- Extract text corpus from ZIM:
PYTHONPATH=src .venv/bin/python -m llm.cli extract-zim-text \
--input-zim /data/iiab/zim/wikipedia_en_all_maxi.zim \
--output data/extracted/wiki_corpus.txt \
--max-articles 50000 \
--min-chars 200If extraction returns written_articles=0, retry with a lower --min-chars (for example 20).
If extract-zim-text reports no fulltext index, generate a --paths-file from
ZIM suggestions/title index and rerun extraction with that file.
- Analyze extracted corpora and generate boilerplate candidates:
PYTHONPATH=src .venv/bin/python -m llm.cli corpus-quality-report \
--input-dir data/extracted \
--output artifacts/reports/corpus_quality.json- Clean corpora before tokenizer training:
PYTHONPATH=src .venv/bin/python -m llm.cli clean-corpus-batch \
--input-dir data/extracted \
--output-dir data/cleaned \
--boilerplate-report artifacts/reports/corpus_quality.json \
--en-onlyBy default this cleanup step also decodes HTML entities and strips common web-shell artifacts
(HTML-like tags, repeated nav/menu phrases, site suffixes such as - Stack Overflow).
Disable individual transforms with:
--no-decode-html-entities, --no-strip-html-tags, --no-strip-site-suffixes,
--no-strip-nav-phrases, --no-strip-stack-metadata, --no-collapse-repeated-prefix,
--no-strip-inline-score-tokens.
To enforce English-only cleanup, add --en-only (with tunable thresholds:
--en-min-words, --en-min-stopword-ratio, --en-min-stopword-count,
--en-min-latin-ratio).
Additional quality guards are enabled by default:
- minimum words per line (
--min-words, default6) - symbol-density filter (
--max-symbol-ratio, default0.20) - URL-heavy line filter (
--max-urls-per-line, default1) - repetitive-token noise filter (
--repeated-token-run-threshold, default8) For talking-only passes, keep code filtering enabled (default) or tune with:--code-symbol-ratio-thresholdand--code-keyword-hits-threshold.
3a. Pull a bounded Hugging Face dataset slice (for example FineWeb sample rows):
python3 scripts/pull_hf_rows.py \
--dataset HuggingFaceFW/fineweb \
--config sample-10BT \
--split train \
--output /mnt/ceph/llm/data/extracted/fineweb_sample-10BT_rows100k.txt \
--max-rows 100000Use warm storage for these pulls first; full FineWeb variants are much larger than typical hot disk.
3aa. Bulk-download FineWeb parquet shards (resumable):
# create token in Hugging Face web UI: Settings -> Access Tokens (read scope)
export HF_TOKEN=hf_xxx
# sample-10BT (~30.6 GB) -> hot storage
HF_HUB_DISABLE_XET=1 .venv/bin/hf download HuggingFaceFW/fineweb \
--repo-type dataset \
--include "sample/10BT/*.parquet" \
--local-dir data/fineweb/sample-10BT \
--max-workers 2 \
--token "$HF_TOKEN"
# sample-350BT (~1.06 TB) -> warm storage
HF_HUB_DISABLE_XET=1 .venv/bin/hf download HuggingFaceFW/fineweb \
--repo-type dataset \
--include "sample/350BT/*.parquet" \
--local-dir /mnt/ceph/llm/data/fineweb/sample-350BT \
--max-workers 2 \
--token "$HF_TOKEN"Notes:
HF_TOKENis recommended (higher limits), not strictly required for public datasets.- Hugging Face SSH keys are for Git-over-SSH and are not used by
hf download.
3ab. Stage FineWeb chunks from warm to hot as needed:
bash scripts/stage_fineweb_from_warm.sh --max-files 4 --max-gib 83ac. Run rolling warm->hot staging + sharding loop (recommended for 350BT on limited hot disk):
bash scripts/fineweb_stage_shard_loop.sh \
--stage-max-files 10 \
--process-max-files 10 \
--sleep-seconds 120This loop stages bounded parquet files to hot storage, builds verified shard batches under
data/shards_global/fineweb-global-bpe-v1/, syncs those batches back to warm storage,
and purges processed hot parquet files.
3ad. Build tokenizer + token shards directly from FineWeb parquet:
PYTHONPATH=src .venv/bin/python scripts/fineweb_parquet_to_shards.py \
--input-dir data/fineweb/sample-10BT \
--output-dir data/shards_global/fineweb-s10bt-global-bpe-v1 \
--tokenizer-out artifacts/tokenizer/fineweb-s10bt-global-bpe-v1.json \
--field text \
--min-chars 80 \
--shard-size-tokens 5000000 \
--val-ratio 0.01This writes manifest.json + shard .bin files directly, skipping extracted text.
Use --max-files to do bounded test runs.
3b. Run heuristic dataset risk audit:
PYTHONPATH=src .venv/bin/python -m llm.cli dataset-risk-report \
--input-dir data/cleaned \
--output artifacts/reports/dataset_risk.jsonThis reports lexical cues for toxicity, stereotypes, political content, and refusal-like phrases. Use it as a screening signal, then manually review high-risk segments.
- Train tokenizer on cleaned corpus:
PYTHONPATH=src .venv/bin/python -m llm.cli train-tokenizer \
--input data/cleaned/wiki_corpus.clean.txt \
--output artifacts/tokenizer/vocab.json \
--bpe-vocab-size 32000 \
--bpe-min-frequency 2- Shard tokenized corpus for training:
PYTHONPATH=src .venv/bin/python -m llm.cli shard-corpus \
--input data/cleaned/wiki_corpus.clean.txt \
--tokenizer artifacts/tokenizer/vocab.json \
--output-dir data/shards/wiki_bpe \
--shard-size-tokens 5000000 \
--val-ratio 0.015b. Build one global tokenizer for multi-dataset training:
PYTHONPATH=src .venv/bin/python -m llm.cli train-tokenizer-global \
--input-dir data/cleaned \
--pattern "*.clean.txt" \
--from-shards-path data/shards \
--output artifacts/tokenizer/global-bpe-v1.json \
--bpe-vocab-size 32000 \
--bpe-min-frequency 25c. Re-shard many corpora with that global tokenizer:
PYTHONPATH=src .venv/bin/python -m llm.cli shard-corpus-batch \
--input-dir data/cleaned \
--pattern "*.clean.txt" \
--from-shards-path data/shards \
--tokenizer artifacts/tokenizer/global-bpe-v1.json \
--output-root data/shards_global/global-bpe-v1- Inspect corpus quickly:
PYTHONPATH=src .venv/bin/python -m llm.cli stats --input data/cleaned/wiki_corpus.clean.txt- Verify shard integrity before training:
PYTHONPATH=src .venv/bin/python -m llm.cli verify-shards \
--path data/shards \
--raw-zim-dir data/raw_zim \
--strict-source- Run a baseline training test:
PYTHONPATH=src .venv/bin/python -m llm.cli train \
--shards-path data/shards/medlineplus.gov_en_all_2025-01 \
--output-dir artifacts/checkpoints/medlineplus_baseline \
--max-steps 200 \
--batch-size 8 \
--context-length 256 \
--lr-schedule cosine \
--lr-warmup-steps 50 \
--grad-accum-steps 1 \
--fail-on-eval-regression \
--precision autoNote: train requires all selected manifests to share the exact same tokenizer mapping.
Use a global tokenizer + shard-corpus-batch output root for multi-dataset runs.
For higher sustained GPU utilization on CUDA, use --precision auto and keep
validation less frequent (--eval-interval 500 --eval-steps 10).
If utilization is still bursty on smaller models, test --compile-model.
Training now supports:
- warmup + cosine LR schedule (
--lr-schedule,--lr-warmup-steps,--lr-min-ratio) - gradient accumulation (
--grad-accum-steps) - fixed held-out eval batches (
--no-eval-freeze-batchesto disable) - eval regression gate (
--fail-on-eval-regression --eval-regression-tolerance 0.20) - optional weights-only export (
--export-safetensors)
- Generate text from a checkpoint:
PYTHONPATH=src .venv/bin/python -m llm.cli generate \
--checkpoint artifacts/checkpoints/medlineplus_baseline/last.pt \
--prompt "The future of medicine is" \
--max-new-tokens 200 \
--temperature 0.9 \
--top-k 50- Run standardized checkpoint eval (fixed prompt suite + scored report):
PYTHONPATH=src .venv/bin/python scripts/eval_checkpoint_prompts.py \
--checkpoint artifacts/checkpoints/medlineplus_baseline/last.pt \
--suite configs/eval/standard_prompt_suite_v1.jsonWrites a JSON report under artifacts/reports/evals/ so runs can be compared over time.
Use this when you want round-1 pretraining only from FineWeb (no ZIM mix yet):
# 1) build tokenizer + shards directly from parquet
PYTHONPATH=src .venv/bin/python scripts/fineweb_parquet_to_shards.py \
--input-dir data/fineweb/sample-10BT \
--output-dir data/shards_global/fineweb-s10bt-global-bpe-v1 \
--tokenizer-out artifacts/tokenizer/fineweb-s10bt-global-bpe-v1.json \
--field text \
--min-chars 80 \
--shard-size-tokens 5000000 \
--val-ratio 0.01
# 2) verify and train
PYTHONPATH=src .venv/bin/python -m llm.cli verify-shards \
--path data/shards_global/fineweb-s10bt-global-bpe-v1
PYTHONPATH=src .venv/bin/python -m llm.cli train \
--shards-path data/shards_global/fineweb-s10bt-global-bpe-v1 \
--output-dir artifacts/checkpoints/fineweb-s10bt-run1 \
--device cuda \
--max-steps 1000 \
--batch-size 12 \
--context-length 256 \
--lr-schedule cosine \
--lr-warmup-steps 200 \
--fail-on-eval-regression \
--precision autoResume training from the latest checkpoint:
PYTHONPATH=src .venv/bin/python -m llm.cli train \
--shards-path data/shards_global/fineweb-s10bt-global-bpe-v1 \
--output-dir artifacts/checkpoints/fineweb-s10bt-run1 \
--device cuda \
--resume-from artifacts/checkpoints/fineweb-s10bt-run1/last.pt \
--max-steps 3000Optional text-first path still exists for inspection-heavy runs:
parquet_to_corpus -> clean-corpus-batch -> train-tokenizer-global -> shard-corpus-batch.
You can start training on a subset, then add new parquet files with the same tokenizer and resume:
# phase 1 file snapshot (example: first 10 files)
find data/fineweb/sample-10BT/sample/10BT -maxdepth 1 -type f -name '*.parquet' | sort | head -n 10 | sed 's#^data/fineweb/sample-10BT/##' > artifacts/reports/fineweb_sample10bt_phase1_files.txt
# build phase 1 tokenizer + shards
PYTHONPATH=src .venv/bin/python scripts/fineweb_parquet_to_shards.py \
--input-dir data/fineweb/sample-10BT \
--files-list artifacts/reports/fineweb_sample10bt_phase1_files.txt \
--output-dir data/shards_global/fineweb-s10bt-incremental/phase1 \
--tokenizer-out artifacts/tokenizer/fineweb-s10bt-incremental-bpe-v1.json \
--field text
# start training on phase 1
PYTHONPATH=src .venv/bin/python -m llm.cli train \
--shards-path data/shards_global/fineweb-s10bt-incremental \
--output-dir artifacts/checkpoints/fineweb-s10bt-incremental-run1 \
--device cuda
# later: build phase 2 from newly arrived files using same tokenizer
find data/fineweb/sample-10BT/sample/10BT -maxdepth 1 -type f -name '*.parquet' | sort | sed 's#^data/fineweb/sample-10BT/##' > /tmp/all_parquets.txt
comm -23 /tmp/all_parquets.txt artifacts/reports/fineweb_sample10bt_phase1_files.txt > artifacts/reports/fineweb_sample10bt_phase2_files.txt
PYTHONPATH=src .venv/bin/python scripts/fineweb_parquet_to_shards.py \
--input-dir data/fineweb/sample-10BT \
--files-list artifacts/reports/fineweb_sample10bt_phase2_files.txt \
--output-dir data/shards_global/fineweb-s10bt-incremental/phase2 \
--tokenizer-in artifacts/tokenizer/fineweb-s10bt-incremental-bpe-v1.json \
--field text
# resume; train sees both manifests under shards-path
PYTHONPATH=src .venv/bin/python -m llm.cli train \
--shards-path data/shards_global/fineweb-s10bt-incremental \
--output-dir artifacts/checkpoints/fineweb-s10bt-incremental-run1 \
--device cuda \
--resume-from artifacts/checkpoints/fineweb-s10bt-incremental-run1/last.ptOn this 20-core host, default FineWeb shard splitting should use 15 parallel streams.
- Tuned profile docs:
docs/RTX5070_TUNING.md - Saved JSON profiles:
configs/train/rtx5070/fineweb_global_bpe_v1_big.json(recommended, BPE)
- Launch tuned big profile:
bash scripts/train_rtx5070_fineweb_bpe_v1_big.shUse ./data and ./artifacts as the hot working set.
Use /mnt/ceph/llm/data as warm cache/backup for durability and overflow.
- Recommended mount layout:
/mnt/ceph/llm/data/raw_zim//mnt/ceph/llm/data/extracted//mnt/ceph/llm/data/shards//mnt/ceph/llm/data/tokenizer/
- Version datasets by ZIM date stamp:
- ZIM:
serverfault.com_en_all_2025-08.zim - Version tag:
serverfault_2025-08 - Raw ZIM:
/mnt/ceph/llm/data/raw_zim/serverfault.com_en_all_2025-08.zim - Extracted text:
/mnt/ceph/llm/data/extracted/serverfault_2025-08.txt - Tokenizer:
/mnt/ceph/llm/data/tokenizer/serverfault_2025-08-vocab.json - Shards:
/mnt/ceph/llm/data/shards/serverfault_2025-08/
- ZIM:
- Default run model:
- Process locally in
data/extracted,data/shards, andartifacts/tokenizer. - Periodically sync to Ceph for backup/caching.
- Process locally in
- Push local artifacts to warm storage:
bash scripts/sync_warm_storage.sh /mnt/ceph/llm/dataThis now syncs training-critical inputs/outputs including:
data/raw_zim, data/fineweb, data/cleaned, data/extracted,
data/shards, data/shards_global, artifacts/tokenizer,
artifacts/checkpoints, and artifacts/reports.
- Continuous ZIM offload worker (hot -> warm):
bash scripts/zim_offload_worker.sh data/raw_zim /mnt/ceph/llm/data/raw_zim 120- Pull artifacts back from warm storage to local hot workspace:
bash scripts/hydrate_from_warm_storage.sh /mnt/ceph/llm/data- Text stats CLI for quick corpus sanity checks.
- Batch corpus quality report generation (
corpus-quality-report). - Batch corpus cleanup and dedupe (
clean-corpus-batch). - Heuristic dataset risk auditing (
dataset-risk-report). - Direct FineWeb parquet -> tokenizer -> shard pipeline (
scripts/fineweb_parquet_to_shards.py). - BPE tokenizer workflow with train/save/load + contract fingerprinting.
- Token-window data pipeline (
TokenWindowDataset) for next-token training pairs. - ZIM archive text extraction (
extract-zim-text) for server-hosted.zimfiles.- Automatically falls back to suggestion-index paths if fulltext search returns no matches.
- Corpus sharding (
shard-corpus) into train/val token shard binaries + manifest. - Batch corpus sharding (
shard-corpus-batch) with one shared tokenizer. - Baseline GPT training (
train) with checkpoint save/resume.- Default architecture: RoPE + RMSNorm + SwiGLU (
gpt_rope_rmsnorm_swiglu_v1). - Includes AdamW no-decay param groups, warmup/cosine LR, and grad accumulation.
- Default architecture: RoPE + RMSNorm + SwiGLU (
- Checkpoint-based text generation (
generate) with temperature/top-k sampling. - Optional safetensors export for deployment (
--export-safetensors). - Unit tests for tokenizer round-trips and unknown token behavior.
- Expand checkpoint eval suite and track regressions in CI.
- Add tokenizer-aware dataset manifests for long-running incremental FineWeb phases.
- Add larger-context training profiles and memory/throughput benchmarking.
- Add finetuning flows for classification and instruction datasets.
- Internal reference index:
information/README.md - Working notes from loaded PDF + external references:
information/raschka-reference-notes.md - Implementation checklist from those references:
information/raschka-implementation-checklist.md - Sebastian Raschka article: https://magazine.sebastianraschka.com/p/coding-llms-from-the-ground-up
- Raschka repository: https://github.com/rasbt/LLMs-from-scratch
- Local checkout (submodule):
information/external/LLMs-from-scratch
git submodule update --init --recursive
git submodule update --remote information/external/LLMs-from-scratchUse the first command after clone; use the second to pull newer upstream reference commits.
Repository wiki pages are maintained from wiki/*.md.
Publish updates to GitHub wiki:
bash scripts/publish_wiki.sh git@github.com:aditaa/llm.wiki.gitPreferred workflow:
- Update
README.mdandAGENTS.mdas needed. - Update matching pages in
wiki/. - Publish wiki with
scripts/publish_wiki.sh.
Dataset inventory and intended use are tracked in:
wiki/Dataset-Registry.md