This repository layer extends TinyRecursiveModels with a SlimOrca-based instruction-tuning workflow. The highlights below capture the decisions we finalized while wiring up end-to-end training and inference.
- Training:
pretrain_instruct.py(aligned with Hydra configconfig/cfg_pretrain_instruct.yaml). - Smoke test:
scripts/pretrain_instruct_smoke.sh(acceptsSMOKE_*overrides for local debugging). - Inference:
inference_instruct.py, which mirrors the training config surface and reuses the sharedevaluateloop.
- The SlimOrca JSONL (~1 GB) is Apache-2.0 licensed and hosted on Hugging Face:
https://huggingface.co/datasets/Open-Orca/SlimOrca. - By default the loader downloads the file on demand into
data/slimorca/SlimOrca.jsonl. Becausedata/is git-ignored, collaborators must fetch it locally before training. - For the full corpus you can either:
- Let
pretrain_instruct.pyorinference_instruct.pyrun once; the code will download the JSONL automatically if it is missing, or - Manually fetch via
huggingface-cli/curl(optionally the.zstvariant), then place the decompressed file atdata/slimorca_full/SlimOrca.jsonl.
- Let
- The LLaMA 32k tokenizer (not distributed here) should live at
tokenizers/llama-32k/tokenizer.modelor another path pointed to byLLAMA_TOKENIZER.
dataset/slimorca.pydownloads (or reuses)SlimOrca.jsonl, tokenizes conversations in-memory with the LLaMA-32k SentencePiece model, and now surfaces atqdmprogress bar while tokenizing. The bar always shows a true denominator; we removed previous caching so it recomputes counts per run.- Assistant supervision is now autoregressive: the loader replaces each assistant token in the input with the preceding token so the model must predict the actual assistant text rather than copy it. Labels still target the true assistant tokens only.
- Set
dataset.data_dir=data/slimorca_fullanddataset.subset_size=nullto consume the full 518k example corpus. For small-scale tests, overridedataset.subset_size. - Use
dataloader_workers=0when training on a single GPU under WSL/Windows to avoid duplicated dataset copies and SSD thrash.
- Always export
LLAMA_TOKENIZER=/path/to/tokenizers/llama-32k/tokenizer.model. - Set
DISABLE_COMPILE=1for now; it keeps PyTorch from wrapping the model intorch.compile, matching how checkpoints were produced. - The default config keeps ACT recursion active. For faster iterations you can reduce
arch.halt_max_stepsor cycles (arch.L_cycles,arch.H_cycles) before scaling back up. - With a 12 GB GPU,
global_batch_size=48(single process) comfortably fits in memory oncedataloader_workers=0is applied; the full-epoch runtime estimate is roughly 4.5 hours.
inference_instruct.pyaccepts the same dataset overrides as the trainer. The script defaultsDISABLE_COMPILE=1to load checkpoints saved without compilation and will emit saved tensors (e.g.,preds) under the chosencheckpoint_path.- If the test split comes up empty (common with tiny subsets), tweak
--subset-size,dataset.test_ratio, or target the train split.
- When
global_batch_sizeexceeds the number of available examples (e.g., a tiny subset), the loader will error with “No train batches available.” Reduce the batch size or increase the subset. - Watching throughput: the SlimOrca tokenizer bar should progress immediately; once it completes you will see the standard training
tqdmbar andW&Bmetrics. Lack of output usually means tokenization is still running.