- Reproduction-oriented implementation of Kalai et al. (2025), centering on the mechanisms behind hallucinations in lightweight open-source LMs (e.g., GPT-2).
- Prioritizes transparency: curated datasets, deterministic decoding, and reproducible evaluation pipelines for IIV vs. generation gaps, singleton-rate effects, calibration metrics, and scoring incentives.
- Designed for hiring managers and research collaborators who want to quickly audit both the code quality and the experimental methodology.
why-lms-hallucinate-repro/
README.md
requirements.txt
src/
datasets.py
lm_iface.py
iiv.py
generation.py
calibration.py
scoring.py
utils.py
data/
experiments/
01_iiv_vs_generation.ipynb
02_singleton_rate.ipynb
03_calibration_and_scoring.ipynb
figures/
- Create and activate a Python 3.10+ environment.
- Install dependencies with
pip install -r requirements.txt. - First run of the notebooks will download the chosen HF model (default:
gpt2).
- Every script/notebook will fix RNG seeds and record decoding parameters.
- Raw generations and metrics will be saved as JSONL/CSV under
data/orexperiments/outputs/(to be added).
01_iiv_vs_generation.ipynb: build IIV datasets, compare err vs 2×err_iiv.02_singleton_rate.ipynb: sweep birthday singleton rates and track hallucination rate.03_calibration_and_scoring.ipynb: estimate δ, ECE, and evaluate scoring policies.
- Scatter plot: generative error vs
2 × err_iiv. - Line/bar charts: hallucination vs singleton rate, calibration metrics, and scoring policy outcomes.
- Uses compact models; qualitative trends may shift for frontier LMs.
- Synthetic data approximations; validating on real corpora is currently out-of-scope.
- Future work: scale to larger checkpoints, expand reward modeling variants, and integrate human preference evaluation.