This repository provides an evaluation framework for diffusion-based Large Language Models (LLMs) and Code LLMs, including models like Dream, DiffuCoder, and LLaDA. It supports running various diffusion decoding algorithms across standard benchmarks.
The codebase relies on Python and HuggingFace libraries.
It is recommended to use the dllmexp conda environment as referenced in the scripts.
conda create -n dllmexp python=3.10
conda activate dllmexpInstall the required packages:
pip install -r requirements.txtKey dependencies:
transformers==4.54.0lm-eval==0.4.8numpy>=1.24tqdmprettytable
Note: Ensure HF_HOME and HF_DATASETS_CACHE are set if you need to store models in a specific directory (already handled in scripts/exp.sh).
You can run a single evaluation job using eval.py. Here is an example showing all supported arguments:
python eval.py \
# [Required] Model selection: dream, llada, llada1.5, diffucoder
--model_alias llada \
# [Required] Task selection: humaneval, mbpp, gsm8k, truthfulqa
--task humaneval \
# [Optional] Algorithm override (e.g., maskgit_plus, low_confidence, random)
--alg low_confidence \
# [Optional] decoding parameters
--tokens_per_step 1 \
--num_steps 128 \
--gen_length 512 \
--block_length 64 \
--temperature 0.0 \
--top_p 0.95 \
--max_new_tokens 512 \
--remasking low_confidence \
# [Optional] Job control
--limit 10 \
--output_dir results \
--tag debug_run \
--dtype bfloat16 \
--device cuda \
# [Optional] Model Checkpoint Overrides (defaults shown)
--dream_ckpt "Dream-org/Dream-v0-Instruct-7B" \
--llada_ckpt "GSAI-ML/LLaDA-8B-Instruct" \
--llada15_ckpt "GSAI-ML/LLaDA-1.5" \
--diffucoder_ckpt "apple/DiffuCoder-7B-cpGRPO"To run the full suite of 48 experiments (all models x all tasks x sweep over TPS), use the provided shell script. This script handles GPU scheduling and resumes from checkpoints automatically.
# Runs inside a tmux session
./scripts/exp.shMonitor progress with:
./scripts/progress.sh| Model Alias | Description | Checkpoint Default |
|---|---|---|
dream |
Dream-v0-Instruct-7B | Dream-org/Dream-v0-Instruct-7B |
diffucoder |
DiffuCoder-7B | apple/DiffuCoder-7B-cpGRPO |
llada |
LLaDA-8B-Instruct | GSAI-ML/LLaDA-8B-Instruct |
llada1.5 |
LLaDA-1.5 | GSAI-ML/LLaDA-1.5 |
| Task Name | Description |
|---|---|
humaneval |
HumanEval Python Coding Benchmark |
mbpp |
MBPP Python Coding Benchmark |
gsm8k |
Grade School Math Reasoning |
truthfulqa |
TruthfulQA (Generation mode) |
- Dream / DiffuCoder:
maskgit_plus - LLaDA:
low_confidence,random,leftright
--tokens_per_step: Tokens generated per diffusion step (default: 1).--num_steps: Number of diffusion steps (or auto-calculated for LLaDA block decoding).--temperature: Sampling temperature.--limit: Limit number of samples (useful for debugging).--output_dir: Directory to save JSON results (default:results).
- Job Resume:
eval.pyautomatically maintains aresults/cache/directory. If a job crashes, re-running it will skip already generated samples. - Experiment Skip:
scripts/exp.shdetects if a final result file exists and skips listing that job entirely. - Explanation:
scripts/exp.shshows how the experiments that the 4 models run on 4 benchmarks with tps=1,2,3, totally 48, are carried out in a tmux window. Before start, you need to set your own hf_home and hf_cache correctly.
To add a new model, modify eval.py:
- Defaults: Add a configuration dictionary to
MODEL_DEFAULTSdefining default steps, algorithms, etc. - Loader: Update the
get_model()function to handle the newmodel_aliasand initialize the appropriate Harness. - Harness: If special generation logic is needed, implement a new harness class in
harness.pyinheriting from_ProfilingHarnessorHFLM.
To add a new benchmark, modify eval.py:
- Mapping: Update the
TASKSdictionary to map your CLI task name to thelm-evaltask name. - Templates: If the task requires a specific chat template, update
get_prompt_template. - Pattern Recognition: You may need to modify
utils.pyto adjust the output to the formay you want.
- This repo is forked from apd(https://github.com/danielmisrael/apd) and restructed on it.
- Ackonwledge for gemini 3.0 pro and codex 5.1