parameterlab · cemde · Mar 5, 2026 · Jan 28, 2026 · Jan 28, 2026 · Jan 30, 2026
diff --git a/.gitignore b/.gitignore
@@ -5,6 +5,10 @@
 .idea/
 .DS_Store
 .devcontainer/
+results/
+DISCO-MMLU/
+flattened-MMLU/
+tmp/
 
 # Byte-compiled / optimized / DLL files
 __pycache__/

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,6 +11,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 **Benchmarks**
 
+- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `HuggingFaceMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `MMLUModelAgent`, `MMLUAgentAdapter`, `AnchorPointsTaskQueue`, `load_tasks()`, and `compute_benchmark_metrics()`. Optional extras: `lm-eval` (for `HuggingFaceMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34)
+
 - CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including `ConverseBenchmark`, `DefaultAgentConverseBenchmark`, `ConverseEnvironment`, `ConverseExternalAgent`, `PrivacyEvaluator`, `SecurityEvaluator`, and `load_tasks()` utilities for `travel`, `real_estate`, and `insurance` domains. Benchmark source files are now downloaded on first use via `ensure_data_exists()` instead of being bundled in the package. (PR: #28)
 
 - GAIA2 Benchmark: Integration with Meta's ARE (Agent Research Environments) platform for evaluating LLM-based agents on dynamic, multi-step scenarios (PR: #26)
@@ -32,6 +34,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 **Examples**
 
+- MMLU benchmark example at `examples/mmlu_benchmark/` for evaluating HuggingFace models on MMLU with optional DISCO prediction (`--disco_model_path`, `--disco_transform_path`). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34)
 - Added a dedicated runnable CONVERSE default benchmark example at `examples/converse_benchmark/default_converse_benchmark.py` for quick start with `DefaultAgentConverseBenchmark`. (PR: #28)
 - Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)
 
@@ -62,6 +65,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Live API round-trip tests for all model adapters (`-m credentialed`) (PR: #29)
 - CI jobs for slow tests (with benchmark data caching) and credentialed tests (behind GitHub Environment approval) (PR: #29)
 - Added `respx` dev dependency for HTTP-level mocking (PR: #29)
+- pytest marker `mmlu` for tests that require the MMLU benchmark (HuggingFace + DISCO). (PR: #34)
 
 ### Changed
 

diff --git a/examples/mmlu_benchmark/.gitignore b/examples/mmlu_benchmark/.gitignore
@@ -0,0 +1,7 @@
+# Results
+results/
+*.jsonl
+
+# Predictions
+predictions/
+*.pkl
diff --git a/examples/mmlu_benchmark/README.md b/examples/mmlu_benchmark/README.md
@@ -0,0 +1,59 @@
+# MMLU Benchmark Example
+
+Evaluate language models on [MMLU (Massive Multitask Language Understanding)](https://arxiv.org/abs/2009.03300) with optional efficient evaluation via [DISCO](https://arxiv.org/abs/2510.07959).
+
+## Run without DISCO (full evaluation)
+
+From the project root:
+
+```bash
+uv run python examples/mmlu_benchmark/mmlu_benchmark.py --model_id alignment-handbook/zephyr-7b-sft-full
+```
+
+Full evaluation results look like:
+
+```
+================================================================================
+Results Summary (Evaluated Tasks)
+================================================================================
+Total tasks: 14042
+Correct: 8291
+Accuracy (on anchor points): 0.5904
+Accuracy norm (on anchor points): 0.5904
+```
+
+## Run with DISCO (predicted full-benchmark score)
+
+From the project root:
+
+```bash
+uv run python examples/mmlu_benchmark/mmlu_benchmark.py --model_id alignment-handbook/zephyr-7b-sft-full --disco_model_path arubique/DISCO-MMLU
+```
+
+Predicted score output:
+
+```
+----------------------------------------
+DISCO Predicted Full Benchmark Accuracy:
+----------------------------------------
+  Model 0: 0.606739
+```
+
+## Arguments
+
+| Argument | Description | Default |
+|----------|-------------|---------|
+| `--model_id` | HuggingFace model identifier (e.g. `meta-llama/Llama-2-7b-hf`) | *(required)* |
+| `--data_path` | Path to MMLU prompts JSON file or Hugging Face dataset repo id | `arubique/flattened-MMLU` |
+| `--anchor_points_path` | Path to anchor points pickle file; if set, only anchor tasks are evaluated | — |
+| `--output_dir` | Directory to save results | `./results` |
+| `--predictions_path` | Path to save predictions tensor as pickle (for DISCO) | — |
+| `--limit` | Limit number of tasks to evaluate (for testing) | — |
+| `--batch_size` | Batch size for evaluation (reserved for future use) | `1` |
+| `--device` | Device to run model on (e.g. `cuda:0`, `cpu`) | `cuda:0` |
+| `--num_workers` | Number of parallel workers for task execution | `1` |
+| `--disco_model_path` | If set, run DISCO prediction; path to `.pkl`, `.npz`, or Hugging Face repo id | — |
+| `--disco_transform_path` | Path to DISCO PCA transform `.pkl` or `.npz` (for local DISCO model when using `--pca`) | — |
+| `--pca` | PCA dimension for DISCO embeddings | — |
+| `--pad_to_size` | Pad predictions to this size with -inf | — |
+| `--use_lmeval_batching` | Use [lm-evaluation-harness-style](https://github.com/EleutherAI/lm-evaluation-harness) batching for exact numerical match with DISCO repo | off |
diff --git a/examples/mmlu_benchmark/__init__.py b/examples/mmlu_benchmark/__init__.py
@@ -0,0 +1,5 @@
+"""MMLU Benchmark Example.
+
+This example demonstrates how to evaluate HuggingFace models on MMLU
+using anchor point-based task selection for DISCO prediction.
+"""