Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
c541ecc
[Add MMLU task]:
arubique Jan 28, 2026
49c83b3
[Add MMLU task]:
arubique Jan 28, 2026
945c80a
[Add MMLU task]:
arubique Jan 30, 2026
0890ce0
Update dependencies
arubique Jan 30, 2026
2e18293
[Add MMLU task]:
arubique Jan 30, 2026
00bb500
[Add MMLU task]:
arubique Jan 30, 2026
92b2232
[Add MMLU task]:
arubique Jan 30, 2026
17b601e
[Add MMLU task]:
arubique Feb 1, 2026
34025e7
[Add MMLU task]:
arubique Feb 1, 2026
140e0fb
[Add MMLU task]:
arubique Feb 1, 2026
62a6d45
[Add MMLU task]:
arubique Feb 1, 2026
1ec50e6
[Add MMLU task]:
arubique Feb 1, 2026
7e3b2b3
[Add MMLU task]:
arubique Feb 1, 2026
2f97f29
[Add DISCO to HF]:
arubique Feb 5, 2026
36e7482
[Add to HF]:
arubique Feb 5, 2026
6c48aa8
Remove unused import
arubique Feb 5, 2026
8b034ed
[Add to HF]:
arubique Feb 6, 2026
c75b9be
[Add to HF]:
arubique Feb 6, 2026
beb9cfc
[Add to HF]:
arubique Feb 6, 2026
7037d4e
Deprecate --disco_prediction arg
arubique Feb 6, 2026
aa1c3e8
[Add to HF]:
arubique Feb 6, 2026
6022bd7
[Add to HF]:
arubique Feb 7, 2026
4573787
Deprecate --use_full_prompt and --trust_remote_code args
arubique Feb 7, 2026
a5981c4
[Add to HF]:
arubique Feb 7, 2026
bdfd81b
[Add to HF]:
arubique Feb 7, 2026
b2fbf67
[Add to HF]:
arubique Feb 7, 2026
829643f
[Add to HF]:
arubique Feb 7, 2026
93c9901
[Create PR]:
arubique Feb 14, 2026
92da22a
[Create PR]:
arubique Feb 15, 2026
ceed87d
[Create PR]:
arubique Feb 15, 2026
03d220c
[Create PR]:
arubique Feb 15, 2026
83c5dfd
[Create PR]:
arubique Feb 15, 2026
7ce1daa
[Create PR]:
arubique Feb 15, 2026
6113adf
[Create PR]:
arubique Feb 15, 2026
4cdd28e
[Create PR]:
arubique Feb 15, 2026
80b8a75
[Create PR]:
arubique Feb 15, 2026
f4a70d4
[Create PR]:
arubique Feb 15, 2026
956508b
[Create PR]:
arubique Feb 15, 2026
57f763b
[Create PR]:
arubique Feb 15, 2026
8f9119d
[Create PR]:
arubique Feb 15, 2026
961fbf4
[Create PR]:
arubique Feb 16, 2026
3b122bb
[Create PR]:
arubique Feb 16, 2026
6c3cb22
[Create PR]:
arubique Feb 16, 2026
8b7343f
[Create PR]:
arubique Feb 16, 2026
d945cb2
[Create PR]:
arubique Feb 16, 2026
27e93a2
[Create PR]:
arubique Feb 16, 2026
2c705a3
[Create PR]:
arubique Feb 18, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@
.idea/
.DS_Store
.devcontainer/
results/
DISCO-MMLU/
flattened-MMLU/
tmp/

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

**Benchmarks**

- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `HuggingFaceMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `MMLUModelAgent`, `MMLUAgentAdapter`, `AnchorPointsTaskQueue`, `load_tasks()`, and `compute_benchmark_metrics()`. Optional extras: `lm-eval` (for `HuggingFaceMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34)

- CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including `ConverseBenchmark`, `DefaultAgentConverseBenchmark`, `ConverseEnvironment`, `ConverseExternalAgent`, `PrivacyEvaluator`, `SecurityEvaluator`, and `load_tasks()` utilities for `travel`, `real_estate`, and `insurance` domains. Benchmark source files are now downloaded on first use via `ensure_data_exists()` instead of being bundled in the package. (PR: #28)

- GAIA2 Benchmark: Integration with Meta's ARE (Agent Research Environments) platform for evaluating LLM-based agents on dynamic, multi-step scenarios (PR: #26)
Expand All @@ -32,6 +34,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

**Examples**

- MMLU benchmark example at `examples/mmlu_benchmark/` for evaluating HuggingFace models on MMLU with optional DISCO prediction (`--disco_model_path`, `--disco_transform_path`). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34)
- Added a dedicated runnable CONVERSE default benchmark example at `examples/converse_benchmark/default_converse_benchmark.py` for quick start with `DefaultAgentConverseBenchmark`. (PR: #28)
- Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)

Expand Down Expand Up @@ -62,6 +65,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Live API round-trip tests for all model adapters (`-m credentialed`) (PR: #29)
- CI jobs for slow tests (with benchmark data caching) and credentialed tests (behind GitHub Environment approval) (PR: #29)
- Added `respx` dev dependency for HTTP-level mocking (PR: #29)
- pytest marker `mmlu` for tests that require the MMLU benchmark (HuggingFace + DISCO). (PR: #34)

### Changed

Expand Down
7 changes: 7 additions & 0 deletions examples/mmlu_benchmark/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Results
results/
*.jsonl

# Predictions
predictions/
*.pkl
59 changes: 59 additions & 0 deletions examples/mmlu_benchmark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# MMLU Benchmark Example

Evaluate language models on [MMLU (Massive Multitask Language Understanding)](https://arxiv.org/abs/2009.03300) with optional efficient evaluation via [DISCO](https://arxiv.org/abs/2510.07959).

## Run without DISCO (full evaluation)

From the project root:

```bash
uv run python examples/mmlu_benchmark/mmlu_benchmark.py --model_id alignment-handbook/zephyr-7b-sft-full
```

Full evaluation results look like:

```
================================================================================
Results Summary (Evaluated Tasks)
================================================================================
Total tasks: 14042
Correct: 8291
Accuracy (on anchor points): 0.5904
Accuracy norm (on anchor points): 0.5904
```

## Run with DISCO (predicted full-benchmark score)

From the project root:

```bash
uv run python examples/mmlu_benchmark/mmlu_benchmark.py --model_id alignment-handbook/zephyr-7b-sft-full --disco_model_path arubique/DISCO-MMLU
```

Predicted score output:

```
----------------------------------------
DISCO Predicted Full Benchmark Accuracy:
----------------------------------------
Model 0: 0.606739
```

## Arguments

| Argument | Description | Default |
|----------|-------------|---------|
| `--model_id` | HuggingFace model identifier (e.g. `meta-llama/Llama-2-7b-hf`) | *(required)* |
| `--data_path` | Path to MMLU prompts JSON file or Hugging Face dataset repo id | `arubique/flattened-MMLU` |
| `--anchor_points_path` | Path to anchor points pickle file; if set, only anchor tasks are evaluated | — |
| `--output_dir` | Directory to save results | `./results` |
| `--predictions_path` | Path to save predictions tensor as pickle (for DISCO) | — |
| `--limit` | Limit number of tasks to evaluate (for testing) | — |
| `--batch_size` | Batch size for evaluation (reserved for future use) | `1` |
| `--device` | Device to run model on (e.g. `cuda:0`, `cpu`) | `cuda:0` |
| `--num_workers` | Number of parallel workers for task execution | `1` |
| `--disco_model_path` | If set, run DISCO prediction; path to `.pkl`, `.npz`, or Hugging Face repo id | — |
| `--disco_transform_path` | Path to DISCO PCA transform `.pkl` or `.npz` (for local DISCO model when using `--pca`) | — |
| `--pca` | PCA dimension for DISCO embeddings | — |
| `--pad_to_size` | Pad predictions to this size with -inf | — |
| `--use_lmeval_batching` | Use [lm-evaluation-harness-style](https://github.com/EleutherAI/lm-evaluation-harness) batching for exact numerical match with DISCO repo | off |
5 changes: 5 additions & 0 deletions examples/mmlu_benchmark/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
"""MMLU Benchmark Example.

This example demonstrates how to evaluate HuggingFace models on MMLU
using anchor point-based task selection for DISCO prediction.
"""
Loading
Loading