Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
a7f406e
initial fix for MAB
cemde Feb 17, 2026
18923c6
fixed multiagentbench
cemde Feb 18, 2026
510dddd
fixed more MAB issues
cemde Feb 18, 2026
d4f83d4
added tests for MAB
cemde Feb 18, 2026
e67b5e3
fixed tau2 fidelity
cemde Feb 18, 2026
5ecc771
added more features
cemde Feb 18, 2026
73a04ae
tau2 refactor user and a few bug fixes
cemde Feb 18, 2026
fc9fbdf
added fixes to multiagentbench
cemde Feb 18, 2026
4271019
fixes to marble dataloading and minecraft domain
cemde Feb 18, 2026
506a2e4
added fixes for tau2
cemde Feb 18, 2026
37abacb
added tests for tau2
cemde Feb 18, 2026
17f3d7c
fixed example and google genai adapter
cemde Feb 24, 2026
b783c70
fixed test
cemde Feb 24, 2026
9f67b0b
tau2 fixes
cemde Feb 24, 2026
2596cd2
fixed macs generic tool issue
cemde Feb 26, 2026
46db513
updated usersimulator to be more robust in decoding json
cemde Feb 26, 2026
95c937b
fixed bugs in tau2 and multiagentbnech
cemde Feb 27, 2026
c4d085e
fixed bug where user LLM simulator returned empty response
cemde Feb 27, 2026
9f6e3a6
added maxiter to llamaindex
cemde Feb 27, 2026
d0b2b7e
fixed typing issues in tau2
cemde Feb 27, 2026
4972469
small fix to user error message
cemde Feb 27, 2026
76d2150
removed default attacker model for converse
cemde Mar 3, 2026
a914f55
noted issues with macs metrics
cemde Mar 3, 2026
c5413f6
added warning about multiagentbench not supporting communication eval
cemde Mar 3, 2026
10f8931
fixed pypi install package.
cemde Mar 3, 2026
867533c
fixed incorrect execption handling in smolagents adapter
cemde Mar 3, 2026
b503330
fixed multiagentbench eval issue
cemde Mar 3, 2026
54aeb79
fixed test
cemde Mar 3, 2026
e2f1a63
updated docstrings for clearer attribution to oringal work
cemde Mar 4, 2026
7fdcec0
formatting and doc hygiene
cemde Mar 4, 2026
cb54456
fixed docstrings
cemde Mar 4, 2026
210ebaf
updated changelog
cemde Mar 4, 2026
c86d576
Merge branch 'main' into fix-benchmark-implementations
cemde Mar 4, 2026
e3cf3d0
cleaned changelog
cemde Mar 4, 2026
b61f9d6
fixed bug in testing of multi-agent bench
cemde Mar 4, 2026
2efca71
fixed bug in multiagentbench loading
cemde Mar 4, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 14 additions & 4 deletions BENCHMARKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,22 +15,27 @@ This benchmark is designed to test and evaluate the collaborative problem-solvin

---

## 2. $\tau^2$-bench
## 2. $\tau^2$-bench (Beta)

$\tau^2$-bench is a benchmark for evaluating agentic systems in realistic, multi-turn interactive environments.

> **Beta:** This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

### Source and License

- **Original Repository:** [https://github.com/sierra-research/tau2-bench](https://github.com/sierra-research/tau2-bench)
- **Paper:** [Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains](https://arxiv.org/abs/2406.12045)
- **Code License:** MIT
- **Data License:** MIT

---

## 3. MultiAgentBench (MARBLE)
## 3. MultiAgentBench (MARBLE) (Beta)

MultiAgentBench is a comprehensive benchmark suite for evaluating multi-agent collaboration and competition in LLM-based systems. It includes diverse scenarios across multiple domains including research collaboration, negotiation, coding tasks, and more.

> **Beta:** This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

### Source and License

- **Original Repository:** [https://github.com/ulab-uiuc/MARBLE](https://github.com/ulab-uiuc/MARBLE) (where the original work was done)
Expand All @@ -43,23 +48,28 @@ MultiAgentBench is a comprehensive benchmark suite for evaluating multi-agent co

---

## 4. GAIA2
## 4. GAIA2 (Beta)

Gaia2 is a benchmark for evaluating LLM-based agents on dynamic, multi-step scenarios using Meta's ARE (Agent Research Environments) platform. It tests agents across 7 capability dimensions: execution, search, adaptability, time, ambiguity, agent2agent, and noise.

> **Beta:** This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

### Source and License

- **Original Repository:** [https://github.com/facebookresearch/meta-agents-research-environments](https://github.com/facebookresearch/meta-agents-research-environments)
- **Paper:** [Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments](https://openreview.net/forum?id=9gw03JpKK4) (ICLR 2026)
- **Dataset:** [https://huggingface.co/datasets/meta-agents-research-environments/gaia2](https://huggingface.co/datasets/meta-agents-research-environments/gaia2)
- **Code License:** MIT
- **Data License:** Subject to Meta's data usage terms (see HuggingFace dataset page)

---

## 5. CONVERSE
## 5. CONVERSE (Beta)

CONVERSE evaluates contextual safety in agent-to-agent conversations. It focuses on adversarial interactions where an external service-provider agent attempts privacy extraction or unauthorized action induction over multiple turns.

> **Beta:** This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

### Source and License

- **Original Repository:** [https://github.com/amrgomaaelhady/ConVerse](https://github.com/amrgomaaelhady/ConVerse)
Expand Down
35 changes: 32 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Added `seed` parameter to `ModelAdapter.__init__` for deterministic model inference (PR: #24)
- Added `SeedingError` exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24)
- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFaceModelAdapter` pass seeds to underlying APIs (PR: #24)
- Added `UserExhaustedError` exception in `maseval.core.exceptions` for flow control when a user's turns are exhausted (PR: #39)

**Interface**

Expand All @@ -69,6 +70,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- Simplified seeding API: `seed_generator` parameter in setup methods is now always non-None (`SeedGenerator` instead of `Optional[SeedGenerator]`). When seeding is disabled (`seed=None`), `derive_seed()` returns `None` instead of raising an error. This eliminates all `if seed_generator is not None:` conditional checks - the same code path works whether seeding is enabled or disabled. (PR: #27)
- Clarified benchmark/evaluator component guidance in docstrings and docs, including recommended evaluator exception behavior with `fail_on_evaluation_error`. (PR: #28)
- `User.respond()` now raises `UserExhaustedError` instead of returning an empty string when the user has no more turns. Set the new `exhausted_response` parameter to return a configurable message instead (e.g. for tool-based integrations where agents call `ask_user`). Affects `LLMUser`, `AgenticLLMUser`, `Tau2User`, and `MACSUser`. (PR: #39)
- `_extract_json_object()` helper in `maseval.core.simulator` replaces brittle markdown-fence stripping with robust outermost-brace extraction for all LLM simulator JSON parsing (`ToolLLMSimulator`, `UserLLMSimulator`, `AgenticUserLLMSimulator`). (PR: #39)
- `UserLLMSimulator` and `AgenticUserLLMSimulator` now preserve stop tokens that appear outside the JSON object in raw LLM output, so `User._check_stop_token` can detect them. (PR: #39)

**Interface**

- `LlamaIndexAgentAdapter`: Added `max_iterations` constructor parameter, forwarded to `AgentWorkflow.run()`. Fixes silent swallowing of `max_steps` by `FunctionAgent.__init__`. (PR: #39)
- `SmolAgentAdapter`: New `_determine_step_status()` detects crashed steps where `AgentGenerationError` was raised before `step.error` was set, preventing false "success" status on empty steps. (PR: #39)
- `GoogleGenAIModelAdapter`: Consecutive tool-response messages are now merged into a single `contents` entry, fixing Google API errors when multiple tool results are returned in one turn. (PR: #39)

**Benchmarks**

Expand All @@ -89,17 +99,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `LangGraphUser` → `LangGraphLLMUser`
- `LlamaIndexUser` → `LlamaIndexLLMUser`

**Documentation**

- All benchmarks except MACS are now labeled as **Beta** in docs, BENCHMARKS.md, and benchmark index, with a warning that results have not yet been validated against original implementations. (PR: #39)

**Testing**

- Coverage script (`scripts/coverage_by_feature.py`) now accepts `--exclude` flag to skip additional markers; always excludes `credentialed` and `smoke` by default (PR: #29)

### Fixed

**Core**

- `ResultLogger._filter_report()` now includes `status` and `error` fields in persisted results, so saved logs can distinguish successful runs from infrastructure failures. Report schema is now consistent across success and failure paths (`error` is always present, `None` on success). (PR: #38)
- GAIA2: Various fixes for faithful reproduction of ARE reference results — scenario lifecycle, data loading, evaluation flow, multi-turn notification handling, tool filtering, default agent fidelity, and simulation time management (PR: #30)
- MultiAgentBench: Corrected domain mappings, added missing werewolf/minecraft support, fixed environment constructors, added result summarization matching MARBLE's evaluation pipeline (PR: #30)
- Packaging: Fixed `setuptools` configuration — `packages` now uses `find` with `include = ["maseval*"]` so subpackages and package data (`.json`, `.jsonl`, `.md`, etc.) are included in PyPI installs. (PR: #39)

**Benchmarks**

- Tau2: Fixed telecom domain schema to match tau2-bench, added agent/user state synchronization and deterministic network simulation, fixed initialization flow and tool result serialization (PR: #30)
- Fixed incorrect return type annotations on `DB.load()` and `DB.copy_deep()` in Tau2 benchmark — now use `Self` instead of `"DB"`, so subclass methods return the correct type (PR: #29)
- Tau2: Added initial agent greeting ("Hi! How can I help you today?") to user simulator's message history, matching the original tau2-bench orchestrator. Fixed tool call counter accumulating across agent turns instead of resetting per turn. Corrected `max_steps` comments (original default is 100, not 200). Documented all known architectural divergences from original tau2-bench in PROVENANCE.md. (PR: #39)
- Tau2: Various bugfixes including user tool routing, environment state synchronization, tool result serialization, telecom domain user models/tools, evaluator assertion logic, and `addict` dependency for nested dict access. (PR: #39)
- Tau2: Fixed incorrect return type annotations on `DB.load()` and `DB.copy_deep()` — now use `Self` instead of `"DB"`, so subclass methods return the correct type (PR: #29)
- MultiAgentBench: Fixed bargaining evaluation to use both buyer and seller LLM evaluation prompts, matching the MARBLE paper's methodology. Previously only the seller prompt was used (mirroring a regression in the MARBLE codebase), causing buyer scores to always default to -1 and completion checks to always fail. Now reports `buyer_score`, `seller_score`, and `mean_score` scaled to 0-100. (PR: #39)
- MultiAgentBench: Corrected domain mappings, added missing werewolf/minecraft support, fixed environment constructors, added result summarization matching MARBLE's evaluation pipeline (PR: #30)
- MultiAgentBench: `MarbleMultiAgentBenchBenchmark` now implements MARBLE's multi-iteration coordination loop with all 4 modes (graph, star, chain, tree) instead of executing agents only once. Fixed default `coordinate_mode` from `"star"` to `"graph"` matching 1215/1226 MARBLE configs. Uses per-task `max_iterations` from task config (matching `engine.py:97`), respects per-agent LLM overrides, and initializes memory type from task config. (PR: #39)
- MultiAgentBench: Faithfulness audit fixes for reproduction mode — fixed wrong import path (`marble.utils.utils` → `marble.llms.model_prompting`), added Minecraft agent registration, per-domain defaults for `max_iterations`/`coordinate_mode`/`environment.type`/`memory.type` from MARBLE YAML configs, resolved hardcoded relative paths for `score.json` and `workspace/solution.py` via `_MARBLE_ROOT`, unified `coordinate_mode` defaults, corrected evaluator and agent model defaults to match MARBLE, replaced auto-generated agent IDs with strict validation. (PR: #39)
- MultiAgentBench: Fixed bargaining evaluation crash from `.format()` on single-brace JSON in evaluator prompts. Documented chain communication assertion bug in MARBLE's `engine.py`. (PR: #39)
- GAIA2: Various fixes for faithful reproduction of ARE reference results — scenario lifecycle, data loading, evaluation flow, multi-turn notification handling, tool filtering, default agent fidelity, and simulation time management (PR: #30)
- MACS: `MACSGenericTool._schema_to_inputs()` now preserves `items` sub-schema for array-type properties, fixing tool registration with Gemini and OpenAI providers. (PR: #39)
- MACS: Simplified `MACSUser._extract_user_profile()` — no longer attempts brittle parsing of scenario text; points profile section at the scenario to avoid duplication. (PR: #39)
- Converse: Removed silent `"gpt-4o"` default for `attacker_model_id`; now raises `ValueError` if not provided, preventing accidental benchmark misconfiguration. (PR: #39)
- ConVerse: Various fixes for faithful reproduction of original. (PR: #32)

### Removed
Expand Down
5 changes: 4 additions & 1 deletion docs/benchmark/converse.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
# CONVERSE Benchmark
# CONVERSE Benchmark (Beta)

!!! warning "Beta"
This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

CONVERSE evaluates privacy and security robustness in agent-to-agent conversations where the external counterpart is adversarial.

Expand Down
39 changes: 16 additions & 23 deletions docs/benchmark/gaia2.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
# Gaia2: Dynamic Multi-Step Scenario Benchmark
# GAIA2: Dynamic Multi-Step Scenario Benchmark (Beta)

The **Gaia2 Benchmark** evaluates LLM-based agents on dynamic, multi-step scenarios using Meta's ARE (Agent Research Environments) platform. It tests agents across multiple capability dimensions in a simulated mobile environment.
!!! warning "Beta"
This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

The **GAIA2 Benchmark** evaluates LLM-based agents on dynamic, multi-step scenarios using Meta's ARE (Agent Research Environments) platform. It tests agents across multiple capability dimensions in a simulated mobile environment.

## Overview

[Gaia2](https://huggingface.co/datasets/meta-agents-research-environments/gaia2) is designed to evaluate agents in realistic, time-sensitive scenarios. The benchmark features:
[GAIA2](https://huggingface.co/datasets/meta-agents-research-environments/gaia2) is designed to evaluate agents in realistic, time-sensitive scenarios. The benchmark features:

- **ARE simulation environment** with real-time dynamics and event scheduling
- **Tool-based time control** via `wait_for_notification()` for temporal reasoning
Expand All @@ -18,7 +21,7 @@ Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/

## Installation

Gaia2 requires additional dependencies:
GAIA2 requires additional dependencies:

```bash
pip install maseval[gaia2]
Expand Down Expand Up @@ -88,15 +91,15 @@ results = benchmark.run(tasks)

## Capabilities

Gaia2 tasks are organized by capability dimension:
GAIA2 tasks are organized by capability dimension:

| Capability | Description |
| -------------- | ------------------------------------------------ |
| `execution` | Basic task execution |
| `search` | Information retrieval tasks |
| `adaptability` | Adapting to changing requirements |
| `time` | Temporal reasoning tasks |
| `ambiguity` | Handling ambiguous instructions |
| Capability | Description |
| -------------- | --------------------------------- |
| `execution` | Basic task execution |
| `search` | Information retrieval tasks |
| `adaptability` | Adapting to changing requirements |
| `time` | Temporal reasoning tasks |
| `ambiguity` | Handling ambiguous instructions |

Load specific capabilities:

Expand All @@ -110,7 +113,7 @@ tasks = load_tasks(limit=50)

## Multi-Turn Notification Loop

GAIA2 uses an **event-driven** multi-turn architecture, not user-turn interaction. Unlike Tau2 (where a user simulator drives multi-turn), GAIA2 scenarios have scheduled events (e.g., "calendar events added at t=240s", "friend replies at t=300s") that the agent must wait for and react to.
GAIA2 uses an **event-driven** multi-turn architecture. Scenarios have scheduled events (e.g., "calendar events added at t=240s", "friend replies at t=300s") that the agent must wait for and react to.

The benchmark invokes the agent **once**. The agent handles multi-turn internally via the notification loop:

Expand Down Expand Up @@ -147,16 +150,6 @@ if has_stop:

See `DefaultGaia2Agent` source for the canonical single-loop implementation.

## Key Differences from Tau2

| Aspect | Gaia2 | Tau2 |
| ---------------- | ---------------------------------------- | --------------------------------- |
| Interaction | Event-driven simulation | Turn-based user simulation |
| Time Control | Agent calls `wait_for_notification()` | Fixed turns |
| Tools | ARE app tools (12 apps) | Domain-specific tools (3 domains) |
| Evaluation | Event DAG comparison | Database state comparison |
| User Simulator | None (events are scheduled) | LLM-based customer simulator |

## API Reference

::: maseval.benchmark.gaia2.Gaia2Benchmark
Expand Down
5 changes: 5 additions & 0 deletions docs/benchmark/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@

MASEval includes pre-implemented benchmarks for evaluating multi-agent systems.

!!! warning "Beta Benchmarks"
Several benchmarks are currently in **Beta**. They have been implemented carefully, but these are highly complex systems and we have not yet validated the results against the original implementations. Use with caution when comparing with existing results or original paper numbers. Contributions and compute donations welcome!

**MACS** is the only benchmark that has been fully validated.

## Adding Custom Benchmarks

You can also create your own benchmarks by subclassing the [`Benchmark`](../reference/benchmark.md) class. See the [Five-a-Day example](../examples/five_a_day_benchmark.ipynb) for a complete walkthrough.
Expand Down
5 changes: 4 additions & 1 deletion docs/benchmark/multiagentbench.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
# MultiAgentBench: Multi-Agent Collaboration Benchmark
# MultiAgentBench: Multi-Agent Collaboration Benchmark (Beta)

!!! warning "Beta"
This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

The **MultiAgentBench** benchmark evaluates multi-agent collaboration and competition in LLM-based systems across diverse scenarios including research, negotiation, coding, and more.

Expand Down
5 changes: 4 additions & 1 deletion docs/benchmark/tau2.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
# Tau2: Tool-Agent-User Interaction Benchmark
# Tau2: Tool-Agent-User Interaction Benchmark (Beta)

!!! warning "Beta"
This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

The **Tau2 Benchmark** evaluates LLM-based agents on customer service tasks across multiple real-world domains, testing their ability to use tools, follow policies, and interact with users.

Expand Down
Loading
Loading