parameterlab · cemde · Mar 4, 2026 · Feb 17, 2026 · Feb 18, 2026 · Feb 18, 2026
diff --git a/BENCHMARKS.md b/BENCHMARKS.md
@@ -15,22 +15,27 @@ This benchmark is designed to test and evaluate the collaborative problem-solvin
 
 ---
 
-## 2. $\tau^2$-bench
+## 2. $\tau^2$-bench (Beta)
 
 $\tau^2$-bench is a benchmark for evaluating agentic systems in realistic, multi-turn interactive environments.
 
+> **Beta:** This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
+
 ### Source and License
 
 - **Original Repository:** [https://github.com/sierra-research/tau2-bench](https://github.com/sierra-research/tau2-bench)
+- **Paper:** [Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains](https://arxiv.org/abs/2406.12045)
 - **Code License:** MIT
 - **Data License:** MIT
 
 ---
 
-## 3. MultiAgentBench (MARBLE)
+## 3. MultiAgentBench (MARBLE) (Beta)
 
 MultiAgentBench is a comprehensive benchmark suite for evaluating multi-agent collaboration and competition in LLM-based systems. It includes diverse scenarios across multiple domains including research collaboration, negotiation, coding tasks, and more.
 
+> **Beta:** This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
+
 ### Source and License
 
 - **Original Repository:** [https://github.com/ulab-uiuc/MARBLE](https://github.com/ulab-uiuc/MARBLE) (where the original work was done)
@@ -43,23 +48,28 @@ MultiAgentBench is a comprehensive benchmark suite for evaluating multi-agent co
 
 ---
 
-## 4. GAIA2
+## 4. GAIA2 (Beta)
 
 Gaia2 is a benchmark for evaluating LLM-based agents on dynamic, multi-step scenarios using Meta's ARE (Agent Research Environments) platform. It tests agents across 7 capability dimensions: execution, search, adaptability, time, ambiguity, agent2agent, and noise.
 
+> **Beta:** This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
+
 ### Source and License
 
 - **Original Repository:** [https://github.com/facebookresearch/meta-agents-research-environments](https://github.com/facebookresearch/meta-agents-research-environments)
+- **Paper:** [Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments](https://openreview.net/forum?id=9gw03JpKK4) (ICLR 2026)
 - **Dataset:** [https://huggingface.co/datasets/meta-agents-research-environments/gaia2](https://huggingface.co/datasets/meta-agents-research-environments/gaia2)
 - **Code License:** MIT
 - **Data License:** Subject to Meta's data usage terms (see HuggingFace dataset page)
 
 ---
 
-## 5. CONVERSE
+## 5. CONVERSE (Beta)
 
 CONVERSE evaluates contextual safety in agent-to-agent conversations. It focuses on adversarial interactions where an external service-provider agent attempts privacy extraction or unauthorized action induction over multiple turns.
 
+> **Beta:** This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
+
 ### Source and License
 
 - **Original Repository:** [https://github.com/amrgomaaelhady/ConVerse](https://github.com/amrgomaaelhady/ConVerse)

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -43,6 +43,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Added `seed` parameter to `ModelAdapter.__init__` for deterministic model inference (PR: #24)
 - Added `SeedingError` exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24)
 - Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFaceModelAdapter` pass seeds to underlying APIs (PR: #24)
+- Added `UserExhaustedError` exception in `maseval.core.exceptions` for flow control when a user's turns are exhausted (PR: #39)
 
 **Interface**
 
@@ -69,6 +70,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 - Simplified seeding API: `seed_generator` parameter in setup methods is now always non-None (`SeedGenerator` instead of `Optional[SeedGenerator]`). When seeding is disabled (`seed=None`), `derive_seed()` returns `None` instead of raising an error. This eliminates all `if seed_generator is not None:` conditional checks - the same code path works whether seeding is enabled or disabled. (PR: #27)
 - Clarified benchmark/evaluator component guidance in docstrings and docs, including recommended evaluator exception behavior with `fail_on_evaluation_error`. (PR: #28)
+- `User.respond()` now raises `UserExhaustedError` instead of returning an empty string when the user has no more turns. Set the new `exhausted_response` parameter to return a configurable message instead (e.g. for tool-based integrations where agents call `ask_user`). Affects `LLMUser`, `AgenticLLMUser`, `Tau2User`, and `MACSUser`. (PR: #39)
+- `_extract_json_object()` helper in `maseval.core.simulator` replaces brittle markdown-fence stripping with robust outermost-brace extraction for all LLM simulator JSON parsing (`ToolLLMSimulator`, `UserLLMSimulator`, `AgenticUserLLMSimulator`). (PR: #39)
+- `UserLLMSimulator` and `AgenticUserLLMSimulator` now preserve stop tokens that appear outside the JSON object in raw LLM output, so `User._check_stop_token` can detect them. (PR: #39)
+
+**Interface**
+
+- `LlamaIndexAgentAdapter`: Added `max_iterations` constructor parameter, forwarded to `AgentWorkflow.run()`. Fixes silent swallowing of `max_steps` by `FunctionAgent.__init__`. (PR: #39)
+- `SmolAgentAdapter`: New `_determine_step_status()` detects crashed steps where `AgentGenerationError` was raised before `step.error` was set, preventing false "success" status on empty steps. (PR: #39)
+- `GoogleGenAIModelAdapter`: Consecutive tool-response messages are now merged into a single `contents` entry, fixing Google API errors when multiple tool results are returned in one turn. (PR: #39)
 
 **Benchmarks**
 
@@ -89,17 +99,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   - `LangGraphUser` → `LangGraphLLMUser`
   - `LlamaIndexUser` → `LlamaIndexLLMUser`
 
+**Documentation**
+
+- All benchmarks except MACS are now labeled as **Beta** in docs, BENCHMARKS.md, and benchmark index, with a warning that results have not yet been validated against original implementations. (PR: #39)
+
 **Testing**
 
 - Coverage script (`scripts/coverage_by_feature.py`) now accepts `--exclude` flag to skip additional markers; always excludes `credentialed` and `smoke` by default (PR: #29)
 
 ### Fixed
 
+**Core**
+
 - `ResultLogger._filter_report()` now includes `status` and `error` fields in persisted results, so saved logs can distinguish successful runs from infrastructure failures. Report schema is now consistent across success and failure paths (`error` is always present, `None` on success). (PR: #38)
-- GAIA2: Various fixes for faithful reproduction of ARE reference results — scenario lifecycle, data loading, evaluation flow, multi-turn notification handling, tool filtering, default agent fidelity, and simulation time management (PR: #30)
-- MultiAgentBench: Corrected domain mappings, added missing werewolf/minecraft support, fixed environment constructors, added result summarization matching MARBLE's evaluation pipeline (PR: #30)
+- Packaging: Fixed `setuptools` configuration — `packages` now uses `find` with `include = ["maseval*"]` so subpackages and package data (`.json`, `.jsonl`, `.md`, etc.) are included in PyPI installs. (PR: #39)
+
+**Benchmarks**
+
 - Tau2: Fixed telecom domain schema to match tau2-bench, added agent/user state synchronization and deterministic network simulation, fixed initialization flow and tool result serialization (PR: #30)
-- Fixed incorrect return type annotations on `DB.load()` and `DB.copy_deep()` in Tau2 benchmark — now use `Self` instead of `"DB"`, so subclass methods return the correct type (PR: #29)
+- Tau2: Added initial agent greeting ("Hi! How can I help you today?") to user simulator's message history, matching the original tau2-bench orchestrator. Fixed tool call counter accumulating across agent turns instead of resetting per turn. Corrected `max_steps` comments (original default is 100, not 200). Documented all known architectural divergences from original tau2-bench in PROVENANCE.md. (PR: #39)
+- Tau2: Various bugfixes including user tool routing, environment state synchronization, tool result serialization, telecom domain user models/tools, evaluator assertion logic, and `addict` dependency for nested dict access. (PR: #39)
+- Tau2: Fixed incorrect return type annotations on `DB.load()` and `DB.copy_deep()` — now use `Self` instead of `"DB"`, so subclass methods return the correct type (PR: #29)
+- MultiAgentBench: Fixed bargaining evaluation to use both buyer and seller LLM evaluation prompts, matching the MARBLE paper's methodology. Previously only the seller prompt was used (mirroring a regression in the MARBLE codebase), causing buyer scores to always default to -1 and completion checks to always fail. Now reports `buyer_score`, `seller_score`, and `mean_score` scaled to 0-100. (PR: #39)
+- MultiAgentBench: Corrected domain mappings, added missing werewolf/minecraft support, fixed environment constructors, added result summarization matching MARBLE's evaluation pipeline (PR: #30)
+- MultiAgentBench: `MarbleMultiAgentBenchBenchmark` now implements MARBLE's multi-iteration coordination loop with all 4 modes (graph, star, chain, tree) instead of executing agents only once. Fixed default `coordinate_mode` from `"star"` to `"graph"` matching 1215/1226 MARBLE configs. Uses per-task `max_iterations` from task config (matching `engine.py:97`), respects per-agent LLM overrides, and initializes memory type from task config. (PR: #39)
+- MultiAgentBench: Faithfulness audit fixes for reproduction mode — fixed wrong import path (`marble.utils.utils` → `marble.llms.model_prompting`), added Minecraft agent registration, per-domain defaults for `max_iterations`/`coordinate_mode`/`environment.type`/`memory.type` from MARBLE YAML configs, resolved hardcoded relative paths for `score.json` and `workspace/solution.py` via `_MARBLE_ROOT`, unified `coordinate_mode` defaults, corrected evaluator and agent model defaults to match MARBLE, replaced auto-generated agent IDs with strict validation. (PR: #39)
+- MultiAgentBench: Fixed bargaining evaluation crash from `.format()` on single-brace JSON in evaluator prompts. Documented chain communication assertion bug in MARBLE's `engine.py`. (PR: #39)
+- GAIA2: Various fixes for faithful reproduction of ARE reference results — scenario lifecycle, data loading, evaluation flow, multi-turn notification handling, tool filtering, default agent fidelity, and simulation time management (PR: #30)
+- MACS: `MACSGenericTool._schema_to_inputs()` now preserves `items` sub-schema for array-type properties, fixing tool registration with Gemini and OpenAI providers. (PR: #39)
+- MACS: Simplified `MACSUser._extract_user_profile()` — no longer attempts brittle parsing of scenario text; points profile section at the scenario to avoid duplication. (PR: #39)
+- Converse: Removed silent `"gpt-4o"` default for `attacker_model_id`; now raises `ValueError` if not provided, preventing accidental benchmark misconfiguration. (PR: #39)
 - ConVerse: Various fixes for faithful reproduction of original. (PR: #32)
 
 ### Removed

diff --git a/docs/benchmark/converse.md b/docs/benchmark/converse.md
@@ -1,4 +1,7 @@
-# CONVERSE Benchmark
+# CONVERSE Benchmark (Beta)
+
+!!! warning "Beta"
+    This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
 
 CONVERSE evaluates privacy and security robustness in agent-to-agent conversations where the external counterpart is adversarial.
 

diff --git a/docs/benchmark/gaia2.md b/docs/benchmark/gaia2.md
@@ -1,10 +1,13 @@
-# Gaia2: Dynamic Multi-Step Scenario Benchmark
+# GAIA2: Dynamic Multi-Step Scenario Benchmark (Beta)
 
-The **Gaia2 Benchmark** evaluates LLM-based agents on dynamic, multi-step scenarios using Meta's ARE (Agent Research Environments) platform. It tests agents across multiple capability dimensions in a simulated mobile environment.
+!!! warning "Beta"
+This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
+
+The **GAIA2 Benchmark** evaluates LLM-based agents on dynamic, multi-step scenarios using Meta's ARE (Agent Research Environments) platform. It tests agents across multiple capability dimensions in a simulated mobile environment.
 
 ## Overview
 
-[Gaia2](https://huggingface.co/datasets/meta-agents-research-environments/gaia2) is designed to evaluate agents in realistic, time-sensitive scenarios. The benchmark features:
+[GAIA2](https://huggingface.co/datasets/meta-agents-research-environments/gaia2) is designed to evaluate agents in realistic, time-sensitive scenarios. The benchmark features:
 
 - **ARE simulation environment** with real-time dynamics and event scheduling
 - **Tool-based time control** via `wait_for_notification()` for temporal reasoning
@@ -18,7 +21,7 @@ Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/
 
 ## Installation
 
-Gaia2 requires additional dependencies:
+GAIA2 requires additional dependencies:
 
 ```bash
 pip install maseval[gaia2]
@@ -88,15 +91,15 @@ results = benchmark.run(tasks)
 
 ## Capabilities
 
-Gaia2 tasks are organized by capability dimension:
+GAIA2 tasks are organized by capability dimension:
 
-| Capability     | Description                                      |
-| -------------- | ------------------------------------------------ |
-| `execution`    | Basic task execution                             |
-| `search`       | Information retrieval tasks                      |
-| `adaptability` | Adapting to changing requirements                |
-| `time`         | Temporal reasoning tasks                         |
-| `ambiguity`    | Handling ambiguous instructions                  |
+| Capability     | Description                       |
+| -------------- | --------------------------------- |
+| `execution`    | Basic task execution              |
+| `search`       | Information retrieval tasks       |
+| `adaptability` | Adapting to changing requirements |
+| `time`         | Temporal reasoning tasks          |
+| `ambiguity`    | Handling ambiguous instructions   |
 
 Load specific capabilities:
 
@@ -110,7 +113,7 @@ tasks = load_tasks(limit=50)
 
 ## Multi-Turn Notification Loop
 
-GAIA2 uses an **event-driven** multi-turn architecture, not user-turn interaction. Unlike Tau2 (where a user simulator drives multi-turn), GAIA2 scenarios have scheduled events (e.g., "calendar events added at t=240s", "friend replies at t=300s") that the agent must wait for and react to.
+GAIA2 uses an **event-driven** multi-turn architecture. Scenarios have scheduled events (e.g., "calendar events added at t=240s", "friend replies at t=300s") that the agent must wait for and react to.
 
 The benchmark invokes the agent **once**. The agent handles multi-turn internally via the notification loop:
 
@@ -147,16 +150,6 @@ if has_stop:
 
 See `DefaultGaia2Agent` source for the canonical single-loop implementation.
 
-## Key Differences from Tau2
-
-| Aspect           | Gaia2                                    | Tau2                              |
-| ---------------- | ---------------------------------------- | --------------------------------- |
-| Interaction      | Event-driven simulation                  | Turn-based user simulation        |
-| Time Control     | Agent calls `wait_for_notification()`    | Fixed turns                       |
-| Tools            | ARE app tools (12 apps)                  | Domain-specific tools (3 domains) |
-| Evaluation       | Event DAG comparison                     | Database state comparison         |
-| User Simulator   | None (events are scheduled)              | LLM-based customer simulator      |
-
 ## API Reference
 
 ::: maseval.benchmark.gaia2.Gaia2Benchmark

diff --git a/docs/benchmark/index.md b/docs/benchmark/index.md
@@ -2,6 +2,11 @@
 
 MASEval includes pre-implemented benchmarks for evaluating multi-agent systems.
 
+!!! warning "Beta Benchmarks"
+    Several benchmarks are currently in **Beta**. They have been implemented carefully, but these are highly complex systems and we have not yet validated the results against the original implementations. Use with caution when comparing with existing results or original paper numbers. Contributions and compute donations welcome!
+
+    **MACS** is the only benchmark that has been fully validated.
+
 ## Adding Custom Benchmarks
 
 You can also create your own benchmarks by subclassing the [`Benchmark`](../reference/benchmark.md) class. See the [Five-a-Day example](../examples/five_a_day_benchmark.ipynb) for a complete walkthrough.

diff --git a/docs/benchmark/multiagentbench.md b/docs/benchmark/multiagentbench.md
@@ -1,4 +1,7 @@
-# MultiAgentBench: Multi-Agent Collaboration Benchmark
+# MultiAgentBench: Multi-Agent Collaboration Benchmark (Beta)
+
+!!! warning "Beta"
+    This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
 
 The **MultiAgentBench** benchmark evaluates multi-agent collaboration and competition in LLM-based systems across diverse scenarios including research, negotiation, coding, and more.
 

diff --git a/docs/benchmark/tau2.md b/docs/benchmark/tau2.md
@@ -1,4 +1,7 @@
-# Tau2: Tool-Agent-User Interaction Benchmark
+# Tau2: Tool-Agent-User Interaction Benchmark (Beta)
+
+!!! warning "Beta"
+    This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
 
 The **Tau2 Benchmark** evaluates LLM-based agents on customer service tasks across multiple real-world domains, testing their ability to use tools, follow policies, and interact with users.