Several Bug Fixes for Benchmarks by cemde · Pull Request #39 · parameterlab/MASEval

cemde · 2026-03-04T11:53:08Z

Description

Summary

After shipping Tau2, MultiAgentBench, GAIA2, and Converse in previous PRs, we ran line-by-line faithfulness audits comparing our implementations against the original codebases. These uncovered divergences in Tau2 and in MultiAgentBench, ranging from wrong defaults and missing orchestration logic to silently broken evaluation pipelines.

This PR fixes the issues surfaced by those audits and adds the infrastructure changes needed to support them. All four benchmarks are now labeled Beta to be transparent that while the implementations are careful, results have not yet been validated against original paper numbers.

Key changes

Tau2: Fixed user tool routing, agent greeting injection, tool call counter reset, telecom domain models, evaluator assertions, and environment state sync — all to match the original tau2-bench behavior
MultiAgentBench: Implemented the full multi-iteration coordination loop (graph/star/chain/tree modes), fixed data loading defaults, import paths, and bargaining evaluation (was silently scoring only seller side)
MACS/Converse: Smaller fixes — array schema handling for Gemini/OpenAI, removed dangerous silent default for attacker model
Core: User.respond() now raises UserExhaustedError instead of silently returning an empty string, which was masking bugs in orchestration loops. LLM simulator JSON parsing made robust against reasoning tokens and markdown fences.
Interface: Fixed Google GenAI tool-response merging, smolagents crashed-step detection, LlamaIndex max_iterations passthrough
Packaging: Fixed setuptools config so subpackages and data files are actually included in PyPI installs

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code quality improvement (refactoring, formatting, etc.)

Checklist

Contribution

I have read the CONTRIBUTING.md guide.
Commits follow "How to write a good git commit message"

Documentation

Added/updated docstrings for new/modified functions as instructed CONTRIBUTING.md
Updated relevant documentation in docs/ (if applicable)
Tag github issue with this PR (if applicable)

Changelog

Added entry to CHANGELOG.md under [Unreleased] section
- Use Added section for new features
- Use Changed section for modifications to existing functionality
- Use Fixed section for bug fixes
- Use Removed section for deprecated/removed features
OR this is a documentation-only change (no changelog needed)

Example:
- Support for multi-agent tracing (PR:#123)

Architecture (if applicable)

Core/Interface separation: Changes in maseval/core/ do NOT import from maseval/interface/
Dependencies: New core dependencies added sparingly; framework integrations go to optional dependencies

Additional Notes

removed duplicate macs extraction

github-actions · 2026-03-04T15:13:36Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
maseval/benchmark/converse
converse.py					93
maseval/benchmark/gaia2
environment.py
evaluator.py
maseval/benchmark/macs
macs.py					171, 179, 593
maseval/benchmark/multiagentbench
data_loader.py					277, 279
environment.py
evaluator.py					83, 87, 106-128, 193, 205-209
multiagentbench.py					66-127, 564-577, 606-607, 624-634, 669-672, 684-741, 884, 1051-1052, 1211, 1231-1233, 1306, 1313-1316, 1325-1328, 1335-1344, 1354-1362, 1392-1436
maseval/benchmark/multiagentbench/adapters
marble_adapter.py					216-251
maseval/benchmark/tau2
environment.py					266, 274-310, 507, 526-527, 530-531, 620
evaluator.py					313-314, 343-344, 388-389, 447
tau2.py					330-331, 337-338, 536, 542, 547, 576, 800, 807, 871, 882, 885, 1376, 1436
utils.py
maseval/benchmark/tau2/domains
base.py					143, 146-149, 153-162, 171, 392, 402, 410, 412
maseval/benchmark/tau2/domains/airline
tools.py					150-152
maseval/benchmark/tau2/domains/retail
tools.py					64, 538, 579
maseval/benchmark/tau2/domains/telecom
tools.py					34
user_models.py					113, 289-301, 341
user_tools.py					130-133, 163, 187, 191, 230, 233-234, 240, 254, 256, 258, 260, 264, 284, 313, 315, 319, 346-348, 356, 365, 372, 448-449, 455, 478, 485, 494, 524, 529-532, 552-556, 560-561, 595-598, 616, 623, 680-684, 686, 703, 712, 762, 783, 787, 796, 817, 825, 833, 840, 861, 865, 874, 918-919, 933-934, 947-949, 978, 986, 996, 1005-1006, 1013, 1019, 1025, 1031, 1037, 1043-1048, 1054-1055, 1061, 1067
maseval/core
exceptions.py
simulator.py					537-538, 550, 676
user.py
maseval/interface/agents
llamaindex.py					322
smolagents.py
maseval/interface/inference
google_genai.py					201-211
Project Total

The report is truncated to 25 files out of 30. To see the full report, please visit the workflow summary page.

_{This report was generated by python-coverage-comment-action}

cemde added 30 commits February 17, 2026 23:46

initial fix for MAB

a7f406e

fixed multiagentbench

18923c6

fixed more MAB issues

510dddd

added tests for MAB

d4f83d4

fixed tau2 fidelity

e67b5e3

added more features

5ecc771

tau2 refactor user and a few bug fixes

73a04ae

added fixes to multiagentbench

fc9fbdf

fixes to marble dataloading and minecraft domain

4271019

added fixes for tau2

506a2e4

added tests for tau2

37abacb

fixed example and google genai adapter

17f3d7c

fixed test

b783c70

tau2 fixes

9f67b0b

fixed macs generic tool issue

2596cd2

updated usersimulator to be more robust in decoding json

46db513

removed duplicate macs extraction

fixed bugs in tau2 and multiagentbnech

95c937b

fixed bug where user LLM simulator returned empty response

c4d085e

added maxiter to llamaindex

9f6e3a6

fixed typing issues in tau2

d0b2b7e

small fix to user error message

4972469

removed default attacker model for converse

76d2150

noted issues with macs metrics

a914f55

added warning about multiagentbench not supporting communication eval

c5413f6

fixed pypi install package.

10f8931

fixed incorrect execption handling in smolagents adapter

867533c

fixed multiagentbench eval issue

b503330

fixed test

54aeb79

updated docstrings for clearer attribution to oringal work

e2f1a63

formatting and doc hygiene

7fdcec0

cemde and others added 4 commits March 4, 2026 15:30

fixed docstrings

cb54456

updated changelog

210ebaf

Merge branch 'main' into fix-benchmark-implementations

c86d576

cleaned changelog

e3cf3d0

cemde marked this pull request as ready for review March 4, 2026 14:43

cemde added 2 commits March 4, 2026 15:51

fixed bug in testing of multi-agent bench

b61f9d6

fixed bug in multiagentbench loading

2efca71

cemde merged commit c21750d into main Mar 4, 2026
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several Bug Fixes for Benchmarks #39

Several Bug Fixes for Benchmarks #39
cemde merged 36 commits intomainfrom
fix-benchmark-implementations

cemde commented Mar 4, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cemde commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary

Key changes

Type of Change

Checklist

Contribution

Documentation

Changelog

Architecture (if applicable)

Additional Notes

Uh oh!

github-actions bot commented Mar 4, 2026

Coverage report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cemde commented Mar 4, 2026 •

edited

Loading