Skip to content

Several Bug Fixes for Benchmarks #39

Merged
cemde merged 36 commits intomainfrom
fix-benchmark-implementations
Mar 4, 2026
Merged

Several Bug Fixes for Benchmarks #39
cemde merged 36 commits intomainfrom
fix-benchmark-implementations

Conversation

@cemde
Copy link
Collaborator

@cemde cemde commented Mar 4, 2026

Description

Summary

After shipping Tau2, MultiAgentBench, GAIA2, and Converse in previous PRs, we ran line-by-line faithfulness audits comparing our implementations against the original codebases. These uncovered divergences in Tau2 and in MultiAgentBench, ranging from wrong defaults and missing orchestration logic to silently broken evaluation pipelines.

This PR fixes the issues surfaced by those audits and adds the infrastructure changes needed to support them. All four benchmarks are now labeled Beta to be transparent that while the implementations are careful, results have not yet been validated against original paper numbers.

Key changes

  • Tau2: Fixed user tool routing, agent greeting injection, tool call counter reset, telecom domain models, evaluator assertions, and environment state sync — all to match the original tau2-bench behavior
  • MultiAgentBench: Implemented the full multi-iteration coordination loop (graph/star/chain/tree modes), fixed data loading defaults, import paths, and bargaining evaluation (was silently scoring only seller side)
  • MACS/Converse: Smaller fixes — array schema handling for Gemini/OpenAI, removed dangerous silent default for attacker model
  • Core: User.respond() now raises UserExhaustedError instead of silently returning an empty string, which was masking bugs in orchestration loops. LLM simulator JSON parsing made robust against reasoning tokens and markdown fences.
  • Interface: Fixed Google GenAI tool-response merging, smolagents crashed-step detection, LlamaIndex max_iterations passthrough
  • Packaging: Fixed setuptools config so subpackages and data files are actually included in PyPI installs

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code quality improvement (refactoring, formatting, etc.)

Checklist

Contribution

Documentation

  • Added/updated docstrings for new/modified functions as instructed CONTRIBUTING.md
  • Updated relevant documentation in docs/ (if applicable)
  • Tag github issue with this PR (if applicable)

Changelog

  • Added entry to CHANGELOG.md under [Unreleased] section
    • Use Added section for new features
    • Use Changed section for modifications to existing functionality
    • Use Fixed section for bug fixes
    • Use Removed section for deprecated/removed features
  • OR this is a documentation-only change (no changelog needed)

Example:
- Support for multi-agent tracing (PR:#123)

Architecture (if applicable)

  • Core/Interface separation: Changes in maseval/core/ do NOT import from maseval/interface/
  • Dependencies: New core dependencies added sparingly; framework integrations go to optional dependencies

Additional Notes

@cemde cemde marked this pull request as ready for review March 4, 2026 14:43
@github-actions
Copy link

github-actions bot commented Mar 4, 2026

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  maseval/benchmark/converse
  converse.py 93
  maseval/benchmark/gaia2
  environment.py
  evaluator.py
  maseval/benchmark/macs
  macs.py 171, 179, 593
  maseval/benchmark/multiagentbench
  data_loader.py 277, 279
  environment.py
  evaluator.py 83, 87, 106-128, 193, 205-209
  multiagentbench.py 66-127, 564-577, 606-607, 624-634, 669-672, 684-741, 884, 1051-1052, 1211, 1231-1233, 1306, 1313-1316, 1325-1328, 1335-1344, 1354-1362, 1392-1436
  maseval/benchmark/multiagentbench/adapters
  marble_adapter.py 216-251
  maseval/benchmark/tau2
  environment.py 266, 274-310, 507, 526-527, 530-531, 620
  evaluator.py 313-314, 343-344, 388-389, 447
  tau2.py 330-331, 337-338, 536, 542, 547, 576, 800, 807, 871, 882, 885, 1376, 1436
  utils.py
  maseval/benchmark/tau2/domains
  base.py 143, 146-149, 153-162, 171, 392, 402, 410, 412
  maseval/benchmark/tau2/domains/airline
  tools.py 150-152
  maseval/benchmark/tau2/domains/retail
  tools.py 64, 538, 579
  maseval/benchmark/tau2/domains/telecom
  tools.py 34
  user_models.py 113, 289-301, 341
  user_tools.py 130-133, 163, 187, 191, 230, 233-234, 240, 254, 256, 258, 260, 264, 284, 313, 315, 319, 346-348, 356, 365, 372, 448-449, 455, 478, 485, 494, 524, 529-532, 552-556, 560-561, 595-598, 616, 623, 680-684, 686, 703, 712, 762, 783, 787, 796, 817, 825, 833, 840, 861, 865, 874, 918-919, 933-934, 947-949, 978, 986, 996, 1005-1006, 1013, 1019, 1025, 1031, 1037, 1043-1048, 1054-1055, 1061, 1067
  maseval/core
  exceptions.py
  simulator.py 537-538, 550, 676
  user.py
  maseval/interface/agents
  llamaindex.py 322
  smolagents.py
  maseval/interface/inference
  google_genai.py 201-211
Project Total  

The report is truncated to 25 files out of 30. To see the full report, please visit the workflow summary page.

This report was generated by python-coverage-comment-action

@cemde cemde merged commit c21750d into main Mar 4, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant