Merged
Conversation
removed duplicate macs extraction
Coverage reportClick to see where and how coverage changedThe report is truncated to 25 files out of 30. To see the full report, please visit the workflow summary page. This report was generated by python-coverage-comment-action |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Summary
After shipping Tau2, MultiAgentBench, GAIA2, and Converse in previous PRs, we ran line-by-line faithfulness audits comparing our implementations against the original codebases. These uncovered divergences in Tau2 and in MultiAgentBench, ranging from wrong defaults and missing orchestration logic to silently broken evaluation pipelines.
This PR fixes the issues surfaced by those audits and adds the infrastructure changes needed to support them. All four benchmarks are now labeled
Betato be transparent that while the implementations are careful, results have not yet been validated against original paper numbers.Key changes
User.respond()now raisesUserExhaustedErrorinstead of silently returning an empty string, which was masking bugs in orchestration loops. LLM simulator JSON parsing made robust against reasoning tokens and markdown fences.Type of Change
Checklist
Contribution
Documentation
docs/(if applicable)Changelog
CHANGELOG.mdunder[Unreleased]sectionExample:
- Support for multi-agent tracing (PR:#123)Architecture (if applicable)
maseval/core/do NOT import frommaseval/interface/Additional Notes