-
Notifications
You must be signed in to change notification settings - Fork 7
Description
OMATS is a benchmark for evaluating LLMs in multi-agent room environments. It tests failure modes that don't show up in single-agent benchmarks: agents echoing each other, ignoring stop orders, leaking system prompts, planning instead of acting, and compounding each other's guardrails.
The suite has 28 scripted scenarios across three capability stages. Stage 3 tests single-agent discipline (loop avoidance, idle management, personality consistency). Stage 4 tests multi-agent communication (stop order compliance, echo resistance, indirect address parsing, social pressure resistance). Stage 5 tests agent management (task delegation, noise control, conflict resolution, escalation judgment).
Scoring is continuous (0.0–1.0) with auto-fail gates for prompt leakage, impersonation, and silence violations. We've run 10 models so far and the results differentiate well between model tiers.
Repo: https://github.com/ThinkOffApp/openclaw-multi-agent-test-suite
This would complement MASEval's existing benchmarks (GAIA, AgentBench) by adding room-based multi-agent communication evaluation, which none of the current integrations cover.