Add OMATS (OpenClaw Multi-Agent Test Suite) as a benchmark

OMATS is a benchmark for evaluating LLMs in multi-agent room environments. It tests failure modes that don't show up in single-agent benchmarks: agents echoing each other, ignoring stop orders, leaking system prompts, planning instead of acting, and compounding each other's guardrails.

The suite has 28 scripted scenarios across three capability stages. Stage 3 tests single-agent discipline (loop avoidance, idle management, personality consistency). Stage 4 tests multi-agent communication (stop order compliance, echo resistance, indirect address parsing, social pressure resistance). Stage 5 tests agent management (task delegation, noise control, conflict resolution, escalation judgment).

Scoring is continuous (0.0–1.0) with auto-fail gates for prompt leakage, impersonation, and silence violations. We've run 10 models so far and the results differentiate well between model tiers.

Repo: https://github.com/ThinkOffApp/openclaw-multi-agent-test-suite

This would complement MASEval's existing benchmarks (GAIA, AgentBench) by adding room-based multi-agent communication evaluation, which none of the current integrations cover.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OMATS (OpenClaw Multi-Agent Test Suite) as a benchmark #40

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add OMATS (OpenClaw Multi-Agent Test Suite) as a benchmark #40

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions