Skip to content

Add OMATS (OpenClaw Multi-Agent Test Suite) as a benchmark #40

@ThinkOffApp

Description

@ThinkOffApp

OMATS is a benchmark for evaluating LLMs in multi-agent room environments. It tests failure modes that don't show up in single-agent benchmarks: agents echoing each other, ignoring stop orders, leaking system prompts, planning instead of acting, and compounding each other's guardrails.

The suite has 28 scripted scenarios across three capability stages. Stage 3 tests single-agent discipline (loop avoidance, idle management, personality consistency). Stage 4 tests multi-agent communication (stop order compliance, echo resistance, indirect address parsing, social pressure resistance). Stage 5 tests agent management (task delegation, noise control, conflict resolution, escalation judgment).

Scoring is continuous (0.0–1.0) with auto-fail gates for prompt leakage, impersonation, and silence violations. We've run 10 models so far and the results differentiate well between model tiers.

Repo: https://github.com/ThinkOffApp/openclaw-multi-agent-test-suite

This would complement MASEval's existing benchmarks (GAIA, AgentBench) by adding room-based multi-agent communication evaluation, which none of the current integrations cover.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions