Skip to content

RLE v1.0: Multi-model colony management leaderboard #8

@jkbennitt

Description

@jkbennitt

Vision

RLE is two things:

  1. A rigorous multi-agent game benchmark modeled after FLE (Factorio Learning Environment, NeurIPS 2025) — but for multi-agent coordination under uncertainty instead of single-agent factory optimization
  2. Chatbot Arena for RimWorld — a public leaderboard where different LLMs compete at managing a colony through 6 specialized agents

The leaderboard is the product. FLE's methodology is the credibility.

The clip: A clean results table showing Claude vs GPT vs Nemotron vs Llama on colony survival. "Claude keeps 5/5 alive through a raid, GPT loses 2." That's what gets shared.

Three audiences, one dataset:

  • AI/ML researchers → FLE-style paper with rigorous methodology, baselines, p-values
  • Dev community → Felix SDK showcase, livestream demo with dashboard
  • RimWorld/gaming community → AI colonies, mod potential, entertaining failures

How RLE Differs from FLE

FLE RLE
Game Factorio (deterministic) RimWorld (stochastic)
Agents Single agent 6 role-specialized, hub-spoke coordination
Communication None CentralPost with phase/score broadcasts
Environment Deterministic (fixed seeds) Stochastic (raids, disease, mood, weather)
Task structure 24 lab-play + open-play 6 scenarios + paired agent-vs-baseline
Scoring Binary pass + Production Score 8-metric composite + delta over baseline
Model comparison 6 frontier models Local (4B) to cloud (120B), any provider
Baseline None (gap in FLE) Unmanaged colony (RimWorld built-in AI)
Human baseline None (gap in FLE) Planned (RimWorld has large player base)

FLE Patterns We're Following

  • Fixed-seed reproducibility: Save/load same colony state for every run
  • Multiple runs per model: N=4+ with mean ± std, report median for skewed distributions
  • Binary + continuous metrics: Victory/failure conditions AND composite score
  • Difficulty progression: Easy (Crashlanded) → Extreme (Ship Launch)
  • Comparative results table: Model × scenario matrix
  • Paired evaluation: Agent score vs baseline score (FLE doesn't have this — we're better here)

FLE Patterns We're Adding

  • Stochastic robustness: Different random events per run (RimWorld storyteller varies), requiring more runs for significance
  • Multi-agent coordination: 6 agents must coordinate without conflicting, measured by conflict resolution stats
  • Ablation: Remove one agent at a time to measure per-agent contribution
  • Human baseline: Expert RimWorld players on same scenarios (FLE acknowledged this gap)
  • Real-time visualization: Helix + dashboard overlay for qualitative assessment

Current State

Infrastructure: DONE

  • 6 role agents with CentralPost hub-spoke communication
  • Parallel deliberation (6 agents concurrently)
  • SSE events wired into agent decisions
  • Helix phase adaptation (exploration → analysis → synthesis)
  • Paired benchmark (agent vs unmanaged baseline with save/load)
  • Delta scoring with statistical tests (Cohen's d, Welch's t-test)
  • Detailed colonist data (skills, traits, current job, needs)
  • Dashboard with 5 RLE widgets
  • Terminal helix visualizer
  • Benchmark tracking (JSONL history, baselines, W&B, HuggingFace Hub)
  • Provider-agnostic (local 4B, cloud 120B, Anthropic, OpenAI)

Agent quality: IN PROGRESS

  • Agents beat baseline for the first time (+0.018, N=2, not yet significant)
  • N=4 run with detailed colonist data for statistical significance
  • Harder scenario saves where baseline struggles more (Create benchmark save files for all 6 scenarios #7)
  • Ablation (remove one agent, measure contribution)

Multi-model comparison: NOT STARTED

  • Run same scenario with 4+ different models
  • Build the leaderboard table
  • Identify which models are best at which scenarios

Milestones

M1: Agents consistently beat baseline (#6)

  • N=4 paired runs on Crashlanded with positive delta (p < 0.05)
  • Detailed colonist data improving agent decisions
  • Target: Agent 0.85 vs Baseline 0.75 → delta +0.10

M2: Multi-scenario benchmark suite (#7)

  • 4+ scenario save files created (Crashlanded, First Winter, Raid Defense, Plague Response)
  • Paired results across scenarios showing agents help more on harder scenarios
  • FLE parallel: like their 24 lab-play tasks but with RimWorld's stochastic challenge progression
  • Target: positive delta on at least 4/6 scenarios

M3: Multi-model leaderboard

  • Run the benchmark suite with 4+ models:
    • Nemotron Nano 4B (local, free)
    • Nemotron Super 120B (OpenRouter, ~$0.09/run)
    • Claude Sonnet (Anthropic)
    • GPT-4o (OpenAI)
    • Qwen3.5-9B (local, free)
  • FLE parallel: their 6-model comparison table but with delta scores instead of pass rates
  • Publish: model × scenario matrix with delta, p-value, effect size
  • Target: Clear differentiation between models

M4: Public release

  • Results on HuggingFace Hub (appsprout/rle-benchmarks)
  • Blog post / Twitter thread with the leaderboard
  • Dashboard recording / livestream VOD
  • README with full reproduction instructions
  • FLE parallel: their open-source release with reproducible evaluation

M5: Paper

  • FLE-style methodology: environment description, agent architecture, evaluation protocol
  • Results table: model × scenario × metric
  • Ablation study: per-agent contribution
  • Human baseline comparison (3-5 expert RimWorld players)
  • Novel contributions over FLE: multi-agent coordination, stochastic environment, paired baseline, human comparison
  • Target venue: NeurIPS workshop, AAAI, or standalone arXiv

Success Criteria

  1. The leaderboard exists with 4+ models showing statistically significant differences
  2. At least one scenario where agents demonstrably save a colony that would otherwise fail
  3. People share the results because it's a fun, intuitive way to compare LLM capabilities
  4. The methodology is rigorous enough that researchers take it seriously

Timeline

No fixed date. Quality over speed. Momentum-dependent.

Metadata

Metadata

Labels

epicLong-term strategic milestone

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions