-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
epicLong-term strategic milestoneLong-term strategic milestone
Description
Vision
RLE is two things:
- A rigorous multi-agent game benchmark modeled after FLE (Factorio Learning Environment, NeurIPS 2025) — but for multi-agent coordination under uncertainty instead of single-agent factory optimization
- Chatbot Arena for RimWorld — a public leaderboard where different LLMs compete at managing a colony through 6 specialized agents
The leaderboard is the product. FLE's methodology is the credibility.
The clip: A clean results table showing Claude vs GPT vs Nemotron vs Llama on colony survival. "Claude keeps 5/5 alive through a raid, GPT loses 2." That's what gets shared.
Three audiences, one dataset:
- AI/ML researchers → FLE-style paper with rigorous methodology, baselines, p-values
- Dev community → Felix SDK showcase, livestream demo with dashboard
- RimWorld/gaming community → AI colonies, mod potential, entertaining failures
How RLE Differs from FLE
| FLE | RLE | |
|---|---|---|
| Game | Factorio (deterministic) | RimWorld (stochastic) |
| Agents | Single agent | 6 role-specialized, hub-spoke coordination |
| Communication | None | CentralPost with phase/score broadcasts |
| Environment | Deterministic (fixed seeds) | Stochastic (raids, disease, mood, weather) |
| Task structure | 24 lab-play + open-play | 6 scenarios + paired agent-vs-baseline |
| Scoring | Binary pass + Production Score | 8-metric composite + delta over baseline |
| Model comparison | 6 frontier models | Local (4B) to cloud (120B), any provider |
| Baseline | None (gap in FLE) | Unmanaged colony (RimWorld built-in AI) |
| Human baseline | None (gap in FLE) | Planned (RimWorld has large player base) |
FLE Patterns We're Following
- Fixed-seed reproducibility: Save/load same colony state for every run
- Multiple runs per model: N=4+ with mean ± std, report median for skewed distributions
- Binary + continuous metrics: Victory/failure conditions AND composite score
- Difficulty progression: Easy (Crashlanded) → Extreme (Ship Launch)
- Comparative results table: Model × scenario matrix
- Paired evaluation: Agent score vs baseline score (FLE doesn't have this — we're better here)
FLE Patterns We're Adding
- Stochastic robustness: Different random events per run (RimWorld storyteller varies), requiring more runs for significance
- Multi-agent coordination: 6 agents must coordinate without conflicting, measured by conflict resolution stats
- Ablation: Remove one agent at a time to measure per-agent contribution
- Human baseline: Expert RimWorld players on same scenarios (FLE acknowledged this gap)
- Real-time visualization: Helix + dashboard overlay for qualitative assessment
Current State
Infrastructure: DONE
- 6 role agents with CentralPost hub-spoke communication
- Parallel deliberation (6 agents concurrently)
- SSE events wired into agent decisions
- Helix phase adaptation (exploration → analysis → synthesis)
- Paired benchmark (agent vs unmanaged baseline with save/load)
- Delta scoring with statistical tests (Cohen's d, Welch's t-test)
- Detailed colonist data (skills, traits, current job, needs)
- Dashboard with 5 RLE widgets
- Terminal helix visualizer
- Benchmark tracking (JSONL history, baselines, W&B, HuggingFace Hub)
- Provider-agnostic (local 4B, cloud 120B, Anthropic, OpenAI)
Agent quality: IN PROGRESS
- Agents beat baseline for the first time (+0.018, N=2, not yet significant)
- N=4 run with detailed colonist data for statistical significance
- Harder scenario saves where baseline struggles more (Create benchmark save files for all 6 scenarios #7)
- Ablation (remove one agent, measure contribution)
Multi-model comparison: NOT STARTED
- Run same scenario with 4+ different models
- Build the leaderboard table
- Identify which models are best at which scenarios
Milestones
M1: Agents consistently beat baseline (#6)
- N=4 paired runs on Crashlanded with positive delta (p < 0.05)
- Detailed colonist data improving agent decisions
- Target: Agent 0.85 vs Baseline 0.75 → delta +0.10
M2: Multi-scenario benchmark suite (#7)
- 4+ scenario save files created (Crashlanded, First Winter, Raid Defense, Plague Response)
- Paired results across scenarios showing agents help more on harder scenarios
- FLE parallel: like their 24 lab-play tasks but with RimWorld's stochastic challenge progression
- Target: positive delta on at least 4/6 scenarios
M3: Multi-model leaderboard
- Run the benchmark suite with 4+ models:
- Nemotron Nano 4B (local, free)
- Nemotron Super 120B (OpenRouter, ~$0.09/run)
- Claude Sonnet (Anthropic)
- GPT-4o (OpenAI)
- Qwen3.5-9B (local, free)
- FLE parallel: their 6-model comparison table but with delta scores instead of pass rates
- Publish: model × scenario matrix with delta, p-value, effect size
- Target: Clear differentiation between models
M4: Public release
- Results on HuggingFace Hub (appsprout/rle-benchmarks)
- Blog post / Twitter thread with the leaderboard
- Dashboard recording / livestream VOD
- README with full reproduction instructions
- FLE parallel: their open-source release with reproducible evaluation
M5: Paper
- FLE-style methodology: environment description, agent architecture, evaluation protocol
- Results table: model × scenario × metric
- Ablation study: per-agent contribution
- Human baseline comparison (3-5 expert RimWorld players)
- Novel contributions over FLE: multi-agent coordination, stochastic environment, paired baseline, human comparison
- Target venue: NeurIPS workshop, AAAI, or standalone arXiv
Success Criteria
- The leaderboard exists with 4+ models showing statistically significant differences
- At least one scenario where agents demonstrably save a colony that would otherwise fail
- People share the results because it's a fun, intuitive way to compare LLM capabilities
- The methodology is rigorous enough that researchers take it seriously
Timeline
No fixed date. Quality over speed. Momentum-dependent.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
epicLong-term strategic milestoneLong-term strategic milestone