Migrate to SKILL.md format, merge Droid/Factory targets by justinmoon · Pull Request #14 · sledtools/rally

justinmoon · 2026-03-04T04:13:05Z

Changes

SKILL.md format: Replace per-target wrapper formats (commands/rally.md, prompts/rally.md) with unified skills/rally/SKILL.md using YAML frontmatter. All 4 agents (Claude, Codex, Pi, Droid) now use the same skill content.
Merge Droid/Factory targets: Remove the separate Factory target. Droid now points at ~/.factory (where the droid binary actually reads skills from). Fixes solo-droid e2e test.
Legacy cleanup: Install/uninstall automatically removes old command/prompt wrapper files.
Concurrent e2e test matrix (tests/e2e_matrix.py): Python test harness that runs solo and pair agent tests concurrently with wave scheduling (no agent used in two tests simultaneously). Added just test-matrix recipe, python3 in flake.nix devShell.
Increased default pair timeout to 420s to reduce flaky failures.

E2e matrix results

Test	Result	Time
solo-claude	PASS	30s
solo-codex	PASS	25s
solo-pi	PASS	25s
solo-droid	PASS	50s
pair-claude+pi	PASS	238s
pair-pi+claude	PASS	98s
pair-claude+codex	FAIL (timeout)	428s

Summary by CodeRabbit

Release Notes

New Features
- Added concurrent e2e test matrix runner for streamlined multi-agent testing with improved orchestration and reporting.
Chores
- Updated development environment to include Python 3.
- Reorganized project directory structure and file layouts for consistency.
- Refined test artifact handling and project configuration.

Replace per-target wrapper format (commands/rally.md, prompts/rally.md) with unified SKILL.md using YAML frontmatter across all agents. Remove rally-target field (skills are target-independent). Add legacy path cleanup on install/uninstall. Add python3 to devShell and concurrent e2e_matrix.py test harness with wave scheduling. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

coderabbitai · 2026-03-04T04:13:19Z

📝 Walkthrough

Walkthrough

This pull request refactors the wrapper-based harness architecture to a skill-based approach, removes the Factory variant from command targets, updates test infrastructure with a new Python-based e2e matrix runner, adds Python 3 to the development environment, and updates test artifacts to reflect the new directory structure and file layout.

Changes

Cohort / File(s)	Summary
Configuration & Environment `.gitignore`, `flake.nix`	Added gitignore rules for test artifact files (solo-.txt, e2e-.txt, skill-.txt, pair-.txt, todo-run-cmd-*.md) and included Python 3 in the development shell.
Test Infrastructure `justfile`, `tests/e2e_matrix.py`	Replaced old e2e test target with test-matrix that accepts arguments and runs concurrent Rally tests; introduced new Python-based e2e matrix runner supporting solo/pair tests across multiple agents with configurable concurrency and timeout.
Core Harness Refactoring `src/cli.rs`, `src/command_surface.rs`	Removed Factory enum variant from CommandTargetArg and HarnessTarget; renamed wrapper_path to skill_path; replaced wrapper-based (prompts/commands) file paths with skill-based SKILL.md layout; consolidated metadata methods into is_rally_managed; refactored wrapper rendering to emit YAML frontmatter; updated default Droid home from .droid to .factory; introduced legacy_paths helper for cleanup.
Integration Test Updates `tests/command_install_run.rs`	Updated test structure to use .factory directory and skills/rally/SKILL.md paths; adjusted target iteration and removed explicit factory references; updated assertions to verify rally-managed frontmatter instead of wrapper-target metadata.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI Entry
    participant Orchestrator as Test Orchestrator
    participant AgentLauncher as Agent Launcher
    participant StatePoller as State Poller
    participant Validator as Output Validator
    participant Reporter as Result Reporter

    CLI->>Orchestrator: main(solo, pairs, agents, timeout, jobs)
    Orchestrator->>Orchestrator: Filter available agents
    
    alt Solo Tests Mode
        loop For each agent
            Orchestrator->>AgentLauncher: launch_agent(agent, prompt, logfile)
            AgentLauncher->>AgentLauncher: Start process (stdin or exec mode)
            AgentLauncher-->>Orchestrator: Popen object
            Orchestrator->>StatePoller: Poll read_state(session_name)
            StatePoller-->>Orchestrator: Session state JSON
            Orchestrator->>Validator: Validate output file content
            Validator-->>Orchestrator: TestResult (pass/fail)
        end
    end
    
    alt Pair Tests Mode
        loop For each pair wave
            Orchestrator->>AgentLauncher: launch_agent(impl_agent, prompt, logfile)
            AgentLauncher-->>Orchestrator: Implementer process
            Orchestrator->>StatePoller: Poll state with phase trace
            StatePoller-->>Orchestrator: Phase transition data
            Orchestrator->>AgentLauncher: launch_agent(revw_agent, prompt, logfile)
            AgentLauncher-->>Orchestrator: Reviewer process
            Orchestrator->>Validator: Validate handoff pattern + output
            Validator-->>Orchestrator: TestResult (pass/fail)
        end
    end
    
    Orchestrator->>Reporter: Aggregate results
    Reporter->>Reporter: Format summary + failure details
    Reporter-->>CLI: Exit with status

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

A rabbit's tale of paths refined,
From wrapper-wrapped to skill-aligned,
The factory now rests, no more in sight,
While YAML frontiers shine so bright.
Tests dance concurrent, e2e in flight—
Code hops forward, structured right! 🐰✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Migrate to SKILL.md format, merge Droid/Factory targets' accurately reflects the two main objectives of the PR: migrating to SKILL.md format and consolidating the Droid/Factory targets.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch skill-md-migration

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

devin-ai-integration

Devin Review found 2 potential issues.

View 6 additional findings in Devin Review.

devin-ai-integration · 2026-03-04T04:18:15Z

tests/e2e_matrix.py

+    # Sort by name for stable output
+    results.sort(key=lambda r: r.name)
+    print_results(results)
+    sys.exit(0 if all(r.passed for r in results) else 1)


🔴 Test harness exits 0 (success) when all tests raise exceptions

The collect function catches exceptions from test futures and prints them but never appends a failed result to the results list. If every test raises an exception (e.g., due to broken infrastructure), results remains empty. At line 461, all(r.passed for r in results) is evaluated — but all() on an empty iterable returns True in Python, so the script exits with code 0 (success).

Root Cause

In collect() at tests/e2e_matrix.py:410-411, exceptions are caught and printed but no TestResult is appended to results. This means a test that crashes is silently dropped from the result set.

At line 461:

sys.exit(0 if all(r.passed for r in results) else 1)

When results is empty (all tests excepted), all([]) returns True, and the script reports success.

Impact: A completely broken test environment (e.g., agent binaries crash on startup) would be reported as a passing test run, masking real failures.

Suggested change

sys.exit(0 if all(r.passed for r in results) else 1)

sys.exit(0 if results and all(r.passed for r in results) else 1)

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-04T04:18:16Z

tests/e2e_matrix.py

+        proc = subprocess.Popen(
+            cmd,
+            stdin=subprocess.PIPE,
+            stdout=open(logfile, "w"),


🟡 File handle leak: open() passed directly to Popen without closing

In launch_agent, open(logfile, "w") is passed directly as the stdout argument to subprocess.Popen without being stored in a variable. The file handle is never explicitly closed by the parent process.

Detailed Explanation

At tests/e2e_matrix.py:105 and tests/e2e_matrix.py:115:

stdout=open(logfile, "w"),

The open() call creates a file object that is passed to Popen but the reference is immediately lost. While CPython's reference counting will eventually close these when garbage collected, this is not guaranteed (especially in other Python implementations), and with many concurrent tests, file descriptors could accumulate.

Impact: Leaked file handles per agent launch. In practice limited by the small number of tests, but violates resource management best practices and could cause issues if the test matrix grows.

Prompt for agents

In tests/e2e_matrix.py, the launch_agent function at lines 102-118 passes open(logfile, "w") directly to subprocess.Popen's stdout parameter without storing the file handle. This causes a file handle leak. Fix both occurrences (lines 105 and 115) by storing the file handle in a variable and returning it alongside the process so it can be closed later, or by using a context manager pattern. For example, store the file object and close it after proc.terminate()/proc.kill() in the calling functions run_solo_test and run_pair_test.

Was this helpful? React with 👍 or 👎 to provide feedback.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/command_surface.rs`:
- Around line 776-793: The branch that returns
UninstallAction::SkippedWrapperMissing when !skill.exists() should still remove
legacy wrapper files; update that arm so it iterates target.legacy_paths(&root)
and attempts to fs::remove_file for each existing legacy path before returning
UninstallAction::SkippedWrapperMissing. Use the same cleanup logic as in the
ManagedOwnership::ManagedForTarget case (checking legacy.exists() and ignoring
errors), referencing skill, target, root, and target.legacy_paths(&root) so old
wrapper-only installs are cleaned up even when SKILL.md is missing.

In `@tests/e2e_matrix.py`:
- Around line 371-372: The bug is that passing both
parser.add_argument("--solo", ...) and parser.add_argument("--pairs", ...) can
cause both phases to be skipped leaving results empty and the test run to pass;
fix by making the flags mutually exclusive (use
argparse.add_mutually_exclusive_group) or add a post-parse validation that
checks if args.solo and args.pairs are both True and then exit non‑zero with a
clear error; ensure the validation covers all places where these flags affect
execution (the code paths that build/execute phases and the variable results) so
the script fails fast instead of returning success when zero tests ran.
- Around line 402-412: When a future raises inside collect(futures_map) the
exception is only logged and no failure is recorded; modify the except block to
append a failing test record to the shared results list so the run will exit
non‑zero. Specifically, in collect(futures_map) (and the other identical collect
instance), construct and append a TestResult-like object (or dict) with fields
matching how results entries are used (e.g., name/test_name, passed=False,
duration=0 or computed, phase="error", plus the exception message) so downstream
code treats the future as a failed test and summary/exit code reflect the
failure.
- Line 377: The CLI --jobs argument is defined but not used; replace the
hardcoded worker/agent/wave sizes with the parsed value (args.jobs) so
concurrency matches the flag. Locate the hardcoded usages (the places that set
worker counts for agents/waves — e.g., where variables like
agent_count/wave_size/workers are set or where thread/process pools are
constructed) and change them to use args.jobs (or a computed value derived from
it, e.g., min(args.jobs, some_limit) or max(1, args.jobs) if needed), ensuring
all three occurrences noted (around the current agent/wave sizing at the three
blocks referenced) consistently read from the parser value instead of literals.

ℹ️ Review info

Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1e37d9c1-ea66-4f23-bee0-67cd079ff9da

📥 Commits

Reviewing files that changed from the base of the PR and between 979a712 and 2f686ca.

📒 Files selected for processing (7)

.gitignore
flake.nix
justfile
src/cli.rs
src/command_surface.rs
tests/command_install_run.rs
tests/e2e_matrix.py

💤 Files with no reviewable changes (1)

src/cli.rs

coderabbitai · 2026-03-04T04:18:46Z

src/command_surface.rs

+        } else if !skill.exists() {
            UninstallAction::SkippedWrapperMissing
        } else {
-            let bytes = fs::read(&wrapper)
-                .with_context(|| format!("reading wrapper {}", wrapper.display()))?;
+            let bytes = fs::read(&skill)
+                .with_context(|| format!("reading skill {}", skill.display()))?;
            match managed_ownership(&bytes, target) {
                ManagedOwnership::ManagedForTarget => {
                    if args.dry_run {
                        UninstallAction::WouldRemoveManaged
                    } else {
-                        fs::remove_file(&wrapper)
-                            .with_context(|| format!("removing wrapper {}", wrapper.display()))?;
+                        fs::remove_file(&skill)
+                            .with_context(|| format!("removing skill {}", skill.display()))?;
+                        for legacy in target.legacy_paths(&root) {
+                            if legacy.exists() {
+                                let _ = fs::remove_file(&legacy);
+                            }
+                        }
                        UninstallAction::RemovedManaged


⚠️ Potential issue | 🟠 Major

Legacy wrapper cleanup is skipped when SKILL.md is missing.

On Line 776, uninstall returns SkippedWrapperMissing before any legacy cleanup. That leaves old managed wrapper files behind for wrapper-only installs.

Suggested fix

fn uninstall_with_environment( args: &CommandUninstallArgs, env: &AdapterEnvironment, ) -> Result<Vec<UninstallOutcome>> { @@ for target in selected_targets(args.target) { let root = env.root_for(target)?; let skill = target.skill_path(&root); + + if root.exists() && root.is_dir() && !args.dry_run { + cleanup_legacy_managed_files(target, &root)?; + } + let action = if !root.exists() { UninstallAction::SkippedEnvironmentMissing } else if !root.is_dir() { UninstallAction::SkippedInvalidEnvironment } else if !skill.exists() { UninstallAction::SkippedWrapperMissing @@ ManagedOwnership::ManagedForTarget => { if args.dry_run { UninstallAction::WouldRemoveManaged } else { fs::remove_file(&skill) .with_context(|| format!("removing skill {}", skill.display()))?; - for legacy in target.legacy_paths(&root) { - if legacy.exists() { - let _ = fs::remove_file(&legacy); - } - } UninstallAction::RemovedManaged } }

+fn cleanup_legacy_managed_files(target: HarnessTarget, root: &Path) -> Result<()> { + for legacy in target.legacy_paths(root) { + if !legacy.exists() { + continue; + } + let bytes = fs::read(&legacy) + .with_context(|| format!("reading legacy wrapper {}", legacy.display()))?; + if is_rally_managed(&bytes) { + fs::remove_file(&legacy) + .with_context(|| format!("removing legacy wrapper {}", legacy.display()))?; + } + } + Ok(()) +}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/command_surface.rs` around lines 776 - 793, The branch that returns UninstallAction::SkippedWrapperMissing when !skill.exists() should still remove legacy wrapper files; update that arm so it iterates target.legacy_paths(&root) and attempts to fs::remove_file for each existing legacy path before returning UninstallAction::SkippedWrapperMissing. Use the same cleanup logic as in the ManagedOwnership::ManagedForTarget case (checking legacy.exists() and ignoring errors), referencing skill, target, root, and target.legacy_paths(&root) so old wrapper-only installs are cleaned up even when SKILL.md is missing.

coderabbitai · 2026-03-04T04:18:47Z

tests/e2e_matrix.py

+    parser.add_argument("--solo", action="store_true", help="solo tests only")
+    parser.add_argument("--pairs", action="store_true", help="pair tests only")


⚠️ Potential issue | 🟡 Minor

--solo --pairs can run zero tests and still pass.

When both flags are set, both phases are skipped, results stays empty, and exit status can be success.

Suggested fix

- parser.add_argument("--solo", action="store_true", help="solo tests only") - parser.add_argument("--pairs", action="store_true", help="pair tests only") + mode = parser.add_mutually_exclusive_group() + mode.add_argument("--solo", action="store_true", help="solo tests only") + mode.add_argument("--pairs", action="store_true", help="pair tests only")

Also applies to: 414-425, 458-461

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/e2e_matrix.py` around lines 371 - 372, The bug is that passing both parser.add_argument("--solo", ...) and parser.add_argument("--pairs", ...) can cause both phases to be skipped leaving results empty and the test run to pass; fix by making the flags mutually exclusive (use argparse.add_mutually_exclusive_group) or add a post-parse validation that checks if args.solo and args.pairs are both True and then exit non‑zero with a clear error; ensure the validation covers all places where these flags affect execution (the code paths that build/execute phases and the variable results) so the script fails fast instead of returning success when zero tests ran.

coderabbitai · 2026-03-04T04:18:47Z

tests/e2e_matrix.py

+    parser.add_argument("--revw", help="specific reviewer agent")
+    parser.add_argument("--timeout", type=int, default=420, help="per-test timeout (default 420)")
+    parser.add_argument("--solo-timeout", type=int, default=120, help="solo test timeout (default 120)")
+    parser.add_argument("--jobs", type=int, default=4, help="max concurrent tests (default 4)")


⚠️ Potential issue | 🟡 Minor

--jobs is currently ignored.

Line 377 advertises max concurrency, but worker counts are hardcoded to agent/wave sizes.

Suggested fix

- with ThreadPoolExecutor(max_workers=len(agents)) as pool: + solo_workers = max(1, min(args.jobs, len(agents))) + with ThreadPoolExecutor(max_workers=solo_workers) as pool: @@ - with ThreadPoolExecutor(max_workers=len(wave)) as pool: + wave_workers = max(1, min(args.jobs, len(wave))) + with ThreadPoolExecutor(max_workers=wave_workers) as pool:

Also applies to: 416-417, 451-452

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/e2e_matrix.py` at line 377, The CLI --jobs argument is defined but not used; replace the hardcoded worker/agent/wave sizes with the parsed value (args.jobs) so concurrency matches the flag. Locate the hardcoded usages (the places that set worker counts for agents/waves — e.g., where variables like agent_count/wave_size/workers are set or where thread/process pools are constructed) and change them to use args.jobs (or a computed value derived from it, e.g., min(args.jobs, some_limit) or max(1, args.jobs) if needed), ensuring all three occurrences noted (around the current agent/wave sizing at the three blocks referenced) consistently read from the parser value instead of literals.

coderabbitai · 2026-03-04T04:18:47Z

tests/e2e_matrix.py

+    def collect(futures_map):
+        for future in as_completed(futures_map):
+            test_name = futures_map[future]
+            try:
+                result = future.result()
+                results.append(result)
+                status = "PASS" if result.passed else "FAIL"
+                print(f"  [{status}] {test_name} ({result.duration:.0f}s) {result.phase}")
+            except Exception as e:
+                print(f"  [ERROR] {test_name}: {e}")
+


⚠️ Potential issue | 🟠 Major

Future exceptions are not counted as test failures.

If a future raises, collect() only logs it. The run can still exit 0 because failed futures are not represented in results.

Suggested fix

def main(): @@ - # Build test list - futures_map = {} - results = [] + # Build test list + futures_map = {} + results = [] + had_future_errors = False @@ - def collect(futures_map): + def collect(futures_map): + nonlocal had_future_errors for future in as_completed(futures_map): test_name = futures_map[future] try: result = future.result() results.append(result) status = "PASS" if result.passed else "FAIL" print(f" [{status}] {test_name} ({result.duration:.0f}s) {result.phase}") except Exception as e: print(f" [ERROR] {test_name}: {e}") + had_future_errors = True + results.append(TestResult( + name=test_name, + impl_agent="unknown", + revw_agent=None, + session="", + phase="error", + agents_joined=[], + file_content=None, + duration=0.0, + passed=False, + failure_reason=str(e), + errors=[str(e)], + )) @@ - sys.exit(0 if all(r.passed for r in results) else 1) + sys.exit(0 if results and all(r.passed for r in results) and not had_future_errors else 1)

Also applies to: 461-461

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 410-410: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/e2e_matrix.py` around lines 402 - 412, When a future raises inside collect(futures_map) the exception is only logged and no failure is recorded; modify the except block to append a failing test record to the shared results list so the run will exit non‑zero. Specifically, in collect(futures_map) (and the other identical collect instance), construct and append a TestResult-like object (or dict) with fields matching how results entries are used (e.g., name/test_name, passed=False, duration=0 or computed, phase="error", plus the exception message) so downstream code treats the future as a failed test and summary/exit code reflect the failure.

justinmoon · 2026-03-04T04:20:17Z

Superseded by #15 which includes these changes plus additional refactoring.

justinmoon and others added 2 commits March 3, 2026 21:07

tweaks

2f686ca

devin-ai-integration bot reviewed Mar 4, 2026

View reviewed changes

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

justinmoon closed this Mar 4, 2026

	sys.exit(0 if all(r.passed for r in results) else 1)
	sys.exit(0 if results and all(r.passed for r in results) else 1)

		parser.add_argument("--solo", action="store_true", help="solo tests only")
		parser.add_argument("--pairs", action="store_true", help="pair tests only")

Conversation

justinmoon commented Mar 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

E2e matrix results

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

justinmoon commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

justinmoon commented Mar 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 4, 2026 •

edited

Loading