Skip to content

Migrate to SKILL.md format, merge Droid/Factory targets#14

Closed
justinmoon wants to merge 2 commits intomasterfrom
skill-md-migration
Closed

Migrate to SKILL.md format, merge Droid/Factory targets#14
justinmoon wants to merge 2 commits intomasterfrom
skill-md-migration

Conversation

@justinmoon
Copy link
Copy Markdown
Contributor

@justinmoon justinmoon commented Mar 4, 2026

Changes

  • SKILL.md format: Replace per-target wrapper formats (commands/rally.md, prompts/rally.md) with unified skills/rally/SKILL.md using YAML frontmatter. All 4 agents (Claude, Codex, Pi, Droid) now use the same skill content.

  • Merge Droid/Factory targets: Remove the separate Factory target. Droid now points at ~/.factory (where the droid binary actually reads skills from). Fixes solo-droid e2e test.

  • Legacy cleanup: Install/uninstall automatically removes old command/prompt wrapper files.

  • Concurrent e2e test matrix (tests/e2e_matrix.py): Python test harness that runs solo and pair agent tests concurrently with wave scheduling (no agent used in two tests simultaneously). Added just test-matrix recipe, python3 in flake.nix devShell.

  • Increased default pair timeout to 420s to reduce flaky failures.

E2e matrix results

Test Result Time
solo-claude PASS 30s
solo-codex PASS 25s
solo-pi PASS 25s
solo-droid PASS 50s
pair-claude+pi PASS 238s
pair-pi+claude PASS 98s
pair-claude+codex FAIL (timeout) 428s

Open with Devin

Summary by CodeRabbit

Release Notes

  • New Features

    • Added concurrent e2e test matrix runner for streamlined multi-agent testing with improved orchestration and reporting.
  • Chores

    • Updated development environment to include Python 3.
    • Reorganized project directory structure and file layouts for consistency.
    • Refined test artifact handling and project configuration.

justinmoon and others added 2 commits March 3, 2026 21:07
Replace per-target wrapper format (commands/rally.md, prompts/rally.md)
with unified SKILL.md using YAML frontmatter across all agents. Remove
rally-target field (skills are target-independent). Add legacy path
cleanup on install/uninstall. Add python3 to devShell and concurrent
e2e_matrix.py test harness with wave scheduling.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 4, 2026

📝 Walkthrough

Walkthrough

This pull request refactors the wrapper-based harness architecture to a skill-based approach, removes the Factory variant from command targets, updates test infrastructure with a new Python-based e2e matrix runner, adds Python 3 to the development environment, and updates test artifacts to reflect the new directory structure and file layout.

Changes

Cohort / File(s) Summary
Configuration & Environment
.gitignore, flake.nix
Added gitignore rules for test artifact files (solo-*.txt, e2e-*.txt, skill-*.txt, pair-*.txt, todo-run-cmd-*.md) and included Python 3 in the development shell.
Test Infrastructure
justfile, tests/e2e_matrix.py
Replaced old e2e test target with test-matrix that accepts arguments and runs concurrent Rally tests; introduced new Python-based e2e matrix runner supporting solo/pair tests across multiple agents with configurable concurrency and timeout.
Core Harness Refactoring
src/cli.rs, src/command_surface.rs
Removed Factory enum variant from CommandTargetArg and HarnessTarget; renamed wrapper_path to skill_path; replaced wrapper-based (prompts/commands) file paths with skill-based SKILL.md layout; consolidated metadata methods into is_rally_managed; refactored wrapper rendering to emit YAML frontmatter; updated default Droid home from .droid to .factory; introduced legacy_paths helper for cleanup.
Integration Test Updates
tests/command_install_run.rs
Updated test structure to use .factory directory and skills/rally/SKILL.md paths; adjusted target iteration and removed explicit factory references; updated assertions to verify rally-managed frontmatter instead of wrapper-target metadata.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI Entry
    participant Orchestrator as Test Orchestrator
    participant AgentLauncher as Agent Launcher
    participant StatePoller as State Poller
    participant Validator as Output Validator
    participant Reporter as Result Reporter

    CLI->>Orchestrator: main(solo, pairs, agents, timeout, jobs)
    Orchestrator->>Orchestrator: Filter available agents
    
    alt Solo Tests Mode
        loop For each agent
            Orchestrator->>AgentLauncher: launch_agent(agent, prompt, logfile)
            AgentLauncher->>AgentLauncher: Start process (stdin or exec mode)
            AgentLauncher-->>Orchestrator: Popen object
            Orchestrator->>StatePoller: Poll read_state(session_name)
            StatePoller-->>Orchestrator: Session state JSON
            Orchestrator->>Validator: Validate output file content
            Validator-->>Orchestrator: TestResult (pass/fail)
        end
    end
    
    alt Pair Tests Mode
        loop For each pair wave
            Orchestrator->>AgentLauncher: launch_agent(impl_agent, prompt, logfile)
            AgentLauncher-->>Orchestrator: Implementer process
            Orchestrator->>StatePoller: Poll state with phase trace
            StatePoller-->>Orchestrator: Phase transition data
            Orchestrator->>AgentLauncher: launch_agent(revw_agent, prompt, logfile)
            AgentLauncher-->>Orchestrator: Reviewer process
            Orchestrator->>Validator: Validate handoff pattern + output
            Validator-->>Orchestrator: TestResult (pass/fail)
        end
    end
    
    Orchestrator->>Reporter: Aggregate results
    Reporter->>Reporter: Format summary + failure details
    Reporter-->>CLI: Exit with status
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

A rabbit's tale of paths refined,
From wrapper-wrapped to skill-aligned,
The factory now rests, no more in sight,
While YAML frontiers shine so bright.
Tests dance concurrent, e2e in flight—
Code hops forward, structured right! 🐰✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Migrate to SKILL.md format, merge Droid/Factory targets' accurately reflects the two main objectives of the PR: migrating to SKILL.md format and consolidating the Droid/Factory targets.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch skill-md-migration

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

# Sort by name for stable output
results.sort(key=lambda r: r.name)
print_results(results)
sys.exit(0 if all(r.passed for r in results) else 1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Test harness exits 0 (success) when all tests raise exceptions

The collect function catches exceptions from test futures and prints them but never appends a failed result to the results list. If every test raises an exception (e.g., due to broken infrastructure), results remains empty. At line 461, all(r.passed for r in results) is evaluated — but all() on an empty iterable returns True in Python, so the script exits with code 0 (success).

Root Cause

In collect() at tests/e2e_matrix.py:410-411, exceptions are caught and printed but no TestResult is appended to results. This means a test that crashes is silently dropped from the result set.

At line 461:

sys.exit(0 if all(r.passed for r in results) else 1)

When results is empty (all tests excepted), all([]) returns True, and the script reports success.

Impact: A completely broken test environment (e.g., agent binaries crash on startup) would be reported as a passing test run, masking real failures.

Suggested change
sys.exit(0 if all(r.passed for r in results) else 1)
sys.exit(0 if results and all(r.passed for r in results) else 1)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

proc = subprocess.Popen(
cmd,
stdin=subprocess.PIPE,
stdout=open(logfile, "w"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 File handle leak: open() passed directly to Popen without closing

In launch_agent, open(logfile, "w") is passed directly as the stdout argument to subprocess.Popen without being stored in a variable. The file handle is never explicitly closed by the parent process.

Detailed Explanation

At tests/e2e_matrix.py:105 and tests/e2e_matrix.py:115:

stdout=open(logfile, "w"),

The open() call creates a file object that is passed to Popen but the reference is immediately lost. While CPython's reference counting will eventually close these when garbage collected, this is not guaranteed (especially in other Python implementations), and with many concurrent tests, file descriptors could accumulate.

Impact: Leaked file handles per agent launch. In practice limited by the small number of tests, but violates resource management best practices and could cause issues if the test matrix grows.

Prompt for agents
In tests/e2e_matrix.py, the launch_agent function at lines 102-118 passes open(logfile, "w") directly to subprocess.Popen's stdout parameter without storing the file handle. This causes a file handle leak. Fix both occurrences (lines 105 and 115) by storing the file handle in a variable and returning it alongside the process so it can be closed later, or by using a context manager pattern. For example, store the file object and close it after proc.terminate()/proc.kill() in the calling functions run_solo_test and run_pair_test.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/command_surface.rs`:
- Around line 776-793: The branch that returns
UninstallAction::SkippedWrapperMissing when !skill.exists() should still remove
legacy wrapper files; update that arm so it iterates target.legacy_paths(&root)
and attempts to fs::remove_file for each existing legacy path before returning
UninstallAction::SkippedWrapperMissing. Use the same cleanup logic as in the
ManagedOwnership::ManagedForTarget case (checking legacy.exists() and ignoring
errors), referencing skill, target, root, and target.legacy_paths(&root) so old
wrapper-only installs are cleaned up even when SKILL.md is missing.

In `@tests/e2e_matrix.py`:
- Around line 371-372: The bug is that passing both
parser.add_argument("--solo", ...) and parser.add_argument("--pairs", ...) can
cause both phases to be skipped leaving results empty and the test run to pass;
fix by making the flags mutually exclusive (use
argparse.add_mutually_exclusive_group) or add a post-parse validation that
checks if args.solo and args.pairs are both True and then exit non‑zero with a
clear error; ensure the validation covers all places where these flags affect
execution (the code paths that build/execute phases and the variable results) so
the script fails fast instead of returning success when zero tests ran.
- Around line 402-412: When a future raises inside collect(futures_map) the
exception is only logged and no failure is recorded; modify the except block to
append a failing test record to the shared results list so the run will exit
non‑zero. Specifically, in collect(futures_map) (and the other identical collect
instance), construct and append a TestResult-like object (or dict) with fields
matching how results entries are used (e.g., name/test_name, passed=False,
duration=0 or computed, phase="error", plus the exception message) so downstream
code treats the future as a failed test and summary/exit code reflect the
failure.
- Line 377: The CLI --jobs argument is defined but not used; replace the
hardcoded worker/agent/wave sizes with the parsed value (args.jobs) so
concurrency matches the flag. Locate the hardcoded usages (the places that set
worker counts for agents/waves — e.g., where variables like
agent_count/wave_size/workers are set or where thread/process pools are
constructed) and change them to use args.jobs (or a computed value derived from
it, e.g., min(args.jobs, some_limit) or max(1, args.jobs) if needed), ensuring
all three occurrences noted (around the current agent/wave sizing at the three
blocks referenced) consistently read from the parser value instead of literals.

ℹ️ Review info
Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1e37d9c1-ea66-4f23-bee0-67cd079ff9da

📥 Commits

Reviewing files that changed from the base of the PR and between 979a712 and 2f686ca.

📒 Files selected for processing (7)
  • .gitignore
  • flake.nix
  • justfile
  • src/cli.rs
  • src/command_surface.rs
  • tests/command_install_run.rs
  • tests/e2e_matrix.py
💤 Files with no reviewable changes (1)
  • src/cli.rs

Comment on lines +776 to 793
} else if !skill.exists() {
UninstallAction::SkippedWrapperMissing
} else {
let bytes = fs::read(&wrapper)
.with_context(|| format!("reading wrapper {}", wrapper.display()))?;
let bytes = fs::read(&skill)
.with_context(|| format!("reading skill {}", skill.display()))?;
match managed_ownership(&bytes, target) {
ManagedOwnership::ManagedForTarget => {
if args.dry_run {
UninstallAction::WouldRemoveManaged
} else {
fs::remove_file(&wrapper)
.with_context(|| format!("removing wrapper {}", wrapper.display()))?;
fs::remove_file(&skill)
.with_context(|| format!("removing skill {}", skill.display()))?;
for legacy in target.legacy_paths(&root) {
if legacy.exists() {
let _ = fs::remove_file(&legacy);
}
}
UninstallAction::RemovedManaged
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Legacy wrapper cleanup is skipped when SKILL.md is missing.

On Line 776, uninstall returns SkippedWrapperMissing before any legacy cleanup. That leaves old managed wrapper files behind for wrapper-only installs.

Suggested fix
 fn uninstall_with_environment(
     args: &CommandUninstallArgs,
     env: &AdapterEnvironment,
 ) -> Result<Vec<UninstallOutcome>> {
@@
     for target in selected_targets(args.target) {
         let root = env.root_for(target)?;
         let skill = target.skill_path(&root);
+
+        if root.exists() && root.is_dir() && !args.dry_run {
+            cleanup_legacy_managed_files(target, &root)?;
+        }
+
         let action = if !root.exists() {
             UninstallAction::SkippedEnvironmentMissing
         } else if !root.is_dir() {
             UninstallAction::SkippedInvalidEnvironment
         } else if !skill.exists() {
             UninstallAction::SkippedWrapperMissing
@@
                 ManagedOwnership::ManagedForTarget => {
                     if args.dry_run {
                         UninstallAction::WouldRemoveManaged
                     } else {
                         fs::remove_file(&skill)
                             .with_context(|| format!("removing skill {}", skill.display()))?;
-                        for legacy in target.legacy_paths(&root) {
-                            if legacy.exists() {
-                                let _ = fs::remove_file(&legacy);
-                            }
-                        }
                         UninstallAction::RemovedManaged
                     }
                 }
+fn cleanup_legacy_managed_files(target: HarnessTarget, root: &Path) -> Result<()> {
+    for legacy in target.legacy_paths(root) {
+        if !legacy.exists() {
+            continue;
+        }
+        let bytes = fs::read(&legacy)
+            .with_context(|| format!("reading legacy wrapper {}", legacy.display()))?;
+        if is_rally_managed(&bytes) {
+            fs::remove_file(&legacy)
+                .with_context(|| format!("removing legacy wrapper {}", legacy.display()))?;
+        }
+    }
+    Ok(())
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/command_surface.rs` around lines 776 - 793, The branch that returns
UninstallAction::SkippedWrapperMissing when !skill.exists() should still remove
legacy wrapper files; update that arm so it iterates target.legacy_paths(&root)
and attempts to fs::remove_file for each existing legacy path before returning
UninstallAction::SkippedWrapperMissing. Use the same cleanup logic as in the
ManagedOwnership::ManagedForTarget case (checking legacy.exists() and ignoring
errors), referencing skill, target, root, and target.legacy_paths(&root) so old
wrapper-only installs are cleaned up even when SKILL.md is missing.

Comment on lines +371 to +372
parser.add_argument("--solo", action="store_true", help="solo tests only")
parser.add_argument("--pairs", action="store_true", help="pair tests only")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

--solo --pairs can run zero tests and still pass.

When both flags are set, both phases are skipped, results stays empty, and exit status can be success.

Suggested fix
-    parser.add_argument("--solo", action="store_true", help="solo tests only")
-    parser.add_argument("--pairs", action="store_true", help="pair tests only")
+    mode = parser.add_mutually_exclusive_group()
+    mode.add_argument("--solo", action="store_true", help="solo tests only")
+    mode.add_argument("--pairs", action="store_true", help="pair tests only")

Also applies to: 414-425, 458-461

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_matrix.py` around lines 371 - 372, The bug is that passing both
parser.add_argument("--solo", ...) and parser.add_argument("--pairs", ...) can
cause both phases to be skipped leaving results empty and the test run to pass;
fix by making the flags mutually exclusive (use
argparse.add_mutually_exclusive_group) or add a post-parse validation that
checks if args.solo and args.pairs are both True and then exit non‑zero with a
clear error; ensure the validation covers all places where these flags affect
execution (the code paths that build/execute phases and the variable results) so
the script fails fast instead of returning success when zero tests ran.

parser.add_argument("--revw", help="specific reviewer agent")
parser.add_argument("--timeout", type=int, default=420, help="per-test timeout (default 420)")
parser.add_argument("--solo-timeout", type=int, default=120, help="solo test timeout (default 120)")
parser.add_argument("--jobs", type=int, default=4, help="max concurrent tests (default 4)")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

--jobs is currently ignored.

Line 377 advertises max concurrency, but worker counts are hardcoded to agent/wave sizes.

Suggested fix
-        with ThreadPoolExecutor(max_workers=len(agents)) as pool:
+        solo_workers = max(1, min(args.jobs, len(agents)))
+        with ThreadPoolExecutor(max_workers=solo_workers) as pool:
@@
-            with ThreadPoolExecutor(max_workers=len(wave)) as pool:
+            wave_workers = max(1, min(args.jobs, len(wave)))
+            with ThreadPoolExecutor(max_workers=wave_workers) as pool:

Also applies to: 416-417, 451-452

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_matrix.py` at line 377, The CLI --jobs argument is defined but not
used; replace the hardcoded worker/agent/wave sizes with the parsed value
(args.jobs) so concurrency matches the flag. Locate the hardcoded usages (the
places that set worker counts for agents/waves — e.g., where variables like
agent_count/wave_size/workers are set or where thread/process pools are
constructed) and change them to use args.jobs (or a computed value derived from
it, e.g., min(args.jobs, some_limit) or max(1, args.jobs) if needed), ensuring
all three occurrences noted (around the current agent/wave sizing at the three
blocks referenced) consistently read from the parser value instead of literals.

Comment on lines +402 to +412
def collect(futures_map):
for future in as_completed(futures_map):
test_name = futures_map[future]
try:
result = future.result()
results.append(result)
status = "PASS" if result.passed else "FAIL"
print(f" [{status}] {test_name} ({result.duration:.0f}s) {result.phase}")
except Exception as e:
print(f" [ERROR] {test_name}: {e}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Future exceptions are not counted as test failures.

If a future raises, collect() only logs it. The run can still exit 0 because failed futures are not represented in results.

Suggested fix
 def main():
@@
-    # Build test list
-    futures_map = {}
-    results = []
+    # Build test list
+    futures_map = {}
+    results = []
+    had_future_errors = False
@@
-    def collect(futures_map):
+    def collect(futures_map):
+        nonlocal had_future_errors
         for future in as_completed(futures_map):
             test_name = futures_map[future]
             try:
                 result = future.result()
                 results.append(result)
                 status = "PASS" if result.passed else "FAIL"
                 print(f"  [{status}] {test_name} ({result.duration:.0f}s) {result.phase}")
             except Exception as e:
                 print(f"  [ERROR] {test_name}: {e}")
+                had_future_errors = True
+                results.append(TestResult(
+                    name=test_name,
+                    impl_agent="unknown",
+                    revw_agent=None,
+                    session="",
+                    phase="error",
+                    agents_joined=[],
+                    file_content=None,
+                    duration=0.0,
+                    passed=False,
+                    failure_reason=str(e),
+                    errors=[str(e)],
+                ))
@@
-    sys.exit(0 if all(r.passed for r in results) else 1)
+    sys.exit(0 if results and all(r.passed for r in results) and not had_future_errors else 1)

Also applies to: 461-461

🧰 Tools
🪛 Ruff (0.15.2)

[warning] 410-410: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_matrix.py` around lines 402 - 412, When a future raises inside
collect(futures_map) the exception is only logged and no failure is recorded;
modify the except block to append a failing test record to the shared results
list so the run will exit non‑zero. Specifically, in collect(futures_map) (and
the other identical collect instance), construct and append a TestResult-like
object (or dict) with fields matching how results entries are used (e.g.,
name/test_name, passed=False, duration=0 or computed, phase="error", plus the
exception message) so downstream code treats the future as a failed test and
summary/exit code reflect the failure.

@justinmoon
Copy link
Copy Markdown
Contributor Author

Superseded by #15 which includes these changes plus additional refactoring.

@justinmoon justinmoon closed this Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant