Migrate to SKILL.md format, merge Droid/Factory targets#14
Migrate to SKILL.md format, merge Droid/Factory targets#14justinmoon wants to merge 2 commits intomasterfrom
Conversation
Replace per-target wrapper format (commands/rally.md, prompts/rally.md) with unified SKILL.md using YAML frontmatter across all agents. Remove rally-target field (skills are target-independent). Add legacy path cleanup on install/uninstall. Add python3 to devShell and concurrent e2e_matrix.py test harness with wave scheduling. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
📝 WalkthroughWalkthroughThis pull request refactors the wrapper-based harness architecture to a skill-based approach, removes the Factory variant from command targets, updates test infrastructure with a new Python-based e2e matrix runner, adds Python 3 to the development environment, and updates test artifacts to reflect the new directory structure and file layout. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI Entry
participant Orchestrator as Test Orchestrator
participant AgentLauncher as Agent Launcher
participant StatePoller as State Poller
participant Validator as Output Validator
participant Reporter as Result Reporter
CLI->>Orchestrator: main(solo, pairs, agents, timeout, jobs)
Orchestrator->>Orchestrator: Filter available agents
alt Solo Tests Mode
loop For each agent
Orchestrator->>AgentLauncher: launch_agent(agent, prompt, logfile)
AgentLauncher->>AgentLauncher: Start process (stdin or exec mode)
AgentLauncher-->>Orchestrator: Popen object
Orchestrator->>StatePoller: Poll read_state(session_name)
StatePoller-->>Orchestrator: Session state JSON
Orchestrator->>Validator: Validate output file content
Validator-->>Orchestrator: TestResult (pass/fail)
end
end
alt Pair Tests Mode
loop For each pair wave
Orchestrator->>AgentLauncher: launch_agent(impl_agent, prompt, logfile)
AgentLauncher-->>Orchestrator: Implementer process
Orchestrator->>StatePoller: Poll state with phase trace
StatePoller-->>Orchestrator: Phase transition data
Orchestrator->>AgentLauncher: launch_agent(revw_agent, prompt, logfile)
AgentLauncher-->>Orchestrator: Reviewer process
Orchestrator->>Validator: Validate handoff pattern + output
Validator-->>Orchestrator: TestResult (pass/fail)
end
end
Orchestrator->>Reporter: Aggregate results
Reporter->>Reporter: Format summary + failure details
Reporter-->>CLI: Exit with status
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
| # Sort by name for stable output | ||
| results.sort(key=lambda r: r.name) | ||
| print_results(results) | ||
| sys.exit(0 if all(r.passed for r in results) else 1) |
There was a problem hiding this comment.
🔴 Test harness exits 0 (success) when all tests raise exceptions
The collect function catches exceptions from test futures and prints them but never appends a failed result to the results list. If every test raises an exception (e.g., due to broken infrastructure), results remains empty. At line 461, all(r.passed for r in results) is evaluated — but all() on an empty iterable returns True in Python, so the script exits with code 0 (success).
Root Cause
In collect() at tests/e2e_matrix.py:410-411, exceptions are caught and printed but no TestResult is appended to results. This means a test that crashes is silently dropped from the result set.
At line 461:
sys.exit(0 if all(r.passed for r in results) else 1)When results is empty (all tests excepted), all([]) returns True, and the script reports success.
Impact: A completely broken test environment (e.g., agent binaries crash on startup) would be reported as a passing test run, masking real failures.
| sys.exit(0 if all(r.passed for r in results) else 1) | |
| sys.exit(0 if results and all(r.passed for r in results) else 1) |
Was this helpful? React with 👍 or 👎 to provide feedback.
| proc = subprocess.Popen( | ||
| cmd, | ||
| stdin=subprocess.PIPE, | ||
| stdout=open(logfile, "w"), |
There was a problem hiding this comment.
🟡 File handle leak: open() passed directly to Popen without closing
In launch_agent, open(logfile, "w") is passed directly as the stdout argument to subprocess.Popen without being stored in a variable. The file handle is never explicitly closed by the parent process.
Detailed Explanation
At tests/e2e_matrix.py:105 and tests/e2e_matrix.py:115:
stdout=open(logfile, "w"),The open() call creates a file object that is passed to Popen but the reference is immediately lost. While CPython's reference counting will eventually close these when garbage collected, this is not guaranteed (especially in other Python implementations), and with many concurrent tests, file descriptors could accumulate.
Impact: Leaked file handles per agent launch. In practice limited by the small number of tests, but violates resource management best practices and could cause issues if the test matrix grows.
Prompt for agents
In tests/e2e_matrix.py, the launch_agent function at lines 102-118 passes open(logfile, "w") directly to subprocess.Popen's stdout parameter without storing the file handle. This causes a file handle leak. Fix both occurrences (lines 105 and 115) by storing the file handle in a variable and returning it alongside the process so it can be closed later, or by using a context manager pattern. For example, store the file object and close it after proc.terminate()/proc.kill() in the calling functions run_solo_test and run_pair_test.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/command_surface.rs`:
- Around line 776-793: The branch that returns
UninstallAction::SkippedWrapperMissing when !skill.exists() should still remove
legacy wrapper files; update that arm so it iterates target.legacy_paths(&root)
and attempts to fs::remove_file for each existing legacy path before returning
UninstallAction::SkippedWrapperMissing. Use the same cleanup logic as in the
ManagedOwnership::ManagedForTarget case (checking legacy.exists() and ignoring
errors), referencing skill, target, root, and target.legacy_paths(&root) so old
wrapper-only installs are cleaned up even when SKILL.md is missing.
In `@tests/e2e_matrix.py`:
- Around line 371-372: The bug is that passing both
parser.add_argument("--solo", ...) and parser.add_argument("--pairs", ...) can
cause both phases to be skipped leaving results empty and the test run to pass;
fix by making the flags mutually exclusive (use
argparse.add_mutually_exclusive_group) or add a post-parse validation that
checks if args.solo and args.pairs are both True and then exit non‑zero with a
clear error; ensure the validation covers all places where these flags affect
execution (the code paths that build/execute phases and the variable results) so
the script fails fast instead of returning success when zero tests ran.
- Around line 402-412: When a future raises inside collect(futures_map) the
exception is only logged and no failure is recorded; modify the except block to
append a failing test record to the shared results list so the run will exit
non‑zero. Specifically, in collect(futures_map) (and the other identical collect
instance), construct and append a TestResult-like object (or dict) with fields
matching how results entries are used (e.g., name/test_name, passed=False,
duration=0 or computed, phase="error", plus the exception message) so downstream
code treats the future as a failed test and summary/exit code reflect the
failure.
- Line 377: The CLI --jobs argument is defined but not used; replace the
hardcoded worker/agent/wave sizes with the parsed value (args.jobs) so
concurrency matches the flag. Locate the hardcoded usages (the places that set
worker counts for agents/waves — e.g., where variables like
agent_count/wave_size/workers are set or where thread/process pools are
constructed) and change them to use args.jobs (or a computed value derived from
it, e.g., min(args.jobs, some_limit) or max(1, args.jobs) if needed), ensuring
all three occurrences noted (around the current agent/wave sizing at the three
blocks referenced) consistently read from the parser value instead of literals.
ℹ️ Review info
Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1e37d9c1-ea66-4f23-bee0-67cd079ff9da
📒 Files selected for processing (7)
.gitignoreflake.nixjustfilesrc/cli.rssrc/command_surface.rstests/command_install_run.rstests/e2e_matrix.py
💤 Files with no reviewable changes (1)
- src/cli.rs
| } else if !skill.exists() { | ||
| UninstallAction::SkippedWrapperMissing | ||
| } else { | ||
| let bytes = fs::read(&wrapper) | ||
| .with_context(|| format!("reading wrapper {}", wrapper.display()))?; | ||
| let bytes = fs::read(&skill) | ||
| .with_context(|| format!("reading skill {}", skill.display()))?; | ||
| match managed_ownership(&bytes, target) { | ||
| ManagedOwnership::ManagedForTarget => { | ||
| if args.dry_run { | ||
| UninstallAction::WouldRemoveManaged | ||
| } else { | ||
| fs::remove_file(&wrapper) | ||
| .with_context(|| format!("removing wrapper {}", wrapper.display()))?; | ||
| fs::remove_file(&skill) | ||
| .with_context(|| format!("removing skill {}", skill.display()))?; | ||
| for legacy in target.legacy_paths(&root) { | ||
| if legacy.exists() { | ||
| let _ = fs::remove_file(&legacy); | ||
| } | ||
| } | ||
| UninstallAction::RemovedManaged |
There was a problem hiding this comment.
Legacy wrapper cleanup is skipped when SKILL.md is missing.
On Line 776, uninstall returns SkippedWrapperMissing before any legacy cleanup. That leaves old managed wrapper files behind for wrapper-only installs.
Suggested fix
fn uninstall_with_environment(
args: &CommandUninstallArgs,
env: &AdapterEnvironment,
) -> Result<Vec<UninstallOutcome>> {
@@
for target in selected_targets(args.target) {
let root = env.root_for(target)?;
let skill = target.skill_path(&root);
+
+ if root.exists() && root.is_dir() && !args.dry_run {
+ cleanup_legacy_managed_files(target, &root)?;
+ }
+
let action = if !root.exists() {
UninstallAction::SkippedEnvironmentMissing
} else if !root.is_dir() {
UninstallAction::SkippedInvalidEnvironment
} else if !skill.exists() {
UninstallAction::SkippedWrapperMissing
@@
ManagedOwnership::ManagedForTarget => {
if args.dry_run {
UninstallAction::WouldRemoveManaged
} else {
fs::remove_file(&skill)
.with_context(|| format!("removing skill {}", skill.display()))?;
- for legacy in target.legacy_paths(&root) {
- if legacy.exists() {
- let _ = fs::remove_file(&legacy);
- }
- }
UninstallAction::RemovedManaged
}
}+fn cleanup_legacy_managed_files(target: HarnessTarget, root: &Path) -> Result<()> {
+ for legacy in target.legacy_paths(root) {
+ if !legacy.exists() {
+ continue;
+ }
+ let bytes = fs::read(&legacy)
+ .with_context(|| format!("reading legacy wrapper {}", legacy.display()))?;
+ if is_rally_managed(&bytes) {
+ fs::remove_file(&legacy)
+ .with_context(|| format!("removing legacy wrapper {}", legacy.display()))?;
+ }
+ }
+ Ok(())
+}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/command_surface.rs` around lines 776 - 793, The branch that returns
UninstallAction::SkippedWrapperMissing when !skill.exists() should still remove
legacy wrapper files; update that arm so it iterates target.legacy_paths(&root)
and attempts to fs::remove_file for each existing legacy path before returning
UninstallAction::SkippedWrapperMissing. Use the same cleanup logic as in the
ManagedOwnership::ManagedForTarget case (checking legacy.exists() and ignoring
errors), referencing skill, target, root, and target.legacy_paths(&root) so old
wrapper-only installs are cleaned up even when SKILL.md is missing.
| parser.add_argument("--solo", action="store_true", help="solo tests only") | ||
| parser.add_argument("--pairs", action="store_true", help="pair tests only") |
There was a problem hiding this comment.
--solo --pairs can run zero tests and still pass.
When both flags are set, both phases are skipped, results stays empty, and exit status can be success.
Suggested fix
- parser.add_argument("--solo", action="store_true", help="solo tests only")
- parser.add_argument("--pairs", action="store_true", help="pair tests only")
+ mode = parser.add_mutually_exclusive_group()
+ mode.add_argument("--solo", action="store_true", help="solo tests only")
+ mode.add_argument("--pairs", action="store_true", help="pair tests only")Also applies to: 414-425, 458-461
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/e2e_matrix.py` around lines 371 - 372, The bug is that passing both
parser.add_argument("--solo", ...) and parser.add_argument("--pairs", ...) can
cause both phases to be skipped leaving results empty and the test run to pass;
fix by making the flags mutually exclusive (use
argparse.add_mutually_exclusive_group) or add a post-parse validation that
checks if args.solo and args.pairs are both True and then exit non‑zero with a
clear error; ensure the validation covers all places where these flags affect
execution (the code paths that build/execute phases and the variable results) so
the script fails fast instead of returning success when zero tests ran.
| parser.add_argument("--revw", help="specific reviewer agent") | ||
| parser.add_argument("--timeout", type=int, default=420, help="per-test timeout (default 420)") | ||
| parser.add_argument("--solo-timeout", type=int, default=120, help="solo test timeout (default 120)") | ||
| parser.add_argument("--jobs", type=int, default=4, help="max concurrent tests (default 4)") |
There was a problem hiding this comment.
--jobs is currently ignored.
Line 377 advertises max concurrency, but worker counts are hardcoded to agent/wave sizes.
Suggested fix
- with ThreadPoolExecutor(max_workers=len(agents)) as pool:
+ solo_workers = max(1, min(args.jobs, len(agents)))
+ with ThreadPoolExecutor(max_workers=solo_workers) as pool:
@@
- with ThreadPoolExecutor(max_workers=len(wave)) as pool:
+ wave_workers = max(1, min(args.jobs, len(wave)))
+ with ThreadPoolExecutor(max_workers=wave_workers) as pool:Also applies to: 416-417, 451-452
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/e2e_matrix.py` at line 377, The CLI --jobs argument is defined but not
used; replace the hardcoded worker/agent/wave sizes with the parsed value
(args.jobs) so concurrency matches the flag. Locate the hardcoded usages (the
places that set worker counts for agents/waves — e.g., where variables like
agent_count/wave_size/workers are set or where thread/process pools are
constructed) and change them to use args.jobs (or a computed value derived from
it, e.g., min(args.jobs, some_limit) or max(1, args.jobs) if needed), ensuring
all three occurrences noted (around the current agent/wave sizing at the three
blocks referenced) consistently read from the parser value instead of literals.
| def collect(futures_map): | ||
| for future in as_completed(futures_map): | ||
| test_name = futures_map[future] | ||
| try: | ||
| result = future.result() | ||
| results.append(result) | ||
| status = "PASS" if result.passed else "FAIL" | ||
| print(f" [{status}] {test_name} ({result.duration:.0f}s) {result.phase}") | ||
| except Exception as e: | ||
| print(f" [ERROR] {test_name}: {e}") | ||
|
|
There was a problem hiding this comment.
Future exceptions are not counted as test failures.
If a future raises, collect() only logs it. The run can still exit 0 because failed futures are not represented in results.
Suggested fix
def main():
@@
- # Build test list
- futures_map = {}
- results = []
+ # Build test list
+ futures_map = {}
+ results = []
+ had_future_errors = False
@@
- def collect(futures_map):
+ def collect(futures_map):
+ nonlocal had_future_errors
for future in as_completed(futures_map):
test_name = futures_map[future]
try:
result = future.result()
results.append(result)
status = "PASS" if result.passed else "FAIL"
print(f" [{status}] {test_name} ({result.duration:.0f}s) {result.phase}")
except Exception as e:
print(f" [ERROR] {test_name}: {e}")
+ had_future_errors = True
+ results.append(TestResult(
+ name=test_name,
+ impl_agent="unknown",
+ revw_agent=None,
+ session="",
+ phase="error",
+ agents_joined=[],
+ file_content=None,
+ duration=0.0,
+ passed=False,
+ failure_reason=str(e),
+ errors=[str(e)],
+ ))
@@
- sys.exit(0 if all(r.passed for r in results) else 1)
+ sys.exit(0 if results and all(r.passed for r in results) and not had_future_errors else 1)Also applies to: 461-461
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 410-410: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/e2e_matrix.py` around lines 402 - 412, When a future raises inside
collect(futures_map) the exception is only logged and no failure is recorded;
modify the except block to append a failing test record to the shared results
list so the run will exit non‑zero. Specifically, in collect(futures_map) (and
the other identical collect instance), construct and append a TestResult-like
object (or dict) with fields matching how results entries are used (e.g.,
name/test_name, passed=False, duration=0 or computed, phase="error", plus the
exception message) so downstream code treats the future as a failed test and
summary/exit code reflect the failure.
|
Superseded by #15 which includes these changes plus additional refactoring. |
Changes
SKILL.md format: Replace per-target wrapper formats (
commands/rally.md,prompts/rally.md) with unifiedskills/rally/SKILL.mdusing YAML frontmatter. All 4 agents (Claude, Codex, Pi, Droid) now use the same skill content.Merge Droid/Factory targets: Remove the separate
Factorytarget.Droidnow points at~/.factory(where thedroidbinary actually reads skills from). Fixes solo-droid e2e test.Legacy cleanup: Install/uninstall automatically removes old command/prompt wrapper files.
Concurrent e2e test matrix (
tests/e2e_matrix.py): Python test harness that runs solo and pair agent tests concurrently with wave scheduling (no agent used in two tests simultaneously). Addedjust test-matrixrecipe,python3inflake.nixdevShell.Increased default pair timeout to 420s to reduce flaky failures.
E2e matrix results
Summary by CodeRabbit
Release Notes
New Features
Chores