Skip to content

Design testing strategy to isolate Squad failures from Copilot CLI failures #19

@diberry

Description

@diberry

Problem

When Squad CI tests fail, it's often unclear whether the failure is:

  1. A real Squad bug — code we changed broke something
  2. A Copilot CLI infrastructure issue — the test harness, agent spawning, or CLI environment is the actual failure point
  3. A pre-existing failure on the base branch — unrelated to the current PR's changes

This ambiguity wastes significant debugging time. In the current StorageProvider PR (#18), we chased
epl-ux-fixes.test.ts:921\ and \shell.test.ts\ failures that turned out to be either pre-existing on \dev\ or caused by Copilot CLI's async migration patterns — not by StorageProvider changes.

Goals

  1. Classify test failures — automatically determine if a failure is SA-scoped (files this PR touched) vs pre-existing vs infrastructure
  2. Baseline comparison — compare PR test results against \dev\ branch results to identify pre-existing failures
  3. Failure attribution — tag each test failure with the likely root cause category
  4. Reduce false alarms — stop agents from chasing failures unrelated to their work

Proposed Approach

1. Baseline CI snapshot

  • Run full test suite on \dev\ nightly (or on-demand) and store results as a baseline
  • PR CI compares its failures against the baseline — any failure that also exists on \dev\ is flagged as pre-existing

2. Scope-aware test filtering

  • Given the set of files changed in a PR, determine which test files are relevant
  • Flag failures in unrelated test files as \out-of-scope\ (informational, not blocking)

3. Failure categorization

  • \🔴 PR-caused\ — test passes on dev, fails on PR branch, test file is in scope
  • \🟡 Pre-existing\ — test also fails on dev (zero diff on the test file between branches)
  • \⚪ Out-of-scope\ — test fails but the test file has no relationship to changed files
  • \🔵 Infrastructure\ — timeout, OOM, orphan process cleanup, Node.js deprecation warnings

4. Agent workflow integration

  • When Squad agents check CI, they should read the categorized results
  • Agents should only fix \🔴 PR-caused\ failures
  • \🟡 Pre-existing\ failures get logged but not chased

Examples from StorageProvider PR #18

Failure Category Why
\shell.test.ts\ — async/await missing 🔴 PR-caused SA migration made functions async, tests needed updating
\storage-provider.test.ts:465\ — EPERM/EACCES 🔴 PR-caused New test, cross-platform error code difference
\
epl-ux-fixes.test.ts:921\ — 'squad init' 🟡 Pre-existing Zero diff on this file between dev and SA branch

Acceptance Criteria

  • CI can identify pre-existing failures (baseline comparison)
  • Test failures are categorized by scope relevance to PR changes
  • Agent spawn prompts include failure categorization so agents don't chase unrelated failures
  • Skill document created at .squad/skills/ci-failure-triage/SKILL.md\ encoding the triage patterns

Labels

squad, squad:flight

Metadata

Metadata

Assignees

No one assigned

    Labels

    go:needs-researchNeeds investigationsquadSquad triage inbox — Lead will assign to a membersquad:fidoAssigned to FIDO (Quality Owner)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions