Skip to content

feat: QA restructure, browser ref staleness, eval efficiency metrics (v0.4.0)#83

Merged
garrytan merged 6 commits intomainfrom
garrytan/qa-2.1
Mar 16, 2026
Merged

feat: QA restructure, browser ref staleness, eval efficiency metrics (v0.4.0)#83
garrytan merged 6 commits intomainfrom
garrytan/qa-2.1

Conversation

@garrytan
Copy link
Owner

Summary

  • QA-only skill (/qa-only) — report-only mode that blocks Edit tool entirely, so bugs are documented without fixes
  • QA fix loop/qa now runs find-fix-verify cycles: discover bugs, fix them, commit, re-navigate to confirm
  • Plan-to-QA artifact flow/plan-eng-review writes test-plan artifacts to ~/.gstack/projects/<slug>/ that /qa picks up for targeted testing
  • {{QA_METHODOLOGY}} DRY placeholder — shared methodology block injected into both /qa and /qa-only templates
  • Browser ref staleness detectionresolveRef() now checks element count to detect stale refs after SPA navigation
  • Eval efficiency metrics — turns, duration, cost displayed across all eval surfaces with natural-language Takeaway commentary interpreting deltas
  • 3 new E2E tests — qa-only guardrail, qa fix loop with commit verification, plan-eng-review test-plan artifact

Pre-Landing Review

No issues found. All changes are developer tooling, test infrastructure, and skill templates — no SQL, auth, or trust boundary code.

Eval Results

16/16 PASS — two consecutive clean runs at $4.42 and $4.42

Test Status Cost Turns Duration
browse basic PASS $0.08 6t 23s
browse snapshot PASS $0.05 6t 21s
SKILL.md setup discovery PASS $0.04 4t 12s
SKILL.md setup (no binary) PASS $0.04 2t 6s
SKILL.md outside git PASS $0.04 2t 6s
/qa quick PASS $0.47 29t 156s
/review SQL injection PASS $0.15 11t 48s
/qa b6-static PASS $0.23 18t 109s
/qa b7-spa PASS $0.47 38t 203s
/qa b8-checkout PASS $0.62 37t 337s
/plan-ceo-review PASS $0.67 5t 524s
/plan-eng-review PASS $0.17 4t 130s
/retro PASS $0.35 26t 210s
/qa-only no-fix PASS $0.42 25t 170s
/qa fix loop PASS $0.41 24t 160s
/plan-eng-review artifact PASS $0.21 15t 110s

Takeaway: Stable run — no significant efficiency changes, no regressions.

TODOS

No TODO items completed in this PR. 102 items remaining.

Test plan

  • All unit tests pass (145 tests, 0 failures)
  • All E2E evals pass (16/16, two consecutive runs)
  • b8-checkout and qa-only — previously flaky, now passing consistently

🤖 Generated with Claude Code

garrytan and others added 6 commits March 15, 2026 21:17
resolveRef() now checks element count to detect stale refs after page
mutations (e.g. SPA navigation). RefEntry stores role+name metadata
for better diagnostics. 3 new snapshot tests for staleness detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add /qa-only (report-only, Edit tool blocked), restructure /qa with
find-fix-verify cycle, add {{QA_METHODOLOGY}} DRY placeholder for
shared methodology. /plan-eng-review now writes test-plan artifacts
to ~/.gstack/projects/<slug>/ for QA consumption.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…l surfaces

Add generateCommentary() for natural-language delta interpretation,
per-test turns/duration in comparison and summary output, judgePassed
unit tests, 3 new E2E tests (qa-only, qa fix loop, plan artifact).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- ARCHITECTURE: add ref staleness detection section, update RefEntry type
- BROWSER: add ref staleness paragraph to snapshot system docs
- CONTRIBUTING: update eval tool descriptions with commentary feature
- README: fix missing qa-only in project-local uninstall command

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@garrytan garrytan merged commit f3ee0ee into main Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant