Skip to content

docs(arxiv): survey paper on skill-based agentic coding for reductions#617

Open
GiggleLiu wants to merge 43 commits intomainfrom
worktree-survey-agentic-reductions
Open

docs(arxiv): survey paper on skill-based agentic coding for reductions#617
GiggleLiu wants to merge 43 commits intomainfrom
worktree-survey-agentic-reductions

Conversation

@GiggleLiu
Copy link
Contributor

Summary

  • Adds a complete arxiv paper draft (IEEEtran format, 15 pages) on skill-based agentic coding for NP-hard problem reductions
  • Includes 5 figures (Typst+CeTZ compiled to PDF), references, and supporting survey materials
  • Integrates real development metrics mined from Claude Code session history (~/.claude)

Key findings from Claude history data:

  • 15:1 automation ratio (9,429 assistant messages vs 630 user messages across 283 sessions)
  • 1,510 co-authored commits, 300 MB of conversation transcripts
  • 75% issue rejection rate by automated quality gate on 322 batch-submitted issues
  • Codebase grew from 17 models/0 rules to 27 models/50 rules in 9 weeks
  • Prompt evolution: from imperative step-by-step commands (Phase 1) to single-command orchestration (Phase 3)

Paper structure:

  1. Introduction — skill-based decomposition thesis
  2. Why Reductions — Goldilocks domain argument
  3. System Architecture — type-driven verification by construction
  4. Skill-Based Task Decomposition — 12 skills, 3 roles, card-based pipeline
  5. Multi-Layered Verification — 7-layer stack
  6. Evaluation — development metrics, issue quality gate, case studies
  7. Related Work — AI coding agents, AI-discovered reductions, formal verification
  8. Discussion & Conclusion — generalizability, limitations, future directions

Test plan

  • Paper compiles cleanly with cd docs/paper/arxiv && pdflatex paper.tex
  • All figures render correctly in compiled PDF
  • No undefined references or broken cross-references
  • Numbers in paper match actual codebase state

🤖 Generated with Claude Code

GiggleLiu and others added 23 commits March 12, 2026 15:13
…ductions

Design spec for a full research paper (ICSE/ASE-class) on using skill-based
AI agent pipelines to build verified NP-hard problem reduction libraries.

Key decisions from brainstorming:
- Methodology-first framing (Goldilocks domain + practical artifact)
- Three roles: contributors (issues), maintainer (board curation), agents (manage + execute)
- Multi-layered verification stack (7 layers from type system to documentation)
- Evaluation: ablation (skill vs no-skill) + git mining + 3 case studies
- Hardware solver motivation (Rydberg atoms, D-Wave)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the two new card-based orchestration skills from origin/main:
- project-pipeline: picks Ready cards, runs issue-to-pr in worktrees
- review-pipeline: fixes Copilot comments, runs agentic tests, moves to In Review

Updated S4.3 with the two-stage pipeline and explicit human touch points
(Backlog→Ready and In Review→Done). Skills count updated to 13.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
16 tasks in 5 parallelizable chunks: scaffolding, figures, sections S1-S4,
sections S5-S6 with git mining, sections S7-S8 + final assembly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch figure generation from TikZ to Typst+CeTZ compiled to PDF,
included in LaTeX via \includegraphics. Paper body remains LaTeX
(IEEEtran class). Removed TikZ packages from preamble. Updated all
figure tasks (3-6), conventions block, compile commands, and Task 17
assembly step.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set up paper.tex with IEEEtran conference class, 8 section stubs, and
a ~150-word abstract. Combined survey bibliography (22 entries) with 6
foundational references (Karp, Cook, Garey-Johnson, Glover, Lucas,
Barahona). Removed old paper.typ placeholder.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ~800 words covering four paragraphs: self-contained/verifiable
reductions (LOC stats from graph-metrics.json), homogeneous task
structure vs SWE-Bench, hardware solver compilation layer (Rydberg
atoms for MIS, D-Wave for QUBO), and real-world applications. Fix
figure caption to use accurate counts (40 impl + 12 inferred edges).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add three subsections: S4.1 Three Roles (table + prose describing
Contributor/Maintainer/Agent responsibilities), S4.2 Skills as Agent
Functions (Table 1 with all 13 skills across 5 categories, detailed
paragraphs per category), and S4.3 Card-Based Orchestration (two-stage
pipeline with human touch points). Success rate column uses TBD
placeholder pending Task 11 git mining results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add three evaluation subsections: S6.1 ablation study design (skill-based
vs raw agent, with TBD results), S6.2 git history mining (58 PRs across
3 phases, error taxonomy table with TBD counts), and S6.3 case studies
of MVC->MIS (96 LOC, simple complement), SAT->MIS (171 LOC, quadratic
gadget), and Factoring->CircuitSAT->ILP (272+225 LOC, composition).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix critical issues identified by simulated peer review:
- Fix timeline contradiction: "six months" -> "seven weeks" (abstract + discussion)
- Fix author count: "two primary contributors" -> "three contributors"
- Soften unsubstantiated "60% of errors" claim to qualitative language
- Add agent platform identification (Claude Code, model versions)
- Reframe unexecuted ablation as experimental design, not pending results
- Add skills vs. prompt engineering differentiation paragraph
- Fix malformed BibTeX entries (dual booktitle/journal fields)
- Add Pichler 2018 citation for Rydberg atom MIS connection
- Note vendor report status on Anthropic 2026 citation
- Soften Table 2/3 captions to acknowledge pending data

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix overfull hboxes in skills and error taxonomy tables by removing
outer padding (@{}) and abbreviating long skill names. Add .gitignore
for LaTeX build artifacts. All figures compile, cross-references
verified, no undefined citations.

Note: paper is 15 pages (over the 10-12 target).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Mine ~/.claude session data: 283 sessions, 300MB transcripts, 15:1
  automation ratio, 1510 co-authored commits
- Add development metrics paragraph with codebase growth timeline
- Add issue quality gate data: 75% rejection rate on 322 checked issues
- Add interaction evolution paragraph (imperative → declarative prompts)
- Update counts: 24→27 models, 40→50 rules, 58→59 PRs, 7→9 weeks
- Remove meta-power skill references (13→12 skills)
- Replace Figure 1 with three-layer problemtree (from NSFC proposal)
- Add future directions: reduction compiler with Pareto cost models
- Save raw Claude history data to survey/claude-history-data.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov
Copy link

codecov bot commented Mar 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.86%. Comparing base (e56b61f) to head (5118ac3).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #617   +/-   ##
=======================================
  Coverage   96.86%   96.86%           
=======================================
  Files         264      264           
  Lines       35196    35196           
=======================================
  Hits        34091    34091           
  Misses       1105     1105           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

GiggleLiu and others added 6 commits March 13, 2026 20:51
Restructures the paper around the "bridge problem" concept — software
too large for humans, made possible by agents constrained through
systematic verification. Three barriers (convention drift, effort
exhaustion, knowledge discontinuity) become the central thesis.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GiggleLiu and others added 14 commits March 14, 2026 21:28
- Rewrite abstract and introduction with bridge problem concept
- Add new Section 2 (Bridge Problems): definition, three barriers,
  verification constrains agent output, other candidate domains
- Rename Section 3 to "Case Study: The Reduction Graph"
- Rewrite Discussion: remove content now in Sec 2, tighten
  "Why Human Experts Remain Essential", rewrite conclusion
- Move topology figure to appendix
- Renumber sections throughout

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…line

Uncomment figure placeholders and add timeline figure in Evidence section.
Figures will be compiled from .typ sources in a following commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- scaling-wall: hero figure showing 3 barriers human teams hit
- verification-funnel: how verification constrains agent output
- timeline: cumulative growth over 9 weeks with phase bands

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The comparison against our own prior project weakens the argument.
The bridge problem thesis stands on its own through the three
structural barriers, not through a self-comparison.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The reduction graph is the most informative visual — real data, not
a conceptual sketch. Lead with what was built.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three agent roles with skill mappings:
- Mentor (4): propose, fix-issue, final-review, dev-setup
- Orchestrator (5): project-pipeline, review-pipeline, issue-to-pr,
  check-issue, topology-sanity-check
- Runner (7): add-model, add-rule, fix-pr, review-implementation,
  write-model-in-paper, write-rule-in-paper, release

Replaces the old "two roles" (guides + runners) text with the
three-role taxonomy and per-role TikZ diagrams.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reframe the agent taxonomy around knowledge asymmetry:
Mentors guide humans with superior project knowledge;
Workers execute routine heavy-lifting with less domain knowledge.
Merges Orchestrator+Runner into Worker with lightweight subcategories.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant