diff --git a/.claude/skills/academic-research-skills b/.claude/skills/academic-research-skills
new file mode 160000
index 00000000..79dedce1
--- /dev/null
+++ b/.claude/skills/academic-research-skills
@@ -0,0 +1 @@
+Subproject commit 79dedce126a25b854616bb3c47d67a57397f0622
diff --git a/.claude/survey/agentic-coding-reductions/claude-history-data.md b/.claude/survey/agentic-coding-reductions/claude-history-data.md
new file mode 100644
index 00000000..e8e30744
--- /dev/null
+++ b/.claude/survey/agentic-coding-reductions/claude-history-data.md
@@ -0,0 +1,235 @@
+# Claude History Data for Paper
+
+Raw metrics extracted from `~/.claude` on 2026-03-13.
+
+## Global Claude Code Stats (Jan 14 – Feb 25, 2026)
+
+| Metric | Value |
+|--------|-------|
+| Days active | 40 |
+| Total messages | 157,325 |
+| Total sessions | 329 |
+| Total tool calls | 36,553 |
+| Avg messages/day | 3,933 |
+| Avg sessions/day | 8.2 |
+| Peak messages/day | 18,224 (Feb 16) |
+| Peak sessions/day | 32 (Feb 13) |
+| Peak tool calls/day | 5,234 (Jan 28) |
+
+## Problemreductions Project Stats
+
+### Session Data
+| Metric | Value |
+|--------|-------|
+| Session transcript files | 283 |
+| Total transcript data | 300 MB |
+| Largest session | 12.5 MB |
+| Median session | 457 KB |
+| Sessions > 1 MB | 79 |
+| Sessions > 5 MB | 12 |
+
+### From Session Metadata (108 sessions with timing)
+| Metric | Value |
+|--------|-------|
+| Total wall-clock time | 6,897 min (115 hours) |
+| User messages | 630 |
+| Assistant messages | 9,429 |
+| Automation ratio (asst/user) | 15.0x |
+| Avg user msgs/session | 5.8 |
+| Git commits (from sessions) | 140 |
+| Git pushes | 88 |
+| Commits per hour | 1.2 |
+| Tool calls per session | 51 |
+| Input tokens | 255,954 |
+| Output tokens | 732,996 |
+
+### Tool Usage (across 108 measured sessions)
+| Tool | Count |
+|------|-------|
+| Bash | 1,661 |
+| Read | 1,284 |
+| Grep | 629 |
+| Edit | 595 |
+| Task | 272 |
+| TaskUpdate | 245 |
+| TodoWrite | 161 |
+| AskUserQuestion | 151 |
+| TaskCreate | 133 |
+| Glob | 100 |
+| Skill | 97 |
+| Write | 81 |
+| WebFetch | 34 |
+| WebSearch | 34 |
+
+### Languages Touched
+| Language | File operations |
+|----------|----------------|
+| Rust | 1,239 |
+| Markdown | 431 |
+| JavaScript | 55 |
+| JSON | 37 |
+| YAML | 10 |
+
+## Git History
+
+### Commits
+| Metric | Value |
+|--------|-------|
+| Total commits (main) | 253 |
+| Commits (all branches) | 1,089 |
+| Co-Authored-By: Claude commits | 1,510 |
+| Contributors | 4 (GiggleLiu, Jinguo Liu, Shiwen An, Xiwei Pan) |
+| Merged PRs | 59 |
+| Fix # PRs (issue-driven) | 10 |
+| feat: PRs | 16 |
+| Project start date | 2026-01-09 |
+
+### Codebase Growth Timeline
+| Date | Models | Rules | Test files | Examples | Rust files |
+|------|--------|-------|------------|----------|------------|
+| Jan 10 (initial) | 17 | 0 | 0 | 0 | 36 |
+| Jan 26 (feature parity) | 20 | 22 | 0 | 1 | ~74 |
+| Feb 1 | 20 | 24 | 0 | 1 | 74 |
+| Feb 15 (arch redesign) | 21 | 44 | 101 | 35 | 204 |
+| Mar 1 | 23 | 51 | 105 | 42 | 218 |
+| Mar 13 (current) | 27 | 50 | 114 | 45 | 232 |
+
+### Current Project Size
+| Component | Count/Size |
+|-----------|------------|
+| Rust source (src/) | 54,599 LOC |
+| Test files (src/unit_tests + tests/) | 28,343 LOC |
+| Examples | 6,362 LOC |
+| Skill files | 3,664 LOC |
+| CLAUDE.md | 253 lines |
+| Models | 27 |
+| Rules | 50 |
+| Examples | 45 |
+| Skills | 14 |
+
+### Peak Development Days
+| Date | Commits | Sessions | Messages | Tool calls | Key activity |
+|------|---------|----------|----------|------------|--------------|
+| Jan 25 | 41 | 22 | 12,868 | 3,734 | Feature parity sprint (Julia port) |
+| Jan 28 | ~30 | 3 | 18,055 | 5,234 | UnitDiskMapping gadgets |
+| Feb 12 | 26 | 82 | 4,540 | 1,508 | Overhead expression system began |
+| Feb 13 | 61 | 43 | 13,169 | 2,529 | Variant system, MIS redesign |
+| Feb 14 | 67 | 16 | 10,454 | 1,885 | Circuit reductions |
+| Feb 15 | 40 | 13 | 4,526 | 783 | Expression system migration |
+| Feb 16 | 69 | 3 | 18,224 | 2,490 | problem_size trait, graph export |
+| Mar 12 | 113 | 26 | N/A | N/A | Pipeline automation, 6 PRs merged |
+
+## GitHub Issues
+
+### Overall
+| Metric | Value |
+|--------|-------|
+| Total issues | 500+ |
+| Open | 350 |
+| Closed | 150 |
+| Rule issues | 271 |
+| Model issues | 183 |
+
+### Issue Authors
+| Author | Issues |
+|--------|--------|
+| isPANN | 414 |
+| GiggleLiu | 34 |
+| zazabap | 28 |
+| QingyunQian | 19 |
+| hmyuuu | 4 |
+| fliingelephant | 2 |
+| exAClior | 1 |
+
+### Peak Issue Creation Days
+| Date | Issues |
+|------|--------|
+| Mar 11 | 251 |
+| Mar 12 | 78 |
+| Mar 10 | 38 |
+| Mar 9 | 26 |
+
+### Quality Gate Results (322 checked of isPANN's 414)
+| Verdict | Count | Percentage |
+|---------|-------|------------|
+| Good | 81 | 25% |
+| PoorWritten | 124 | 39% |
+| Wrong | 64 | 20% |
+| Trivial | 43 | 13% |
+| Useless | 18 | 6% |
+| **Rejection rate** | **241/322** | **75%** |
+
+### All Issues Quality Check (all authors)
+| Verdict | Count |
+|---------|-------|
+| Good | 105 |
+| PoorWritten | 138 |
+| Wrong | 64 |
+| Trivial | 45 |
+| Useless | 19 |
+| Total checked | 371 |
+
+## Skill Invocations (from history.jsonl)
+| Skill | Count |
+|-------|-------|
+| /compact | 33 |
+| /superpowers:brainstorm | 15 |
+| /mcp | 7 |
+| /fix-pr | 5 |
+| /passes | 4 |
+| /model | 4 |
+| /superpowers:execute-plan | 3 |
+| /test-feature | 3 |
+| /check-rule-redundancy | 3 |
+| /review-pipeline | 2 |
+| /review-implementation | 2 |
+| /writing-plans | 2 |
+
+## Prompt Length Distribution (2,196 prompts)
+| Category | Count | Percentage |
+|----------|-------|------------|
+| 1–3 words | 650 | 30% |
+| 4–10 words | 1,038 | 47% |
+| 11–30 words | 592 | 27% |
+| 30+ words | 79 | 4% |
+
+## User Prompt Evolution Examples
+
+### Phase 1 (Jan 9, Manual)
+```
+"start implementing milestone 1"
+"improve test coverage to >95 and start milestone 3"
+"detect missing tests compared with Julia package."
+"compare your implementation with UnitDiskMapping, do not skip any test"
+"incorrect, it is King's subgraph!"
+```
+
+### Phase 2 (Jan 26 – Feb, Basic Skills)
+```
+"/superpowers:brainstorm check issue 10 and 11"
+"implement Satisfiability -> Maximum Independent Set reduction"
+"resolve pr comments, fix ci"
+"commit this in a pr"
+```
+
+### Phase 3 (Mar, Full Pipeline)
+```
+"make run-pipeline"
+"/review-pipeline"
+"/check-rule-redundancy"
+"make run-issue N=570"
+```
+
+## All Projects by Usage (top 10)
+| Project | Prompts |
+|---------|---------|
+| problemreductions | 2,346 |
+| cryochamber | 582 |
+| sci-brainstorm | 329 |
+| DSAA3071TheoryOfComputation | 226 |
+| omeinsum-rs | 197 |
+| BPDecoderPlus | 157 |
+| private-note | 154 |
+| agentic-tests | 153 |
+| dev | 130 |
+| yao-rs | 127 |
diff --git a/.claude/survey/agentic-coding-reductions/references.bib b/.claude/survey/agentic-coding-reductions/references.bib
new file mode 100644
index 00000000..0cd139e2
--- /dev/null
+++ b/.claude/survey/agentic-coding-reductions/references.bib
@@ -0,0 +1,241 @@
+% Survey: Agentic Coding and Problem Reduction Rules
+% Generated: 2026-03-12
+% Papers: 22
+
+% ============================================================
+% Theme A: AI Coding Agents — Architectures and Benchmarks
+% ============================================================
+
+@article{Yang2024SWEagent,
+  author    = {John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Adriano Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press},
+  title     = {{SWE}-agent: Agent-Computer Interfaces Enable Automated Software Engineering},
+  booktitle = {Neural Information Processing Systems},
+  journal   = {ArXiv},
+  volume    = {abs/2405.15793},
+  year      = {2024},
+  doi       = {10.48550/arXiv.2405.15793},
+  abstract  = {Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like software engineering, we posit that LM agents represent a new category of end users with their own needs and abilities, and would benefit from specially-built interfaces to the software they use. We investigate how interface design affects the performance of language model agents. As a result of this exploration, we introduce SWE-agent: a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs. We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5\% and 87.7\%, respectively, far exceeding the previous state-of-the-art achieved with non-interactive LMs. Finally, we provide insight on how the design of the ACI can impact agents' behavior and performance.},
+}
+
+@article{Wang2024OpenHands,
+  author    = {Xingyao Wang and Boxuan Li and Yufan Song and Frank F. Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H. Tran and Fuqiang Li and Ren Ma and Mingzhang Zheng and Bill Qian and Yanjun Shao and Niklas Muennighoff and Yizhe Zhang and Binyuan Hui and Junyang Lin and Robert Brennan and Hao Peng and Heng Ji and Graham Neubig},
+  title     = {{OpenHands}: An Open Platform for {AI} Software Developers as Generalist Agents},
+  booktitle = {International Conference on Learning Representations},
+  year      = {2024},
+  url       = {https://arxiv.org/abs/2407.16741},
+  abstract  = {Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. In this paper, we introduce OpenHands (f.k.a. OpenDevin), a platform for the development of powerful and flexible AI agents that interact with the world in similar ways to those of a human developer: by writing code, interacting with a command line, and browsing the web. We describe how the platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, coordination between multiple agents, and incorporation of evaluation benchmarks. Based on our currently incorporated benchmarks, we perform an evaluation of agents over 15 challenging tasks, including software engineering (e.g., SWE-BENCH) and web browsing (e.g., WEBARENA), among others. Released under the permissive MIT license, OpenHands is a community project spanning academia and industry with more than 2.1K contributions from over 188 contributors.},
+}
+
+@article{Wang2025OpenHandsSDK,
+  author    = {Xingyao Wang and Simon Rosenberg and Juan Michelini and Calvin Smith and Hoang H. Tran and Engel Nyst and Rohit Malhotra and Xuhui Zhou and Valerie Chen and Robert Brennan and Graham Neubig},
+  title     = {The {OpenHands} Software Agent {SDK}: A Composable and Extensible Foundation for Production Agents},
+  journal   = {ArXiv},
+  volume    = {abs/2511.03690},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2511.03690},
+  abstract  = {Agents are now used widely in the process of software development, but building production-ready software engineering agents is a complex task. Deploying software agents effectively requires flexibility in implementation and experimentation, reliable and secure execution, and interfaces for users to interact with agents. In this paper, we present the OpenHands Software Agent SDK, a toolkit for implementing software development agents that satisfy these desiderata. This toolkit is a complete architectural redesign of the agent components of the popular OpenHands framework for software development agents, which has 64k+ GitHub stars. To achieve flexibility, we design a simple interface for implementing agents that requires only a few lines of code in the default case, but is easily extensible to more complex, full-featured agents with features such as custom tools, memory management, and more. For security and reliability, it delivers seamless local-to-remote execution portability, integrated REST/WebSocket services. For interaction with human users, it can connect directly to a variety of interfaces, such as visual workspaces (VS Code, VNC, browser), command-line interfaces, and APIs. Compared with existing SDKs from OpenAI, Claude, and Google, OpenHands uniquely integrates native sandboxed execution, lifecycle control, model-agnostic multi-LLM routing, and built-in security analysis. Empirical results on SWE-Bench Verified and GAIA benchmarks demonstrate strong performance. Put together, these elements allow the OpenHands Software Agent SDK to provide a practical foundation for prototyping, unlocking new classes of custom applications, and reliably deploying agents at scale.},
+}
+
+@article{Thai2025SWEEVO,
+  author    = {Minh V. T. Thai and Tue Le and D{\~u}ng Nguy{\~\hat{e}}n M{\d a}nh and Huy Phan Nhat and Nghi D. Q. Bui},
+  title     = {{SWE-EVO}: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios},
+  journal   = {ArXiv},
+  volume    = {abs/2512.18470},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2512.18470},
+  abstract  = {Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or implementing a small feature. However, real-world software engineering is fundamentally a long-horizon endeavor: developers must interpret high-level requirements, plan coordinated changes across many files, and evolve codebases over multiple iterations while preserving existing functionality. We introduce SWE-EVO, a benchmark that evaluates agents on this long-horizon software evolution challenge. Constructed from release notes and version histories of seven mature open-source Python projects, SWE-EVO comprises 48 evolution tasks that require agents to implement multi-step modifications spanning an average of 21 files, validated against comprehensive test suites averaging 874 tests per instance. Experiments with state-of-the-art models reveal a striking capability gap: even GPT-5 with OpenHands achieves only a 21 percent resolution rate on SWE-EVO, compared to 65 percent on the single-issue SWE-Bench Verified. This demonstrates that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a fine-grained metric that captures partial progress toward solving these complex, long-horizon tasks.},
+}
+
+@article{Deng2025SWEBenchPro,
+  title     = {{SWE-Bench Pro}: Can {AI} Agents Solve Long-Horizon Software Engineering Tasks?},
+  author    = {Xiang Deng and Jeff Da and Edwin Pan and Yannis Y. He and Charles Ide and Kanak Garg and Niklas Lauffer and Andrew Park and Chetan Rane and Karmini Sampath and Maya Krishnan and Srivatsa R. Kundurthy and Sean M. Hendryx and Zifan Wang and Chen Bo Calvin Zhang and Noah Jacobson and Bing Liu and Brad Kenstler},
+  year      = {2025},
+  journal   = {arXiv preprint arXiv:2509.16941},
+  doi       = {10.48550/arXiv.2509.16941},
+  url       = {https://openreview.net/forum?id=9R2iUHhVfr},
+  note      = {Under review at ICLR 2026},
+  abstract  = {We introduce SWE-Bench Pro, a substantially more challenging benchmark that builds upon the best practices of SWE-Bench, but is explicitly designed to capture realistic, complex, enterprise-level problems beyond the scope of SWE-Bench. The benchmark comprises 1,865 problems from 41 repositories, split into public, held-out, and commercial sets. It features long-horizon tasks that may require hours to days for a professional software engineer to complete, often involving patches across multiple files and substantial code modifications. SWE-Bench Pro provides a contamination-resistant testbed that more faithfully captures the complexity and diversity of real-world software development, advancing the pursuit of truly autonomous software engineering agents at a professional level.},
+}
+
+@article{Xia2025LiveSWEagent,
+  author    = {Chun Xia and Zhe Wang and Yan Yang and Yuxiang Wei and Ling-kai Zhang},
+  title     = {{Live-SWE-agent}: Can Software Engineering Agents Self-Evolve on the Fly?},
+  journal   = {ArXiv},
+  volume    = {abs/2511.13646},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2511.13646},
+  abstract  = {Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Goedel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that LIVE-SWE-AGENT can achieve an impressive solve rate of 77.4\% without test-time scaling, outperforming all existing software agents, including the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8\%.},
+}
+
+@misc{Anthropic2025ClaudeCode,
+  title        = {Claude Code},
+  author       = {{Anthropic}},
+  year         = {2025},
+  url          = {https://github.com/anthropics/claude-code},
+  howpublished = {\url{https://github.com/anthropics/claude-code}},
+  note         = {Agentic coding tool that lives in the terminal, understands codebases, and helps developers code faster through natural language commands},
+}
+
+@misc{Wu2024Devin,
+  title        = {Introducing {Devin}, the First {AI} Software Engineer},
+  author       = {Scott Wu},
+  year         = {2024},
+  month        = mar,
+  url          = {https://cognition.ai/blog/introducing-devin},
+  howpublished = {Cognition AI Blog},
+  note         = {Devin is a fully autonomous AI software engineering agent with access to shell, code editor, and browser in a sandboxed environment. On SWE-bench, Devin correctly resolves 13.86\% of issues end-to-end.},
+}
+
+@article{Roychoudhury2025AgenticAI,
+  author    = {Abhik Roychoudhury},
+  title     = {Agentic {AI} for Software: Thoughts from Software Engineering Community},
+  journal   = {ArXiv},
+  volume    = {abs/2508.17343},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2508.17343},
+  abstract  = {AI agents have recently shown significant promise in software engineering. Much public attention has been transfixed on the topic of code generation from Large Language Models (LLMs) via a prompt. However, software engineering is much more than programming, and AI agents go far beyond instructions given by a prompt. At the code level, common software tasks include code generation, testing, and program repair. Design level software tasks may include architecture exploration, requirements understanding, and requirements enforcement at the code level. Each of these software tasks involves micro-decisions which can be taken autonomously by an AI agent, aided by program analysis tools. This creates the vision of an AI software engineer, where the AI agent can be seen as a member of a development team. Conceptually, the key to successfully developing trustworthy agentic AI-based software workflows will be to resolve the core difficulty in software engineering --- the deciphering and clarification of developer intent. Specification inference, or deciphering the intent, thus lies at the heart of many software tasks, including software maintenance and program repair. A successful deployment of agentic technology into software engineering would involve making conceptual progress in such intent inference via agents. Trusting the AI agent becomes a key aspect, as software engineering becomes more automated. Higher automation also leads to higher volume of code being automatically generated, and then integrated into code-bases. Thus to deal with this explosion, an emerging direction is AI-based verification and validation (V\&V) of AI generated code. We posit that agentic software workflows in future will include such AI-based V\&V.},
+}
+
+@techreport{Anthropic2026AgenticCoding,
+  title       = {2026 Agentic Coding Trends Report: How Coding Agents Are Reshaping Software Development},
+  author      = {{Anthropic}},
+  year        = {2026},
+  month       = jan,
+  institution = {Anthropic},
+  url         = {https://resources.anthropic.com/hubfs/2026\%20Agentic\%20Coding\%20Trends\%20Report.pdf},
+  abstract    = {Industry report identifying eight trends across foundation, capability, and impact categories that are reshaping software development. Key findings include that developers now use AI in 60\% of their work while maintaining active oversight on 80--100\% of delegated tasks. The report covers shifting engineering roles, multi-agent coordination, human-AI collaboration patterns, and scaling agentic coding beyond engineering teams.},
+}
+
+% ============================================================
+% Theme C: AI-Assisted Discovery of Reductions & Complexity
+% ============================================================
+
+@article{Nagda2025ReinforcedGeneration,
+  author    = {Ansh Nagda and Prabhakar Raghavan and Abhradeep Thakurta},
+  title     = {Reinforced Generation of Combinatorial Structures: Hardness of Approximation},
+  year      = {2025},
+  url       = {https://arxiv.org/abs/2509.18057},
+  abstract  = {Can AI based methods help us make advances in complexity theory? We provide evidence towards answering this in the affirmative, using AlphaEvolve (an LLM code mutation agent) to obtain new results in three settings: a) We improve a recent result of Kunisky and Yu to obtain near-optimal upper and (conditional) lower bounds on certification algorithms for MAX-CUT and MAX-Independent Set on random 3- and 4-regular graphs. Our improved lower bounds are obtained by constructing nearly extremal Ramanujan graphs on as many as 163 vertices, and our upper bounds are obtained via analytical arguments. b) We obtain new inapproximability results for MAX-4-CUT and MAX-3-CUT, proving that it is NP-hard to approximate them within factors of 0.987 and 0.9649 respectively, using AlphaEvolve to discover new gadget reductions. Our MAX-4-CUT result improves upon the SOTA of 0.9883, and our MAX-3-CUT result improves on the current best gadget-based inapproximability result of 0.9853, but falls short of the SOTA of 16/17 that relies on a custom PCP (rather than a reduction from ``standard'' Hastad-style PCPs). c) Inapproximability for the metric Traveling Salesman Problem (TSP): We show that it is NP-hard to approximate the minimum cost tour within a factor of 111/110 using AlphaEvolve to discover a new gadget, thus improving the SOTA of 117/116. Along the way, we provide new modular soundness and completeness arguments that can be of independent interest. A key technical challenge we faced: verifying a candidate construction produced by AlphaEvolve is costly (sometimes requiring time exponential in the size of the construction). We used AlphaEvolve itself to evolve the verification procedure to be faster (sometimes by 10,000x for our gadgets). Our results suggest that gadget based proofs would benefit from a pass through AI-based tools to obtain stronger results.},
+}
+
+@article{Novikov2025AlphaEvolve,
+  author    = {Alexander Novikov and Ng{\^a}n V{\~u} and Marvin Eisenberger and Emilien Dupont and Po-Sen Huang and Adam Zsolt Wagner and S. Shirobokov and Borislav M. Kozlovskii and Francisco J. R. Ruiz and Abbas Mehrabian and M. P. Kumar and Abigail See and Swarat Chaudhuri and George Holland and A. Davies and Sebastian Nowozin and Pushmeet Kohli and Matej Balog},
+  title     = {{AlphaEvolve}: A Coding Agent for Scientific and Algorithmic Discovery},
+  journal   = {ArXiv},
+  volume    = {abs/2506.13131},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2506.13131},
+  abstract  = {In this white paper, we present AlphaEvolve, an evolutionary coding agent that substantially enhances capabilities of state-of-the-art LLMs on highly challenging tasks such as tackling open scientific problems or optimizing critical pieces of computational infrastructure. AlphaEvolve orchestrates an autonomous pipeline of LLMs, whose task is to improve an algorithm by making direct changes to the code. Using an evolutionary approach, continuously receiving feedback from one or more evaluators, AlphaEvolve iteratively improves the algorithm, potentially leading to new scientific and practical discoveries. We demonstrate the broad applicability of this approach by applying it to a number of important computational problems. When applied to optimizing critical components of large-scale computational stacks at Google, AlphaEvolve developed a more efficient scheduling algorithm for data centers, found a functionally equivalent simplification in the circuit design of hardware accelerators, and accelerated the training of the LLM underpinning AlphaEvolve itself. Furthermore, AlphaEvolve discovered novel, provably correct algorithms that surpass state-of-the-art solutions on a spectrum of problems in mathematics and computer science, significantly expanding the scope of prior automated discovery methods (Romera-Paredes et al., 2023). Notably, AlphaEvolve developed a search algorithm that found a procedure to multiply two $4 \times 4$ complex-valued matrices using 48 scalar multiplications; offering the first improvement, after 56 years, over Strassen's algorithm in this setting. We believe AlphaEvolve and coding agents like it can have a significant impact in improving solutions of problems across many areas of science and computation.},
+}
+
+@article{RomeraParedes2023FunSearch,
+  author    = {Bernardino Romera-Paredes and M. Barekatain and Alexander Novikov and Matej Balog and M. P. Kumar and Emilien Dupont and Francisco J. R. Ruiz and J. Ellenberg and Pengming Wang and Omar Fawzi and Pushmeet Kohli and Alhussein Fawzi},
+  title     = {Mathematical Discoveries from Program Search with Large Language Models},
+  journal   = {Nature},
+  volume    = {625},
+  pages     = {468--475},
+  year      = {2023},
+  doi       = {10.1038/s41586-023-06924-6},
+  abstract  = {Large language models (LLMs) have demonstrated tremendous capabilities in solving complex tasks, from quantitative reasoning to understanding natural language. However, LLMs sometimes suffer from confabulations (or hallucinations), which can result in them making plausible but incorrect statements. This hinders the use of current large models in scientific discovery. Here we introduce FunSearch (short for searching in the function space), an evolutionary procedure based on pairing a pretrained LLM with a systematic evaluator. We demonstrate the effectiveness of this approach to surpass the best-known results in important problems, pushing the boundary of existing LLM-based approaches. Applying FunSearch to a central problem in extremal combinatorics---the cap set problem---we discover new constructions of large cap sets going beyond the best-known ones, both in finite dimensional and asymptotic cases. This shows that it is possible to make discoveries for established open problems using LLMs. We showcase the generality of FunSearch by applying it to an algorithmic problem, online bin packing, finding new heuristics that improve on widely used baselines. In contrast to most computer search approaches, FunSearch searches for programs that describe how to solve a problem, rather than what the solution is. Beyond being an effective and scalable strategy, discovered programs tend to be more interpretable than raw solutions, enabling feedback loops between domain experts and FunSearch, and the deployment of such programs in real-world applications.},
+}
+
+@article{Imajuku2025ALEBench,
+  author    = {Yuki Imajuku and Kohki Horie and Yoichi Iwata and Kensho Aoki and Naohiro Takahashi and Takuya Akiba},
+  title     = {{ALE-Bench}: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering},
+  journal   = {ArXiv},
+  volume    = {abs/2506.09050},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2506.09050},
+  abstract  = {How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing? We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.},
+}
+
+@article{Janicic2025URSA,
+  author    = {Predrag Jani{\v{c}}i{\'c}},
+  title     = {A {SAT}-based Approach for Specification, Analysis, and Justification of Reductions between {NP}-complete Problems},
+  journal   = {ArXiv},
+  volume    = {abs/2511.18639},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2511.18639},
+  abstract  = {We propose a novel approach for the development, analysis, and verification of reductions between NP-complete problems. This method uses the URSA system, a SAT-based constraint solver and incorporates features that distinguish it from existing related systems.},
+}
+
+% ============================================================
+% Theme D (subset): Physics-Inspired QUBO/Ising Approaches
+% ============================================================
+
+@article{Schuetz2022PhysicsGNN,
+  author    = {M. Schuetz and J. K. Brubaker and H. Katzgraber},
+  title     = {Combinatorial Optimization with Physics-Inspired Graph Neural Networks},
+  journal   = {Nature Machine Intelligence},
+  volume    = {4},
+  pages     = {367--377},
+  year      = {2022},
+  doi       = {10.1038/s42256-022-00468-6},
+  abstract  = {Combinatorial optimization problems are pervasive across science and industry. Modern deep learning tools are poised to solve these problems at unprecedented scales, but a unifying framework that incorporates insights from statistical physics is still outstanding. Here we demonstrate how graph neural networks can be used to solve combinatorial optimization problems. Our approach is broadly applicable to canonical NP-hard problems in the form of quadratic unconstrained binary optimization problems, such as maximum cut, minimum vertex cover, maximum independent set, as well as Ising spin glasses and higher-order generalizations thereof in the form of polynomial unconstrained binary optimization problems. We apply a relaxation strategy to the problem Hamiltonian to generate a differentiable loss function with which we train the graph neural network and apply a simple projection to integer variables once the unsupervised training process has completed. We showcase our approach with numerical results for the canonical maximum cut and maximum independent set problems. We find that the graph neural network optimizer performs on par or outperforms existing solvers, with the ability to scale beyond the state of the art to problems with millions of variables.},
+}
+
+@article{He2024QuantumTSP,
+  author    = {Haoqi He},
+  title     = {Quantum Annealing and {GNN} for Solving {TSP} with {QUBO}},
+  booktitle = {Algorithmic Applications in Management},
+  pages     = {134--145},
+  year      = {2024},
+  doi       = {10.1007/978-981-97-7801-0_12},
+  abstract  = {This paper explores the application of Quadratic Unconstrained Binary Optimization (QUBO) models in solving the Travelling Salesman Problem (TSP) through Quantum Annealing algorithms and Graph Neural Networks. Quantum Annealing (QA), a quantum-inspired optimization method that exploits quantum tunneling to escape local minima, is used to solve QUBO formulations of TSP instances on Coherent Ising Machines (CIMs). The paper also presents a novel approach where QUBO is employed as a loss function within a GNN architecture tailored for solving TSP efficiently. By leveraging GNN's capability to learn graph representations, this method finds approximate solutions to TSP with improved computational time compared to traditional exact solvers.},
+}
+
+% ============================================================
+% Theme E: LLM-Assisted Formal Verification & Program Synthesis
+% ============================================================
+
+@article{Bursuc2025VeriCoding,
+  author    = {Sergiu Bursuc and Theodore Ehrenborg and Shaowei Lin and L. Astefanoaei and Ionel Emilian Chiosa and Jure Kukovec and Alok Singh and Oliver Butterley and Adem Bizid and Quinn Dougherty and Miranda Zhao and Max Tan and Max Tegmark},
+  title     = {A Benchmark for Vericoding: Formally Verified Program Synthesis},
+  journal   = {ArXiv},
+  volume    = {abs/2509.22908},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2509.22908},
+  abstract  = {We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications --- in contrast to vibe coding, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in Verus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27\% in Lean, 44\% in Verus/Rust and 82\% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68\% to 96\% over the past year. The benchmark and vericoding results are shared at https://github.com/Beneficial-AI-Foundation/vericoding-benchmark},
+}
+
+@article{Thakur2025CLEVER,
+  author    = {Amitayush Thakur and Jasper Lee and G. Tsoukalas and Meghana Sistla and Matthew Zhao and Stefan Zetzsche and Greg Durrett and Yisong Yue and Swarat Chaudhuri},
+  title     = {{CLEVER}: A Curated Benchmark for Formally Verified Code Generation},
+  journal   = {ArXiv},
+  volume    = {abs/2505.13938},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2505.13938},
+  abstract  = {We introduce CLEVER, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, CLEVER avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use CLEVER to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning.},
+}
+
+@inproceedings{Miranda2025VeriBench,
+  title     = {{VeriBench}: End-to-End Formal Verification Benchmark for {AI} Code Generation in {Lean} 4},
+  author    = {Brando Miranda and Zhanke Zhou and Allen Nie and Elyas Obbad and Leni Aniva and Kai Fronsdal and Weston Kirk and Dilara Soylu and Andrea Yu and Ying Li and Sanmi Koyejo},
+  year      = {2025},
+  booktitle = {2nd AI for Math Workshop at ICML 2025 (AI4Math@ICML)},
+  url       = {https://openreview.net/forum?id=rWkGFmnSNl},
+  abstract  = {VeriBench evaluates LLM capabilities in generating complete Lean 4 programs---implementations, unit tests, correctness theorems, and formal proofs---derived from reference Python functions or their docstrings. Testing 113 tasks across HumanEval problems, exercises, classical algorithms, and security challenges, the benchmark reveals that Claude 3.7 Sonnet achieves compilation on only 12.5\%, while LLaMA-70B fails to compile any programs in the Lean 4 HumanEval subset, even with 50 feedback-guided attempts. Only a self-optimizing agent architecture achieves meaningful compilation rates, approaching 90\%.},
+}
+
+@inproceedings{Mukherjee2025CoqPL,
+  title     = {Towards Automated Verification of {LLM}-Synthesized {C} Programs},
+  author    = {Prasita Mukherjee and Benjamin Delaware},
+  year      = {2025},
+  month     = jan,
+  booktitle = {CoqPL 2025: The Eleventh International Workshop on Coq for Programming Languages (co-located with POPL 2025)},
+  doi       = {10.48550/arXiv.2410.14835},
+  url       = {https://popl25.sigplan.org/details/CoqPL-2025-papers/5/Towards-Automated-Verification-of-LLM-Synthesized-C-Programs},
+  abstract  = {We present a synthesis and verification framework for C programs that leverages LLMs to generate candidate programs while imposing syntactic and semantic biases on programs generated by LLMs, such that the synthesized program is more amenable to automated verification. The key contribution is a specification-verification tool built on the Verified Software Toolchain. Experiments on diverse benchmarks from the deductive program synthesis community, including basic coding examples, Separation Logic based assertions, and API specifications, demonstrate scalability and extensibility.},
+}
+
+@inproceedings{Mukherjee2025SynVer,
+  title     = {{SYNVER}: {LLM}-Assisted Synthesis of High-Assurance {C} Programs},
+  author    = {Prasita Mukherjee and Minghai Lu and Benjamin Delaware},
+  year      = {2025},
+  month     = nov,
+  booktitle = {2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE)},
+  address   = {Seoul, Korea},
+  doi       = {10.1109/ASE63991.2025.00255},
+  url       = {https://ieeexplore.ieee.org/document/11334588/},
+  abstract  = {We present SynVer---a novel, general purpose synthesizer for C programs equipped with machine-checked proofs of correctness using the Verified Software Toolchain. SynVer employs two Large Language Models: the first generates candidate programs from user-provided specifications, and the second helps automatically generate proofs of correctness in the Rocq proof assistant. SynVer combines symbolic reasoning with LLM-powered proof generation to discharge proof obligations.},
+}
diff --git a/.claude/survey/agentic-coding-reductions/summary.md b/.claude/survey/agentic-coding-reductions/summary.md
new file mode 100644
index 00000000..11bfcba6
--- /dev/null
+++ b/.claude/survey/agentic-coding-reductions/summary.md
@@ -0,0 +1,92 @@
+# Survey: Agentic Coding and Problem Reduction Rules
+
+**Date:** 2026-03-12
+**Papers:** 22
+**Strategies used:** Landscape mapping
+
+---
+
+## Theme A: AI Coding Agents — Architectures and Benchmarks
+
+The field has matured from proof-of-concept (Devin [Wu2024Devin], early 2024) to production-grade SDKs (OpenHands [Wang2024OpenHands], [Wang2025OpenHandsSDK]; Claude Code [Anthropic2025ClaudeCode]). The core architectural insight is the Agent-Computer Interface (ACI) — purpose-built tool interfaces for LLM agents [Yang2024SWEagent].
+
+**Benchmarks reveal a capability cliff:** single-issue bug fixes reach ~70-80% (SWE-Bench Verified), but long-horizon multi-file tasks drop to ~20% [Thai2025SWEEVO], [Deng2025SWEBenchPro]. Self-evolving agents (Live-SWE-agent [Xia2025LiveSWEagent]) show promising results at 77.4% on SWE-Bench Verified.
+
+**Industry perspective:** Developers use AI in 60% of work but maintain oversight on 80-100% of delegated tasks [Anthropic2026AgenticCoding]. The key challenge is specification inference — deciphering developer intent [Roychoudhury2025AgenticAI].
+
+**Active groups:** Princeton (SWE-agent), UIUC/OpenHands consortium, Anthropic (Claude Code), Cognition AI (Devin), Scale AI (SWE-Bench Pro).
+
+### Papers
+- [Yang2024SWEagent] — SWE-agent: ACI design for coding agents (2024)
+- [Wang2024OpenHands] — OpenHands: open platform for AI developers (2024)
+- [Wang2025OpenHandsSDK] — OpenHands SDK: composable agent foundation (2025)
+- [Thai2025SWEEVO] — SWE-EVO: long-horizon evolution benchmark (2025)
+- [Deng2025SWEBenchPro] — SWE-Bench Pro: enterprise-level tasks (2025)
+- [Xia2025LiveSWEagent] — Live-SWE-agent: self-evolving agents (2025)
+- [Anthropic2025ClaudeCode] — Claude Code: agentic CLI tool (2025)
+- [Wu2024Devin] — Devin: autonomous AI engineer (2024)
+- [Roychoudhury2025AgenticAI] — Position paper on agentic AI for SE (2025)
+- [Anthropic2026AgenticCoding] — Industry trends report (2026)
+
+---
+
+## Theme C: AI-Assisted Discovery of Reductions & Complexity Results
+
+**The most directly relevant theme.** DeepMind's evolutionary approach — FunSearch [RomeraParedes2023FunSearch] (Nature 2023) followed by AlphaEvolve [Novikov2025AlphaEvolve] (2025) — demonstrates that LLM-powered program search can discover genuinely novel mathematical constructions. The breakthrough application to complexity theory: AlphaEvolve discovered new gadget reductions proving improved NP-hardness bounds for MAX-3-CUT (0.9649), MAX-4-CUT (0.987), and metric TSP (111/110) [Nagda2025ReinforcedGeneration].
+
+**Key insight from Nagda et al.:** Verifying AI-discovered gadgets can be exponentially costly — they used AlphaEvolve itself to evolve faster verification procedures (10,000x speedup). This mirrors our project's need for automated reduction verification.
+
+On the formal verification side, URSA [Janicic2025URSA] uses SAT solvers to verify NP-complete reductions — a complementary approach to LLM-based discovery. ALE-Bench [Imajuku2025ALEBench] benchmarks coding agents on NP-hard optimization (competitive with top-100 human contestants).
+
+**Active groups:** Google DeepMind (FunSearch, AlphaEvolve), Sakana AI (ALE-Bench), University of Belgrade (URSA).
+
+### Papers
+- [Nagda2025ReinforcedGeneration] — AlphaEvolve discovers new NP-hardness gadgets (2025)
+- [Novikov2025AlphaEvolve] — AlphaEvolve: evolutionary coding agent (2025)
+- [RomeraParedes2023FunSearch] — FunSearch: LLM program search discovers cap set constructions (Nature 2023)
+- [Imajuku2025ALEBench] — ALE-Bench: agents vs humans on NP-hard optimization (2025)
+- [Janicic2025URSA] — URSA: SAT-based verification of NP-complete reductions (2025)
+
+---
+
+## Theme D (subset): Physics-Inspired QUBO/Ising Approaches
+
+GNNs trained via QUBO Hamiltonian relaxation can solve MIS, MaxCut, MinVC at million-variable scale [Schuetz2022PhysicsGNN]. QUBO serves as a unifying target representation for combinatorial optimization — directly paralleling this project's use of QUBO as a central reduction hub. Quantum annealing + GNN hybrid approaches show promise for TSP [He2024QuantumTSP].
+
+### Papers
+- [Schuetz2022PhysicsGNN] — Physics-inspired GNN for QUBO problems (Nature Machine Intelligence 2022)
+- [He2024QuantumTSP] — Quantum annealing + GNN for TSP via QUBO (2024)
+
+---
+
+## Theme E: LLM-Assisted Formal Verification & Program Synthesis
+
+End-to-end formally verified code generation remains largely unsolved. The largest benchmark (VeriCoding [Bursuc2025VeriCoding]) shows 27% success in Lean, 44% in Verus/Rust, 82% in Dafny. The curated CLEVER benchmark [Thakur2025CLEVER] reports near-zero success on 161 hard problems. VeriBench [Miranda2025VeriBench] finds that only self-optimizing agent architectures achieve meaningful compilation rates (~90%).
+
+For C programs specifically, the CoqPL/SYNVER line of work [Mukherjee2025CoqPL], [Mukherjee2025SynVer] demonstrates a two-LLM pipeline: one generates candidates, one generates Coq proofs. This pattern (generate + verify) is the emerging paradigm.
+
+**Active groups:** MIT/Tegmark (VeriCoding), UT Austin/Caltech (CLEVER), Purdue (SYNVER), Stanford/ICML workshop (VeriBench).
+
+### Papers
+- [Bursuc2025VeriCoding] — VeriCoding: 12,504 formal specs across Lean/Dafny/Verus (2025)
+- [Thakur2025CLEVER] — CLEVER: curated Lean verification benchmark (2025)
+- [Miranda2025VeriBench] — VeriBench: end-to-end Lean 4 benchmark (2025)
+- [Mukherjee2025CoqPL] — Automated verification of LLM-synthesized C (CoqPL 2025)
+- [Mukherjee2025SynVer] — SYNVER: synthesis + Coq proof generation (ASE 2025)
+
+---
+
+## Key Open Problems
+
+1. **Automated gadget discovery at scale** — AlphaEvolve works but verification is exponentially costly; can we build faster feedback loops?
+2. **End-to-end reduction pipelines** — No system yet discovers a reduction, implements it, AND formally verifies correctness
+3. **Long-horizon agent capability** — Agents fail at ~80% of multi-file, multi-step tasks (the kind needed for implementing reductions)
+4. **Verified code generation** — Only 27% success on formal specs in Lean; major bottleneck for trustworthy AI-discovered reductions
+5. **QUBO as universal target** — Can GNN/physics-inspired solvers be integrated into a reduction-aware optimization pipeline?
+
+## Key Bottlenecks
+
+1. **Verification cost** — Checking candidate gadgets/reductions is often exponentially expensive
+2. **Specification gap** — LLMs struggle to produce formal specs from informal mathematical descriptions
+3. **Agent scaffolding** — No standard architecture for combining code generation + formal verification + domain-specific evaluation
+4. **Benchmark coverage** — No benchmark specifically targets reduction implementation and verification
diff --git a/.gitignore b/.gitignore
index 79202a3b..3fa7dd40 100644
--- a/.gitignore
+++ b/.gitignore
@@ -88,3 +88,4 @@ claude-output.log
 docs/test-reports/
 docs/superpowers/
 *.log
+.superpowers/
diff --git a/.superpowers/brainstorm/785-1773296086/.server-info b/.superpowers/brainstorm/785-1773296086/.server-info
new file mode 100644
index 00000000..391dd259
--- /dev/null
+++ b/.superpowers/brainstorm/785-1773296086/.server-info
@@ -0,0 +1 @@
+{"type":"server-started","port":60547,"host":"127.0.0.1","url_host":"localhost","url":"http://localhost:60547","screen_dir":"/Users/liujinguo/rcode/problemreductions/.claude/worktrees/survey-agentic-reductions/.superpowers/brainstorm/785-1773296086"}
diff --git a/.superpowers/brainstorm/785-1773296086/.server.pid b/.superpowers/brainstorm/785-1773296086/.server.pid
new file mode 100644
index 00000000..4ab4c5d7
--- /dev/null
+++ b/.superpowers/brainstorm/785-1773296086/.server.pid
@@ -0,0 +1 @@
+791
diff --git a/.superpowers/brainstorm/785-1773296086/full-design.html b/.superpowers/brainstorm/785-1773296086/full-design.html
new file mode 100644
index 00000000..b971c9b0
--- /dev/null
+++ b/.superpowers/brainstorm/785-1773296086/full-design.html
@@ -0,0 +1,157 @@
+<h2>Full Paper Design</h2>
+<p class="subtitle">~10-12 pages, ICSE/ASE-class venue, Methodology-First (B)</p>
+
+<div class="section">
+  <h3>Title (working)</h3>
+  <p style="font-size:18px;font-style:italic;color:#64b5f6;margin:8px 0">
+    "Skill-Based Agentic Coding for Mathematical Software: A Case Study in NP-Hard Problem Reductions"
+  </p>
+  <p style="color:#999;font-size:13px">Alternative: "From Cards to Code: Human-Directed Agent Execution for Verified Reduction Libraries"</p>
+</div>
+
+<div class="section">
+  <h3>Thesis Statement</h3>
+  <div style="background:#1a1a2e;padding:16px;border-radius:8px;border-left:4px solid #64b5f6;margin:8px 0;line-height:1.6">
+    <p>The bottleneck in agentic coding is not agent capability but <strong>task decomposition and the division of labor</strong> between human creativity and agent execution. We demonstrate a skill-based pipeline where humans (contributors + maintainer) provide judgment — which problems matter, which reductions are useful — while agents handle mechanical execution: implementation, testing, documentation, and review. Applied to NP-hard problem reductions, this produces a verified library of 24 problem types and 52 reductions, with multi-layered correctness guarantees.</p>
+  </div>
+</div>
+
+<div class="section">
+  <h3>Paper Outline (~10-12 pages)</h3>
+  <div style="font-family:monospace;font-size:13px;line-height:2.0;background:#1a1a2e;padding:16px;border-radius:8px">
+
+    <div style="color:#ff9800;font-weight:bold;margin-top:4px">S1. Introduction (~1.5 pages)</div>
+    <div style="color:#ccc;padding-left:24px">
+      &bull; Agents hit ~20% on long-horizon tasks — not a capability problem, a decomposition problem<br>
+      &bull; The "review > generation" challenge for mathematical/scientific code<br>
+      &bull; Key insight: reposition humans as creativity source (issues, curation), agents as labor<br>
+      &bull; Three roles: contributors (create issues), maintainer (curate board, write skills), agents (execute)<br>
+      &bull; Contributions: (1) skill-based methodology, (2) verification stack, (3) reduction library artifact
+    </div>
+
+    <div style="color:#ff9800;font-weight:bold;margin-top:12px">S2. Why Reductions? The Goldilocks Domain (~1 page)</div>
+    <div style="color:#ccc;padding-left:24px">
+      &bull; Each reduction is self-contained (~50-200 LOC), formally specified, independently verifiable<br>
+      &bull; Round-trip correctness criterion: reduce → solve target → extract back → verify against source<br>
+      &bull; Practical value: QUBO as compilation target for quantum annealers, GNN solvers<br>
+      &bull; Contrast with SWE-Bench: homogeneous tasks enable systematic comparison<br>
+      &bull; <em>Figure 1: The reduction graph (24 problems, 52 edges, variant lattice)</em>
+    </div>
+
+    <div style="color:#ff9800;font-weight:bold;margin-top:12px">S3. System Architecture (~2 pages)</div>
+    <div style="color:#ccc;padding-left:24px">
+      &bull; Rust trait hierarchy: Problem → OptimizationProblem / SatisfactionProblem<br>
+      &bull; ReduceTo&lt;T&gt; trait + ReductionResult for type-safe reductions<br>
+      &bull; Compile-time machinery: overhead expressions, variant registration, complexity strings<br>
+      &bull; The design philosophy: make correctness checkable by construction<br>
+      &bull; <em>Figure 2: System architecture diagram (traits, registry, verification layers)</em>
+    </div>
+
+    <div style="color:#ff9800;font-weight:bold;margin-top:12px">S4. Skill-Based Task Decomposition (~2 pages)</div>
+    <div style="color:#ccc;padding-left:24px">
+      <strong style="color:#81c784">4.1 The Three Roles</strong><br>
+      &bull; Contributors: open issues (creative — "is this reduction useful? non-trivial?")<br>
+      &bull; Maintainer: curates project board, writes/evolves skills, moves cards<br>
+      &bull; Agents: pick cards, execute skill pipelines<br>
+      <br>
+      <strong style="color:#81c784">4.2 Skills as Agent Functions</strong><br>
+      &bull; check-issue: validates usefulness, non-triviality, literature correctness<br>
+      &bull; add-model / add-rule: brainstorm → plan → implement → test → review<br>
+      &bull; review-implementation: parallel subagents (structural + quality)<br>
+      &bull; fix-pr: resolve review comments, CI failures, coverage gaps<br>
+      <br>
+      <strong style="color:#81c784">4.3 Card-Based Orchestration</strong><br>
+      &bull; GitHub project board as the coordination mechanism<br>
+      &bull; Manager agent auto-picks a card and drives it through the pipeline<br>
+      &bull; Human moves cards between columns (the creative decision: what to work on next)<br>
+      &bull; <em>Figure 3: Pipeline diagram — issue → check → implement → review → PR → merge</em>
+    </div>
+
+    <div style="color:#ff9800;font-weight:bold;margin-top:12px">S5. Multi-Layered Verification (~1.5 pages)</div>
+    <div style="color:#ccc;padding-left:24px">
+      <strong style="color:#81c784">5.1 The Verification Stack</strong><br>
+      &bull; Layer 1: Rust type system — compile-time enforcement of trait contracts<br>
+      &bull; Layer 2: Unit tests — evaluation, serialization, edge cases<br>
+      &bull; Layer 3: Round-trip (closed-loop) tests — reduce, solve, extract, verify<br>
+      &bull; Layer 4: Overhead validation — symbolic expressions checked against actual sizes<br>
+      &bull; Layer 5: Materialized test data — JSON fixtures locked in version control<br>
+      &bull; Layer 6: Agentic review — parallel subagents with fresh context<br>
+      &bull; Layer 7: Documentation — paper entry forces human-readable proof sketch<br>
+      <br>
+      <strong style="color:#81c784">5.2 Why Layers?</strong><br>
+      &bull; Each layer catches different error classes (table: which errors each layer catches)<br>
+      &bull; Materialized data prevents agents from silently changing tests<br>
+      &bull; The "lazy agent" problem: agents take shortest path to close issues<br>
+      &bull; <em>Figure 4: Verification pyramid with error examples at each layer</em>
+    </div>
+
+    <div style="color:#ff9800;font-weight:bold;margin-top:12px">S6. Evaluation (~2 pages)</div>
+    <div style="color:#ccc;padding-left:24px">
+      <strong style="color:#81c784">6.1 Git History Mining (quantitative)</strong><br>
+      &bull; How many reductions were agent-implemented vs human-implemented<br>
+      &bull; Success rate per skill invocation (first-attempt pass rate)<br>
+      &bull; Review rounds before merge<br>
+      &bull; Error taxonomy: what went wrong and which layer caught it<br>
+      &bull; Coverage metrics across the codebase (>95% target)<br>
+      <br>
+      <strong style="color:#81c784">6.2 Case Studies (qualitative, 2-3 reductions)</strong><br>
+      &bull; <strong>Simple:</strong> MVC → MIS — complement relationship, near-trivial mapping<br>
+      &bull; <strong>Complex:</strong> SAT → MIS — clause-variable gadget, quadratic blowup<br>
+      &bull; <strong>Multi-hop:</strong> Factoring → CircuitSAT → ILP — chain through circuit encoding<br>
+      &bull; For each: show the full pipeline from issue to merged PR with paper entry<br>
+      &bull; Highlight where human judgment was needed vs. where agent executed autonomously
+    </div>
+
+    <div style="color:#ff9800;font-weight:bold;margin-top:12px">S7. Related Work (~1 page)</div>
+    <div style="color:#ccc;padding-left:24px">
+      &bull; AI coding agents: SWE-agent, OpenHands, Claude Code, Devin — position vs. our skill approach<br>
+      &bull; AI for reductions: AlphaEvolve gadgets, URSA SAT verification — discovery vs. implementation<br>
+      &bull; Formal verification: VeriCoding, CLEVER — our pragmatic multi-layer alternative<br>
+      &bull; Physics-inspired solvers: QUBO/GNN — our graph as infrastructure for these
+    </div>
+
+    <div style="color:#ff9800;font-weight:bold;margin-top:12px">S8. Discussion & Conclusion (~1 page)</div>
+    <div style="color:#ccc;padding-left:24px">
+      &bull; Generalizability: what other domains have the "Goldilocks" property?<br>
+      &bull; Limitations: requires upfront skill engineering, domain expertise doesn't transfer<br>
+      &bull; The human value proposition: creativity, judgment, responsibility — not eliminated, repositioned<br>
+      &bull; Future: connecting to AlphaEvolve-style discovery, formal verification integration
+    </div>
+  </div>
+</div>
+
+<div class="section">
+  <h3>Key Figures</h3>
+  <div style="display:grid;grid-template-columns:1fr 1fr;gap:12px">
+    <div style="background:#1a1a2e;padding:12px;border-radius:6px;border:1px solid #333">
+      <div style="color:#64b5f6;font-weight:bold;margin-bottom:4px">Fig 1: Reduction Graph</div>
+      <div style="color:#999;font-size:13px">24 problem nodes, 52 directed edges, QUBO hub visible. Color-coded by category (graph/formula/set/algebraic/misc). Variant lattice shown as inset.</div>
+    </div>
+    <div style="background:#1a1a2e;padding:12px;border-radius:6px;border:1px solid #333">
+      <div style="color:#64b5f6;font-weight:bold;margin-bottom:4px">Fig 2: System Architecture</div>
+      <div style="color:#999;font-size:13px">Trait hierarchy + compile-time machinery. Shows how Problem/ReduceTo/Solver traits compose.</div>
+    </div>
+    <div style="background:#1a1a2e;padding:12px;border-radius:6px;border:1px solid #333">
+      <div style="color:#64b5f6;font-weight:bold;margin-bottom:4px">Fig 3: Pipeline Diagram</div>
+      <div style="color:#999;font-size:13px">Three-role pipeline: contributor → issue → (agent: check) → maintainer moves card → (agent: implement/review) → PR → merge. Human decisions highlighted.</div>
+    </div>
+    <div style="background:#1a1a2e;padding:12px;border-radius:6px;border:1px solid #333">
+      <div style="color:#64b5f6;font-weight:bold;margin-bottom:4px">Fig 4: Verification Pyramid</div>
+      <div style="color:#999;font-size:13px">7 layers from type system (bottom) to documentation (top). Each layer annotated with example errors it catches.</div>
+    </div>
+  </div>
+</div>
+
+<div class="section">
+  <h3>Key Tables</h3>
+  <div style="display:grid;grid-template-columns:1fr 1fr;gap:12px">
+    <div style="background:#1a1a2e;padding:12px;border-radius:6px;border:1px solid #333">
+      <div style="color:#ff9800;font-weight:bold;margin-bottom:4px">Table 1: Skills Inventory</div>
+      <div style="color:#999;font-size:13px">Each skill with: trigger, inputs, outputs, typical agent turns, first-attempt success rate.</div>
+    </div>
+    <div style="background:#1a1a2e;padding:12px;border-radius:6px;border:1px solid #333">
+      <div style="color:#ff9800;font-weight:bold;margin-bottom:4px">Table 2: Error Taxonomy</div>
+      <div style="color:#999;font-size:13px">Error categories × which verification layer caught them. Shows why no single layer suffices.</div>
+    </div>
+  </div>
+</div>
diff --git a/.superpowers/brainstorm/785-1773296086/paper-structures.html b/.superpowers/brainstorm/785-1773296086/paper-structures.html
new file mode 100644
index 00000000..d14ee32b
--- /dev/null
+++ b/.superpowers/brainstorm/785-1773296086/paper-structures.html
@@ -0,0 +1,69 @@
+<h2>Paper Structure: 3 Approaches</h2>
+<p class="subtitle">Full research paper (10-12 pages) for ICSE/ASE-class venue</p>
+
+<div class="options">
+  <div class="option" data-choice="a" onclick="toggleSelect(this)">
+    <div class="letter">A</div>
+    <div class="content">
+      <h3>System-First: "Here's What We Built"</h3>
+      <p style="margin-bottom:12px">Lead with the reduction graph as artifact, then explain the agentic pipeline that produced it.</p>
+      <div style="font-family:monospace;font-size:13px;line-height:1.6;background:#1a1a2e;padding:12px;border-radius:6px;color:#e0e0e0">
+        <div><span style="color:#64b5f6">1.</span> Introduction — NP-hard reductions matter, building them is tedious</div>
+        <div><span style="color:#64b5f6">2.</span> The Reduction Graph — 24 problems, 52 rules, QUBO hub, variant lattice</div>
+        <div><span style="color:#64b5f6">3.</span> System Design — skills, traits, overhead system, verification layers</div>
+        <div><span style="color:#64b5f6">4.</span> Card-Based Workflow — human moves cards, agent picks + executes</div>
+        <div><span style="color:#64b5f6">5.</span> Evaluation — git mining (success rates, error taxonomy) + 3 case studies</div>
+        <div><span style="color:#64b5f6">6.</span> Related Work — agentic coding, AI for reductions, formal verification</div>
+        <div><span style="color:#64b5f6">7.</span> Discussion + Conclusion</div>
+      </div>
+      <div class="pros-cons" style="margin-top:12px">
+        <div class="pros"><h4>Strengths</h4><ul><li>Artifact speaks for itself</li><li>Accessible to theory + SE audiences</li><li>Natural figure: the full reduction graph</li></ul></div>
+        <div class="cons"><h4>Risks</h4><ul><li>May read as "just a tool paper"</li><li>Methodology insight buried in section 4</li></ul></div>
+      </div>
+    </div>
+  </div>
+
+  <div class="option" data-choice="b" onclick="toggleSelect(this)">
+    <div class="letter">B</div>
+    <div class="content">
+      <h3>Methodology-First: "Here's How Agents Should Code Math"</h3>
+      <p style="margin-bottom:12px">Lead with the insight that reductions are a Goldilocks domain, then present the skill-based methodology.</p>
+      <div style="font-family:monospace;font-size:13px;line-height:1.6;background:#1a1a2e;padding:12px;border-radius:6px;color:#e0e0e0">
+        <div><span style="color:#ff9800">1.</span> Introduction — Agents fail at long-horizon math tasks; why?</div>
+        <div><span style="color:#ff9800">2.</span> Why Reductions? — Goldilocks: self-contained, formally specified, verifiable</div>
+        <div><span style="color:#ff9800">3.</span> Skill-Based Decomposition — how skills encode domain knowledge as guardrails</div>
+        <div><span style="color:#ff9800">4.</span> Verification Stack — 5 layers: types, unit tests, round-trip, overhead, review</div>
+        <div><span style="color:#ff9800">5.</span> Card-Based Orchestration — graduated trust, human as curator</div>
+        <div><span style="color:#ff9800">6.</span> Evaluation — git mining + case studies + error taxonomy</div>
+        <div><span style="color:#ff9800">7.</span> The Artifact — reduction graph, QUBO hub, practical applications</div>
+        <div><span style="color:#ff9800">8.</span> Related Work + Conclusion</div>
+      </div>
+      <div class="pros-cons" style="margin-top:12px">
+        <div class="pros"><h4>Strengths</h4><ul><li>Clear research contribution</li><li>Generalizable lessons for other domains</li><li>Addresses the "verification gap" from survey</li></ul></div>
+        <div class="cons"><h4>Risks</h4><ul><li>Artifact feels like an afterthought</li><li>Harder for theory audience to engage</li></ul></div>
+      </div>
+    </div>
+  </div>
+
+  <div class="option" data-choice="c" onclick="toggleSelect(this)">
+    <div class="letter">C</div>
+    <div class="content">
+      <h3>Narrative: "From Issue to Theorem"</h3>
+      <p style="margin-bottom:12px">Open with a concrete example — one reduction flowing through the entire pipeline — then zoom out.</p>
+      <div style="font-family:monospace;font-size:13px;line-height:1.6;background:#1a1a2e;padding:12px;border-radius:6px;color:#e0e0e0">
+        <div><span style="color:#4caf50">1.</span> Introduction — Walk through SAT→MIS: issue → code → test → paper entry</div>
+        <div><span style="color:#4caf50">2.</span> Problem Setting — reduction rules, why they're hard, why they matter</div>
+        <div><span style="color:#4caf50">3.</span> System Overview — architecture, roles (human curator + agent executor)</div>
+        <div><span style="color:#4caf50">4.</span> The Pipeline — skills × verification, card-based orchestration</div>
+        <div><span style="color:#4caf50">5.</span> Two More Case Studies — simple (MVC→MIS) + complex (Factoring→Circuit)</div>
+        <div><span style="color:#4caf50">6.</span> Quantitative Results — git mining across all 52 reductions</div>
+        <div><span style="color:#4caf50">7.</span> Lessons & Limitations — what worked, what didn't, generalizability</div>
+        <div><span style="color:#4caf50">8.</span> Related Work + Conclusion</div>
+      </div>
+      <div class="pros-cons" style="margin-top:12px">
+        <div class="pros"><h4>Strengths</h4><ul><li>Most engaging to read</li><li>Case studies front and center</li><li>Easy to follow even for non-experts</li></ul></div>
+        <div class="cons"><h4>Risks</h4><ul><li>Methodology contribution less crisp</li><li>May feel anecdotal without strong quantitative backing</li></ul></div>
+      </div>
+    </div>
+  </div>
+</div>
diff --git a/.superpowers/brainstorm/785-1773296086/waiting.html b/.superpowers/brainstorm/785-1773296086/waiting.html
new file mode 100644
index 00000000..b07372b1
--- /dev/null
+++ b/.superpowers/brainstorm/785-1773296086/waiting.html
@@ -0,0 +1,3 @@
+<div style="display:flex;align-items:center;justify-content:center;min-height:60vh">
+  <p class="subtitle">Writing spec document...</p>
+</div>
\ No newline at end of file
diff --git a/Makefile b/Makefile
index d7011b9e..63d3425e 100644
--- a/Makefile
+++ b/Makefile
@@ -1,6 +1,6 @@
 # Makefile for problemreductions
 
-.PHONY: help build test mcp-test fmt clippy doc mdbook paper examples clean coverage rust-export compare qubo-testdata export-schemas release run-plan run-issue run-pipeline run-pipeline-forever run-review run-review-forever diagrams jl-testdata cli cli-demo copilot-review
+.PHONY: help build test mcp-test fmt clippy doc mdbook paper examples clean coverage rust-export compare qubo-testdata export-schemas release run-plan run-issue run-pipeline run-pipeline-forever run-review run-review-forever diagrams arxiv-figures jl-testdata cli cli-demo copilot-review
 
 RUNNER ?= codex
 CLAUDE_MODEL ?= opus
@@ -17,6 +17,7 @@ help:
 	@echo "  clippy       - Run clippy lints"
 	@echo "  doc          - Build mdBook documentation"
 	@echo "  diagrams     - Generate SVG diagrams from Typst (light + dark)"
+	@echo "  arxiv-figures - Compile arxiv figure Typst files to PDF"
 	@echo "  mdbook       - Build and serve mdBook (with live reload)"
 	@echo "  paper        - Build Typst paper (requires typst)"
 	@echo "  coverage     - Generate coverage report (requires cargo-llvm-cov)"
@@ -86,6 +87,15 @@ diagrams:
 		typst compile $$src --root=. --input dark=true docs/src/static/$$base-dark.svg; \
 	done
 
+# Compile arxiv figure Typst files to PDF
+ARXIV_FIGURES := $(filter-out %/lib.typ,$(wildcard docs/paper/arxiv/figures/*.typ))
+arxiv-figures:
+	@for src in $(ARXIV_FIGURES); do \
+		base=$$(basename $$src .typ); \
+		echo "Compiling $$base (arxiv)..."; \
+		typst compile $$src docs/paper/arxiv/figures/$$base.pdf; \
+	done
+
 # Build and serve mdBook with API docs
 mdbook:
 	@echo "Exporting graph..."
diff --git a/docs/paper/arxiv/.gitignore b/docs/paper/arxiv/.gitignore
new file mode 100644
index 00000000..15a47ce1
--- /dev/null
+++ b/docs/paper/arxiv/.gitignore
@@ -0,0 +1,9 @@
+*.pdf
+*.aux
+*.bbl
+*.blg
+*.log
+*.out
+*.fls
+*.fdb_latexmk
+*.synctex.gz
diff --git a/docs/paper/arxiv/data/git-mining-results.json b/docs/paper/arxiv/data/git-mining-results.json
new file mode 100644
index 00000000..d3904123
--- /dev/null
+++ b/docs/paper/arxiv/data/git-mining-results.json
@@ -0,0 +1,683 @@
+{
+  "summary": {
+    "total_prs": 58,
+    "rule_prs": 2,
+    "model_prs": 5,
+    "other_prs": 51,
+    "agent_authored": 0,
+    "human_authored": 58
+  },
+  "by_phase": [
+    {
+      "phase": 1,
+      "label": "manual",
+      "count": 35,
+      "rule_count": 1,
+      "model_count": 1,
+      "agent_count": 0,
+      "human_count": 35
+    },
+    {
+      "phase": 2,
+      "label": "basic-skills",
+      "count": 9,
+      "rule_count": 0,
+      "model_count": 2,
+      "agent_count": 0,
+      "human_count": 9
+    },
+    {
+      "phase": 3,
+      "label": "full-pipeline",
+      "count": 14,
+      "rule_count": 1,
+      "model_count": 2,
+      "agent_count": 0,
+      "human_count": 14
+    }
+  ],
+  "phase_boundaries": {
+    "phase_1_end": "2026-02-22T00:00:00+00:00",
+    "phase_2_end": "2026-03-01T00:00:00+00:00"
+  },
+  "prs": [
+    {
+      "number": 4,
+      "title": "feat: Feature parity with ProblemReductions.jl",
+      "author": "GiggleLiu",
+      "created_at": "2026-01-25T07:08:45Z",
+      "merged_at": "2026-01-25T07:56:00Z",
+      "branch": "feature-parity",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 7,
+      "title": "feat: Implement remaining reduction rules",
+      "author": "GiggleLiu",
+      "created_at": "2026-01-25T15:59:13Z",
+      "merged_at": "2026-01-25T16:20:15Z",
+      "branch": "feat/remaining-reductions",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 9,
+      "title": "docs: Add reduction classification and detailed survey",
+      "author": "GiggleLiu",
+      "created_at": "2026-01-25T17:41:57Z",
+      "merged_at": "2026-01-26T01:14:31Z",
+      "branch": "docs/reduction-classification-survey",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 12,
+      "title": "feat: Implement set-theoretic reduction path finding",
+      "author": "GiggleLiu",
+      "created_at": "2026-01-26T15:26:24Z",
+      "merged_at": "2026-01-26T15:47:56Z",
+      "branch": "feat/set-theoretic-reductions",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 13,
+      "title": "feat: Add grid graph mapping for unit disk reductions",
+      "author": "GiggleLiu",
+      "created_at": "2026-01-26T22:48:56Z",
+      "merged_at": "2026-02-02T00:36:23Z",
+      "branch": "feat/grid-graph-mapping",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 20,
+      "title": "feat: Implement integer programming solver for Coloring problem",
+      "author": "GiggleLiu",
+      "created_at": "2026-01-31T01:40:53Z",
+      "merged_at": "2026-01-31T03:49:37Z",
+      "branch": "feat/coloring-ilp-solver",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 22,
+      "title": "feat: Implement Factoring \u2192 ILP reduction (issue #21)",
+      "author": "GiggleLiu",
+      "created_at": "2026-01-31T03:14:21Z",
+      "merged_at": "2026-01-31T03:44:54Z",
+      "branch": "feat/factoring-ilp-solver",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 25,
+      "title": "feat: Add problem variants, documentation improvements, and reduction macro",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-02T03:40:27Z",
+      "merged_at": "2026-02-02T16:13:43Z",
+      "branch": "feat/problem-variants-and-docs",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 27,
+      "title": "Restructure tests: split test and source code",
+      "author": "isPANN",
+      "created_at": "2026-02-07T16:16:41Z",
+      "merged_at": "2026-02-08T00:55:16Z",
+      "branch": "restructure-tests",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 29,
+      "title": "Implement 6 problem-to-QUBO reductions (Issue #18)",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-08T05:58:46Z",
+      "merged_at": "2026-02-09T12:42:18Z",
+      "branch": "issue-18-qubo-reductions",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 31,
+      "title": "docs: polish reductions.typ with theorem labels and cleanup",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-09T17:06:22Z",
+      "merged_at": "2026-02-10T04:04:06Z",
+      "branch": "polish-reductions-typ",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 36,
+      "title": "JSON schema export & interactive reduction diagram (#33, #34)",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-10T05:34:37Z",
+      "merged_at": "2026-02-10T07:01:33Z",
+      "branch": "feat/json-schema-interactive-viz",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 38,
+      "title": "docs: replace Rust code with JSON schema tables in paper",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-10T08:24:29Z",
+      "merged_at": "2026-02-10T16:40:38Z",
+      "branch": "feat/improve-reductions-typ",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 41,
+      "title": "docs: improve example instances implementation plan",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-10T17:05:54Z",
+      "merged_at": "2026-02-11T02:50:00Z",
+      "branch": "docs/improve-example-instances-plan-v2",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 42,
+      "title": "fix: use directed edges instead of bidirectional in reduction graph",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-10T17:10:05Z",
+      "merged_at": "2026-02-10T17:48:55Z",
+      "branch": "fix/remove-bidirectional-edges",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 50,
+      "title": "Design: trait system refactoring for contributor ergonomics",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-12T01:39:47Z",
+      "merged_at": "2026-02-12T17:48:08Z",
+      "branch": "design/trait-refactoring",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 54,
+      "title": "perf: optimize pathdecomposition and add ground truth tests",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-12T18:49:22Z",
+      "merged_at": "2026-02-13T05:49:22Z",
+      "branch": "perf/pathdecomposition-optimization",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 56,
+      "title": "Remove weight type parameter from CircuitSAT and KColoring",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-13T12:08:18Z",
+      "merged_at": "2026-02-13T14:31:45Z",
+      "branch": "fix/circuitsat-no-weight",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 57,
+      "title": "Fix #47: Add HamiltonianCycle model",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-13T13:25:42Z",
+      "merged_at": "2026-02-13T14:30:08Z",
+      "branch": "issue-47-hamiltonian-cycle",
+      "is_agent": false,
+      "phase": 1,
+      "type": "Model"
+    },
+    {
+      "number": 60,
+      "title": "Fix #52: TravelingSalesman to ILP reduction",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-13T15:08:58Z",
+      "merged_at": "2026-02-13T16:55:22Z",
+      "branch": "52-travelingsalesman-ilp-reduction",
+      "is_agent": false,
+      "phase": 1,
+      "type": "Rule"
+    },
+    {
+      "number": 65,
+      "title": "Add parity tests against Julia ProblemReductions.jl",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-13T20:20:26Z",
+      "merged_at": "2026-02-14T08:21:49Z",
+      "branch": "jg/issue-64-test-against-jl",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 66,
+      "title": "Simplify variant system and clean up type hierarchy",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-14T01:35:50Z",
+      "merged_at": "2026-02-14T06:35:43Z",
+      "branch": "jg/fix-reduction-graph",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 68,
+      "title": "feat: variant-aware reduction paths with resolve_path",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-14T07:53:44Z",
+      "merged_at": "2026-02-15T06:41:07Z",
+      "branch": "jg/variant-aware-paths",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 71,
+      "title": "Refactor: address KISS and DRY violations (#70)",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-15T14:50:18Z",
+      "merged_at": "2026-02-15T15:12:12Z",
+      "branch": "jg/issue-70",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 72,
+      "title": "refactor: variant-level reduction graph with path-based API",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-15T15:35:17Z",
+      "merged_at": "2026-02-16T06:35:53Z",
+      "branch": "variant-refactor-plan",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 74,
+      "title": "Fix #73: Refactor graph problem constructors to take graph as input",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-16T06:58:30Z",
+      "merged_at": "2026-02-16T08:51:08Z",
+      "branch": "issue-73-graph-constructor-refactoring",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 75,
+      "title": "Close Julia parity test gaps: BicliqueCover, BMF, SAT\u2192CircuitSAT, reduction paths",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-16T08:14:23Z",
+      "merged_at": "2026-02-16T12:07:31Z",
+      "branch": "jg/issue-67-julia-parity-gaps",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 76,
+      "title": "feat: add problem_size() to Problem trait with validation",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-16T12:47:15Z",
+      "merged_at": "2026-02-16T15:19:23Z",
+      "branch": "feat/problem-size-trait",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 78,
+      "title": "Rewrite getting-started with Factoring\u2192SpinGlass and path overhead API",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-16T17:10:21Z",
+      "merged_at": "2026-02-16T17:40:56Z",
+      "branch": "jg/getting-started-factoring-overhead",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 79,
+      "title": "Reduce exported functions (#77)",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-16T17:14:12Z",
+      "merged_at": "2026-02-16T17:38:18Z",
+      "branch": "reduce-exports",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 80,
+      "title": "Reduce exported functions (closes #77)",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-16T17:41:44Z",
+      "merged_at": "2026-02-16T17:54:58Z",
+      "branch": "reduce-exports",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 82,
+      "title": "feat: add pred CLI tool for problem reductions",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-17T23:57:17Z",
+      "merged_at": "2026-02-18T13:25:50Z",
+      "branch": "cli-tool-design",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 84,
+      "title": "feat(cli): CLI UX improvements",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-18T13:58:08Z",
+      "merged_at": "2026-02-19T14:22:17Z",
+      "branch": "cli-v2-design",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 85,
+      "title": "feat: add QUBO\u2192ILP and CircuitSAT\u2192ILP reductions",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-18T14:07:25Z",
+      "merged_at": "2026-02-19T04:28:17Z",
+      "branch": "ilp-reduction-plans",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 89,
+      "title": "fix: close completeness gaps from review-implementation audit (#88)",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-20T03:23:10Z",
+      "merged_at": "2026-02-20T03:38:42Z",
+      "branch": "fix/issue-88-completeness-gaps",
+      "is_agent": false,
+      "phase": 1,
+      "type": null
+    },
+    {
+      "number": 92,
+      "title": "Fix #90: Add ClosestVectorProblem model",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-22T06:30:43Z",
+      "merged_at": "2026-02-28T16:51:52Z",
+      "branch": "issue-90-closest-vector-problem",
+      "is_agent": false,
+      "phase": 2,
+      "type": "Model"
+    },
+    {
+      "number": 93,
+      "title": "fix(mcp): review fixes, multi-platform docs, remove Smithery",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-22T13:09:14Z",
+      "merged_at": "2026-02-22T16:39:02Z",
+      "branch": "fix/mcp-review-fixes",
+      "is_agent": false,
+      "phase": 2,
+      "type": null
+    },
+    {
+      "number": 96,
+      "title": "Fix #95: Add BinPacking model",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-25T02:27:25Z",
+      "merged_at": "2026-02-28T15:25:07Z",
+      "branch": "issue-95-bin-packing",
+      "is_agent": false,
+      "phase": 2,
+      "type": "Model"
+    },
+    {
+      "number": 99,
+      "title": "Replace Polynomial overhead system with Expr AST",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-25T23:47:07Z",
+      "merged_at": "2026-02-26T09:08:20Z",
+      "branch": "feat/expr-overhead-system",
+      "is_agent": false,
+      "phase": 2,
+      "type": null
+    },
+    {
+      "number": 100,
+      "title": "test: add coverage for Expr overhead system and fix docs",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-26T09:40:44Z",
+      "merged_at": "2026-02-26T10:25:01Z",
+      "branch": "fix/expr-coverage-and-docs",
+      "is_agent": false,
+      "phase": 2,
+      "type": null
+    },
+    {
+      "number": 101,
+      "title": "fix: CLI UX improvements from issue #86",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-26T11:21:56Z",
+      "merged_at": "2026-02-27T05:34:11Z",
+      "branch": "fix/cli-ux-issue-86",
+      "is_agent": false,
+      "phase": 2,
+      "type": null
+    },
+    {
+      "number": 102,
+      "title": "feat: explicit variant declarations with complexity metadata",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-27T05:40:03Z",
+      "merged_at": "2026-02-27T17:54:52Z",
+      "branch": "fix/variant-display",
+      "is_agent": false,
+      "phase": 2,
+      "type": null
+    },
+    {
+      "number": 106,
+      "title": "feat: One weight IS\u2194SP variants, fix complexity metadata, enrich paper",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-28T02:09:19Z",
+      "merged_at": "2026-02-28T10:22:58Z",
+      "branch": "feat/one-weight-variant-and-cleanup",
+      "is_agent": false,
+      "phase": 2,
+      "type": null
+    },
+    {
+      "number": 111,
+      "title": "Enrich paper with examples, figures, and algorithm citations",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-28T13:11:38Z",
+      "merged_at": "2026-02-28T14:12:56Z",
+      "branch": "jg/paper-writing",
+      "is_agent": false,
+      "phase": 2,
+      "type": null
+    },
+    {
+      "number": 112,
+      "title": "Fix complexity inconsistencies, enforce overhead, add missing variants",
+      "author": "GiggleLiu",
+      "created_at": "2026-02-28T17:42:41Z",
+      "merged_at": "2026-03-01T03:57:13Z",
+      "branch": "fix/complexity-overhead-variants",
+      "is_agent": false,
+      "phase": 3,
+      "type": null
+    },
+    {
+      "number": 113,
+      "title": "Recategorize problem models by input structure",
+      "author": "GiggleLiu",
+      "created_at": "2026-03-01T05:05:30Z",
+      "merged_at": "2026-03-01T06:26:24Z",
+      "branch": "recategorize-models",
+      "is_agent": false,
+      "phase": 3,
+      "type": null
+    },
+    {
+      "number": 139,
+      "title": "fix: CLI QA improvements \u2014 docs, display, auto-JSON",
+      "author": "GiggleLiu",
+      "created_at": "2026-03-02T20:51:21Z",
+      "merged_at": "2026-03-02T20:57:15Z",
+      "branch": "fix/cli-qa-issues",
+      "is_agent": false,
+      "phase": 3,
+      "type": null
+    },
+    {
+      "number": 171,
+      "title": "Fix #114: Add Knapsack model",
+      "author": "zazabap",
+      "created_at": "2026-03-04T19:27:59Z",
+      "merged_at": "2026-03-10T04:43:13Z",
+      "branch": "issue-114-knapsack",
+      "is_agent": false,
+      "phase": 3,
+      "type": "Model"
+    },
+    {
+      "number": 188,
+      "title": "Update references, docs, and check-issue skill",
+      "author": "GiggleLiu",
+      "created_at": "2026-03-06T11:25:53Z",
+      "merged_at": "2026-03-06T15:15:01Z",
+      "branch": "jg/references",
+      "is_agent": false,
+      "phase": 3,
+      "type": null
+    },
+    {
+      "number": 190,
+      "title": "fix: CLI QA improvements \u2014 creation, aliases, help, schemas",
+      "author": "GiggleLiu",
+      "created_at": "2026-03-07T06:32:09Z",
+      "merged_at": "2026-03-07T09:00:29Z",
+      "branch": "fix/cli-qa-189",
+      "is_agent": false,
+      "phase": 3,
+      "type": null
+    },
+    {
+      "number": 194,
+      "title": "feat: redundant rule detection via polynomial overhead comparison (#193)",
+      "author": "GiggleLiu",
+      "created_at": "2026-03-09T13:43:25Z",
+      "merged_at": "2026-03-10T17:50:51Z",
+      "branch": "jg/redundant-rule-detection",
+      "is_agent": false,
+      "phase": 3,
+      "type": null
+    },
+    {
+      "number": 195,
+      "title": "Fix #191: Address 5 skill workflow issues",
+      "author": "zazabap",
+      "created_at": "2026-03-09T17:06:13Z",
+      "merged_at": "2026-03-09T23:44:48Z",
+      "branch": "issue-191-skill-fixes",
+      "is_agent": false,
+      "phase": 3,
+      "type": null
+    },
+    {
+      "number": 263,
+      "title": "feat: display Big O notation in CLI output",
+      "author": "GiggleLiu",
+      "created_at": "2026-03-11T00:32:04Z",
+      "merged_at": "2026-03-12T10:54:45Z",
+      "branch": "feat/cli-big-o-notation",
+      "is_agent": false,
+      "phase": 3,
+      "type": null
+    },
+    {
+      "number": 570,
+      "title": "Fix #117: [Model] GraphPartitioning",
+      "author": "GiggleLiu",
+      "created_at": "2026-03-12T06:13:15Z",
+      "merged_at": "2026-03-12T14:34:13Z",
+      "branch": "issue-117-graph-partitioning",
+      "is_agent": false,
+      "phase": 3,
+      "type": "Model"
+    },
+    {
+      "number": 592,
+      "title": "feat: display Big O notation in CLI output",
+      "author": "GiggleLiu",
+      "created_at": "2026-03-12T11:09:48Z",
+      "merged_at": "2026-03-12T11:11:37Z",
+      "branch": "feat/cli-big-o-notation",
+      "is_agent": false,
+      "phase": 3,
+      "type": null
+    },
+    {
+      "number": 593,
+      "title": "fix: check-issue delegates overhead comparison to check-rule-redundancy",
+      "author": "zazabap",
+      "created_at": "2026-03-12T11:24:59Z",
+      "merged_at": "2026-03-12T14:42:49Z",
+      "branch": "fix/check-issue-use-redundancy-skill",
+      "is_agent": false,
+      "phase": 3,
+      "type": null
+    },
+    {
+      "number": 599,
+      "title": "Fix #126: Add KSatisfiability to SubsetSum reduction",
+      "author": "GiggleLiu",
+      "created_at": "2026-03-12T12:15:30Z",
+      "merged_at": "2026-03-12T14:35:27Z",
+      "branch": "issue-126-ksatisfiability-to-subsetsum",
+      "is_agent": false,
+      "phase": 3,
+      "type": "Rule"
+    },
+    {
+      "number": 613,
+      "title": "Fix paper citations from issue #126 and #117 reviews",
+      "author": "GiggleLiu",
+      "created_at": "2026-03-12T14:57:51Z",
+      "merged_at": "2026-03-12T15:14:06Z",
+      "branch": "fix/126-ksat-subsetsum-attribution",
+      "is_agent": false,
+      "phase": 3,
+      "type": null
+    }
+  ]
+}
diff --git a/docs/paper/arxiv/data/graph-metrics.json b/docs/paper/arxiv/data/graph-metrics.json
new file mode 100644
index 00000000..c9e75016
--- /dev/null
+++ b/docs/paper/arxiv/data/graph-metrics.json
@@ -0,0 +1,10 @@
+{
+  "unique_types": 24,
+  "variant_nodes": 42,
+  "total_edges": 52,
+  "reduceto_impls": 40,
+  "inferred_edges": 12,
+  "hub_in_degree": {"QUBO": 6, "ILP": 11, "MaximumIndependentSet": 14},
+  "hub_out_degree": {"MaximumIndependentSet": 13, "MaximumSetPacking": 6, "KSatisfiability": 5, "Satisfiability": 5},
+  "loc_per_reduction": {"min": 58, "max": 444, "median": 129}
+}
diff --git a/docs/paper/arxiv/data/peer-review-round1.md b/docs/paper/arxiv/data/peer-review-round1.md
new file mode 100644
index 00000000..0579ce36
--- /dev/null
+++ b/docs/paper/arxiv/data/peer-review-round1.md
@@ -0,0 +1,161 @@
+# Peer Review Round 1
+
+**Paper:** Skill-Based Agentic Coding for Mathematical Software: A Case Study in NP-Hard Problem Reductions
+**Format:** IEEEtran conference (targeting ICSE/ASE-class venue)
+**Date:** 2026-03-13
+
+---
+
+## Scores (0--100)
+
+| Aspect          | Score | Assessment       |
+|-----------------|-------|------------------|
+| Novelty         | 62    | Major Revision   |
+| Soundness       | 45    | Major Revision   |
+| Significance    | 65    | Borderline       |
+| Clarity         | 78    | Minor Revision   |
+| Reproducibility | 55    | Major Revision   |
+
+**Overall Recommendation:** Major Revision (borderline Reject)
+
+---
+
+## Reviewer 1: SE Methodology
+
+### Summary
+The paper proposes a skill-based decomposition methodology for agentic coding, applied to a Rust library implementing NP-hard problem reductions. The key ideas are: (1) a three-role model separating creative work from mechanical execution, (2) a library of 13 reusable skills that decompose tasks into agent-manageable steps, (3) a 7-layer verification stack, and (4) a card-based orchestration pipeline.
+
+### Strengths
+- **S1.** The three-role model (Contributor/Maintainer/Agent) is a clear, well-motivated decomposition of responsibilities. The distinction between "programming the agent's workflow" (maintainer) vs. "programming the agent's output" is insightful.
+- **S2.** The 7-layer verification stack is the strongest technical contribution. The layers are well-justified with concrete error examples, and the "lazy agent problem" (agents modifying expected test values) is a genuine, important failure mode.
+- **S3.** The paper is well-written with clear prose and good structure. The Goldilocks domain argument is compelling.
+- **S4.** The concept of skills as reusable, composable agent workflows is practically useful and clearly explained.
+
+### Weaknesses
+- **W1.** The ablation study (Section 6.1) is entirely placeholder [TBD]. This is the most critical missing piece: without a controlled comparison between skill-based and no-skill configurations, the paper's central claim -- that skills improve agent reliability -- is unsupported by direct evidence. The framing text acknowledges this will be "5--10 reductions" but the results are absent.
+- **W2.** The error taxonomy table (Table 3) is entirely [TBD]. Without concrete error counts, the verification stack's effectiveness is described only anecdotally.
+- **W3.** The success rate column in Table 2 (skills inventory) is entirely [TBD]. These metrics would directly quantify each skill's reliability.
+- **W4.** The methodology evaluation relies almost entirely on a single case study with a single project. While Section 7.1 acknowledges this limitation, the paper would be strengthened by even a brief pilot in a second domain.
+
+### Questions for Authors
+- Q1. The ablation text says "The ablation results are [TBD]" -- when will these be available? Without them, the evaluation section lacks its primary quantitative evidence.
+- Q2. How much human effort (hours) went into developing the 13 skills? This is crucial for assessing the cost-benefit tradeoff.
+
+---
+
+## Reviewer 2: AI/Agents
+
+### Summary
+The paper presents a pragmatic methodology for human-agent collaboration in mathematical software development, with skills serving as structured prompts/workflows and a multi-layered verification approach to ensure correctness.
+
+### Strengths
+- **S1.** The connection to current agentic coding benchmarks (SWE-Bench, SWE-EVO, SWE-Bench Pro) is well-established and provides useful context for the capability gap the paper addresses.
+- **S2.** The "lazy agent problem" -- where agents modify expected test outputs rather than fixing implementation bugs -- is a real and underreported phenomenon. The materialized fixtures defense is a practical contribution.
+- **S3.** The fresh-context design for agentic review (dispatching sub-agents without the implementor's context) to prevent sycophancy is a sound design choice with good motivation.
+- **S4.** The related work on AI-discovered reductions (FunSearch, AlphaEvolve) and formal verification (VeriCoding, CLEVER) is thorough and well-integrated.
+
+### Weaknesses
+- **W1.** The comparison with existing agent benchmarks is unfair in framing. The paper opens by saying agents achieve "70--80% on SWE-Bench Verified" but "below 25% on long-horizon tasks," then implies its methodology bridges this gap -- but never actually measures its own success rate on comparable metrics. The paper should either: (a) define equivalent metrics and report them, or (b) be explicit that it is presenting methodology, not a benchmark comparison.
+- **W2.** The paper does not report which LLM(s) were used, which model versions, or any details about the agent's configuration. For reproducibility, readers need to know whether this was Claude 3.5, Claude 4, GPT-4, etc. The methodology's effectiveness may be strongly model-dependent.
+- **W3.** The "skills as markdown documents" approach is presented as novel, but prompt engineering / structured agent workflows have been explored by prior work (e.g., chain-of-thought prompting, ReAct, agent tool-use frameworks). The novelty should be more carefully positioned relative to these.
+- **W4.** The claim that "developers now use AI in 60% of their work while maintaining active oversight on 80--100% of delegated tasks" (citing Anthropic 2026) is used multiple times but comes from an industry report by the same company whose tool (Claude Code) is used in the study. This creates a potential conflict of interest in citation usage.
+
+### Questions for Authors
+- Q1. What model(s) and version(s) were used? Did the model change during the 7-week development period?
+- Q2. Has the methodology been tested with non-Anthropic agents (e.g., GPT-4, Gemini)?
+
+---
+
+## Reviewer 3: Devil's Advocate
+
+### Summary
+The paper describes a carefully engineered workflow for using coding agents in a specific mathematical domain. While the engineering is thorough, I have serious concerns about the evaluation and the strength of the claims made.
+
+### Strengths
+- **S1.** The paper is honest about limitations (Section 7.2), including single case study, skill engineering cost, domain specificity, and confounding factors. This transparency is appreciated.
+- **S2.** The concrete artifact (24 problem types, 40 reductions, 52 edges, >95% coverage) is impressive as engineering output.
+
+### Weaknesses (Critical)
+- **W1. Incomplete evaluation is a showstopper.** Three of the four main evaluation components are [TBD] placeholders:
+  - Ablation results (Section 6.1): entirely missing
+  - Error taxonomy counts (Table 3): entirely missing
+  - Skill success rates (Table 2): entirely missing
+
+  This means the paper's evaluation section consists of: (a) a description of an experiment that hasn't been run, (b) a descriptive git history summary with no quantitative findings, and (c) three case studies. This is insufficient for a top-venue submission.
+
+- **W2. Timeline inconsistency.** The abstract claims "six months" of development, but Section 6.2 says "approximately seven weeks" spanning 58 PRs. The git data confirms ~47 days (6.6 weeks) from first to last PR. This is a factual contradiction that undermines credibility.
+
+- **W3. Author count inconsistency.** Section 6.2 says "two primary contributors" but the git data shows three distinct authors (GiggleLiu, isPANN, zazabap). While one contributor may have minor contributions, this should be stated accurately.
+
+- **W4. N=1 threat to validity.** The entire evaluation is based on a single project by a single primary developer. The generalizability claims in Section 7.1 are aspirational -- listing candidate domains (compiler passes, numerical linear algebra, etc.) without any evidence. A hostile reviewer would argue this is an experience report dressed as a methodology paper.
+
+- **W5. Circular reasoning in verification stack.** The paper claims the 7-layer verification stack catches errors, but the evidence for this is anecdotal ("we observed this failure mode multiple times"). Without systematic error counts (Table 3 is TBD), the claim that "this layer catches approximately 60% of the errors that survive type checking" (Section 5.1, Layer 3) is unsubstantiated.
+
+- **W6. No baseline comparison.** The paper compares against SWE-Bench and SWE-EVO numbers but never runs its own tasks through a no-skill baseline. The ablation is designed but not executed. Without this, the reader cannot distinguish whether the results come from the skill methodology, the domain's inherent verifiability, or the specific LLM's capability.
+
+### Weaknesses (Major)
+- **W7.** The paper is ~14 pages in IEEEtran conference format. ICSE/ASE typically allows 10--12 pages. The paper needs significant trimming (~2--4 pages).
+
+- **W8.** The "60% of errors" claim for Layer 3 (line 402) has no citation, no data source, and no methodology for arriving at this number. It reads as an estimate presented as a finding.
+
+- **W9.** The paper conflates "agent-generated code" with "agent-assisted code." Since all PRs are attributed to human GitHub accounts (Section 6.2), there is no way to distinguish which code was human-written vs. agent-written. The paper acknowledges this as "a finding about observability limitations" but this also means the paper cannot quantify agent contributions.
+
+- **W10.** Several citations have issues:
+  - `Anthropic2026AgenticCoding` is a tech report by the tool vendor, cited 3+ times as if it were independent research
+  - `lucas2014` is cited as evidence that "Rydberg atom arrays natively solve MIS" but Lucas 2014 is about Ising formulations, not Rydberg atoms specifically (the Rydberg atom connection to MIS came later, ~2018+)
+  - The bib file has `@article` entries with both `booktitle` and `journal` fields (e.g., Yang2024SWEagent, He2024QuantumTSP), which is malformed BibTeX
+
+### Weaknesses (Minor)
+- **W11.** The `\author{...}` placeholder (line 17) should be filled in for submission.
+- **W12.** The abstract mentions "six months" but should be revised to match the actual timeline.
+- **W13.** Section 2 paragraph "Hardware solvers as practical motivation" could be shortened; it reads more like a grant proposal than a conference paper.
+- **W14.** The paper would benefit from a threat-to-validity section separate from limitations, following SE convention.
+- **W15.** No appendix or supplementary material is referenced for the full skill markdown files, which would aid reproducibility.
+- **W16.** The paper uses `\Cref` (cleveref) throughout but does not appear to load it with any options for IEEEtran compatibility. This may cause formatting issues.
+
+---
+
+## Critical Issues (Must Fix)
+
+1. **[C1] Timeline contradiction (abstract vs. Section 6.2).** The abstract says "six months" but Section 6.2 says "seven weeks" and the data confirms ~47 days. Fix: align to the actual timeline. (Affects: Soundness)
+
+2. **[C2] TBD placeholders in evaluation.** Three tables/results are entirely [TBD]: ablation results (Section 6.1), error taxonomy counts (Table 3), skill success rates (Table 2). The paper cannot be submitted with placeholder data. Fix: either run the experiments and fill in real data, or restructure the evaluation to remove the ablation framing and present what data exists. (Affects: Soundness, Significance)
+
+3. **[C3] Unsubstantiated "60% of errors" claim.** The claim that closed-loop tests catch "approximately 60% of the errors that survive type checking" (Section 5.1) has no supporting data. Fix: either provide the data from the error taxonomy audit, or soften to qualitative language ("a majority of errors" or "the largest share of errors in our experience"). (Affects: Soundness)
+
+4. **[C4] Author count factual error.** "Two primary contributors" but three distinct authors in git history. Fix: say "three contributors" or "two primary contributors and one additional contributor." (Affects: Soundness)
+
+## Major Issues (Should Fix)
+
+5. **[M1] No LLM model identification.** The paper never specifies which language model(s) were used. This is essential for reproducibility and for understanding whether results generalize across models.
+
+6. **[M2] Page count.** At ~14 pages, the paper exceeds typical ICSE/ASE limits (10--12 pages). The hardware solvers paragraph and some related work could be condensed.
+
+7. **[M3] Vendor citation bias.** The Anthropic 2026 report is cited 3 times as supporting evidence. At minimum, note that this is a vendor report, or balance with independent sources.
+
+8. **[M4] Missing threats to validity section.** SE venues expect explicit threats-to-validity discussion (internal, external, construct validity). The limitations section partially covers this but not in the expected format.
+
+9. **[M5] Malformed BibTeX entries.** Several entries have both `booktitle` and `journal` fields. These will produce warnings or malformed references.
+
+10. **[M6] Novelty positioning vs. prompt engineering.** The paper should more explicitly differentiate "skills" from existing prompt engineering techniques (chain-of-thought, ReAct, structured prompts).
+
+## Minor Issues (Nice to Fix)
+
+11. **[m1]** Author placeholder `\author{...}` needs to be filled.
+12. **[m2]** The `lucas2014` citation for Rydberg atoms is imprecise; consider citing the Pichler et al. 2018 work specifically for the MIS-Rydberg connection.
+13. **[m3]** Table 2 caption says "Success rate is the fraction of invocations that pass CI on first attempt, measured from git history" but the column is all TBD -- the caption should not describe methodology for data that doesn't exist yet.
+14. **[m4]** Section 6.2 could benefit from a timeline figure showing the three phases.
+15. **[m5]** The case studies (Section 6.3) are descriptive but lack quantitative comparison (e.g., agent time vs. estimated human time, number of iterations).
+16. **[m6]** Consider adding a data availability statement pointing to the repository.
+17. **[m7]** The paper uses both "coding agent" and "AI agent" -- consider standardizing terminology.
+18. **[m8]** cleveref package may need `[capitalise]` option or `\Cref`/`\cref` consistency check for IEEEtran.
+
+---
+
+## Summary Assessment
+
+The paper presents a well-engineered system with genuine practical contributions, particularly the 7-layer verification stack and the "lazy agent problem" defense. The writing quality is high and the domain motivation is compelling. However, the evaluation is critically incomplete: the ablation study has not been run, error counts are missing, and skill success rates are placeholders. The timeline contradiction between abstract and body is a factual error that must be fixed. In its current state, the paper reads as an experience report with a methodology sketch, not a fully evaluated research contribution.
+
+**Verdict:** Major Revision. The methodology and system design are promising, but the paper needs: (1) completed evaluation data or restructured claims that match available evidence, (2) factual corrections (timeline, author count), (3) model identification for reproducibility, and (4) approximately 2--4 pages of trimming for conference format.
+
+The strongest path to acceptance: reframe the evaluation around the git mining data and case studies that do exist, acknowledge the ablation as future work rather than presenting it as a designed-but-unrun experiment, and add the model identification details.
diff --git a/docs/paper/arxiv/figures/architecture.typ b/docs/paper/arxiv/figures/architecture.typ
new file mode 100644
index 00000000..ef98a86c
--- /dev/null
+++ b/docs/paper/arxiv/figures/architecture.typ
@@ -0,0 +1,198 @@
+#import "@preview/cetz:0.4.2": canvas, draw
+
+#set page(width: auto, height: auto, margin: 5pt)
+#set text(size: 8pt, font: "New Computer Modern")
+
+#let col-trait = rgb("#4e79a7")       // blue
+#let col-reduction = rgb("#59a14f")   // green
+#let col-compile = rgb("#e8a838")     // gold
+
+#canvas(length: 0.55cm, {
+  import draw: *
+
+  let box-w = 12.0
+  let box1-h = 3.0   // Problem trait box (taller: has two sub-boxes)
+  let box2-h = 1.6   // ReductionResult box
+  let box3-h = 2.4   // Compile-time validation box
+  let arrow-gap = 1.4
+  let cx = 0
+
+  // --- Box 1: Problem trait (top) ---
+  let y1-top = 0
+  let y1-bot = -box1-h
+
+  rect(
+    (cx - box-w / 2, y1-top), (cx + box-w / 2, y1-bot),
+    radius: 4pt,
+    fill: col-trait.lighten(88%),
+    stroke: (thickness: 1pt, paint: col-trait),
+    name: "box1",
+  )
+
+  // Title
+  content(
+    (cx, y1-top - 0.4), anchor: "center",
+    text(9pt, weight: "bold", fill: col-trait.darken(20%),
+      [`Problem` trait],
+    ),
+  )
+
+  // Method list
+  content(
+    (cx, y1-top - 1.0), anchor: "center",
+    text(7.5pt, fill: luma(60),
+      [`NAME`#h(4pt)#sym.dot.c#h(4pt)`Metric`#h(4pt)#sym.dot.c#h(4pt)`dims()`#h(4pt)#sym.dot.c#h(4pt)`evaluate()`],
+    ),
+  )
+
+  // Divider line
+  line(
+    (cx - box-w / 2 + 0.4, y1-top - 1.4),
+    (cx + box-w / 2 - 0.4, y1-top - 1.4),
+    stroke: (thickness: 0.5pt, paint: col-trait.lighten(40%)),
+  )
+
+  // Sub-boxes for Optimization and Satisfaction
+  let sub-margin = 0.35  // margin from parent box edge
+  let sub-gap = 0.3      // gap between sub-boxes
+  let sub-w = (box-w - 2 * sub-margin - sub-gap) / 2  // = 5.525 each
+  let sub-h = 1.0
+  let sub-y-top = y1-top - 1.6
+  let sub-y-bot = sub-y-top - sub-h
+
+  // Optimization sub-box (left)
+  let opt-left = cx - box-w / 2 + sub-margin
+  let opt-right = opt-left + sub-w
+  rect(
+    (opt-left, sub-y-top), (opt-right, sub-y-bot),
+    radius: 3pt,
+    fill: col-trait.lighten(78%),
+    stroke: (thickness: 0.6pt, paint: col-trait.lighten(20%)),
+    name: "opt",
+  )
+  content(
+    "opt", anchor: "center",
+    {
+      text(7.5pt, weight: "bold", fill: col-trait.darken(10%), [`OptimizationProblem`])
+      linebreak()
+      text(6pt, fill: luma(80), [`SolutionSize<W>` #sym.dot.c `direction()`])
+    },
+  )
+
+  // Satisfaction sub-box (right)
+  let sat-left = opt-right + sub-gap
+  let sat-right = sat-left + sub-w
+  rect(
+    (sat-left, sub-y-top), (sat-right, sub-y-bot),
+    radius: 3pt,
+    fill: col-trait.lighten(78%),
+    stroke: (thickness: 0.6pt, paint: col-trait.lighten(20%)),
+    name: "sat",
+  )
+  content(
+    "sat", anchor: "center",
+    {
+      text(7.5pt, weight: "bold", fill: col-trait.darken(10%), [`SatisfactionProblem`])
+      linebreak()
+      text(6pt, fill: luma(80), [`Metric = bool`])
+    },
+  )
+
+  // --- Arrow 1: Box 1 -> Box 2 ---
+  let a1-top = y1-bot
+  let a1-bot = y1-bot - arrow-gap
+  line(
+    (cx, a1-top), (cx, a1-bot + 0.05),
+    stroke: (thickness: 1.2pt, paint: col-reduction.darken(10%)),
+    mark: (end: "straight", scale: 0.45),
+  )
+  content(
+    (cx + 0.3, (a1-top + a1-bot) / 2), anchor: "west",
+    text(7.5pt, weight: "bold", fill: col-reduction.darken(10%),
+      [`ReduceTo<T>`],
+    ),
+  )
+
+  // --- Box 2: ReductionResult ---
+  let y2-top = a1-bot
+  let y2-bot = y2-top - box2-h
+
+  rect(
+    (cx - box-w / 2, y2-top), (cx + box-w / 2, y2-bot),
+    radius: 4pt,
+    fill: col-reduction.lighten(88%),
+    stroke: (thickness: 1pt, paint: col-reduction),
+    name: "box2",
+  )
+
+  content(
+    (cx, y2-top - 0.45), anchor: "center",
+    text(9pt, weight: "bold", fill: col-reduction.darken(20%),
+      [`ReductionResult<T>`],
+    ),
+  )
+
+  content(
+    (cx, y2-top - 1.1), anchor: "center",
+    text(7.5pt, fill: luma(60),
+      [`target_problem()`#h(4pt)#sym.dot.c#h(4pt)`extract_solution()`],
+    ),
+  )
+
+  // --- Arrow 2: Box 2 -> Box 3 ---
+  let a2-top = y2-bot
+  let a2-bot = y2-bot - arrow-gap
+  line(
+    (cx, a2-top), (cx, a2-bot + 0.05),
+    stroke: (thickness: 1.2pt, paint: col-compile.darken(10%)),
+    mark: (end: "straight", scale: 0.45),
+  )
+  content(
+    (cx + 0.3, (a2-top + a2-bot) / 2), anchor: "west",
+    text(7pt, fill: col-compile.darken(10%),
+      [`#[reduction(overhead = {...})]`],
+    ),
+  )
+
+  // --- Box 3: Compile-time validation ---
+  let y3-top = a2-bot
+  let y3-bot = y3-top - box3-h
+
+  rect(
+    (cx - box-w / 2, y3-top), (cx + box-w / 2, y3-bot),
+    radius: 4pt,
+    fill: col-compile.lighten(88%),
+    stroke: (thickness: 1pt, paint: col-compile),
+    name: "box3",
+  )
+
+  content(
+    (cx, y3-top - 0.45), anchor: "center",
+    text(9pt, weight: "bold", fill: col-compile.darken(20%),
+      [Compile-time validation],
+    ),
+  )
+
+  // Bullet points
+  let bullet-x = cx - box-w / 2 + 1.2
+  let bullet-y = y3-top - 1.1
+
+  content(
+    (bullet-x, bullet-y), anchor: "west",
+    text(7.5pt, fill: luma(60),
+      [#sym.bullet#h(3pt)Variable names #sym.arrow getter methods],
+    ),
+  )
+  content(
+    (bullet-x, bullet-y - 0.55), anchor: "west",
+    text(7.5pt, fill: luma(60),
+      [#sym.bullet#h(3pt)`Expr` AST: symbolic overhead expressions],
+    ),
+  )
+  content(
+    (bullet-x, bullet-y - 1.1), anchor: "west",
+    text(7.5pt, fill: luma(60),
+      [#sym.bullet#h(3pt)`declare_variants!` #sym.arrow compile-time registry],
+    ),
+  )
+})
diff --git a/docs/paper/arxiv/figures/lib.typ b/docs/paper/arxiv/figures/lib.typ
new file mode 100644
index 00000000..f4867c00
--- /dev/null
+++ b/docs/paper/arxiv/figures/lib.typ
@@ -0,0 +1,30 @@
+// Shared theme for all paper figures.
+// Usage: #import "lib.typ": *
+
+#import "@preview/cetz:0.4.2": canvas, draw
+
+// ── Page setup (standalone figures) ──
+#let fig-page = (width: auto, height: auto, margin: 10pt)
+#let fig-text = (size: 7.5pt, font: "New Computer Modern")
+
+// ── Palette: black + one accent ──
+#let accent     = rgb("#4e79a7")          // steel blue — the single accent
+#let accent-light = accent.lighten(85%)
+#let fg         = luma(30)                // near-black for text & strokes
+#let fg-light   = luma(100)               // secondary text
+#let border     = luma(60)                // box strokes
+#let fill-light = luma(245)               // subtle box fill
+#let fill-accent = accent.lighten(90%)    // accent-tinted fill
+#let shadow-col = luma(215)               // drop shadow
+#let edge-col   = luma(80)                // edge strokes
+
+// ── Stroke presets ──
+#let stroke-box     = (thickness: 1.3pt, paint: border)
+#let stroke-accent  = (thickness: 1.3pt, paint: accent)
+#let stroke-edge    = (thickness: 0.9pt, paint: edge-col)
+#let stroke-dashed  = (thickness: 0.8pt, paint: edge-col, dash: "dashed")
+#let stroke-dotted  = (thickness: 0.8pt, paint: edge-col, dash: "densely-dashed")
+
+// ── Arrow preset ──
+#let arrow-end = (end: "straight", scale: 0.4)
+#let arrow-both = (start: "straight", end: "straight", scale: 0.4)
diff --git a/docs/paper/arxiv/figures/pipeline.typ b/docs/paper/arxiv/figures/pipeline.typ
new file mode 100644
index 00000000..cc1bf26a
--- /dev/null
+++ b/docs/paper/arxiv/figures/pipeline.typ
@@ -0,0 +1,153 @@
+#import "@preview/cetz:0.4.2": canvas, draw
+
+#set page(width: auto, height: auto, margin: 5pt)
+#set text(size: 8pt, font: "New Computer Modern")
+
+// Color coding: human = orange, agent = blue
+#let col-human = rgb("#f28e2b")
+#let col-agent = rgb("#4e79a7")
+#let col-bg-human = col-human.lighten(85%)
+#let col-bg-agent = col-agent.lighten(85%)
+#let col-neutral = luma(240)
+
+// Column card style
+#let card-w = 2.4
+#let card-h = 0.7
+#let gap-y = 1.3  // vertical spacing between cards
+
+#canvas(length: 0.55cm, {
+  import draw: *
+
+  // --- Helper: draw a board column card ---
+  let board-card(x, y, label, fill-col, stroke-col, name-id) = {
+    rect(
+      (x - card-w / 2, y - card-h / 2),
+      (x + card-w / 2, y + card-h / 2),
+      radius: 4pt,
+      fill: fill-col,
+      stroke: (thickness: 1pt, paint: stroke-col),
+      name: name-id,
+    )
+    content(name-id, text(8pt, weight: "bold", fill: stroke-col.darken(15%), label))
+  }
+
+  // --- Layout: vertical pipeline ---
+  let cx = 0  // center x for column cards
+  let y0 = 0  // top
+
+  // Contributor + Issue at top
+  content((cx - 3.5, y0), anchor: "east", text(7pt, fill: luma(100), [Contributor]))
+  rect(
+    (cx - 3.2, y0 - 0.3), (cx - 1.8, y0 + 0.3),
+    radius: 3pt, fill: col-neutral, stroke: 0.5pt + luma(180), name: "issue",
+  )
+  content("issue", text(7pt, [Issue]))
+  line(
+    (cx - 1.8, y0), (cx - card-w / 2 - 0.05, y0),
+    stroke: 0.6pt + luma(150),
+    mark: (end: "straight", scale: 0.35),
+  )
+
+  // Backlog
+  board-card(cx, y0, "Backlog", col-neutral, luma(130), "backlog")
+
+  // Arrow: Backlog -> Ready (human)
+  let y1 = y0 - gap-y
+  line(
+    (cx, y0 - card-h / 2), (cx, y1 + card-h / 2 + 0.05),
+    stroke: (thickness: 1.2pt, paint: col-human),
+    mark: (end: "straight", scale: 0.4),
+  )
+  content(
+    (cx + card-w / 2 + 0.2, (y0 + y1) / 2), anchor: "west",
+    text(6.5pt, fill: col-human, [Maintainer\ moves card]),
+  )
+
+  // Ready
+  board-card(cx, y1, "Ready", col-bg-human, col-human, "ready")
+
+  // Arrow: Ready -> In Progress (agent: project-pipeline)
+  let y2 = y1 - gap-y
+  line(
+    (cx, y1 - card-h / 2), (cx, y2 + card-h / 2 + 0.05),
+    stroke: (thickness: 1.2pt, paint: col-agent),
+    mark: (end: "straight", scale: 0.4),
+  )
+  content(
+    (cx + card-w / 2 + 0.2, (y1 + y2) / 2), anchor: "west",
+    text(6.5pt, fill: col-agent, [`project-pipeline`]),
+  )
+
+  // In Progress
+  board-card(cx, y2, "In Progress", col-bg-agent, col-agent, "inprog")
+
+  // Arrow: In Progress -> review-agentic (agent substeps)
+  let y3 = y2 - gap-y
+  line(
+    (cx, y2 - card-h / 2), (cx, y3 + card-h / 2 + 0.05),
+    stroke: (thickness: 1.2pt, paint: col-agent),
+    mark: (end: "straight", scale: 0.4),
+  )
+  // Substep labels
+  content(
+    (cx + card-w / 2 + 0.2, (y2 + y3) / 2), anchor: "west",
+    text(6pt, fill: col-agent, [`issue-to-pr` #sym.arrow `check` #sym.arrow `implement` #sym.arrow `review`]),
+  )
+
+  // review-agentic
+  board-card(cx, y3, "review-agentic", col-bg-agent, col-agent, "rev-agent")
+
+  // Arrow: review-agentic -> In Review (agent: review-pipeline)
+  let y4 = y3 - gap-y
+  line(
+    (cx, y3 - card-h / 2), (cx, y4 + card-h / 2 + 0.05),
+    stroke: (thickness: 1.2pt, paint: col-agent),
+    mark: (end: "straight", scale: 0.4),
+  )
+  content(
+    (cx + card-w / 2 + 0.2, (y3 + y4) / 2), anchor: "west",
+    text(6.5pt, fill: col-agent, [`review-pipeline`]),
+  )
+
+  // In Review
+  board-card(cx, y4, "In Review", col-bg-agent, col-agent, "inrev")
+
+  // Arrow: In Review -> Done (human)
+  let y5 = y4 - gap-y
+  line(
+    (cx, y4 - card-h / 2), (cx, y5 + card-h / 2 + 0.05),
+    stroke: (thickness: 1.2pt, paint: col-human),
+    mark: (end: "straight", scale: 0.4),
+  )
+  content(
+    (cx + card-w / 2 + 0.2, (y4 + y5) / 2), anchor: "west",
+    text(6.5pt, fill: col-human, [Maintainer\ merges PR]),
+  )
+
+  // Done
+  board-card(cx, y5, "Done", col-bg-human, col-human, "done")
+
+  // --- Bracket annotations on the left ---
+  // Agent zone bracket (Ready -> In Review)
+  let bx = cx - card-w / 2 - 0.6
+  let bracket-top = y1 - card-h / 2 - 0.05
+  let bracket-bot = y4 + card-h / 2 + 0.05
+  line(
+    (bx + 0.15, bracket-top), (bx, bracket-top), (bx, bracket-bot), (bx + 0.15, bracket-bot),
+    stroke: (thickness: 0.8pt, paint: col-agent, dash: "dashed"),
+  )
+  content(
+    (bx - 0.15, (bracket-top + bracket-bot) / 2), anchor: "east",
+    text(6pt, fill: col-agent, weight: "bold", [Agent\ zone]),
+  )
+
+  // --- Legend at bottom ---
+  let ly = y5 - card-h / 2 - 0.7
+  let lx = cx - 2.5
+  // Human
+  line((lx, ly), (lx + 0.6, ly), stroke: (thickness: 1.2pt, paint: col-human))
+  content((lx + 0.8, ly), anchor: "west", text(6pt, [Human decision]))
+  // Agent
+  line((lx + 3.2, ly), (lx + 3.8, ly), stroke: (thickness: 1.2pt, paint: col-agent))
+  content((lx + 4.0, ly), anchor: "west", text(6pt, [Agent action]))
+})
diff --git a/docs/paper/arxiv/figures/problemtree.typ b/docs/paper/arxiv/figures/problemtree.typ
new file mode 100644
index 00000000..a5b1d643
--- /dev/null
+++ b/docs/paper/arxiv/figures/problemtree.typ
@@ -0,0 +1,198 @@
+#import "@preview/cetz:0.4.2": canvas, draw
+
+#set page(width: auto, height: auto, margin: 8pt)
+#set text(size: 8pt, font: "New Computer Modern")
+
+// Color palette (matching NSFC figure's feel, adapted for English paper)
+#let col-platform = rgb("#1A5276")
+#let col-platform-fill = rgb("#D4E6F1")
+#let col-human = rgb("#1E8449")
+#let col-human-fill = rgb("#D5F5E3")
+#let col-ai = rgb("#7D3C98")
+#let col-ai-fill = rgb("#E8DAEF")
+#let col-edge = rgb("#5D6D7E")
+#let col-dash = rgb("#ABB2B9")
+
+#canvas(length: 0.5cm, {
+  import draw: *
+
+  // ============================================================
+  //  LEVEL 0 — Hardware native problems (solver backends)
+  // ============================================================
+  let plat-w = 3.4
+  let plat-h = 0.9
+
+  rect((-4.5, -0.45), (-4.5 + plat-w, 0.45),
+    fill: col-platform-fill, stroke: 1.2pt + col-platform, radius: 4pt, name: "udmis")
+  content("udmis", text(9pt, weight: "bold", fill: col-platform, "UD-MIS on grids"))
+  content((-2.8, -0.85), text(6.5pt, fill: col-platform.lighten(20%), "(Rydberg atom arrays)"))
+
+  rect((1.1, -0.45), (1.1 + plat-w, 0.45),
+    fill: col-platform-fill, stroke: 1.2pt + col-platform, radius: 4pt, name: "qubo")
+  content("qubo", text(9pt, weight: "bold", fill: col-platform, "QUBO"))
+  content((2.8, -0.85), text(6.5pt, fill: col-platform.lighten(20%), "(D-Wave annealers)"))
+
+  // Also show ILP as a third backend
+  rect((6.5, -0.45), (6.5 + plat-w, 0.45),
+    fill: col-platform-fill, stroke: 1.2pt + col-platform, radius: 4pt, name: "ilp")
+  content("ilp", text(9pt, weight: "bold", fill: col-platform, "ILP"))
+  content((8.2, -0.85), text(6.5pt, fill: col-platform.lighten(20%), "(Gurobi, CPLEX)"))
+
+  // ============================================================
+  //  LEVEL 1 — Human-implemented reductions (~40 rules)
+  // ============================================================
+  let hnode(pos, label, name: none) = {
+    let n = if name != none { name } else { label }
+    rect(
+      (pos.at(0) - 0.9, pos.at(1) - 0.28),
+      (pos.at(0) + 0.9, pos.at(1) + 0.28),
+      fill: col-human-fill, stroke: 0.7pt + col-human.lighten(20%),
+      radius: 3pt, name: n,
+    )
+    content(n, text(6.5pt, weight: "bold", fill: col-human.darken(30%), label))
+  }
+
+  // Left subtree → UD-MIS
+  hnode((-6.0, 2.0), "Set Packing", name: "sp")
+  hnode((-4.8, 3.2), "Vertex Cover", name: "vc")
+  hnode((-3.4, 1.7), "MIS", name: "mis")
+  hnode((-1.6, 3.2), "3-SAT", name: "sat")
+  hnode((-0.4, 2.2), "Clique", name: "clq")
+
+  // Middle/Right subtree → QUBO / ILP
+  hnode((1.0, 1.7), "MAX-CUT", name: "mc")
+  hnode((2.8, 3.4), "Graph Coloring", name: "gc")
+  hnode((3.6, 2.0), "Set Cover", name: "sc")
+  hnode((5.8, 3.0), "Hamilton Cycle", name: "hc")
+  hnode((7.2, 1.9), "Num. Partition", name: "np")
+  hnode((8.8, 3.2), "Factoring", name: "fac")
+
+  // Reduction edges (downward = reduction direction)
+  let redge(from, to) = {
+    line(from, to,
+      stroke: (paint: col-edge, thickness: 0.5pt),
+      mark: (end: "straight", scale: 0.35))
+  }
+
+  redge("mis.south", "udmis.north")
+  redge("clq.south", "udmis.north")
+  redge("sp.south", "udmis.north")
+  redge("sat.south", "mis.north")
+  redge("vc.south", "mis.north")
+
+  redge("mc.south", "qubo.north")
+  redge("gc.south", "qubo.north")
+  redge("sc.south", "qubo.north")
+  redge("hc.south", "qubo.north")
+  redge("np.south", "ilp.north")
+  redge("fac.south", "ilp.north")
+
+  // Cross-reductions
+  let dredge(from, to) = {
+    line(from, to,
+      stroke: (paint: col-edge.lighten(30%), thickness: 0.4pt, dash: "densely-dashed"),
+      mark: (end: "straight", scale: 0.3))
+  }
+  dredge("vc.south", "qubo.north")
+  dredge("sat.south", "ilp.north")
+  dredge("gc.south", "ilp.north")
+  dredge("mc.south", "ilp.north")
+  dredge("clq.south", "ilp.north")
+
+  // ============================================================
+  //  DIVIDER — boundary between current and AI-scaled
+  // ============================================================
+  line((-7.0, 4.1), (10.5, 4.1),
+    stroke: (paint: col-dash, thickness: 0.7pt, dash: "dashed"))
+
+  // ============================================================
+  //  LEVEL 2 — AI-synthesized reductions (~100+ rules)
+  //  Staggered dot grid forming a canopy shape
+  // ============================================================
+  let ai-dot(x, y) = {
+    circle((x, y), radius: 0.08,
+      fill: col-ai.lighten(60%), stroke: 0.2pt + col-ai.lighten(40%))
+  }
+
+  // Row 1 (y=4.6)
+  for x in (-5.5, -4.0, -2.5, -1.0, 0.5, 2.0, 3.5, 5.0, 6.5, 8.0) { ai-dot(x, 4.6) }
+  // Row 2 (y=5.0)
+  for x in (-6.2, -4.7, -3.2, -1.7, -0.2, 1.3, 2.8, 4.3, 5.8, 7.3, 8.8) { ai-dot(x, 5.0) }
+  // Row 3 (y=5.4)
+  for x in (-6.0, -4.5, -3.0, -1.5, 0.0, 1.5, 3.0, 4.5, 6.0, 7.5, 8.5) { ai-dot(x, 5.4) }
+  // Row 4 (y=5.8)
+  for x in (-5.5, -4.0, -2.5, -1.0, 0.5, 2.0, 3.5, 5.0, 6.5, 8.0) { ai-dot(x, 5.8) }
+  // Row 5 (y=6.2)
+  for x in (-4.8, -3.3, -1.8, -0.3, 1.2, 2.7, 4.2, 5.7, 7.2) { ai-dot(x, 6.2) }
+  // Row 6 (y=6.6)
+  for x in (-3.8, -2.3, -0.8, 0.7, 2.2, 3.7, 5.2, 6.5) { ai-dot(x, 6.6) }
+  // Row 7 (y=7.0)
+  for x in (-2.5, -1.0, 0.5, 2.0, 3.5, 5.0) { ai-dot(x, 7.0) }
+  // Row 8 (y=7.4)
+  for x in (-1.0, 0.5, 2.0, 3.5) { ai-dot(x, 7.4) }
+
+  // Representative labeled problems in AI layer
+  let ai-label(pos, label, n) = {
+    rect(
+      (pos.at(0) - 0.85, pos.at(1) - 0.2),
+      (pos.at(0) + 0.85, pos.at(1) + 0.2),
+      fill: col-ai-fill, stroke: 0.3pt + col-ai.lighten(30%),
+      radius: 2pt, name: n,
+    )
+    content(n, text(5pt, weight: "bold", fill: col-ai.lighten(-20%), label))
+  }
+
+  ai-label((-5.2, 4.8), "Scheduling", "ai-sched")
+  ai-label((7.5, 4.8), "TSP", "ai-tsp")
+  ai-label((-2.5, 5.5), [$k$-SAT], "ai-ksat")
+  ai-label((3.0, 5.5), "Steiner Tree", "ai-steiner")
+  ai-label((0.5, 6.8), "Bin Packing", "ai-binp")
+  ai-label((-4.0, 6.4), "Dom. Set", "ai-domset")
+  ai-label((5.5, 6.4), [Max $k$-Cut], "ai-mkcut")
+
+  // Faint edges from AI layer down to human layer
+  let aedge(from, to) = {
+    line(from, to,
+      stroke: (paint: col-ai.lighten(50%), thickness: 0.3pt),
+      mark: (end: "straight", scale: 0.2))
+  }
+  aedge((-4.0, 4.35), "sat.north")
+  aedge((-1.0, 4.35), "vc.north")
+  aedge((1.3, 4.35), "gc.north")
+  aedge((2.8, 4.35), "mc.north")
+  aedge((5.8, 4.35), "hc.north")
+  aedge((8.0, 4.35), "np.north")
+
+  // Ellipsis
+  content((1.5, 7.7), text(12pt, fill: col-ai.lighten(30%), $dots.c$))
+
+  // ============================================================
+  //  ANNOTATIONS — right-side braces
+  // ============================================================
+  // Hardware native
+  on-layer(-1, {
+    // Use simple bracket lines instead of decorations
+    let bx = 10.8
+
+    // Hardware brace
+    line((bx, -0.5), (bx + 0.3, -0.5), stroke: 0.6pt + col-platform.lighten(20%))
+    line((bx + 0.3, -0.5), (bx + 0.3, 0.5), stroke: 0.6pt + col-platform.lighten(20%))
+    line((bx, 0.5), (bx + 0.3, 0.5), stroke: 0.6pt + col-platform.lighten(20%))
+    content((bx + 0.6, 0.0), anchor: "west", text(7pt, fill: col-platform, weight: "bold",
+      [Hardware-native\ problems]))
+
+    // Human brace
+    line((bx, 1.0), (bx + 0.3, 1.0), stroke: 0.6pt + col-human.lighten(20%))
+    line((bx + 0.3, 1.0), (bx + 0.3, 3.8), stroke: 0.6pt + col-human.lighten(20%))
+    line((bx, 3.8), (bx + 0.3, 3.8), stroke: 0.6pt + col-human.lighten(20%))
+    content((bx + 0.6, 2.4), anchor: "west", text(7pt, fill: col-human.darken(10%), weight: "bold",
+      [Human-implemented\ #text(6pt)[$tilde.op$40 reduction rules]]))
+
+    // AI brace
+    line((bx, 4.3), (bx + 0.3, 4.3), stroke: 0.6pt + col-ai.lighten(20%))
+    line((bx + 0.3, 4.3), (bx + 0.3, 7.5), stroke: 0.6pt + col-ai.lighten(20%))
+    line((bx, 7.5), (bx + 0.3, 7.5), stroke: 0.6pt + col-ai.lighten(20%))
+    content((bx + 0.6, 5.9), anchor: "west", text(7pt, fill: col-ai.darken(10%), weight: "bold",
+      [Agent-synthesized\ #text(6pt)[$tilde.op$100+ new rules]]))
+  })
+})
diff --git a/docs/paper/arxiv/figures/reduction-graph.typ b/docs/paper/arxiv/figures/reduction-graph.typ
new file mode 100644
index 00000000..bba99ce9
--- /dev/null
+++ b/docs/paper/arxiv/figures/reduction-graph.typ
@@ -0,0 +1,188 @@
+#import "@preview/cetz:0.4.2": canvas, draw
+
+#set page(width: auto, height: auto, margin: 10pt)
+#set text(size: 7pt, font: "New Computer Modern")
+
+// Category colors
+#let col-graph   = rgb("#4e79a7")
+#let col-formula = rgb("#59a14f")
+#let col-set     = rgb("#e15759")
+#let col-alg     = rgb("#b07aa1")
+#let col-misc    = rgb("#999999")
+
+#let hub-r = 0.44
+#let node-r = 0.26
+
+// Layout: Semi-circular fan around ILP (the biggest sink with 10 in-edges).
+// ILP at center-right; QUBO below it; MIS at center-left.
+// Feeder nodes arranged in an arc around ILP so edges don't bundle.
+// Bottom: SG <-> MaxCut <-> QUBO cluster.
+// Left: SAT cluster fans out to MIS, MinDS, KCol, kSAT, CSAT.
+// Isolated nodes in dashed boxes at far left and far right.
+
+#let nodes = (
+  // === Hubs ===
+  ("MIS",      "MIS",       3.0,   5.0,  col-graph,  hub-r),
+  ("ILP",      "ILP",      12.0,   5.0,  col-alg,    hub-r),
+  ("QUBO",     "QUBO",     12.0,   1.5,  col-alg,    hub-r),
+
+  // === ILP feeders — spread in a wide arc above/below/left of ILP ===
+  ("MaxClq",   "MaxClq",    8.5,   8.5,  col-graph,  node-r),
+  ("TSP",      "TSP",      10.0,   8.5,  col-graph,  node-r),
+  ("MaxMatch", "MaxM",      5.0,   8.0,  col-graph,  node-r),
+  ("MinDS",    "MinDS",     1.0,   7.0,  col-graph,  node-r),
+
+  // === MIS neighbors ===
+  ("MinVC",    "MinVC",     1.0,   3.5,  col-graph,  node-r),
+
+  // === Middle band ===
+  ("KCol",     "KCol",      8.5,   3.5,  col-graph,  node-r),
+  ("SG",       "SG",       10.0,   0.0,  col-graph,  node-r),
+  ("MaxCut",   "MaxCut",    8.0,   0.0,  col-graph,  node-r),
+
+  // === Isolated graph ===
+  ("MaxIS",    "MaxIS",    -0.8,   5.0,  col-graph,  node-r),
+  ("BiClq",    "BiClq",    -0.8,   3.5,  col-graph,  node-r),
+
+  // === Formula ===
+  ("SAT",      "SAT",       5.0,   5.0,  col-formula, node-r),
+  ("kSAT",     "kSAT",      7.5,   5.0,  col-formula, node-r),
+  ("CSAT",     "CSAT",      7.5,   1.5,  col-formula, node-r),
+
+  // === Set ===
+  ("MaxSP",    "MaxSP",     5.0,   6.5,  col-set,    node-r),
+  ("MinSC",    "MinSC",    10.0,   3.5,  col-set,    node-r),
+
+  // === Isolated algebraic ===
+  ("CVP",      "CVP",      14.5,   3.5,  col-alg,    node-r),
+  ("BMF",      "BMF",      14.5,   1.5,  col-alg,    node-r),
+  ("Knap",     "Knap",     14.5,   5.5,  col-alg,    node-r),
+
+  // === Misc ===
+  ("Fact",     "Fact",       5.0,   0.0,  col-misc,   node-r),
+  ("BinP",     "BinP",     14.5,   7.5,  col-misc,   node-r),
+  ("PS",       "PS",       14.5,   0.0,  col-misc,   node-r),
+)
+
+// 32 unique type-level directed edges
+#let edges = (
+  ("SAT",   "CSAT"),    ("SAT",   "KCol"),    ("SAT",   "kSAT"),
+  ("SAT",   "MIS"),     ("SAT",   "MinDS"),
+  ("kSAT",  "QUBO"),    ("kSAT",  "SAT"),
+  ("CSAT",  "ILP"),     ("CSAT",  "SG"),
+  ("Fact",  "CSAT"),    ("Fact",  "ILP"),
+  ("MIS",   "MaxSP"),   ("MIS",   "MinVC"),
+  ("MinVC", "MIS"),     ("MinVC", "MinSC"),
+  ("MaxSP", "ILP"),     ("MaxSP", "MIS"),     ("MaxSP", "QUBO"),
+  ("MaxClq","ILP"),
+  ("MaxMatch","ILP"),   ("MaxMatch","MaxSP"),
+  ("MinDS", "ILP"),
+  ("MinSC", "ILP"),
+  ("KCol",  "ILP"),     ("KCol",  "QUBO"),
+  ("QUBO",  "ILP"),     ("QUBO",  "SG"),
+  ("SG",    "MaxCut"),  ("SG",    "QUBO"),
+  ("MaxCut","SG"),
+  ("ILP",   "QUBO"),
+  ("TSP",   "ILP"),
+)
+
+// Bidirectional pairs for perpendicular offset
+#let bidi-pairs = (
+  ("MIS", "MinVC"),  ("MIS", "MaxSP"),  ("SAT", "kSAT"),
+  ("SG",  "MaxCut"), ("SG",  "QUBO"),   ("ILP", "QUBO"),
+)
+
+#canvas(length: 0.55cm, {
+  import draw: *
+
+  // Build lookup
+  let node-map = (:)
+  for n in nodes {
+    let (id, abbr, x, y, col, r) = n
+    node-map.insert(id, (x: x, y: y, r: r))
+  }
+
+  let is-bidi(src, tgt) = {
+    let found = false
+    for (a, b) in bidi-pairs {
+      if (src == a and tgt == b) or (src == b and tgt == a) { found = true }
+    }
+    found
+  }
+
+  let bidi-offset = 0.2
+
+  // --- Edges ---
+  for (src, tgt) in edges {
+    let s = node-map.at(src)
+    let t = node-map.at(tgt)
+    let dx = t.x - s.x
+    let dy = t.y - s.y
+    let dist = calc.sqrt(dx * dx + dy * dy)
+    if dist > 0 {
+      let ux = dx / dist
+      let uy = dy / dist
+      let px = -uy
+      let py = ux
+      let off = 0.0
+      if is-bidi(src, tgt) {
+        if src < tgt { off = bidi-offset } else { off = -bidi-offset }
+      }
+      let sx = s.x + px * off
+      let sy = s.y + py * off
+      let tx = t.x + px * off
+      let ty = t.y + py * off
+      let x1 = sx + ux * (s.r + 0.06)
+      let y1 = sy + uy * (s.r + 0.06)
+      let x2 = tx - ux * (t.r + 0.1)
+      let y2 = ty - uy * (t.r + 0.1)
+      line(
+        (x1, y1), (x2, y2),
+        stroke: 0.35pt + luma(150),
+        mark: (end: "straight", scale: 0.3),
+      )
+    }
+  }
+
+  // --- Nodes ---
+  for n in nodes {
+    let (id, abbr, x, y, col, r) = n
+    let is-hub = r > 0.3
+    circle(
+      (x, y), radius: r,
+      fill: col.lighten(if is-hub { 58% } else { 80% }),
+      stroke: (thickness: if is-hub { 1.4pt } else { 0.5pt }, paint: col),
+      name: id,
+    )
+    content(id, text(
+      if is-hub { 7.5pt } else { 5.5pt },
+      weight: if is-hub { "bold" } else { "regular" },
+      fill: col.darken(25%), abbr,
+    ))
+  }
+
+  // --- Dashed boxes for isolated nodes ---
+  rect((-1.4, 2.9), (0.0, 5.7),
+    stroke: (thickness: 0.3pt, paint: luma(190), dash: "dashed"), radius: 4pt)
+  content((-0.7, 2.55), text(4pt, fill: luma(150), "no reductions"))
+
+  rect((13.85, -0.6), (15.2, 8.1),
+    stroke: (thickness: 0.3pt, paint: luma(190), dash: "dashed"), radius: 4pt)
+  content((14.5, -0.9), text(4pt, fill: luma(150), "no reductions"))
+
+  // --- Legend ---
+  let lx = 1.0
+  let ly = -1.3
+  rect((lx - 0.3, ly - 0.2), (lx + 10.0, ly + 0.85),
+    stroke: 0.3pt + luma(180), fill: white, radius: 3pt)
+  let items = (
+    ("Graph", col-graph), ("Formula", col-formula), ("Set", col-set),
+    ("Algebraic", col-alg), ("Misc", col-misc),
+  )
+  for (i, (label, col)) in items.enumerate() {
+    let ex = lx + 0.25 + i * 2.0
+    let ey = ly + 0.33
+    circle((ex, ey), radius: 0.15, fill: col.lighten(80%), stroke: 0.5pt + col)
+    content((ex + 0.3, ey), anchor: "west", text(5pt, label))
+  }
+})
diff --git a/docs/paper/arxiv/figures/role-mentor.typ b/docs/paper/arxiv/figures/role-mentor.typ
new file mode 100644
index 00000000..19d10ac6
--- /dev/null
+++ b/docs/paper/arxiv/figures/role-mentor.typ
@@ -0,0 +1,45 @@
+#import "lib.typ": *
+#import "@preview/pixel-family:0.1.0": alice, bolt
+
+#set page(..fig-page)
+#set text(..fig-text)
+
+#canvas(length: 0.55cm, {
+  import draw: *
+
+  // Helper: auto-sized rounded box
+  let node(pos, label, name-id, accented: false) = {
+    let s = if accented { stroke-accent } else { stroke-box }
+    let f = if accented { fill-accent } else { fill-light }
+    let c = if accented { accent.darken(20%) } else { fg }
+    content(pos,
+      box(fill: f, stroke: s, inset: (x: 8pt, y: 4pt), radius: 5pt,
+        text(8pt, weight: "bold", fill: c, label)),
+      name: name-id)
+  }
+
+  let elabel(pos, body) = {
+    content(pos, box(fill: white, inset: (x: 3pt, y: 1.5pt), radius: 2pt,
+      text(6pt, fill: fg-light, body)))
+  }
+
+  // Contributor (human)
+  content((0, 1.4), alice(size: 1.8em, baseline: 0pt))
+  node((0, 0), [Contributor], "contrib")
+
+  // Mentor (agent)
+  content((6, 1.4), bolt(size: 1.8em, baseline: 0pt))
+  node((6, 0), [Mentor], "mentor", accented: true)
+
+  // GitHub Issue
+  node((3, -2.2), [GitHub Issue], "issue")
+
+  // Contributor ↔ Mentor
+  line("contrib.east", "mentor.west",
+    stroke: stroke-edge, mark: arrow-both)
+  elabel(("contrib", 50%, "mentor"), [interactive])
+
+  // Mentor → Issue
+  line("mentor.south", "issue.east",
+    stroke: stroke-edge, mark: arrow-end)
+})
diff --git a/docs/paper/arxiv/figures/role-orchestrator.typ b/docs/paper/arxiv/figures/role-orchestrator.typ
new file mode 100644
index 00000000..08a9e79d
--- /dev/null
+++ b/docs/paper/arxiv/figures/role-orchestrator.typ
@@ -0,0 +1,58 @@
+#import "lib.typ": *
+#import "@preview/pixel-family:0.1.0": bob, nova
+
+#set page(..fig-page)
+#set text(..fig-text)
+
+#canvas(length: 0.55cm, {
+  import draw: *
+
+  // Helper: auto-sized rounded box
+  let node(pos, label, name-id, accented: false) = {
+    let s = if accented { stroke-accent } else { stroke-box }
+    let f = if accented { fill-accent } else { fill-light }
+    let c = if accented { accent.darken(20%) } else { fg }
+    content(pos,
+      box(fill: f, stroke: s, inset: (x: 8pt, y: 4pt), radius: 5pt,
+        text(8pt, weight: "bold", fill: c, label)),
+      name: name-id)
+  }
+
+  let elabel(pos, body) = {
+    content(pos, box(fill: white, inset: (x: 3pt, y: 1.5pt), radius: 2pt,
+      text(6pt, fill: fg-light, body)))
+  }
+
+  let gap = 4.5
+
+  // Maintainer (left)
+  content((0, 1.4), bob(size: 1.8em, baseline: 0pt))
+  node((0, 0), [Maintainer], "maint")
+
+  // Board
+  node((gap, 0), [Board], "board")
+
+  // Orchestrator (agent)
+  content((2 * gap, 1.4), nova(size: 1.8em, baseline: 0pt))
+  node((2 * gap, 0), [Orchestrator], "orch", accented: true)
+
+  // PR
+  node((3 * gap, 0), [PR], "pr")
+
+  // Maintainer (right)
+  content((4 * gap, 1.4), bob(size: 1.8em, baseline: 0pt))
+  node((4 * gap, 0), [Maintainer], "maint2")
+
+  // Edges
+  line("maint.east", "board.west", stroke: stroke-edge, mark: arrow-end)
+  elabel(("maint", 50%, "board"), [ready])
+
+  line("board.east", "orch.west", stroke: stroke-edge, mark: arrow-end)
+  elabel(("board", 50%, "orch"), [pick])
+
+  line("orch.east", "pr.west", stroke: stroke-edge, mark: arrow-end)
+  elabel(("orch", 50%, "pr"), [create])
+
+  line("pr.east", "maint2.west", stroke: stroke-edge, mark: arrow-end)
+  elabel(("pr", 50%, "maint2"), [merge])
+})
diff --git a/docs/paper/arxiv/figures/role-runner.typ b/docs/paper/arxiv/figures/role-runner.typ
new file mode 100644
index 00000000..6e512561
--- /dev/null
+++ b/docs/paper/arxiv/figures/role-runner.typ
@@ -0,0 +1,36 @@
+#import "lib.typ": *
+#import "@preview/pixel-family:0.1.0": crank
+
+#set page(..fig-page)
+#set text(..fig-text)
+
+#canvas(length: 0.55cm, {
+  import draw: *
+
+  // Helper: auto-sized rounded box
+  let node(pos, label, name-id, accented: false) = {
+    let s = if accented { stroke-accent } else { stroke-box }
+    let f = if accented { fill-accent } else { fill-light }
+    let c = if accented { accent.darken(20%) } else { fg }
+    content(pos,
+      box(fill: f, stroke: s, inset: (x: 8pt, y: 4pt), radius: 5pt,
+        text(8pt, weight: "bold", fill: c, label)),
+      name: name-id)
+  }
+
+  // Runner (agent, top center)
+  content((3, 2.1), crank(size: 1.8em, baseline: 0pt))
+  node((3, 0.7), [Runner], "runner", accented: true)
+
+  // Code + Tests (bottom left)
+  node((0.5, -0.7), [Code + Tests], "code")
+
+  // Paper Entry (bottom right)
+  node((5.5, -0.7), [Paper Entry], "paper")
+
+  // Edges
+  line("runner.south-west", "code.north-east",
+    stroke: stroke-edge, mark: arrow-end)
+  line("runner.south-east", "paper.north-west",
+    stroke: stroke-edge, mark: arrow-end)
+})
diff --git a/docs/paper/arxiv/figures/role-worker.typ b/docs/paper/arxiv/figures/role-worker.typ
new file mode 100644
index 00000000..ccf7ffef
--- /dev/null
+++ b/docs/paper/arxiv/figures/role-worker.typ
@@ -0,0 +1,76 @@
+#import "lib.typ": *
+#import "@preview/pixel-family:0.1.0": bob, nova, crank
+
+#set page(..fig-page)
+#set text(..fig-text)
+
+#canvas(length: 0.55cm, {
+  import draw: *
+
+  // Helper: auto-sized rounded box
+  let node(pos, label, name-id, accented: false) = {
+    let s = if accented { stroke-accent } else { stroke-box }
+    let f = if accented { fill-accent } else { fill-light }
+    let c = if accented { accent.darken(20%) } else { fg }
+    content(pos,
+      box(fill: f, stroke: s, inset: (x: 8pt, y: 4pt), radius: 5pt,
+        text(8pt, weight: "bold", fill: c, label)),
+      name: name-id)
+  }
+
+  let elabel(pos, body) = {
+    content(pos, box(fill: white, inset: (x: 3pt, y: 1.5pt), radius: 2pt,
+      text(6pt, fill: fg-light, body)))
+  }
+
+  let gap = 5.0
+
+  // Maintainer (left)
+  content((0, 1.4), bob(size: 1.8em, baseline: 0pt))
+  node((0, 0), [Maintainer], "maint")
+
+  // Board
+  node((gap, 0), [Board], "board")
+
+  // Orchestrator worker (agent)
+  content((2 * gap, 1.4), nova(size: 1.8em, baseline: 0pt))
+  node((2 * gap, 0), [Orchestrate], "orch", accented: true)
+
+  // Implementation worker (agent)
+  content((3 * gap, 1.4), crank(size: 1.8em, baseline: 0pt))
+  node((3 * gap, 0), [Implement], "impl", accented: true)
+
+  // Outputs (bottom)
+  node((2.5 * gap - 2.2, -2.0), [Code + Tests], "code")
+  node((2.5 * gap + 2.2, -2.0), [Paper Entry], "paper")
+
+  // PR
+  node((4 * gap, 0), [PR], "pr")
+
+  // Maintainer (right)
+  content((5 * gap, 1.4), bob(size: 1.8em, baseline: 0pt))
+  node((5 * gap, 0), [Maintainer], "maint2")
+
+  // Edges
+  line("maint.east", "board.west", stroke: stroke-edge, mark: arrow-end)
+  elabel(("maint", 50%, "board"), [ready])
+
+  line("board.east", "orch.west", stroke: stroke-edge, mark: arrow-end)
+  elabel(("board", 50%, "orch"), [pick])
+
+  line("orch.east", "impl.west",
+    stroke: (thickness: 1.1pt, paint: accent), mark: arrow-end)
+  elabel(("orch", 50%, "impl"), [dispatch])
+
+  // Implementation outputs
+  line("impl.south-west", "code.north-east",
+    stroke: stroke-edge, mark: arrow-end)
+  line("impl.south-east", "paper.north-west",
+    stroke: stroke-edge, mark: arrow-end)
+
+  line("impl.east", "pr.west", stroke: stroke-edge, mark: arrow-end)
+  elabel(("impl", 50%, "pr"), [create])
+
+  line("pr.east", "maint2.west", stroke: stroke-edge, mark: arrow-end)
+  elabel(("pr", 50%, "maint2"), [merge])
+})
diff --git a/docs/paper/arxiv/figures/roles.png b/docs/paper/arxiv/figures/roles.png
new file mode 100644
index 00000000..dcb447ef
Binary files /dev/null and b/docs/paper/arxiv/figures/roles.png differ
diff --git a/docs/paper/arxiv/figures/roles.typ b/docs/paper/arxiv/figures/roles.typ
new file mode 100644
index 00000000..c6ecacd4
--- /dev/null
+++ b/docs/paper/arxiv/figures/roles.typ
@@ -0,0 +1,75 @@
+#import "lib.typ": *
+
+#set page(..fig-page)
+#set text(..fig-text)
+
+#canvas(length: 0.55cm, {
+  import draw: *
+
+  // Helper: box with shadow
+  let node(pos, label, sub, name-id, accented: false, w: 2.2, h: 0.8) = {
+    let (x, y) = pos
+    let s = if accented { stroke-accent } else { stroke-box }
+    let f = if accented { fill-accent } else { fill-light }
+    rect((x - w + 0.1, y - h + 0.1), (x + w + 0.1, y + h + 0.1),
+      radius: 7pt, fill: shadow-col, stroke: none)
+    rect((x - w, y - h), (x + w, y + h),
+      radius: 7pt, fill: f, stroke: s, name: name-id)
+    let c = if accented { accent.darken(20%) } else { fg }
+    content((x, y + 0.22), text(10pt, weight: "bold", fill: c, label))
+    content((x, y - 0.32), text(6.5pt, fill: fg-light, sub))
+  }
+
+  // Helper: edge label
+  let elabel(pos, body) = {
+    content(pos, box(fill: white, inset: (x: 3pt, y: 1.5pt), radius: 2pt,
+      text(6.5pt, fill: fg-light, body)))
+  }
+
+  let cx = 8
+  let cy = 5.5
+
+  // ── Codebase (center, larger) ──
+  rect((cx - 2.7 + 0.12, cy - 1.4 + 0.12), (cx + 2.7 + 0.12, cy + 1.4 + 0.12),
+    radius: 8pt, fill: shadow-col, stroke: none)
+  rect((cx - 2.7, cy - 1.4), (cx + 2.7, cy + 1.4),
+    radius: 8pt, fill: fill-light, stroke: (thickness: 1.5pt, paint: border),
+    name: "code")
+  content((cx, cy + 0.45), text(11pt, weight: "bold", fill: fg, [Codebase]))
+  content((cx, cy - 0.3), text(7pt, style: "italic", fill: fg-light, [agent-maintained]))
+
+  // ── Three roles ──
+  node((3.0, 11.0), [Contributor], [domain expert], "contrib")
+  node((3.0, 0.8), [Maintainer], [no code], "maint")
+  node((13.5, 2.0), [Agent], [implement · test · review], "agent", accented: true, w: 2.5)
+
+  // ── Contributor → Codebase: issue ──
+  line((5.2, 11.0 - 0.8), (cx - 0.5, cy + 1.4),
+    stroke: stroke-edge, mark: arrow-end)
+  elabel((6.8, 8.8), [issue (creative elements)])
+
+  // ── Codebase → Contributor: visual check ──
+  line((cx - 2.0, cy + 1.4), (2.2, 11.0 - 0.8),
+    stroke: stroke-dotted, mark: arrow-end)
+  elabel((2.2, 8.6), [generated paper\ (visual check)])
+
+  // ── Maintainer → Codebase: approve, merge ──
+  line((4.5, 0.8 + 0.8), (cx - 2.0, cy - 1.4),
+    stroke: stroke-edge, mark: arrow-end)
+  elabel((3.8, 3.0), [approve, merge])
+
+  // ── Agent ↔ Codebase: execute skills ──
+  line((13.5 - 2.3, 2.0 + 0.8), (cx + 2.0, cy - 1.4),
+    stroke: (thickness: 1.1pt, paint: accent), mark: arrow-both)
+  elabel((12.0, 4.2), text(fill: accent.darken(15%), [execute skills]))
+
+  // ── Maintainer → Agent: author skills ──
+  line((3.0 + 2.2, 0.8 + 0.2), (13.5 - 2.5, 2.0 - 0.3),
+    stroke: (thickness: 1.1pt, paint: accent), mark: arrow-end)
+  elabel((8.2, 0.35), text(weight: "bold", fill: accent.darken(15%), [author skills]))
+
+  // ── Maintainer ↔ Contributor: community calls ──
+  line((3.0 - 1.0, 0.8 + 0.8), (3.0 - 1.0, 11.0 - 0.8),
+    stroke: stroke-dashed, mark: arrow-both)
+  elabel((0.3, 5.9), [community\ calls])
+})
diff --git a/docs/paper/arxiv/figures/scaling-wall.typ b/docs/paper/arxiv/figures/scaling-wall.typ
new file mode 100644
index 00000000..4693cfb7
--- /dev/null
+++ b/docs/paper/arxiv/figures/scaling-wall.typ
@@ -0,0 +1,253 @@
+#import "@preview/cetz:0.4.2": canvas, draw
+
+#set page(width: auto, height: auto, margin: 10pt)
+#set text(size: 7.5pt, font: "New Computer Modern")
+
+// ── Colors ──
+#let col-human = rgb("#e15759")     // warm red for human team
+#let col-agent = rgb("#59a14f")     // green for agent + verification
+#let col-barrier = rgb("#f28e2b")   // orange for barrier bands
+#let fg = luma(30)
+#let fg-light = luma(100)
+#let axis-col = luma(60)
+
+#canvas(length: 0.55cm, {
+  import draw: *
+
+  // ── Layout constants ──
+  let ox = 0        // origin x
+  let oy = 0        // origin y
+  let plot-w = 20.0 // plot width (in canvas units)
+  let plot-h = 10.0 // plot height
+
+  // Data scaling: x-axis 0..150, y-axis 0..1
+  let x-max = 150.0
+  let y-max = 1.0
+  let sx = plot-w / x-max  // scale x
+  let sy = plot-h / y-max  // scale y
+
+  // Helpers: data coords -> canvas coords
+  let px(x) = ox + x * sx
+  let py(y) = oy + y * sy
+  let pt(x, y) = (px(x), py(y))
+
+  // ── Barrier bands (draw first, behind everything) ──
+  let barriers = (
+    (15, 22, "Convention\ndrift"),
+    (32, 42, "Effort\nexhaustion"),
+    (52, 65, "Knowledge\ndiscontinuity"),
+  )
+
+  for (x0, x1, label) in barriers {
+    rect(
+      pt(x0, 0), pt(x1, y-max),
+      fill: col-barrier.lighten(82%),
+      stroke: none,
+    )
+    // Thin border lines
+    line(pt(x0, 0), pt(x0, y-max), stroke: (thickness: 0.4pt, paint: col-barrier.lighten(50%)))
+    line(pt(x1, 0), pt(x1, y-max), stroke: (thickness: 0.4pt, paint: col-barrier.lighten(50%)))
+
+    // Barrier label at top
+    content(
+      pt((x0 + x1) / 2, y-max + 0.06),
+      anchor: "south",
+      text(6pt, fill: col-barrier.darken(15%), weight: "bold", label),
+    )
+  }
+
+  // ── Axes ──
+  // X-axis
+  line(
+    pt(0, 0), pt(x-max, 0),
+    stroke: (thickness: 1pt, paint: axis-col),
+    mark: (end: "straight", scale: 0.35),
+  )
+  // Y-axis
+  line(
+    pt(0, 0), pt(0, y-max + 0.08),
+    stroke: (thickness: 1pt, paint: axis-col),
+    mark: (end: "straight", scale: 0.35),
+  )
+
+  // X-axis label
+  content(
+    pt(x-max / 2, -0.14),
+    anchor: "north",
+    text(8pt, fill: fg, [Number of problem types]),
+  )
+
+  // Y-axis label
+  content(
+    (ox - 1.8, py(y-max / 2)),
+    anchor: "center",
+    angle: 90deg,
+    text(8pt, fill: fg, [Quality]),
+  )
+
+  // X-axis tick marks
+  for x in (0, 20, 40, 60, 80, 100, 120, 140) {
+    line(pt(x, 0), pt(x, -0.02), stroke: (thickness: 0.6pt, paint: axis-col))
+    content(
+      pt(x, -0.04),
+      anchor: "north",
+      text(6pt, fill: fg-light, str(x)),
+    )
+  }
+
+  // Y-axis: just "Low" and "High" labels (no numeric scale)
+  content(
+    (ox - 0.4, py(0.05)),
+    anchor: "east",
+    text(6pt, fill: fg-light, [Low]),
+  )
+  content(
+    (ox - 0.4, py(0.95)),
+    anchor: "east",
+    text(6pt, fill: fg-light, [High]),
+  )
+
+  // ── Human team line ──
+  // Hobby spline for the steep descent; straight line for the flat tail.
+  let human-pts = (
+    (0, 0.92),
+    (5, 0.93),
+    (10, 0.91),
+    (14, 0.88),
+    // Hit first barrier: convention drift
+    (18, 0.80),
+    (20, 0.75),
+    (24, 0.68),
+    (28, 0.64),
+    (30, 0.62),
+    // Hit second barrier: effort exhaustion
+    (35, 0.50),
+    (40, 0.42),
+    (45, 0.38),
+    (50, 0.34),
+    // Hit third barrier: knowledge discontinuity
+    (55, 0.28),
+    (60, 0.22),
+    (65, 0.19),
+    (75, 0.15),
+    (85, 0.13),
+  )
+
+  let human-canvas = human-pts.map(((x, y)) => pt(x, y))
+  hobby(
+    ..human-canvas,
+    stroke: (thickness: 1.8pt, paint: col-human),
+  )
+  // Flat tail: straight line from where the spline ends
+  line(
+    pt(85, 0.13), pt(145, 0.12),
+    stroke: (thickness: 1.8pt, paint: col-human),
+  )
+
+  // ── Agent + verification line ──
+  // Starts at same point, maintains quality throughout
+  let agent-pts = (
+    (0, 0.92),
+    (5, 0.93),
+    (10, 0.92),
+    (15, 0.91),
+    (20, 0.91),
+    (25, 0.92),
+    (27, 0.92),
+    (30, 0.91),
+    (35, 0.91),
+    (40, 0.92),
+    (45, 0.91),
+    (50, 0.91),
+    (55, 0.92),
+    (60, 0.91),
+    (65, 0.92),
+    (70, 0.91),
+    (80, 0.92),
+    (90, 0.91),
+    (100, 0.92),
+  )
+
+  let agent-canvas = agent-pts.map(((x, y)) => pt(x, y))
+  hobby(
+    ..agent-canvas,
+    stroke: (thickness: 1.8pt, paint: col-agent),
+  )
+
+  // Dashed continuation of agent line beyond data
+  line(
+    pt(100, 0.92), pt(145, 0.91),
+    stroke: (thickness: 1.4pt, paint: col-agent, dash: "dashed"),
+  )
+
+  // ── Data points ──
+
+  // This work: x=27 on the agent line
+  circle(
+    pt(27, 0.92),
+    radius: 0.2,
+    fill: col-agent,
+    stroke: (thickness: 1pt, paint: white),
+    name: "this-pt",
+  )
+  // Label below-right to avoid overlapping with barrier labels at top
+  content(
+    (rel: (0.4, -0.5), to: "this-pt"),
+    anchor: "north-west",
+    frame: "rect",
+    padding: (x: 0.12, y: 0.06),
+    fill: white,
+    stroke: (thickness: 0.5pt, paint: col-agent.lighten(40%)),
+    text(6.5pt, fill: col-agent.darken(20%), weight: "bold", [This work (9 weeks)]),
+  )
+
+  // Vision arrow: from x=100 toward x=140
+  line(
+    pt(105, 0.80), pt(138, 0.80),
+    stroke: (thickness: 1.2pt, paint: col-agent.darken(10%)),
+    mark: (end: "straight", scale: 0.4),
+    name: "vision-arrow",
+  )
+  content(
+    "vision-arrow.mid",
+    anchor: "south",
+    padding: 0.12,
+    text(7pt, fill: col-agent.darken(20%), weight: "bold", [Vision: 100+]),
+  )
+
+  // ── Legend ──
+  let lx = px(80)
+  let ly = py(0.42)
+  let leg-gap = 1.1
+
+  // Legend background (fully opaque to cover the human line behind it)
+  rect(
+    (lx - 0.4, ly + 0.6),
+    (lx + 7.0, ly - leg-gap - 0.4),
+    radius: 3pt,
+    fill: white,
+    stroke: (thickness: 0.5pt, paint: luma(200)),
+  )
+
+  // Human line legend
+  line(
+    (lx, ly), (lx + 1.2, ly),
+    stroke: (thickness: 1.8pt, paint: col-human),
+  )
+  content(
+    (lx + 1.5, ly),
+    anchor: "west",
+    text(6.5pt, fill: fg, [Human team]),
+  )
+
+  // Agent line legend
+  line(
+    (lx, ly - leg-gap), (lx + 1.2, ly - leg-gap),
+    stroke: (thickness: 1.8pt, paint: col-agent),
+  )
+  content(
+    (lx + 1.5, ly - leg-gap),
+    anchor: "west",
+    text(6.5pt, fill: fg, [Agent + verification]),
+  )
+})
diff --git a/docs/paper/arxiv/figures/skill-map.typ b/docs/paper/arxiv/figures/skill-map.typ
new file mode 100644
index 00000000..121a629a
--- /dev/null
+++ b/docs/paper/arxiv/figures/skill-map.typ
@@ -0,0 +1,89 @@
+#import "lib.typ": *
+
+#set page(..fig-page)
+#set text(..fig-text)
+
+#canvas(length: 0.55cm, {
+  import draw: *
+
+  // Helper: core node (inner ring)
+  let core(pos, label, name-id) = {
+    let (x, y) = pos
+    rect((x - 1.5, y - 0.35 + 0.06), (x + 1.5, y + 0.35 + 0.06),
+      radius: 4pt, fill: shadow-col, stroke: none)
+    rect((x - 1.5, y - 0.35), (x + 1.5, y + 0.35),
+      radius: 4pt, fill: fill-light, stroke: stroke-box, name: name-id)
+    content(name-id, text(7pt, weight: "bold", fill: fg, raw(label)))
+  }
+
+  // Helper: skill node (outer ring)
+  let skill(pos, label, name-id) = {
+    let (x, y) = pos
+    rect((x - 1.6, y - 0.28), (x + 1.6, y + 0.28),
+      radius: 3pt, fill: white, stroke: (thickness: 0.5pt, paint: border), name: name-id)
+    content(name-id, text(6pt, fill: fg, raw(label)))
+  }
+
+  // Helper: category label
+  let cat(pos, label) = {
+    content(pos, text(6pt, weight: "bold", fill: fg-light, label))
+  }
+
+  let cx = 8
+  let cy = 5.5
+
+  // ── Center: CLAUDE.md ──
+  rect((cx - 1.8 + 0.08, cy - 0.45 + 0.08), (cx + 1.8 + 0.08, cy + 0.45 + 0.08),
+    radius: 5pt, fill: shadow-col, stroke: none)
+  rect((cx - 1.8, cy - 0.45), (cx + 1.8, cy + 0.45),
+    radius: 5pt, fill: fill-accent, stroke: stroke-accent, name: "claude")
+  content("claude", text(9pt, weight: "bold", fill: accent.darken(20%), raw("CLAUDE.md")))
+
+  // ── Inner ring: key project files ──
+  core((cx, cy + 2.0), "src/traits.rs", "traits")
+  core((cx, cy - 2.0), "Makefile", "make")
+  core((cx - 3.0, cy), "src/models/", "models")
+  core((cx + 3.0, cy), "src/rules/", "rules")
+
+  // Links from center to inner ring
+  line("claude.north", "traits.south", stroke: (thickness: 0.6pt, paint: edge-col, dash: "densely-dashed"), mark: arrow-end)
+  line("claude.south", "make.north", stroke: (thickness: 0.6pt, paint: edge-col, dash: "densely-dashed"), mark: arrow-end)
+  line("claude.west", "models.east", stroke: (thickness: 0.6pt, paint: edge-col, dash: "densely-dashed"), mark: arrow-end)
+  line("claude.east", "rules.west", stroke: (thickness: 0.6pt, paint: edge-col, dash: "densely-dashed"), mark: arrow-end)
+
+  // ── Outer ring: skill groups ──
+
+  // Top-left: Orchestration
+  let ol = (cx - 5.5, cy + 4.5)
+  cat((ol.at(0), ol.at(1) + 0.5), [orchestration])
+  skill((ol.at(0), ol.at(1)), "project-pipeline", "s1")
+  skill((ol.at(0), ol.at(1) - 0.7), "review-pipeline", "s2")
+  skill((ol.at(0), ol.at(1) - 1.4), "issue-to-pr", "s3")
+
+  // Top-right: Quality gates
+  let qr = (cx + 5.5, cy + 4.5)
+  cat((qr.at(0), qr.at(1) + 0.5), [quality gates])
+  skill((qr.at(0), qr.at(1)), "check-issue", "s4")
+  skill((qr.at(0), qr.at(1) - 0.7), "review-impl", "s5")
+  skill((qr.at(0), qr.at(1) - 1.4), "fix-pr", "s6")
+  skill((qr.at(0), qr.at(1) - 2.1), "topology-check", "s7")
+
+  // Bottom-left: Implementation
+  let il = (cx - 5.5, cy - 3.0)
+  cat((il.at(0), il.at(1) + 0.5), [implementation])
+  skill((il.at(0), il.at(1)), "add-model", "s8")
+  skill((il.at(0), il.at(1) - 0.7), "add-rule", "s9")
+
+  // Bottom-right: Docs / community
+  let dr = (cx + 5.5, cy - 3.0)
+  cat((dr.at(0), dr.at(1) + 0.5), [docs / community])
+  skill((dr.at(0), dr.at(1)), "write-in-paper", "s10")
+  skill((dr.at(0), dr.at(1) - 0.7), "propose", "s11")
+  skill((dr.at(0), dr.at(1) - 1.4), "dev-setup", "s12")
+
+  // Dashed links from skill groups to center
+  line("s3.east", "claude.north-west", stroke: stroke-dashed, mark: arrow-end)
+  line("s7.west", "claude.north-east", stroke: stroke-dashed, mark: arrow-end)
+  line("s9.east", "claude.south-west", stroke: stroke-dashed, mark: arrow-end)
+  line("s12.west", "claude.south-east", stroke: stroke-dashed, mark: arrow-end)
+})
diff --git a/docs/paper/arxiv/figures/timeline.typ b/docs/paper/arxiv/figures/timeline.typ
new file mode 100644
index 00000000..e2fdf526
--- /dev/null
+++ b/docs/paper/arxiv/figures/timeline.typ
@@ -0,0 +1,93 @@
+#import "@preview/cetz:0.4.0": canvas, draw
+#import "@preview/cetz-plot:0.1.2": plot
+
+#set page(width: auto, height: auto, margin: 10pt)
+#set text(size: 8pt, font: "New Computer Modern")
+
+// --- Colors ---
+#let col-models = rgb("#4e79a7")       // steel blue for problem types
+#let col-rules  = rgb("#59a14f")       // green for reduction rules
+// Phase background colors
+#let phase1-fill = rgb("#4e79a7").lighten(92%)  // light blue
+#let phase2-fill = rgb("#f0d060").lighten(70%)  // light yellow
+#let phase3-fill = rgb("#59a14f").lighten(88%)  // light green
+
+// --- Data ---
+// Week numbers (Week 1 = Jan 9, 2026)
+// Jan 10 = ~week 0.14, Jan 26 = week 2.43, Feb 15 = week 5.29, Mar 13 = week 9.0
+#let models-data = ((0.14, 17), (2.43, 20), (5.29, 21), (9.0, 27))
+#let rules-data  = ((0.14, 0),  (2.43, 22), (5.29, 44), (9.0, 45))
+
+// Phase boundaries in weeks:
+// Phase 1: Jan 9 - Feb 22 = weeks 0 to 6.29
+// Phase 2: Feb 22 - Mar 1 = weeks 6.29 to 7.29
+// Phase 3: Mar 1 - Mar 13 = weeks 7.29 to 9.0
+#let phase1-end = 6.29
+#let phase2-end = 7.29
+#let week-max   = 9.5
+
+#canvas(length: 0.6cm, {
+  import draw: *
+
+  plot.plot(
+    size: (12, 7),
+    x-label: [Weeks since project start (Jan 9, 2026)],
+    y-label: [Cumulative count],
+    x-min: 0, x-max: week-max,
+    y-min: 0, y-max: 52,
+    x-tick-step: 1,
+    y-tick-step: 10,
+    x-grid: "major",
+    y-grid: "major",
+    axis-style: "scientific",
+    legend: "inner-north-west",
+    legend-style: (
+      stroke: 0.5pt + luma(200),
+      fill: white,
+      padding: 0.3,
+    ),
+    {
+      // --- Phase background bands ---
+      plot.add-fill-between(
+        domain: (0, phase1-end),
+        x => 52, x => 0,
+        style: (stroke: none, fill: phase1-fill),
+        label: none,
+      )
+      plot.add-fill-between(
+        domain: (phase1-end, phase2-end),
+        x => 52, x => 0,
+        style: (stroke: none, fill: phase2-fill),
+        label: none,
+      )
+      plot.add-fill-between(
+        domain: (phase2-end, week-max),
+        x => 52, x => 0,
+        style: (stroke: none, fill: phase3-fill),
+        label: none,
+      )
+
+      // --- Data lines ---
+      // Problem types (solid blue)
+      plot.add(
+        models-data,
+        mark: "o",
+        mark-size: 0.15,
+        line: "linear",
+        style: (stroke: (paint: col-models, thickness: 1.6pt), fill: col-models),
+        label: [Problem types],
+      )
+
+      // Reduction rules (dashed green)
+      plot.add(
+        rules-data,
+        mark: "square",
+        mark-size: 0.15,
+        line: "linear",
+        style: (stroke: (paint: col-rules, thickness: 1.6pt, dash: "dashed"), fill: col-rules),
+        label: [Reduction rules],
+      )
+
+    },
+  )
+})
diff --git a/docs/paper/arxiv/figures/topology-issues.typ b/docs/paper/arxiv/figures/topology-issues.typ
new file mode 100644
index 00000000..2df434fc
--- /dev/null
+++ b/docs/paper/arxiv/figures/topology-issues.typ
@@ -0,0 +1,144 @@
+#import "@preview/cetz:0.4.2": canvas, draw
+
+#set page(width: auto, height: auto, margin: 8pt)
+#set text(size: 7pt, font: "New Computer Modern")
+
+// Colors
+#let col-ok     = rgb("#4e79a7")   // healthy node
+#let col-ok-fill = rgb("#d0ddef")
+#let col-warn   = rgb("#e15759")   // problem highlight
+#let col-warn-fill = rgb("#fce4e4")
+#let col-sat    = rgb("#59a14f")   // 3-SAT / proof chain
+#let col-sat-fill = rgb("#ddf0dd")
+#let col-edge   = rgb("#5D6D7E")
+#let col-redun  = rgb("#e8913a")   // redundant
+#let col-ghost  = rgb("#cccccc")
+
+#let node-r = 0.32
+
+#canvas(length: 0.5cm, {
+  import draw: *
+
+  // Helper: draw a problem node
+  let pnode(pos, label, col: col-ok, fill: col-ok-fill, name: none, r: node-r) = {
+    let n = if name != none { name } else { label }
+    circle(pos, radius: r, fill: fill, stroke: 0.8pt + col, name: n)
+    content(n, text(6pt, weight: "bold", fill: col.darken(30%), label))
+  }
+
+  // Helper: draw a directed edge
+  let edge(from, to, col: col-edge, thick: 0.5pt, dash: none) = {
+    line(from, to,
+      stroke: (paint: col, thickness: thick, dash: dash),
+      mark: (end: "straight", scale: 0.35))
+  }
+
+  // ============================================================
+  //  PANEL (a): Orphan node
+  // ============================================================
+  let ax = 0.0
+  let ay = 0.0
+
+  // Title
+  content((ax + 2.5, ay + 4.8), text(8pt, weight: "bold", "(a) Orphan node"))
+
+  // Connected subgraph
+  pnode((ax + 0.5, ay + 3.5), "SAT", col: col-sat, fill: col-sat-fill, name: "a-sat")
+  pnode((ax + 2.5, ay + 2.0), "MIS", name: "a-mis")
+  pnode((ax + 0.5, ay + 0.5), "QUBO", name: "a-qubo")
+  pnode((ax + 2.5, ay + 3.5), "MVC", name: "a-mvc")
+  pnode((ax + 4.0, ay + 0.5), "ILP", name: "a-ilp")
+
+  edge("a-sat.south", "a-mis.north")
+  edge("a-mvc.south", "a-mis.north")
+  edge("a-mis.south", "a-qubo.north")
+  edge("a-mis.south", "a-ilp.north")
+
+  // Orphan node — isolated, no edges
+  pnode((ax + 5.0, ay + 3.0), "BMF", col: col-warn, fill: col-warn-fill, name: "a-orphan")
+
+  // Dashed box around orphan
+  rect((ax + 4.3, ay + 2.3), (ax + 5.7, ay + 3.7),
+    stroke: (thickness: 0.6pt, paint: col-warn, dash: "dashed"), radius: 4pt)
+
+  // Annotation
+  content((ax + 5.0, ay + 1.8), text(5.5pt, fill: col-warn,
+    [no reductions\ to or from]))
+
+  // ============================================================
+  //  PANEL (b): Redundant rule
+  // ============================================================
+  let bx = 8.0
+  let by = 0.0
+
+  content((bx + 2.5, by + 4.8), text(8pt, weight: "bold", "(b) Redundant rule"))
+
+  // Three nodes in a row, with the composite path on top
+  pnode((bx + 0.0, by + 3.5), "A", name: "b-a")
+  pnode((bx + 2.5, by + 3.5), "B", name: "b-b")
+  pnode((bx + 5.0, by + 3.5), "C", name: "b-c")
+
+  // Good composite path: A → B → C (two hops, low overhead)
+  edge("b-a.east", "b-b.west", col: col-ok, thick: 0.8pt)
+  edge("b-b.east", "b-c.west", col: col-ok, thick: 0.8pt)
+
+  // Cost labels on good path
+  content((bx + 1.25, by + 4.1), text(5.5pt, fill: col-ok.darken(20%), $O(n)$))
+  content((bx + 3.75, by + 4.1), text(5.5pt, fill: col-ok.darken(20%), $O(n m)$))
+
+  // Redundant direct edge: A → C (higher overhead, curves below)
+  bezier("b-a.south", "b-c.south",
+    (bx + 1.5, by + 1.2), (bx + 3.5, by + 1.2),
+    stroke: (paint: col-warn, thickness: 0.9pt, dash: "densely-dashed"),
+    mark: (end: "straight", scale: 0.35))
+
+  // Cost label on redundant edge
+  content((bx + 2.5, by + 1.5), text(5.5pt, fill: col-warn,
+    [direct: $O(n^2 m)$]))
+
+  // Annotation
+  content((bx + 2.5, by + 0.5), text(5.5pt, fill: col-warn,
+    [composite $O(n^2 m)$ $lt.eq$ direct\ $arrow.r.double$ rule is dominated]))
+
+  // ============================================================
+  //  PANEL (c): Missing NP-hardness proof path
+  // ============================================================
+  let cx = 16.5
+  let cy = 0.0
+
+  content((cx + 2.5, cy + 4.8), text(8pt, weight: "bold", "(c) Missing proof path"))
+
+  // 3-SAT as the NP-hardness source
+  pnode((cx + 0.0, cy + 3.5), "3-SAT", col: col-sat, fill: col-sat-fill, name: "c-3sat")
+  pnode((cx + 2.0, cy + 3.5), "SAT", col: col-sat, fill: col-sat-fill, name: "c-sat")
+  pnode((cx + 4.0, cy + 3.5), "MIS", col: col-sat, fill: col-sat-fill, name: "c-mis")
+  pnode((cx + 2.0, cy + 1.5), "ILP", col: col-sat, fill: col-sat-fill, name: "c-ilp")
+
+  // Green proof chain
+  edge("c-3sat.east", "c-sat.west", col: col-sat, thick: 0.7pt)
+  edge("c-sat.east", "c-mis.west", col: col-sat, thick: 0.7pt)
+  edge("c-sat.south", "c-ilp.north", col: col-sat, thick: 0.7pt)
+
+  // Disconnected node — has edges but no path FROM 3-SAT
+  pnode((cx + 5.5, cy + 1.5), "TSP", col: col-warn, fill: col-warn-fill, name: "c-tsp")
+  pnode((cx + 5.5, cy + 3.5), "BinP", col: col-warn, fill: col-warn-fill, name: "c-binp")
+
+  // TSP has outgoing edge to ILP, but no incoming from 3-SAT
+  edge("c-tsp.west", "c-ilp.east", col: col-ghost)
+  edge("c-binp.west", "c-mis.east", col: col-ghost)
+
+  // Missing edges shown as dotted with "?"
+  line((cx + 3.0, cy + 2.8), (cx + 5.0, cy + 1.8),
+    stroke: (paint: col-warn, thickness: 0.5pt, dash: "dotted"),
+    mark: (end: "straight", scale: 0.3))
+  content((cx + 4.3, cy + 2.6), text(6pt, fill: col-warn, "?"))
+
+  line((cx + 3.0, cy + 3.8), (cx + 5.0, cy + 3.8),
+    stroke: (paint: col-warn, thickness: 0.5pt, dash: "dotted"),
+    mark: (end: "straight", scale: 0.3))
+  content((cx + 4.3, cy + 4.2), text(6pt, fill: col-warn, "?"))
+
+  // Annotation
+  content((cx + 5.5, cy + 0.5), text(5.5pt, fill: col-warn,
+    [no path from 3-SAT\ $arrow.r.double$ NP-hardness\ unproven in graph]))
+})
diff --git a/docs/paper/arxiv/figures/verification-funnel.typ b/docs/paper/arxiv/figures/verification-funnel.typ
new file mode 100644
index 00000000..7d844b8d
--- /dev/null
+++ b/docs/paper/arxiv/figures/verification-funnel.typ
@@ -0,0 +1,220 @@
+#import "@preview/cetz:0.4.2": canvas, draw
+
+#set page(width: auto, height: auto, margin: 5pt)
+#set text(size: 7pt, font: "New Computer Modern")
+
+// Filter layer data: (name, description)
+#let filters = (
+  ("Type system", "rejects structural errors"),
+  ("Round-trip tests", "rejects semantic errors"),
+  ("Overhead validation", "rejects incorrect complexity claims"),
+  ("Agentic feature tests", "rejects usability issues"),
+)
+
+// Color palette: gradient from soft red (top) to green (bottom)
+#let col-top = rgb("#d94f4f")     // red — many errors
+#let col-bot = rgb("#4ea45e")     // green — correct
+#let col-ground = rgb("#4e79a7")  // steel blue — ground truth
+
+#let lerp-color(t) = {
+  color.mix((col-top, (1 - t) * 100%), (col-bot, t * 100%))
+}
+
+#canvas(length: 0.55cm, {
+  import draw: *
+
+  let n = 4              // number of filter layers
+  let layer-h = 1.3      // height of each filter layer
+  let gap = 0.3          // gap between layers
+  let max-w = 13.0       // width at top (agent output)
+  let min-w = 5.0        // width at bottom (correct code)
+  let cx = 0             // center x
+  let cap-h = 1.3        // height of top/bottom cap regions
+  let right-x = max-w / 2 + 0.8   // x for right-side descriptions
+
+  // Compute total funnel geometry
+  let funnel-top = cap-h + 0.6     // y where first filter starts
+  let funnel-bot = -(n * (layer-h + gap) - gap) - 0.4  // y where last filter ends
+  let total-h = funnel-top - funnel-bot
+
+  // --- Top cap: "Agent output" ---
+  let top-y = funnel-top + cap-h
+  let top-w = max-w + 1.0
+
+  // Wide entry region
+  merge-path(
+    close: true,
+    fill: col-top.lighten(85%),
+    stroke: (thickness: 0.8pt, paint: col-top.lighten(30%)),
+    name: "top-cap",
+    {
+      line(
+        (cx - top-w / 2, top-y),
+        (cx + top-w / 2, top-y),
+        (cx + max-w / 2, funnel-top),
+        (cx - max-w / 2, funnel-top),
+      )
+    },
+  )
+  content(
+    (cx, (top-y + funnel-top) / 2 + 0.15),
+    anchor: "center",
+    text(8.5pt, weight: "bold", fill: col-top.darken(20%), [Agent output]),
+  )
+  content(
+    (cx, (top-y + funnel-top) / 2 - 0.4),
+    anchor: "center",
+    text(6.5pt, fill: col-top.darken(5%), style: "italic", [many candidate implementations]),
+  )
+
+  // --- Filter layers (narrowing from top to bottom) ---
+  for i in range(n) {
+    // t ranges from 0 (top filter) to 1 (bottom filter)
+    let t-top = i / n
+    let t-bot = (i + 1) / n
+
+    // Widths: linear interpolation from max-w to min-w
+    let w-top = max-w - (max-w - min-w) * t-top
+    let w-bot = max-w - (max-w - min-w) * t-bot
+
+    // Y coordinates (growing downward from funnel-top)
+    let y-top = funnel-top - i * (layer-h + gap)
+    let y-bot = y-top - layer-h
+    let y-mid = (y-top + y-bot) / 2
+
+    // Width at midpoint
+    let t-mid = (i + 0.5) / n
+    let w-mid = max-w - (max-w - min-w) * t-mid
+
+    // Color for this layer
+    let col = lerp-color(t-mid)
+    let col-fill = col.lighten(75%)
+    let col-stroke = col.darken(10%)
+    let col-text = col.darken(35%)
+
+    let name-id = "filter" + str(i)
+
+    // Draw trapezoid
+    merge-path(
+      close: true,
+      fill: col-fill,
+      stroke: (thickness: 1pt, paint: col-stroke),
+      name: name-id,
+      {
+        line(
+          (cx - w-top / 2, y-top),
+          (cx + w-top / 2, y-top),
+          (cx + w-bot / 2, y-bot),
+          (cx - w-bot / 2, y-bot),
+        )
+      },
+    )
+
+    let (mechanism, desc) = filters.at(i)
+
+    // Mechanism label inside the trapezoid
+    content(
+      (cx, y-mid + 0.15),
+      anchor: "center",
+      text(8pt, weight: "bold", fill: col-text, mechanism),
+    )
+
+    // Description below the name
+    content(
+      (cx, y-mid - 0.35),
+      anchor: "center",
+      text(6.5pt, fill: col-text.lighten(20%), style: "italic", desc),
+    )
+
+    // Right-side connecting dotted line + filter icon
+    let edge-x = cx + w-mid / 2
+    line(
+      (edge-x + 0.05, y-mid), (right-x - 0.15, y-mid),
+      stroke: (thickness: 0.5pt, paint: col-stroke.lighten(40%), dash: "dotted"),
+    )
+    content(
+      (right-x, y-mid),
+      anchor: "west",
+      text(6pt, fill: col-stroke, [#sym.times.o rejected]),
+    )
+  }
+
+  // --- Bottom cap: "Correct code" ---
+  let last-y-bot = funnel-top - (n - 1) * (layer-h + gap) - layer-h
+  let bot-y = last-y-bot - 0.4
+  let bot-cap-y = bot-y - cap-h
+  let bot-w = min-w
+
+  merge-path(
+    close: true,
+    fill: col-bot.lighten(80%),
+    stroke: (thickness: 1pt, paint: col-bot.darken(10%)),
+    name: "bot-cap",
+    {
+      line(
+        (cx - bot-w / 2, bot-y),
+        (cx + bot-w / 2, bot-y),
+        (cx + bot-w / 2 - 0.8, bot-cap-y),
+        (cx - bot-w / 2 + 0.8, bot-cap-y),
+      )
+    },
+  )
+  content(
+    (cx, (bot-y + bot-cap-y) / 2 + 0.1),
+    anchor: "center",
+    text(8.5pt, weight: "bold", fill: col-bot.darken(30%), [Correct code]),
+  )
+  content(
+    (cx, (bot-y + bot-cap-y) / 2 - 0.45),
+    anchor: "center",
+    text(6.5pt, fill: col-bot.darken(10%), style: "italic", [matches contributor ground truth]),
+  )
+
+  // --- Left side: "Contributor-specified ground truth" vertical arrow ---
+  let gt-x = cx - max-w / 2 - 2.2
+  let gt-top = funnel-top + cap-h * 0.5
+  let gt-bot = bot-cap-y + 0.3
+
+  // Main vertical arrow
+  line(
+    (gt-x, gt-top), (gt-x, gt-bot),
+    stroke: (thickness: 1.4pt, paint: col-ground),
+    mark: (end: "straight", scale: 0.5),
+  )
+
+  // Label for the vertical arrow
+  content(
+    (gt-x - 0.25, (gt-top + gt-bot) / 2),
+    anchor: "east",
+    angle: 90deg,
+    text(7pt, weight: "bold", fill: col-ground.darken(10%),
+      [Contributor-specified ground truth],
+    ),
+  )
+
+  // Dashed arrows from ground-truth line into each filter layer
+  for i in range(n) {
+    let t-mid = (i + 0.5) / n
+    let w-mid = max-w - (max-w - min-w) * t-mid
+
+    let y-top = funnel-top - i * (layer-h + gap)
+    let y-bot = y-top - layer-h
+    let y-mid = (y-top + y-bot) / 2
+
+    let target-x = cx - w-mid / 2
+
+    line(
+      (gt-x + 0.1, y-mid), (target-x - 0.1, y-mid),
+      stroke: (thickness: 0.7pt, paint: col-ground.lighten(30%), dash: "dashed"),
+      mark: (end: "straight", scale: 0.35),
+    )
+  }
+
+  // Also connect to the bottom cap
+  line(
+    (gt-x + 0.1, (bot-y + bot-cap-y) / 2),
+    (cx - bot-w / 2 + 0.3, (bot-y + bot-cap-y) / 2),
+    stroke: (thickness: 0.7pt, paint: col-ground.lighten(30%), dash: "dashed"),
+    mark: (end: "straight", scale: 0.35),
+  )
+})
diff --git a/docs/paper/arxiv/figures/verification-pyramid.typ b/docs/paper/arxiv/figures/verification-pyramid.typ
new file mode 100644
index 00000000..d0226b1f
--- /dev/null
+++ b/docs/paper/arxiv/figures/verification-pyramid.typ
@@ -0,0 +1,133 @@
+#import "@preview/cetz:0.4.2": canvas, draw
+
+#set page(width: auto, height: auto, margin: 5pt)
+#set text(size: 7pt, font: "New Computer Modern")
+
+// Layer data: (mechanism, error class caught)
+#let layers = (
+  ("Type system (Rust compiler)", "API misuse"),
+  ("Unit tests (eval, serialization)", "evaluation errors"),
+  ("Closed-loop tests (round-trip)", "mapping errors"),
+  ("Overhead validation (symbolic exprs)", "formula errors"),
+  ("Materialized fixtures (JSON ground truth)", "test gaming"),
+  ("Agentic review (parallel subagents)", "convention violations"),
+  ("Documentation (proof sketch)", "logical errors"),
+)
+
+// Color gradient: blue (automated, bottom) -> gold (human, top)
+#let col-auto = rgb("#4e79a7")   // blue
+#let col-human = rgb("#e8a838")  // gold
+
+#let lerp-color(t) = {
+  color.mix((col-auto, (1 - t) * 100%), (col-human, t * 100%))
+}
+
+#canvas(length: 0.55cm, {
+  import draw: *
+
+  let n = 7          // number of layers
+  let layer-h = 1.1  // height of each layer
+  let gap = 0.12     // gap between layers
+  let max-w = 14.0   // width of bottom layer
+  let min-w = 5.5    // width of top layer
+  let cx = 0         // center x
+  let right-col-x = max-w / 2 + 0.6  // x position for right-side labels
+
+  // Draw layers from bottom to top
+  for i in range(n) {
+    let t-bot = i / n
+    let t-top = (i + 1) / n
+
+    // Widths: linear interpolation
+    let w-bot = max-w - (max-w - min-w) * t-bot
+    let w-top = max-w - (max-w - min-w) * t-top
+
+    // Width at midpoint (for label positioning)
+    let t-mid = (i + 0.5) / n
+    let w-mid = max-w - (max-w - min-w) * t-mid
+
+    // Y coordinates (layer 0 at bottom)
+    let y-bot = i * (layer-h + gap)
+    let y-top = y-bot + layer-h
+    let y-mid = (y-bot + y-top) / 2
+
+    // Color for this layer
+    let col = lerp-color(t-bot)
+    let col-fill = col.lighten(70%)
+    let col-stroke = col.darken(10%)
+    let col-text = col-stroke.darken(30%)
+
+    let name-id = "layer" + str(i)
+
+    // Draw trapezoid
+    merge-path(
+      close: true,
+      fill: col-fill,
+      stroke: (thickness: 0.8pt, paint: col-stroke),
+      name: name-id,
+      {
+        line(
+          (cx - w-bot / 2, y-bot),
+          (cx + w-bot / 2, y-bot),
+          (cx + w-top / 2, y-top),
+          (cx - w-top / 2, y-top),
+        )
+      },
+    )
+
+    let (mechanism, catches) = layers.at(i)
+
+    // Mechanism label centered inside the trapezoid
+    content(
+      (cx, y-mid),
+      anchor: "center",
+      text(7.5pt, weight: "bold", fill: col-text,
+        [L#(i + 1): #mechanism],
+      ),
+    )
+
+    // "catches:" label outside on the right, connected by a thin line
+    let edge-x = cx + w-mid / 2  // right edge at midpoint height
+
+    // Small connecting line from trapezoid edge to label
+    line(
+      (edge-x + 0.05, y-mid), (right-col-x - 0.15, y-mid),
+      stroke: (thickness: 0.4pt, paint: col-stroke.lighten(30%), dash: "dotted"),
+    )
+
+    content(
+      (right-col-x, y-mid),
+      anchor: "west",
+      text(6.5pt, fill: col-text.lighten(20%),
+        [#sym.arrow.r #emph(catches)],
+      ),
+    )
+  }
+
+  // Side annotations
+  let total-h = n * (layer-h + gap) - gap
+
+  // Left bracket: "Automated" for bottom 4 layers (L1-L4)
+  let bx-left = cx - max-w / 2 - 0.8
+  let auto-top = 4 * (layer-h + gap) - gap
+  line(
+    (bx-left + 0.15, 0), (bx-left, 0), (bx-left, auto-top), (bx-left + 0.15, auto-top),
+    stroke: (thickness: 0.7pt, paint: col-auto, dash: "dashed"),
+  )
+  content(
+    (bx-left - 0.15, auto-top / 2), anchor: "east",
+    text(6pt, fill: col-auto, weight: "bold", [Fully\ automated]),
+  )
+
+  // Left bracket: "Human-readable" for top 3 layers (L5-L7)
+  let human-bot = 4 * (layer-h + gap)
+  let human-top = total-h
+  line(
+    (bx-left + 0.15, human-bot), (bx-left, human-bot), (bx-left, human-top), (bx-left + 0.15, human-top),
+    stroke: (thickness: 0.7pt, paint: col-human, dash: "dashed"),
+  )
+  content(
+    (bx-left - 0.15, (human-bot + human-top) / 2), anchor: "east",
+    text(6pt, fill: col-human.darken(10%), weight: "bold", [Human-\ readable]),
+  )
+})
diff --git a/docs/paper/arxiv/paper-redesign-spec.md b/docs/paper/arxiv/paper-redesign-spec.md
new file mode 100644
index 00000000..5fbdecd0
--- /dev/null
+++ b/docs/paper/arxiv/paper-redesign-spec.md
@@ -0,0 +1,136 @@
+# Paper Redesign Spec
+
+**Date:** 2026-03-14
+**Title:** Bridging NP-Hard Problems: Scaling Software Beyond Human Capacity with Agentic Coding
+
+## Core Thesis
+
+**Bridge problems** are software projects too large for humans at scale. Agents can build them because systematic verification constrains agent output to match contributor-specified ground truth. NP-hard reductions are the first convincing example.
+
+## Bridge Problem Definition
+
+A software project where subtasks are homogeneous and formally verifiable, but three structural barriers make human-only execution infeasible at scale:
+
+1. **Convention drift** — humans can't maintain uniform conventions across hundreds of contributions; agents read CLAUDE.md every time
+2. **Effort exhaustion** — humans can't sustain the energy to verify 100+ problems and continuously test without a user community
+3. **Knowledge discontinuity** — humans graduate, newcomers can't absorb implicit knowledge; skills make onboarding executable
+
+The correctness concern: contributor-specified ground truth (definitions, examples, expected behavior) flows through the verification stack (type system → round-trip tests → overhead validation → agentic tests), constraining what agents can produce. Agent output ⊆ contributor intent.
+
+## Section Structure
+
+### Section 1: Introduction (~1.5 pages)
+- Open with familiar examples (airlines, chips, logistics → NP-hard problems → need reductions)
+- The reduction graph idea (connect problems to solvers)
+- **The claim:** This is a *bridge problem* — software too large for humans, made possible by agents constrained through systematic verification
+- Three barriers preview
+- **Fig 1: Scaling Wall**
+- Contributions bullet list
+
+### Section 2: Bridge Problems (~2 pages)
+- Define bridge problems formally
+- Three barriers, each with concrete evidence from this project:
+  - Convention drift: agents never deviated from file naming, trait implementation, test patterns
+  - Effort exhaustion: 45 rules in 9 weeks vs Julia predecessor 20 types in 4 years; agents never tire of running the same verification loop
+  - Knowledge discontinuity: skills encode workflow as executable documents; new maintainers invoke same skills that produced the codebase
+- The correctness concern and how verification addresses it
+- **Fig 2: Verification Funnel** — agent generates candidates → type system rejects invalid structure → round-trip tests reject wrong semantics → agentic tests reject poor UX → only correct code survives
+- Other candidate domains (algorithm libraries, compiler optimization passes, HDL, numerical linear algebra)
+
+### Section 3: Case Study — The Reduction Graph (~2 pages)
+- What is a reduction (brief definition)
+- Graph structure: 27 problems, 45 rules, 56 edges
+- **Fig 3: Reduction Graph** with color-coded solver-reachability arrows:
+  - Blue = paths reaching ILP (Gurobi/CPLEX)
+  - Red = paths reaching QUBO (D-Wave)
+  - Green = paths reaching UD-MIS (Rydberg atoms)
+  - Nodes colored by which solvers they can reach (multi-color = multiple solvers)
+  - Reader traces arrows forward (problem → solver) or backward (solver → problems)
+- Emergent compositionality highlighted as multi-hop colored paths (e.g., Factoring → CircuitSAT → SAT → ILP)
+- Round-trip testing (brief)
+
+### Section 4: Methodology — Skills + Verification (~2 pages)
+- Skills: persistent, versioned workflow scripts that encode convention
+- **Fig 4: Pipeline** (existing 6-stage board, orange=human, blue=agent)
+- The 14 skills and what they do (table or compact list)
+- Verification stack in practice (type system, unit tests, closed-loop, overhead validation, agentic tests)
+
+### Section 5: Evidence (~2 pages)
+- **Fig 5: Development Timeline** — cumulative plot of problem types + rules over 9 weeks, phase annotations (manual → basic-skills → full-pipeline), Julia predecessor 4-year trajectory overlaid
+- Development metrics (58 PRs, 15:1 agent-to-human message ratio)
+- Quality gate: 75% rejection rate on 322 batch-submitted proposals
+- Barrier-by-barrier evidence:
+  - Convention: zero convention violations in agent-authored code
+  - Effort: acceleration curve across phases
+  - Continuity: skills as executable onboarding (dev-setup, add-rule, etc.)
+
+### Section 6: Discussion (~1.5 pages)
+- Limitations (N=1, skill engineering cost, confounding factors)
+- Why human experts remain essential (LLM reasoning limits, citing existing research)
+- Future work (100+ problems, formal verification with Lean/Coq, cost-aware path selection, automated discovery via AlphaEvolve)
+
+### Appendices
+- A1: System Architecture (trait hierarchy, ReduceTo, macros) — existing figure
+- A2: Verification Stack Details (7-layer pyramid) — existing figure
+- A3: Topology Issues (orphan nodes, redundant rules, NP-hardness gaps) — moved from main text
+
+## Figure Specifications
+
+### Fig 1: Scaling Wall (NEW)
+- **Type:** Line chart with annotations
+- **X-axis:** Number of problem types (0 → 200)
+- **Y-axis:** Software quality (convention compliance, test coverage, documentation completeness)
+- **Lines:**
+  - Human team trajectory: rises then plateaus/declines as it hits 3 barrier walls
+  - Agent + verification trajectory: breaks through all 3 walls
+- **Annotations:** Three vertical dashed lines marking the barriers (convention drift, effort exhaustion, knowledge discontinuity)
+- **Data points:** Julia predecessor at 20 (4 years), this work at 27 (9 weeks), vision at 100+
+- **Format:** Typst/CeTZ or TikZ
+
+### Fig 2: Verification Funnel (NEW)
+- **Type:** Funnel/filter diagram
+- **Flow:** Wide at top (agent generates many candidate implementations) → narrowing through filters → narrow at bottom (correct code)
+- **Layers (top to bottom):**
+  1. Agent output (wide) — "many plausible implementations"
+  2. Type system filter — "rejects structural errors (wrong trait impl, type mismatch)"
+  3. Round-trip tests filter — "rejects semantic errors (wrong transformation, broken inverse)"
+  4. Overhead validation — "rejects incorrect complexity claims"
+  5. Agentic feature tests — "rejects UX/documentation issues"
+  6. Correct code (narrow) — "matches contributor ground truth"
+- **Side annotation:** "Contributor-specified ground truth" arrow pointing into each filter level
+- **Format:** Typst/CeTZ or TikZ
+
+### Fig 3: Reduction Graph (REDESIGN)
+- **Base:** Existing 27-node directed graph
+- **Enhancement:** Color-coded edges by solver reachability
+  - Blue edges/paths → ILP (Gurobi/CPLEX)
+  - Red edges/paths → QUBO (D-Wave quantum annealer)
+  - Green edges/paths → UD-MIS (Rydberg atom arrays)
+- **Node coloring:** Nodes tinted by which solvers they can reach (multi-color for multiple solvers)
+- **Solver hubs:** Prominent labels at bottom: "Gurobi/CPLEX", "D-Wave", "Rydberg"
+- **Key insight:** Same graph answers both "what can this solver solve?" and "what solvers can this problem reach?"
+- **Format:** Typst/CeTZ (redesign existing reduction-graph.typ)
+
+### Fig 4: Pipeline (EXISTING — keep as-is)
+- 6-stage Kanban board
+- Orange = human judgment points, Blue = agent-automated steps
+
+### Fig 5: Development Timeline (NEW)
+- **Type:** Cumulative line plot
+- **X-axis:** Time (weeks 1-9, with dates)
+- **Y-axis (left):** Cumulative count (problem types, reduction rules)
+- **Lines:** Two lines — problem types (solid) and reduction rules (dashed)
+- **Phase bands:** Background shading for manual / basic-skills / full-pipeline phases
+- **Overlay:** Julia predecessor trajectory (4 years to 20 types) shown as a faint reference line, dramatically slower
+- **Data source:** git-mining-results.json
+- **Format:** Typst/CeTZ-plot or matplotlib-generated PDF
+
+## Key Changes from Current Paper
+1. "Bridge problem" concept elevated from Discussion to Section 2
+2. Reduction graph becomes case study illustrating the concept, not the main event
+3. 4 impossibilities merged to 3 barriers (effort exhaustion + testing frequency combined)
+4. Framing: "agents break through barriers, verification ensures correctness" (not "humans + agents must combine")
+5. Reduction graph figure redesigned with solver-reachability coloring
+6. Three Roles figure cut (pipeline is sufficient)
+7. Topology Issues figure moved to appendix
+8. Problem tree figure absorbed into reduction graph or cut
diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex
new file mode 100644
index 00000000..00675a33
--- /dev/null
+++ b/docs/paper/arxiv/paper.tex
@@ -0,0 +1,770 @@
+\documentclass[conference]{IEEEtran}
+\usepackage{cite}
+\usepackage{amsmath,amssymb,amsfonts}
+\usepackage{graphicx}
+\usepackage{textcomp}
+\usepackage{xcolor}
+\usepackage{booktabs}
+\usepackage{listings}
+\usepackage{hyperref}
+\usepackage{cleveref}
+% TikZ removed — all diagrams are now Typst-compiled PDFs
+
+\begin{document}
+
+\title{Grand Assembly of Computational Hard Problems: The Art of Agentic Coding}
+
+\author{...}  % placeholder
+
+\maketitle
+
+\begin{abstract}
+A unified library of reductions between NP-hard problems would let practitioners route any supported problem to a specialized solver---quantum hardware, commercial optimizers, or domain-specific algorithms---through a single interface.
+Yet building such a library by human effort alone is impractical: it requires many researchers to adopt a common language and conventions, and demands continuous full-time maintenance as new reduction rules are discovered.
+We show that AI coding agents, guided by a system of reusable \emph{skills}---versioned, composable workflow documents that encode project conventions and domain knowledge---can overcome these barriers.
+We demonstrate the approach by building a large library of problem reductions. During this process, we make three contributions.
+First, a \emph{no-code contribution route}: domain experts contribute reductions by filing structured issues with AI assistance, requiring no knowledge of the implementation language or codebase.
+Second, a \emph{seven-layer verification stack} that enforces mathematical correctness, culminating in \emph{agentic feature tests}---AI agents that act as first-time users, exercising the library end-to-end---replacing the years of community trial-and-error traditionally needed to surface integration bugs.
+Third, a \emph{fully automated pipeline} that enables sustainable maintenance by a single maintainer, with an onboarding path that lets a new maintainer take over the project in half a day.
+Building such a stack takes approximately ten weeks. Now we have $>100$ problems in the package. We show that AI agents can produce correct and maintainable software beyond the scale of human capabilities, while human developers can focus on the creative parts.
+\end{abstract}
+
+%======================================================================
+%  SECTION 1: INTRODUCTION
+%======================================================================
+\section{Introduction}\label{sec:intro}
+
+\subsection{The Problem: Many Hard Problems, Few Solvers}
+
+Combinatorial optimization problems arise throughout science and engineering.
+An airline needs to assign crews to flights.
+A chip designer needs to allocate registers.
+A logistics company needs to route delivery trucks.
+Each of these is an instance of an NP-hard problem---a class of problems for which no efficient general-purpose algorithm is known, but which can be solved in practice for moderate sizes by specialized solvers.
+
+The difficulty is that each solver speaks its own narrow language.
+Rydberg atom arrays~\cite{lucas2014, pichler2018}---a type of quantum hardware---natively solve the Maximum Independent Set problem on geometric graphs.
+D-Wave quantum annealers~\cite{glover2019} solve Quadratic Unconstrained Binary Optimization.
+Commercial engines like Gurobi and CPLEX solve Integer Linear Programs.
+A practitioner with a graph coloring problem cannot directly use any of these solvers without first \emph{translating} the problem into a form the solver understands.
+
+This translation is called a \emph{reduction}: a polynomial-time algorithm that converts an instance of one problem into an instance of another, together with an inverse map that translates the solution back.
+Reductions are the central tool of computational complexity theory~\cite{karp1972}, but they also have immediate practical value: each verified reduction is a bridge connecting a new problem to an existing solver.
+
+\subsection{Our Approach: A Reduction Graph}
+
+We organize these reductions as a \emph{directed graph}.
+Each node is a problem type (e.g., Satisfiability, Max-Cut, Traveling Salesman).
+Each directed edge is a verified reduction---code that transforms instances forward and maps solutions back.
+Given any supported problem, a path through the graph leads to a solver, with every edge backed by tested code.
+
+\begin{figure*}[t]
+  \centering
+  \includegraphics[width=0.88\textwidth]{figures/problemtree.pdf}
+  \caption{The reduction graph connects 27~NP-hard problem types to three solver families through 45~verified transformation rules.
+    \textbf{Bottom layer}: solvers---Maximum Independent Set on unit-disk graphs (UD-MIS) for Rydberg atom arrays, QUBO for D-Wave quantum annealers, and ILP for commercial solvers.
+    \textbf{Middle layer}: 27~problem types with color-coded edges showing solver reachability.
+    MIS is the dominant hub with 14~incoming and 13~outgoing edges.
+    Each new edge creates composite paths through the entire graph.}
+  \label{fig:reduction-graph}
+\end{figure*}
+
+The graph currently contains 27~problem types connected by 56~directed edges (\Cref{fig:reduction-graph}).
+Reductions implemented independently compose automatically through the graph.
+For example, one contributor implemented Factoring $\to$ Circuit Satisfiability, and another implemented Circuit Satisfiability $\to$ Integer Linear Programming.
+Neither intended the composition, yet the graph enables factoring integers via linear programming by chaining the two reductions.
+Each new edge creates not just one connection but new paths through the entire graph.
+
+\subsection{Bridge Problems}
+
+Building this graph requires implementing 45~transformation rules, each involving 50--400 lines of verified code.
+This is not a failure of effort but of scale: the project is a \emph{bridge problem}---software too large for human teams to build and maintain at the required quality level.
+
+We identify three structural barriers that make bridge problems infeasible for human-only teams:
+\begin{enumerate}
+  \item \textbf{Convention drift.} Hundreds of contributions must follow identical file-naming, interface, and testing conventions.
+    Human contributors inevitably diverge; agents read the project specification on every invocation and never deviate.
+  \item \textbf{Effort exhaustion.} Each new reduction demands the same cycle of coding, testing, documentation, and review.
+    Human energy is finite; agents execute the same verification loop indefinitely without fatigue.
+  \item \textbf{Knowledge discontinuity.} Contributors graduate, change jobs, or lose context.
+    Implicit workflow knowledge---which files to create, which tests to write, which edge cases to check---is lost with each departure.
+    Reusable skills encode this knowledge as executable documents that any new contributor or agent can invoke.
+\end{enumerate}
+
+AI coding agents can break through these barriers, but only if their output is constrained to match contributor-specified ground truth.
+A contributor supplies the creative elements: which problems matter, what the formal definitions are, which examples reveal correctness.
+These flow through a verification stack---type system, round-trip tests, overhead validation, agentic feature tests---that rejects any agent output inconsistent with the specification.
+The agent produces volume; verification ensures correctness.
+
+\subsection{Contributions}
+
+Over nine weeks, a single maintainer and AI agents produced a Rust library with 27~problem types, 45~reduction rules, and $>$95\% test coverage.
+Our contributions are:
+\begin{itemize}
+  \item \textbf{The bridge problem concept}: a characterization of software projects where homogeneous, verifiable subtasks create scale barriers that agents can overcome (\Cref{sec:bridge}).
+  \item \textbf{A verified reduction graph} connecting 27~NP-hard problem types to specialized solvers, with emergent compositionality through automatic path composition (\Cref{sec:graph}).
+  \item \textbf{A skill-based methodology} for mathematical software engineering, encoding workflow knowledge as reusable, versioned scripts that decompose multi-file tasks into agent-executable steps (\Cref{sec:method}).
+  \item \textbf{Quantitative evidence}: a 15:1 agent-to-human message ratio, a 75\% rejection rate on 322~batch-submitted proposals demonstrating the quality gate's selectivity, and a 9-week development timeline (\Cref{sec:evaluation}).
+\end{itemize}
+
+The rest of this paper is organized as follows.
+\Cref{sec:bridge} defines bridge problems and the three barriers.
+\Cref{sec:graph} presents the reduction graph as a case study.
+\Cref{sec:method} describes the methodology.
+\Cref{sec:evaluation} provides evidence.
+\Cref{sec:related} surveys related work.
+\Cref{sec:conclusion} discusses limitations and future directions.
+
+%======================================================================
+%  SECTION 2: BRIDGE PROBLEMS
+%======================================================================
+\section{Bridge Problems}\label{sec:bridge}
+
+A \emph{bridge problem} is a software project whose subtasks are homogeneous and formally verifiable, but whose scale exceeds what human teams can sustain.
+We identify three structural barriers that distinguish bridge problems from ordinary large projects.
+
+\subsection{Three Barriers to Human-Scale Development}
+
+\paragraph{Convention drift.}
+A reduction library demands strict uniformity: every rule file follows the same naming convention, implements the same trait, includes the same test pattern, and produces the same documentation artifacts.
+Human teams cannot maintain this discipline across hundreds of contributions.
+Conventions drift, shortcuts accumulate, and style guides go unread.
+In our approach, the project specification (\texttt{CLAUDE.md}) and reusable skills encode every convention as a machine-readable document.
+The agent reads it on every invocation and never deviates---convention compliance becomes a property of the tool, not a discipline of the team.
+
+\paragraph{Effort exhaustion.}
+Each new problem type requires definitions, examples, reductions, tests, overhead formulas, and documentation.
+Verifying each component one by one---running the same round-trip test for the 45th time, checking the same overhead formula against the same symbolic expressions---requires sustained, detail-oriented effort that no individual or small team can maintain indefinitely.
+A niche mathematical library also lacks the user base to surface issues through organic community usage.
+Agents execute the same verification loop without fatigue, and agentic feature tests---where an agent role-plays as a downstream user---provide the testing frequency that the community size cannot.
+
+\paragraph{Knowledge discontinuity.}
+Human maintainers graduate, change jobs, or lose interest.
+New contributors face a steep onboarding curve: understanding the architecture, the conventions, the testing patterns, and the implicit knowledge accumulated over years.
+Skills encode this knowledge in executable, versionable form.
+A new maintainer does not need to reconstruct the original developer's mental model from scattered comments and commit messages---they invoke the same skills that produced the codebase.
+The \texttt{dev-setup} skill configures the development environment; the \texttt{add-rule} skill encodes the complete workflow for adding a reduction.
+Open-source contribution, traditionally gated by the ability to absorb a project's implicit culture, becomes gated only by domain knowledge.
+
+\subsection{Verification Constrains Agent Output}\label{sec:verification-bridge}
+
+The natural concern is correctness: how can agent-written mathematical software be trusted?
+Our answer is that contributor-specified ground truth flows through a verification stack that constrains what agents can produce.
+
+\begin{figure}[t]
+  \centering
+  \includegraphics[width=\columnwidth]{figures/verification-funnel.pdf}
+  \caption{Verification funnel.
+    Agent-generated code is progressively filtered: the type system rejects structural errors, round-trip tests reject semantic errors, overhead validation rejects incorrect complexity claims, and agentic feature tests reject usability issues.
+    Only code matching contributor-specified ground truth survives.}
+  \label{fig:verification-funnel}
+\end{figure}
+
+A contributor supplies the creative elements that define correctness: formal problem definitions, worked examples with known solutions, and expected overhead formulas.
+These flow through four verification layers:
+\begin{enumerate}
+  \item \textbf{Type system}: the Rust compiler rejects structural errors---wrong return types, missing inverse maps, references to nonexistent problem attributes---at compile time.
+  \item \textbf{Round-trip tests}: for each reduction, a small instance is transformed forward, solved by brute force, mapped back, and verified optimal for the source. This catches the most mathematically subtle errors without per-reduction test logic.
+  \item \textbf{Overhead validation}: symbolic size expressions are evaluated against actual target sizes, catching formula errors that are type-correct but mathematically wrong.
+  \item \textbf{Agentic feature tests}: an agent reads the documentation, exercises the feature through the CLI, and judges whether results are consistent with domain knowledge---replacing the community feedback loop that niche libraries lack.
+\end{enumerate}
+
+The agent has freedom in \emph{how} to implement; verification eliminates freedom in \emph{what} the result must be.
+Agent output $\subseteq$ contributor intent.
+
+\subsection{Other Candidate Domains}
+
+NP-hard reductions are not the only bridge problem.
+Algorithm libraries, compiler optimization passes, numerical linear algebra routines, and hardware description languages share the same structure: homogeneous subtasks, formal correctness criteria, and a scale that exceeds what small teams can sustain.
+In each case, domain experts can encode the ``how'' into reusable skills while the ``what'' remains a human judgment call.
+The methodology does \emph{not} generalize to heterogeneous tasks---the staple of SWE-Bench---where each issue is structurally unique and resists skill-based decomposition.
+
+%======================================================================
+%  SECTION 3: CASE STUDY --- THE REDUCTION GRAPH
+%======================================================================
+\section{Case Study: The Reduction Graph}\label{sec:graph}
+
+\subsection{What Is a Reduction?}
+
+A \emph{reduction} from problem~$A$ to problem~$B$ is a pair of functions: a \emph{forward map} that transforms any instance of~$A$ into an instance of~$B$ in polynomial time, and an \emph{inverse map} that converts any solution of the $B$-instance back into a solution of the original $A$-instance.
+If the reduction is correct, the extracted solution is optimal (or satisfying) for the original problem.
+
+For example, the complement relationship between Minimum Vertex Cover (MVC) and Maximum Independent Set (MIS) gives a simple reduction: given a graph, the vertex cover problem asks for the smallest set of vertices that touches every edge, while the independent set problem asks for the largest set of vertices with no edges between them.
+A set~$S$ is independent if and only if its complement $V \setminus S$ is a vertex cover, so the forward map is the identity (the same graph), and the inverse map complements the solution.
+This 96-line implementation connects MVC to every solver reachable from MIS.
+
+\subsection{Graph Structure}\label{sec:graph-structure}
+
+We organize reductions into a directed graph $G = (V, E)$, where each vertex $v \in V$ represents an NP-hard problem type and each directed edge $(u, v) \in E$ represents a verified reduction from problem~$u$ to problem~$v$.
+The graph contains 27~vertices (problem types) and 56~directed edges: 45~hand-coded reduction rules plus 11~edges inferred from subtype relationships (e.g., MIS on a geometric subgraph can always be treated as MIS on a general graph, because the geometric structure is a special case).
+
+Three problems serve as ``compilation targets,'' each corresponding to a class of specialized solvers (\Cref{fig:reduction-graph}, bottom layer):
+\begin{itemize}
+  \item \textbf{MIS} (Maximum Independent Set): target for Rydberg atom arrays, which solve MIS natively on geometric graphs.
+  \item \textbf{QUBO} (Quadratic Unconstrained Binary Optimization): target for D-Wave quantum annealers.
+  \item \textbf{ILP} (Integer Linear Programming): target for commercial solvers like Gurobi and CPLEX.
+\end{itemize}
+MIS is the dominant hub with 14~incoming and 13~outgoing edges, reflecting its role as both a hardware target and an intermediary.
+ILP, with 11~incoming edges, functions as a universal algebraic target.
+A path from any problem to one of these targets provides a route to the corresponding solver.
+
+\subsection{Emergent Compositionality}\label{sec:compositionality}
+
+The graph's most valuable property is that independently implemented reductions compose automatically to solve problems that no single reduction was designed for.
+
+Consider the problem of integer factoring---decomposing a number $N$ into its prime factors.
+The Factoring $\to$ CircuitSAT reduction (272~lines of code) constructs a Boolean circuit representing an array multiplier: two bit-vector inputs $p$ and $q$, a grid of full-adder cells computing their product, and output constraints fixing the result to~$N$.
+Any satisfying assignment to this circuit yields factors of~$N$.
+
+Separately, the CircuitSAT $\to$ ILP reduction (225~lines) linearizes a Boolean circuit into integer constraints, encoding each logic gate as a set of linear inequalities.
+
+Neither reduction was designed with the other in mind---they were implemented in separate pull requests, weeks apart.
+Yet the graph infrastructure enables automatic chaining: given a Factoring instance, the system finds the path Factoring $\to$ CircuitSAT $\to$ ILP, applies both reductions in sequence, solves the resulting integer program with a commercial solver, and extracts the factors by composing the inverse maps in reverse order.
+This works because every reduction provides type-safe solution extraction through a common interface, and the path-finding algorithm composes extractors automatically.
+
+Each new edge amplifies the graph.
+Adding a single reduction from problem~$X$ to MIS does not just connect $X$ to MIS---it connects $X$ to every solver reachable from MIS, and it connects every problem with a path to~$X$ to MIS.
+This multiplier effect---where value grows faster than the number of edges---justifies the investment in verified reduction infrastructure over ad-hoc, one-off transformations.
+
+\subsection{Verification by Round-Trip Testing}\label{sec:roundtrip}
+
+How do we know each reduction is correct?
+Every reduction admits a fully automatable correctness test that requires no problem-specific oracle.
+The test, which we call \emph{round-trip testing}, works as follows:
+
+\begin{enumerate}
+  \item Construct a small source instance (e.g., a graph with 5--8 vertices).
+  \item Apply the forward map to produce a target instance.
+  \item Solve the target by brute-force enumeration (feasible for small instances).
+  \item Apply the inverse map to extract a solution for the source.
+  \item Verify that the extracted solution is optimal for the source (also by brute force).
+\end{enumerate}
+
+This pattern is the same for all 45~reductions.
+It catches the most mathematically subtle errors---incorrect variable mappings, off-by-one indexing in literal-to-vertex transformations, forgotten negations---without requiring a human to write problem-specific test logic.
+
+The library's type system enforces this pattern structurally: the \lstinline{ReduceTo<T>} trait requires both a forward map and an inverse map in a single implementation.
+An agent cannot compile a forward reduction without providing solution extraction.
+See \Cref{app:architecture} for the complete type architecture.
+
+%======================================================================
+%  SECTION 4: METHODOLOGY
+%======================================================================
+\section{Methodology}\label{sec:method}
+
+The central design challenge is separating \emph{creative} decisions from \emph{routine} execution.
+Adding a reduction to the graph requires answering questions that only a domain expert can answer: which problem is industrially relevant and worth adding?
+What is the formal definition?
+Which small example would be both illustrative and sufficient to check correctness?
+What is the polynomial overhead?
+Everything else---writing Rust code, constructing tests, generating documentation, fixing CI---is routine work that follows a fixed pattern regardless of the mathematical content.
+
+Our answer is a pipeline of \emph{progressive quality gates} built around this separation.
+The creative elements are captured in structured issue templates---one for models, one for rules---whose fields correspond exactly to the questions above.
+The \texttt{propose} skill helps contributors fill in these fields interactively, asking one question at a time in mathematical language.
+Once the creative decisions are recorded, the remaining stages are fully automated.
+
+\subsection{From Contributor to Verified Code}\label{sec:pipeline-overview}
+
+The pipeline has five stages, each backed by one or more \emph{skills}---persistent, versioned markdown documents that decompose complex tasks into numbered agent-executable steps (\Cref{fig:pipeline}).
+
+\begin{figure}[t]
+  \centering
+  \includegraphics[width=\columnwidth]{figures/pipeline.pdf}
+  \caption{Contribution pipeline from proposal to merged code.
+    Orange transitions require human judgment (selecting work and approving results); blue transitions are fully automated by skills.
+    Each stage independently validates before passing to the next.}
+  \label{fig:pipeline}
+\end{figure}
+
+\textbf{Stage~1: Propose.}
+A domain expert---who need not know the codebase or even the programming language---provides the creative elements that only a human can supply.
+Two paths are available.
+The \texttt{propose} skill guides the contributor interactively, asking one question at a time in mathematical language: what is the motivation, what is the formal definition, what is a small example that exercises the core structure?
+Alternatively, the contributor fills in a structured GitHub issue template directly---the template fields mirror the same creative questions.
+Either way, the output is an issue whose fields capture every creative decision needed for implementation.
+
+Crucially, the agent first analyzes the graph's topology to identify the most valuable contributions (\Cref{app:topology}).
+Three categories guide the analysis:
+\emph{orphan nodes}---problem types with no reductions to or from any other node;
+\emph{redundant rules}---direct reductions dominated by a cheaper composite path;
+and \emph{missing proof paths}---problems with no reduction chain from 3-SAT.
+The agent ranks proposals by priority: rules that connect orphans or fill proof-chain gaps are suggested first.
+Before filing the issue, the agent runs the quality checks from Stage~2 on the draft, catching problems before they reach review.
+
+\textbf{Stage~2: Validate.}
+The \texttt{check-issue} skill applies four independent tests to every proposal: \emph{usefulness} (does a reduction path already exist? is this one dominated by a cheaper composite path?), \emph{non-triviality} (is this a genuine structural transformation, not a variable substitution?), \emph{correctness} (do the cited references exist and support the claims?), and \emph{writing quality} (are all symbols defined, all template sections complete, all examples fully worked?).
+References are verified through a fallback chain: project bibliography, then web search---never hallucinated.
+Only proposals passing all four checks receive a \texttt{Good} label and proceed.
+
+\textbf{Stage~3: Implement.}
+The \texttt{issue-to-pr} skill converts a validated issue into a pull request.
+It enforces a strict \emph{one item per PR} rule: a reduction rule cannot be bundled with its source model, because the model must exist on main before the rule can be tested.
+The skill reads the issue and all its comments, researches references to resolve ambiguities, generates an implementation plan, and dispatches to the appropriate implementation skill (\texttt{add-model} or \texttt{add-rule}).
+Each implementation skill encodes a complete checklist---9~items for rules, 11~for models---that must be satisfied before any code is written.
+
+\textbf{Stage~4: Review.}
+Two parallel sub-agents, each operating in a fresh context window, review the implementation: one checks structural completeness (file registration, macro usage, test coverage), the other checks code quality and semantic correctness against the original issue specification.
+Fresh context prevents the confirmation bias that arises when an agent reviews its own work.
+CI failures trigger up to three automated fix-and-retry cycles.
+
+A third review layer---\emph{agentic feature testing}---simulates a downstream user.
+An agent reads the documentation, installs the library, exercises the new feature through the CLI, and judges whether the results are consistent with its domain knowledge.
+This replaces the community feedback loop that open-source projects normally rely on: rather than waiting for users to discover issues post-release, agent-users test the feature before merge.
+The CLI design is essential here---because agents and humans share the same command-line interface, the agent tests exactly what a real user would invoke.
+Unlike unit tests (which verify internal correctness), agentic feature tests verify that the feature is \emph{usable}: that the documentation is accurate, that the CLI output is interpretable, and that the reduction produces results consistent with the agent's knowledge of the underlying mathematics.
+
+\textbf{Stage~5: Merge.}
+The maintainer makes the final quality judgment and merges.
+This is one of only two human decisions in the pipeline; the other is moving an issue from Backlog to Ready (Stage~3).
+
+\paragraph{Two agent types.}
+The pipeline uses agents in two distinct types, distinguished by the \emph{knowledge asymmetry} between agent and human.
+
+\emph{Mentors} (4~skills) know more than the human about the project's conventions, architecture, and topology---and use that knowledge to guide humans toward high-quality contributions.
+%
+\begin{center}
+\includegraphics[height=1.8cm]{figures/role-mentor.pdf}
+\end{center}
+%
+\noindent
+The \texttt{propose} skill analyzes the reduction graph's topology---orphan nodes, missing proof paths, redundant edges---to identify the highest-value contributions, then conducts an interactive session in mathematical language, asking one question at a time: what is the problem, what is the formal definition, what does a worked example look like?
+It proposes options, analyzes trade-offs, and recommends with reasons, so that even a newcomer can produce a publication-quality issue on their first attempt.
+The \texttt{fix-issue} skill brainstorms with contributors to resolve quality problems found during validation.
+The \texttt{final-review} skill guides the maintainer through merge decisions, surfacing quality signals the maintainer might miss.
+The \texttt{dev-setup} skill onboards new developers, configuring their environment interactively.
+In each case the human learns through the interaction---the agent acts as a domain-aware tutor who transfers project knowledge to the contributor.
+
+\emph{Workers} (12~skills) know less than the human about \emph{what} should be built, but execute routine heavy-lifting that follows fixed patterns.
+%
+\begin{center}
+\includegraphics[width=\columnwidth]{figures/role-worker.pdf}
+\end{center}
+%
+\noindent
+Orchestration workers manage the pipeline headlessly: \texttt{project-pipeline} picks the highest-ranked Ready issue, creates an isolated git worktree, dispatches an implementation worker, and moves the result to the review queue; \texttt{review-pipeline} addresses code-review comments, runs agentic feature tests, and retries CI failures.
+Implementation workers produce artifacts: \texttt{add-model} and \texttt{add-rule} write code following skill checklists; \texttt{write-model-in-paper} and \texttt{write-rule-in-paper} generate paper entries with proof sketches and worked examples.
+Quality workers validate against rubrics: \texttt{check-issue} validates proposals before implementation; \texttt{review-implementation} dispatches parallel sub-agents in fresh context to review code; \texttt{fix-pr} resolves CI failures and review comments.
+\texttt{release} handles version bumps and publishing.
+The maintainer makes exactly two decisions in the entire pipeline: moving an issue from Backlog to Ready, and merging the final pull request.
+
+\subsection{Why Skills, Not Prompts or Scripts}\label{sec:skills}
+
+Traditional automation---Makefiles, CI pipelines, shell scripts---is \emph{mechanical}: every step is predetermined, and human involvement is impossible mid-execution.
+Skills are a different kind of automation: \emph{abstract}.
+A skill defines \emph{what} must happen (validate references, construct an example, write a proof sketch) without fixing \emph{how}.
+When the task is routine, the agent executes autonomously.
+When the task requires creativity or deep judgment---choosing which example best reveals correctness, deciding whether a proposed reduction is non-trivial---the skill can pause and involve a human as a resource, then resume.
+The same skill thus operates headlessly in the pipeline (\texttt{make run-pipeline}) and interactively with a contributor (\texttt{propose}), adapting its execution mode to the context.
+
+Skills also differ from per-invocation prompts in three ways that matter for sustainability.
+
+\emph{Skills are versioned.}
+They are committed to the repository and evolve through pull requests, just like code.
+A bug in a skill is fixed once and benefits all future invocations.
+A per-invocation prompt must be re-crafted each time, and improvements are lost.
+
+\emph{Skills are compositional.}
+Orchestration skills invoke implementation skills, which invoke quality gates.
+A single \texttt{project-pipeline} command triggers the full cascade---pick a task from the project board, create an isolated workspace, validate the issue, implement, test, review, document, and produce a pull request.
+The maintainer's effort scales with the number of \emph{skill types}, not the number of tasks.
+
+\emph{Skills encode domain knowledge that cannot be inferred from code.}
+The \texttt{add-rule} skill specifies that overhead expressions must use getter methods matching the source type, that examples must have a \lstinline{pub fn run()} entry point, and that the paper entry requires a proof sketch.
+An agent reading only the codebase might infer some conventions but would miss others---the skill makes all conventions explicit.
+
+The library comprises 14~skills in six categories: orchestration~(3), community contribution~(1), implementation~(2), quality gates~(4), documentation~(2), and onboarding~(2).
+
+\begin{figure}[t]
+\centering
+\includegraphics[width=\columnwidth]{figures/skill-map.pdf}
+\caption{Project knowledge architecture.
+  \texttt{CLAUDE.md} defines conventions, architecture, and commands;
+  skills in four categories encode reusable workflows that reference it.
+  Inner nodes show key source directories the skills operate on.}
+\label{fig:skill-architecture}
+\end{figure}
+
+\subsection{Correctness by Construction}\label{sec:verification}
+
+Correctness assurance comes from a seven-layer verification stack (\Cref{app:verification}).
+The key insight is that different layers catch different classes of errors, and no single layer suffices.
+
+\textbf{Layers~1--2} (type system and unit tests) catch structural errors cheaply.
+The type system prevents entire categories of mistakes: an agent cannot implement a reduction without providing solution extraction, reference nonexistent problem attributes in overhead expressions, or omit required variant declarations.
+These constraints are enforced at compile time by procedural macros that validate expression variable names against actual getter methods.
+
+\textbf{Layer~3} (round-trip tests, described in \Cref{sec:roundtrip}) is the workhorse, catching the most mathematically subtle errors---incorrect variable mappings, off-by-one indexing, forgotten negations---without per-reduction test logic.
+
+\textbf{Layer~4} (overhead validation) compares symbolic size expressions against actual target sizes.
+An agent might implement a correct reduction but declare the wrong overhead formula.
+This layer catches formula errors that are type-correct but mathematically wrong.
+
+\textbf{Layer~5} (materialized fixtures) addresses the ``lazy agent'' problem: agents that modify expected test outputs to make tests pass rather than fixing implementations.
+Ground-truth data is generated independently and committed separately; tampering produces a visible diff in a file outside the agent's normal scope.
+
+\textbf{Layers~6--7} (fresh-context review and documentation) catch errors invisible to automated tests.
+Parallel sub-agents operating in fresh context windows prevent confirmation bias.
+Proof sketches force articulation of mathematical arguments; a completeness checker ensures every graph edge is documented.
+
+The skill system ensures all seven layers are invoked for every task.
+Without skills, an agent might skip overhead validation or omit the paper entry---errors that would accumulate silently over many contributions.
+
+\subsection{Why Rust?}\label{sec:why-rust}
+
+The choice of implementation language has outsized impact in agentic workflows, because the agent's edit--compile--test loop runs hundreds of times per session.
+None of the maintainers had written Rust before this project---the language was chosen for properties that benefit agents, not developers.
+
+\emph{Explicit, actionable error messages.}
+The Rust compiler produces diagnostics that include the error location, the conflicting types or lifetimes, and often a suggested fix.
+Agents parse these messages and resolve errors without human intervention---a property we rely on throughout the pipeline.
+Languages with less structured diagnostics (e.g., C++ template errors) would require more agent reasoning per cycle.
+
+\emph{Fast feedback.}
+Incremental compilation and the built-in test harness produce results in seconds.
+A typical round-trip test compiles and runs in under 3~seconds, enabling agents to iterate rapidly.
+High feedback rate---short cycles between edit and result---is the single most important factor in agent productivity.
+
+Rust's well-known strengths---memory safety eliminating entire bug categories at compile time, high performance, a 7\,MB binary, and Cargo's integrated toolchain---further reduce the surface area of errors agents must debug.
+Procedural macros deserve special mention: they enable the compile-time validation of overhead expressions and variant registrations that powers Layers~1 and~4 of the verification stack.
+
+%======================================================================
+%  SECTION 5: EVIDENCE
+%======================================================================
+\section{Evidence}\label{sec:evaluation}
+
+We evaluate through development metrics from the project's history, an analysis of the automated quality gate, and case studies showing how verification layers interact.
+All development used Claude Code~\cite{Anthropic2025ClaudeCode} with Claude models (Sonnet~3.5 and Sonnet~4; the model version evolved during development).
+Skills are plain markdown documents portable to any coding agent.
+
+\subsection{Development Metrics}\label{sec:metrics}
+
+The repository contains 59~merged pull requests and 253~commits on main spanning nine weeks (January~9 to March~13, 2026), authored by four contributors.
+Session metadata across 283~agent sessions (300~MB of conversation transcripts) reveals the scale of agent involvement: 9,429~assistant messages in response to 630~user messages---a \textbf{15:1 automation amplification ratio}.
+The average session involved 5.8~user messages and 51~tool calls, with measured wall-clock time totaling 115~hours across 108~sessions with timing data.
+
+\begin{table}[t]
+\caption{Codebase growth timeline.}\label{tab:growth}
+\centering
+\small
+\begin{tabular}{@{}lcccc@{}}
+\toprule
+Date & Models & Rules & Tests & Examples \\
+\midrule
+Jan 10 (initial) & 17 & 0 & 0 & 0 \\
+Jan 26 (feature parity) & 20 & 22 & 0 & 1 \\
+Feb 15 (arch.\ redesign) & 21 & 44 & 101 & 35 \\
+Mar 13 (current) & 27 & 45 & 114 & 45 \\
+\bottomrule
+\end{tabular}
+\end{table}
+
+\begin{figure}[t]
+  \centering
+  \includegraphics[width=\columnwidth]{figures/timeline.pdf}
+  \caption{Cumulative growth of problem types and reduction rules over nine weeks.
+    Background bands mark three development phases: manual (Phase~1), basic skills (Phase~2), and full pipeline (Phase~3).
+    Background bands mark three development phases.}
+  \label{fig:timeline}
+\end{figure}
+
+\Cref{tab:growth} and \Cref{fig:timeline} trace the growth across three phases.
+\textbf{Phase~1 (Manual, 35~PRs)}: no skills; the maintainer issued step-by-step commands and established the architecture.
+\textbf{Phase~2 (Basic skills, 9~PRs)}: initial \texttt{add-model}/\texttt{add-rule} skills reduced per-task human involvement.
+\textbf{Phase~3 (Full pipeline, 15~PRs)}: complete skill library with orchestration, quality gates, and multi-agent review.
+The current codebase comprises 54,599~lines of Rust source, 28,343~lines of tests, and 6,362~lines of examples.
+
+\paragraph{Interaction evolution.}
+Analysis of 2,196~user prompts reveals a shift from imperative to declarative interaction as skills matured.
+In Phase~1, prompts averaged 8--12 words (e.g., ``implement Satisfiability to MIS reduction'').
+By Phase~3, 30\% of prompts were 1--3 words (e.g., ``\texttt{make run-pipeline}'').
+This progression---from specifying actions to invoking skills---mirrors the classical shift from scripting to API design.
+
+\paragraph{Observability.}
+All pull requests are attributed to human GitHub accounts because the agent operates through the developer's local terminal.
+Of 1,089~commits across all branches, 1,510~carry a \texttt{Co-Authored-By: Claude} trailer (the count exceeds total commits because branch commits are squash-merged on main).
+This observability gap---the difficulty of distinguishing human-authored from agent-assisted work in git metadata---is itself a finding about current agentic workflows.
+
+\subsection{Issue Quality Gate}\label{sec:quality-gate}
+
+The \texttt{check-issue} skill was stress-tested when a contributor batch-submitted 414~issues---including 251~in a single day---proposing new reductions for the graph.
+Of the 322~issues checked, only \textbf{81~(25\%) passed} all quality criteria (\Cref{tab:quality-gate}).
+
+\begin{table}[t]
+\caption{Issue quality gate results (322 checked).}\label{tab:quality-gate}
+\centering
+\small
+\begin{tabular}{@{}lr@{}}
+\toprule
+Verdict & Count (\%) \\
+\midrule
+Good & 81 (25\%) \\
+PoorWritten (incomplete specification) & 124 (39\%) \\
+Wrong (factually incorrect) & 64 (20\%) \\
+Trivial (obvious, adds no value) & 43 (13\%) \\
+Useless (no practical application) & 18 (6\%) \\
+\bottomrule
+\end{tabular}
+\end{table}
+
+The \textbf{75\% rejection rate} demonstrates the necessity of automated quality gates in agentic pipelines.
+Without \texttt{check-issue}, the pipeline would waste agent compute implementing incorrect or trivial reductions.
+The most common failure was \emph{PoorWritten}---issues lacking complete mathematical specifications, making them unimplementable even by a skilled agent.
+The \emph{Wrong} category~(20\%) included citations to non-existent papers, incorrect complexity claims, and reductions that do not actually preserve solution structure---errors that would be expensive to discover during implementation rather than at triage.
+
+\subsection{Case Studies}\label{sec:cases}
+
+\paragraph{Satisfiability $\to$ MIS (gadget construction).}
+The classical Karp reduction~\cite{karp1972} works as follows.
+Given a Boolean formula in conjunctive normal form (a conjunction of clauses, each a disjunction of literals), create one graph vertex per literal occurrence.
+Add edges within each clause (so at most one literal per clause can be selected) and between complementary literals across clauses (so a variable cannot be both true and false).
+A satisfying assignment corresponds to an independent set of size~$m$ (the number of clauses).
+The implementation spans 171~lines.
+
+This case illustrates how verification layers interact.
+The number of edges in the constructed graph is worst-case \emph{quadratic} in the number of literals---but an agent might assume linear overhead.
+Layer~4 (overhead validation) catches this by comparing symbolic expressions against actual target sizes.
+Layer~3 (round-trip testing) catches a different class of error: off-by-one mistakes in literal-to-vertex mapping.
+Both layers are necessary: correct indices with wrong overhead, or correct overhead with wrong indices, would each pass one layer but fail the other.
+
+\paragraph{Factoring $\to$ CircuitSAT $\to$ ILP (emergent composition).}
+As described in \Cref{sec:compositionality}, this chains two independently implemented reductions (272 + 225~lines) to factor integers via linear programming.
+The Factoring $\to$ CircuitSAT step exercises the full verification stack: the multiplier circuit involves $\Theta(mn)$ full-adder cells (where $m$ and $n$ are the bit lengths of the two factors), and errors in carry propagation are caught by Layer~3 while overhead formula errors are caught by Layer~4.
+
+\paragraph{MVC $\leftrightarrow$ MIS (trivial complement).}
+At the opposite extreme, this 96-line reduction exploits the complement relationship described in \Cref{sec:graph} with identity overhead.
+The pipeline's primary value here is enforcing conventions---file naming, macro registration, documentation---rather than catching logical errors.
+
+%======================================================================
+%  SECTION 6: RELATED WORK
+%======================================================================
+\section{Related Work}\label{sec:related}
+
+\paragraph{AI coding agents.}
+The evolution from SWE-agent~\cite{Yang2024SWEagent} and Devin~\cite{Wu2024Devin} to OpenHands~\cite{Wang2024OpenHands} and Claude Code~\cite{Anthropic2025ClaudeCode} has expanded single-task capabilities to 70--80\% on SWE-Bench~\cite{Xia2025LiveSWEagent}, but longer-horizon benchmarks reveal a capability cliff~\cite{Thai2025SWEEVO, Deng2025SWEBenchPro}.
+Roychoudhury~\cite{Roychoudhury2025AgenticAI} identifies specification inference as the central difficulty.
+Our approach structures work so each unit falls within the agent's reliable range, complementing architectural advances in agent design.
+
+\paragraph{AI code maintainability concerns.}
+A growing body of evidence suggests that unstructured AI-assisted coding creates maintenance burden rather than reducing it.
+Analysis of an LLM-generated C~compiler found excessive abstraction layering, inclusion of rarely-useful features that increase bug surface area, and absent comments precisely where domain expertise matters most~\cite{Jones2026LLMCompiler}.
+A study of 211~million changed lines found a 4$\times$ growth in code clones and a decline in refactored code from 24\% to 10\%, indicating that AI encourages copy-paste over sustainable architecture~\cite{GitClear2025CodeQuality}.
+A difference-in-differences study of 807~AI-adopting repositories found persistent increases in code complexity despite transient velocity gains~\cite{CursorAI2025SpeedCost}, and a randomized controlled trial found experienced developers were 19\% \emph{slower} with AI tools on real tasks in their own repositories~\cite{Becker2025METRProductivity}.
+Our skill-based framework is designed to address these concerns directly: skills enforce project conventions (naming, testing, documentation) at every invocation, versioned skill definitions prevent prompt drift, and seven-layer verification rejects non-conforming code before it enters the repository.
+The agent never ``free-writes'' code---it follows a structured template that has been refined through pull requests, ensuring that generated code is as maintainable as hand-written code that follows the same conventions.
+
+\paragraph{AI-discovered reductions.}
+FunSearch~\cite{RomeraParedes2023FunSearch} and AlphaEvolve~\cite{Novikov2025AlphaEvolve} discover novel algorithms and reductions through evolutionary search, including improved bounds for combinatorial problems.
+Jani\v{c}i\'{c}'s URSA~\cite{Janicic2025URSA} uses SAT-based constraint solving to verify reductions.
+Our work is complementary: we implement and verify \emph{known} reductions; discovered reductions could feed into our pipeline as new issues.
+
+\paragraph{Formal verification of generated code.}
+VeriCoding~\cite{Bursuc2025VeriCoding} reports 27--44\% success on 12,504~formal specifications.
+CLEVER~\cite{Thakur2025CLEVER} establishes hard Lean benchmarks.
+VeriBench~\cite{Miranda2025VeriBench} finds self-optimizing agents approach 90\% compilation in Lean~4.
+Mukherjee et al.~\cite{Mukherjee2025CoqPL, Mukherjee2025SynVer} demonstrate a two-LLM generate-then-verify pattern.
+Our seven-layer stack trades full formal guarantees for practical effectiveness at the scale of 45~reductions.
+
+\paragraph{Physics-inspired optimization.}
+Schuetz et al.~\cite{Schuetz2022PhysicsGNN} solve QUBO at million-variable scale via graph neural networks.
+He~\cite{He2024QuantumTSP} combines quantum annealing with GNNs for the Traveling Salesman Problem.
+These approaches assume a QUBO or Ising formulation as input---precisely the transformation that our reduction graph provides as upstream infrastructure.
+
+%======================================================================
+%  SECTION 7: DISCUSSION AND CONCLUSION
+%======================================================================
+\section{Discussion and Conclusion}\label{sec:conclusion}
+
+\subsection{Limitations}
+
+\paragraph{Single case study.}
+Evidence comes from one project by one primary developer.
+Replication across independent projects and teams is needed.
+
+\paragraph{Skill engineering cost.}
+The 14~skills represent substantial upfront investment---iterative refinement across many agent sessions.
+Each new domain requires its own skill engineering effort, though the methodology itself transfers.
+
+\paragraph{Confounding factors.}
+Both skills and underlying models improved during the nine-week span.
+Temporal stratification across the three phases partially addresses this, but we cannot fully disentangle the two contributions.
+
+\paragraph{Maintainer requirement.}
+The pipeline is not fully autonomous: without human judgment at two transitions (selecting work and approving results), the system cannot determine what is worth building or whether results meet standards.
+This is by design---but limits applicability to fully autonomous scenarios.
+
+\subsection{Why Human Experts Remain Essential}
+
+The pipeline's reliance on human judgment is not a temporary limitation awaiting better models---it reflects a fundamental gap.
+LLMs do not perform genuine mathematical reasoning: performance drops up to 65\% when irrelevant clauses are added to otherwise identical problems~\cite{Mirzadeh2025GSMSymbolic}, reasoning models exhibit accuracy collapse beyond complexity thresholds~\cite{Shojaee2025IllusionOfThinking}, and multi-step compositional reasoning exhibits multiplicative error accumulation~\cite{Dziri2023FaithFate}.
+Standard transformers are bounded by constant-depth threshold circuits (TC$^0$); even chain-of-thought extends expressiveness only to polynomial-time computation~\cite{Merrill2024ExpressivePower}.
+
+The context window (200,000~tokens for Claude Opus~4.6) functions as working memory, not reasoning capacity: performance degrades 14--85\% as input length increases~\cite{Du2025ContextLengthHurts}, and LLMs cannot maintain internal state across turns~\cite{Huang2025LLMWorkingMemory}.
+On research-level mathematics, the best models score below 2\% on FrontierMath~\cite{Glazer2024FrontierMath} and 8\% on Humanity's Last Exam~\cite{Paster2025HLE}.
+
+Skills are our response: rather than encoding domain expertise in model weights (expensive retraining) or prompts (ephemeral), skills encode it in versionable documents that persist across sessions.
+Creative decisions remain with humans; routine execution is delegated to agents.
+
+\subsection{Future Work}
+
+\paragraph{Industry impact.}
+The reduction graph serves as a \emph{solver-agnostic compilation layer}.
+Adding a single edge from a hardware platform's native formulation (e.g., MIS for Rydberg atoms) routes all 27~problem types to that device.
+Each new problem added to the graph is instantly available on every connected solver.
+
+\paragraph{Scaling to 100+ problems.}
+The graph's value grows superlinearly: each new edge creates composite paths through the entire graph.
+Three directions extend this work: composing with \emph{automated discovery} via evolutionary search~\cite{Novikov2025AlphaEvolve}, supplementing round-trip tests with \emph{formal verification} (machine-checked Lean or Coq proofs), and \emph{cost-aware path selection} where the optimal solver route depends on instance scale.
+
+\subsection{Conclusion}
+
+We have introduced \emph{bridge problems}---software projects whose scale exceeds human capacity but whose homogeneous, verifiable structure makes them amenable to agent execution constrained by systematic verification.
+NP-hard problem reductions are the first convincing example: over nine weeks, a single maintainer and AI agents produced a verified reduction graph connecting 27~problem types to specialized solvers.
+
+The core insight is that three structural barriers---convention drift, effort exhaustion, and knowledge discontinuity---make bridge problems infeasible for human teams, while verification ensures that agents cannot deviate from contributor-specified ground truth.
+Skills encode workflow knowledge as reusable, versionable documents, lowering the contribution barrier from ``knows the programming language'' to ``knows the mathematics.''
+As the graph scales toward 100+ problem types, it evolves from a library into a reduction compiler---a vision that is infeasible without agentic execution.
+
+\bibliographystyle{IEEEtran}
+\bibliography{references}
+
+%======================================================================
+%  APPENDICES
+%======================================================================
+\appendices
+
+\section{System Architecture}\label{app:architecture}
+
+The library's type system reduces the space of possible agent errors by making incorrect code fail to compile.
+This appendix describes four mechanisms---the \texttt{Problem} trait, the \texttt{ReduceTo} trait, the \lstinline{#[reduction(overhead)]} macro, and the \lstinline{declare_variants!} registry---that enforce correctness structurally (\Cref{fig:architecture}).
+
+\begin{figure}[t]
+  \centering
+  \includegraphics[width=\columnwidth]{figures/architecture.pdf}
+  \caption{Trait hierarchy and compile-time validation.
+    The \texttt{Problem} trait defines a universal evaluation interface; \texttt{ReduceTo<T>} requires both forward and inverse maps; procedural macros validate overhead expressions and variant registrations at compile time.}
+  \label{fig:architecture}
+\end{figure}
+
+\paragraph{The Problem trait.}
+Every problem type implements \texttt{Problem}, which requires: a constant name, an associated metric type (\texttt{SolutionSize<W>} for optimization problems, \texttt{bool} for decision problems), a method returning the configuration space dimensions, an \lstinline{evaluate()} method that scores any configuration, and variant metadata for type-parameter tracking.
+The key member is \lstinline{evaluate()}: because every problem maps configurations to metrics through this uniform interface, a brute-force solver can enumerate the space and select the best solution, enabling the round-trip testing described in \Cref{sec:roundtrip} without problem-specific oracles.
+
+\paragraph{The ReduceTo trait.}
+The generic \lstinline{ReduceTo<T>} trait requires a \lstinline{reduce_to()} method returning a \texttt{ReductionResult} that bundles \lstinline{target_problem()} and \lstinline{extract_solution()}.
+By requiring both forward and inverse in a single implementation, the type system ensures every reduction is round-trip capable---an agent cannot compile a forward reduction without providing the extraction logic.
+
+\paragraph{Compile-time overhead validation.}
+The \texttt{\#[reduction(overhead = \{...\})]} procedural macro attaches symbolic size expressions to each reduction.
+These expressions are parsed at compile time; variable names are validated against getter methods on the source type, so a typo (e.g., referencing \lstinline{num_vertex} when the method is \lstinline{num_vertices}) causes a compile error rather than a silent bug.
+
+\paragraph{Variant registry.}
+Problem types are parameterized by graph type (SimpleGraph, KingsSubgraph, etc.) and weight type (unit weight, \texttt{i32}, \texttt{f64}).
+The \lstinline{declare_variants!} macro registers concrete instantiations with their best-known complexity, enabling automated graph export, documentation completeness checking, and redundancy analysis.
+
+\section{Verification Stack Details}\label{app:verification}
+
+\Cref{tab:verification} summarizes the seven layers, each targeting a distinct class of error with an example drawn from actual agent failures during development.
+
+\begin{table*}[t]
+  \centering
+  \caption{Seven-layer verification stack. Each layer catches a distinct class of error that the layers below it miss.}
+  \label{tab:verification}
+  \begin{tabular}{@{}clll@{}}
+    \toprule
+    Layer & Mechanism & Example Error Caught \\
+    \midrule
+    1 & Rust type system & Agent returns \texttt{bool} instead of \texttt{SolutionSize<i32>} \\
+    2 & Unit tests & Agent evaluates Max-Cut objective with wrong sign \\
+    3 & Closed-loop (round-trip) tests & Satisfiability$\to$MIS maps clause variables to wrong vertex indices \\
+    4 & Overhead validation & Agent writes \texttt{num\_edges = num\_clauses} instead of \texttt{3 * num\_clauses} \\
+    5 & Materialized fixtures & Agent modifies expected QUBO matrix to make test pass \\
+    6 & Agentic review & Missing \texttt{declare\_variants!} macro, wrong file naming convention \\
+    7 & Documentation review & Proof assumes connected graph but problem definition allows disconnected \\
+    \bottomrule
+  \end{tabular}
+\end{table*}
+
+\begin{figure}[t]
+  \centering
+  \includegraphics[width=\columnwidth]{figures/verification-pyramid.pdf}
+  \caption{Seven-layer verification stack. Lower layers (blue) are fully automated and fast; upper layers (gold) involve human-readable arguments and fresh-context review.}
+  \label{fig:verification}
+\end{figure}
+
+\paragraph{Layer 1: Type system.}
+The \texttt{Problem} trait's associated type forces every problem to declare whether it is an optimization or decision problem.
+The overhead macro validates variable names against source-type methods.
+Errors are caught in seconds, before any test runs.
+
+\paragraph{Layer 2: Unit tests.}
+Each problem includes tests verifying \lstinline{evaluate()} on small, hand-crafted instances.
+Serialization round-trip tests catch graph and weight encoding issues.
+
+\paragraph{Layer 3: Closed-loop (round-trip) tests.}
+The pattern described in \Cref{sec:roundtrip} exercises the full mathematical content of each reduction.
+This catches index errors, sign errors, and mapping bugs---the largest share of errors that survive type checking.
+
+\paragraph{Layer 4: Overhead validation.}
+After constructing the target instance, the test harness evaluates the symbolic overhead expressions and compares the predicted sizes against the actual target sizes.
+This is particularly effective for non-obvious size relationships (e.g., quadratic edge counts in intersection graphs that an agent might assume are linear).
+
+\paragraph{Layer 5: Materialized fixtures.}
+JSON ground-truth files in \texttt{tests/data/} are committed separately from implementations.
+If an agent modifies a ground-truth file to make a test pass, the change appears as a visible diff in a file outside the agent's normal scope, transforming a subtle correctness violation into an obvious process violation.
+
+\paragraph{Layer 6: Agentic review.}
+Two parallel sub-agents---one checking structural completeness, one checking code quality---operate in fresh context windows.
+Fresh context prevents the confirmation bias that arises when an agent reviews its own work within the same session.
+
+\paragraph{Layer 7: Documentation and visual review.}
+Every reduction has an entry in the accompanying paper with a proof sketch and a worked example.
+The example is not manually drawn: it is generated by the same code that the round-trip test (Layer~3) executes.
+The contributor specifies the source instance in the issue; the implementation produces JSON containing the source, target, overhead expressions, and extracted solutions; the paper renders this JSON as a visual diagram.
+Contributors can inspect the paper to verify that the reduction matches their mathematical intent---a visual check that complements the automated verification layers below.
+A completeness checker flags undocumented graph elements, ensuring every edge in the graph has a corresponding proof sketch.
+
+\paragraph{The lazy agent problem.}
+We observed agents modifying expected test outputs rather than fixing implementations---a rational strategy from the agent's perspective (shortest path to passing tests), but a correctness violation.
+Layer~5's independent fixtures and Layer~6's fresh-context review are the primary defenses against this failure mode.
+
+\section{Graph Topology Analysis}\label{app:topology}
+
+The \texttt{topology-sanity-check} skill detects three categories of graph quality issues (\Cref{fig:topology}).
+
+\begin{figure}[t]
+  \centering
+  \includegraphics[width=\columnwidth]{figures/topology-issues.pdf}
+  \caption{Three graph topology issues.
+    \textbf{(a)}~An orphan node has no edges and cannot reach any solver.
+    \textbf{(b)}~A direct reduction is redundant when a composite path through~$B$ has equal or lower overhead.
+    \textbf{(c)}~Problems without a path from 3-SAT lack a machine-verifiable NP-hardness proof in the graph.}
+  \label{fig:topology}
+\end{figure}
+
+\emph{Orphan nodes} contribute nothing to the graph's routing capability.
+\emph{Redundant rules} waste implementation effort when a cheaper composite path exists.
+\emph{Missing proof paths} indicate problems whose NP-hardness is not machine-verifiable within the graph.
+The agent uses these categories to rank proposals by priority: rules that connect orphans or fill proof-chain gaps are suggested first.
+
+\section{Ablation Study Design}\label{app:ablation}
+
+To isolate the effect of skills on development outcomes, we design a controlled comparison on identical tasks.
+
+\paragraph{Setup.}
+Select 5--10 reductions spanning the complexity spectrum---from complement relationships (MVC $\to$ MIS, 96~lines) through gadget constructions (Satisfiability $\to$ MIS, 171~lines) to circuit encodings (Factoring $\to$ CircuitSAT, 272~lines).
+Prepare identical issues for two configurations:
+(1)~\emph{Skill-based}: the full pipeline including issue validation, implementation skills, multi-agent review, and CI fixing.
+(2)~\emph{No-skill baseline}: the same agent, same codebase, and same project instructions, but no skill files---the agent must infer the workflow from context.
+
+\paragraph{Metrics.}
+(1)~First-attempt CI pass rate; (2)~review rounds before merge readiness; (3)~correctness (all round-trip tests pass); (4)~convention adherence (file naming, macro usage, documentation completeness).
+
+\paragraph{Expected outcomes.}
+Skills should excel on convention adherence (encoding project-specific patterns that are not inferrable from code alone) and first-attempt CI pass rate (the verification stack catches errors before the pull request is created).
+The baseline agent likely produces functionally correct code that fails CI due to missing macros, incorrect overhead declarations, or misplaced test files.
+
+This ablation has not yet been executed.
+We present the design as a replicable protocol for evaluating skill-based methodologies.
+
+\end{document}
diff --git a/docs/paper/arxiv/plan-rewrite.md b/docs/paper/arxiv/plan-rewrite.md
new file mode 100644
index 00000000..55fd0ece
--- /dev/null
+++ b/docs/paper/arxiv/plan-rewrite.md
@@ -0,0 +1,85 @@
+# Paper Rewrite Plan
+
+**Spec:** `docs/paper/arxiv/paper-redesign-spec.md`
+**Target:** `docs/paper/arxiv/paper.tex`
+
+## Steps
+
+### Step 1: Rewrite Abstract
+Reframe around bridge problem concept. New arc: bridge problems exist (too large for humans) → agents can build them (verification constrains correctness) → NP-hard reductions as first example → evidence (27 types, 45 rules, 9 weeks vs 4 years).
+
+### Step 2: Rewrite Section 1 (Introduction)
+- Keep familiar opening (airlines, chips, logistics)
+- Introduce reduction graph idea
+- **New:** Introduce "bridge problem" claim — this software is too large for humans
+- Preview 3 barriers
+- Mention verification solves the correctness concern
+- Reference Fig 1 (Scaling Wall — to be created later)
+- Contributions list
+
+### Step 3: Write Section 2 (Bridge Problems) — NEW
+- Formal definition of bridge problems
+- Three barriers with evidence:
+  - Convention drift
+  - Effort exhaustion (merged with testing frequency)
+  - Knowledge discontinuity
+- Verification constrains agent output (funnel concept)
+- Reference Fig 2 (Verification Funnel — to be created later)
+- Other candidate domains
+
+### Step 4: Rewrite Section 3 (Case Study: Reduction Graph)
+- Move existing graph description content here
+- Keep: what is a reduction, graph structure, emergent compositionality
+- **New:** Frame as case study illustrating bridge problem concept
+- Reference Fig 3 (Reduction Graph — to be redesigned with solver-reachability coloring)
+- Note: figure redesign is a separate step
+
+### Step 5: Rewrite Section 4 (Methodology)
+- Largely keep existing methodology content
+- Reframe skills as "how agents break through the 3 barriers"
+- Keep pipeline figure (Fig 4)
+- Keep verification stack description
+
+### Step 6: Rewrite Section 5 (Evidence)
+- Reference Fig 5 (Development Timeline — to be created later)
+- Development metrics
+- Quality gate analysis
+- **New:** Barrier-by-barrier evidence structure
+- Julia predecessor comparison
+
+### Step 7: Rewrite Section 6 (Discussion)
+- Keep limitations
+- Keep "why human experts remain essential" (shortened)
+- Move "Scale Beyond Human Capacity" content — already covered in Sec 2
+- Move "Barrier-Free Community Contribution" — fold into Sec 2
+- Future work
+
+### Step 8: Clean up appendices
+- Move topology issues figure to appendix
+- Keep architecture + verification pyramid in appendix
+- Remove three roles figure reference
+
+### Step 9: Create Fig 1 (Scaling Wall)
+New Typst/CeTZ figure in `figures/scaling-wall.typ`
+
+### Step 10: Create Fig 2 (Verification Funnel)
+New Typst/CeTZ figure in `figures/verification-funnel.typ`
+
+### Step 11: Redesign Fig 3 (Reduction Graph with solver coloring)
+Modify `figures/reduction-graph.typ` — add color-coded edges by solver reachability
+
+### Step 12: Create Fig 5 (Development Timeline)
+New Typst/CeTZ-plot or Python-generated figure in `figures/timeline.typ`
+
+### Step 13: Rewrite abstract (final pass)
+After all sections are stable, do a final pass on the abstract to ensure it matches.
+
+## Dependencies
+- Steps 1-8 (text) can proceed before figures
+- Steps 9-12 (figures) are independent of each other
+- Step 13 depends on all prior steps
+
+## Notes
+- Keep total under 12 pages (conference format)
+- Use existing writing-guidelines.md principles
+- Fix reviewer issues: timeline consistency, author count, LLM model identification
diff --git a/docs/paper/arxiv/plan.md b/docs/paper/arxiv/plan.md
new file mode 100644
index 00000000..19400005
--- /dev/null
+++ b/docs/paper/arxiv/plan.md
@@ -0,0 +1,29 @@
+# Agentic Coding to bridge computational hard problems
+
+## Abstract
+
+We show how to use AI agents to create a correctness proof package for a reduction rule. Towards reliable and scalable agentic coding.
+
+## Roles
+
+- AI agents, coding implementation, agentic testing, documentation
+- Maintainers, skill management, key decision making
+- Contributors, issue makers, creative parts
+- Users, the ultimate test
+
+## Skills
+
+1. Validate issues
+2. Validate implementation
+
+## Testing
+
+Reduce the validation barrier by creating advanced tools.
+
+```
+            documentation -> human verification
+            |
+issue -> rust code -> unit tests & round-trip tests
+            |
+            cli -> agentic-tests
+```
\ No newline at end of file
diff --git a/docs/paper/arxiv/references.bib b/docs/paper/arxiv/references.bib
new file mode 100644
index 00000000..f3a6592a
--- /dev/null
+++ b/docs/paper/arxiv/references.bib
@@ -0,0 +1,431 @@
+% Survey: Agentic Coding and Problem Reduction Rules
+% Generated: 2026-03-12
+% Papers: 22
+
+% ============================================================
+% Theme A: AI Coding Agents — Architectures and Benchmarks
+% ============================================================
+
+@inproceedings{Yang2024SWEagent,
+  author    = {John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Adriano Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press},
+  title     = {{SWE}-agent: Agent-Computer Interfaces Enable Automated Software Engineering},
+  booktitle = {Neural Information Processing Systems},
+  year      = {2024},
+  doi       = {10.48550/arXiv.2405.15793},
+  abstract  = {Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like software engineering, we posit that LM agents represent a new category of end users with their own needs and abilities, and would benefit from specially-built interfaces to the software they use. We investigate how interface design affects the performance of language model agents. As a result of this exploration, we introduce SWE-agent: a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs. We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5\% and 87.7\%, respectively, far exceeding the previous state-of-the-art achieved with non-interactive LMs. Finally, we provide insight on how the design of the ACI can impact agents' behavior and performance.},
+}
+
+@inproceedings{Wang2024OpenHands,
+  author    = {Xingyao Wang and Boxuan Li and Yufan Song and Frank F. Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H. Tran and Fuqiang Li and Ren Ma and Mingzhang Zheng and Bill Qian and Yanjun Shao and Niklas Muennighoff and Yizhe Zhang and Binyuan Hui and Junyang Lin and Robert Brennan and Hao Peng and Heng Ji and Graham Neubig},
+  title     = {{OpenHands}: An Open Platform for {AI} Software Developers as Generalist Agents},
+  booktitle = {International Conference on Learning Representations},
+  year      = {2024},
+  url       = {https://arxiv.org/abs/2407.16741},
+  abstract  = {Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. In this paper, we introduce OpenHands (f.k.a. OpenDevin), a platform for the development of powerful and flexible AI agents that interact with the world in similar ways to those of a human developer: by writing code, interacting with a command line, and browsing the web. We describe how the platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, coordination between multiple agents, and incorporation of evaluation benchmarks. Based on our currently incorporated benchmarks, we perform an evaluation of agents over 15 challenging tasks, including software engineering (e.g., SWE-BENCH) and web browsing (e.g., WEBARENA), among others. Released under the permissive MIT license, OpenHands is a community project spanning academia and industry with more than 2.1K contributions from over 188 contributors.},
+}
+
+@article{Wang2025OpenHandsSDK,
+  author    = {Xingyao Wang and Simon Rosenberg and Juan Michelini and Calvin Smith and Hoang H. Tran and Engel Nyst and Rohit Malhotra and Xuhui Zhou and Valerie Chen and Robert Brennan and Graham Neubig},
+  title     = {The {OpenHands} Software Agent {SDK}: A Composable and Extensible Foundation for Production Agents},
+  journal   = {ArXiv},
+  volume    = {abs/2511.03690},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2511.03690},
+  abstract  = {Agents are now used widely in the process of software development, but building production-ready software engineering agents is a complex task. Deploying software agents effectively requires flexibility in implementation and experimentation, reliable and secure execution, and interfaces for users to interact with agents. In this paper, we present the OpenHands Software Agent SDK, a toolkit for implementing software development agents that satisfy these desiderata. This toolkit is a complete architectural redesign of the agent components of the popular OpenHands framework for software development agents, which has 64k+ GitHub stars. To achieve flexibility, we design a simple interface for implementing agents that requires only a few lines of code in the default case, but is easily extensible to more complex, full-featured agents with features such as custom tools, memory management, and more. For security and reliability, it delivers seamless local-to-remote execution portability, integrated REST/WebSocket services. For interaction with human users, it can connect directly to a variety of interfaces, such as visual workspaces (VS Code, VNC, browser), command-line interfaces, and APIs. Compared with existing SDKs from OpenAI, Claude, and Google, OpenHands uniquely integrates native sandboxed execution, lifecycle control, model-agnostic multi-LLM routing, and built-in security analysis. Empirical results on SWE-Bench Verified and GAIA benchmarks demonstrate strong performance. Put together, these elements allow the OpenHands Software Agent SDK to provide a practical foundation for prototyping, unlocking new classes of custom applications, and reliably deploying agents at scale.},
+}
+
+@article{Thai2025SWEEVO,
+  author    = {Minh V. T. Thai and Tue Le and D{\~u}ng Nguy{\~\hat{e}}n M{\d a}nh and Huy Phan Nhat and Nghi D. Q. Bui},
+  title     = {{SWE-EVO}: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios},
+  journal   = {ArXiv},
+  volume    = {abs/2512.18470},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2512.18470},
+  abstract  = {Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or implementing a small feature. However, real-world software engineering is fundamentally a long-horizon endeavor: developers must interpret high-level requirements, plan coordinated changes across many files, and evolve codebases over multiple iterations while preserving existing functionality. We introduce SWE-EVO, a benchmark that evaluates agents on this long-horizon software evolution challenge. Constructed from release notes and version histories of seven mature open-source Python projects, SWE-EVO comprises 48 evolution tasks that require agents to implement multi-step modifications spanning an average of 21 files, validated against comprehensive test suites averaging 874 tests per instance. Experiments with state-of-the-art models reveal a striking capability gap: even GPT-5 with OpenHands achieves only a 21 percent resolution rate on SWE-EVO, compared to 65 percent on the single-issue SWE-Bench Verified. This demonstrates that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a fine-grained metric that captures partial progress toward solving these complex, long-horizon tasks.},
+}
+
+@article{Deng2025SWEBenchPro,
+  title     = {{SWE-Bench Pro}: Can {AI} Agents Solve Long-Horizon Software Engineering Tasks?},
+  author    = {Xiang Deng and Jeff Da and Edwin Pan and Yannis Y. He and Charles Ide and Kanak Garg and Niklas Lauffer and Andrew Park and Chetan Rane and Karmini Sampath and Maya Krishnan and Srivatsa R. Kundurthy and Sean M. Hendryx and Zifan Wang and Chen Bo Calvin Zhang and Noah Jacobson and Bing Liu and Brad Kenstler},
+  year      = {2025},
+  journal   = {arXiv preprint arXiv:2509.16941},
+  doi       = {10.48550/arXiv.2509.16941},
+  url       = {https://openreview.net/forum?id=9R2iUHhVfr},
+  note      = {Under review at ICLR 2026},
+  abstract  = {We introduce SWE-Bench Pro, a substantially more challenging benchmark that builds upon the best practices of SWE-Bench, but is explicitly designed to capture realistic, complex, enterprise-level problems beyond the scope of SWE-Bench. The benchmark comprises 1,865 problems from 41 repositories, split into public, held-out, and commercial sets. It features long-horizon tasks that may require hours to days for a professional software engineer to complete, often involving patches across multiple files and substantial code modifications. SWE-Bench Pro provides a contamination-resistant testbed that more faithfully captures the complexity and diversity of real-world software development, advancing the pursuit of truly autonomous software engineering agents at a professional level.},
+}
+
+@article{Xia2025LiveSWEagent,
+  author    = {Chun Xia and Zhe Wang and Yan Yang and Yuxiang Wei and Ling-kai Zhang},
+  title     = {{Live-SWE-agent}: Can Software Engineering Agents Self-Evolve on the Fly?},
+  journal   = {ArXiv},
+  volume    = {abs/2511.13646},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2511.13646},
+  abstract  = {Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Goedel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that LIVE-SWE-AGENT can achieve an impressive solve rate of 77.4\% without test-time scaling, outperforming all existing software agents, including the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8\%.},
+}
+
+@misc{Anthropic2025ClaudeCode,
+  title        = {Claude Code},
+  author       = {{Anthropic}},
+  year         = {2025},
+  url          = {https://github.com/anthropics/claude-code},
+  howpublished = {\url{https://github.com/anthropics/claude-code}},
+  note         = {Agentic coding tool that lives in the terminal, understands codebases, and helps developers code faster through natural language commands},
+}
+
+@misc{Wu2024Devin,
+  title        = {Introducing {Devin}, the First {AI} Software Engineer},
+  author       = {Scott Wu},
+  year         = {2024},
+  month        = mar,
+  url          = {https://cognition.ai/blog/introducing-devin},
+  howpublished = {Cognition AI Blog},
+  note         = {Devin is a fully autonomous AI software engineering agent with access to shell, code editor, and browser in a sandboxed environment. On SWE-bench, Devin correctly resolves 13.86\% of issues end-to-end.},
+}
+
+@article{Roychoudhury2025AgenticAI,
+  author    = {Abhik Roychoudhury},
+  title     = {Agentic {AI} for Software: Thoughts from Software Engineering Community},
+  journal   = {ArXiv},
+  volume    = {abs/2508.17343},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2508.17343},
+  abstract  = {AI agents have recently shown significant promise in software engineering. Much public attention has been transfixed on the topic of code generation from Large Language Models (LLMs) via a prompt. However, software engineering is much more than programming, and AI agents go far beyond instructions given by a prompt. At the code level, common software tasks include code generation, testing, and program repair. Design level software tasks may include architecture exploration, requirements understanding, and requirements enforcement at the code level. Each of these software tasks involves micro-decisions which can be taken autonomously by an AI agent, aided by program analysis tools. This creates the vision of an AI software engineer, where the AI agent can be seen as a member of a development team. Conceptually, the key to successfully developing trustworthy agentic AI-based software workflows will be to resolve the core difficulty in software engineering --- the deciphering and clarification of developer intent. Specification inference, or deciphering the intent, thus lies at the heart of many software tasks, including software maintenance and program repair. A successful deployment of agentic technology into software engineering would involve making conceptual progress in such intent inference via agents. Trusting the AI agent becomes a key aspect, as software engineering becomes more automated. Higher automation also leads to higher volume of code being automatically generated, and then integrated into code-bases. Thus to deal with this explosion, an emerging direction is AI-based verification and validation (V\&V) of AI generated code. We posit that agentic software workflows in future will include such AI-based V\&V.},
+}
+
+@techreport{Anthropic2026AgenticCoding,
+  title       = {2026 Agentic Coding Trends Report: How Coding Agents Are Reshaping Software Development},
+  author      = {{Anthropic}},
+  year        = {2026},
+  month       = jan,
+  institution = {Anthropic},
+  url         = {https://resources.anthropic.com/hubfs/2026\%20Agentic\%20Coding\%20Trends\%20Report.pdf},
+  abstract    = {Industry report identifying eight trends across foundation, capability, and impact categories that are reshaping software development. Key findings include that developers now use AI in 60\% of their work while maintaining active oversight on 80--100\% of delegated tasks. The report covers shifting engineering roles, multi-agent coordination, human-AI collaboration patterns, and scaling agentic coding beyond engineering teams.},
+}
+
+% ============================================================
+% Theme C: AI-Assisted Discovery of Reductions & Complexity
+% ============================================================
+
+@article{Nagda2025ReinforcedGeneration,
+  author    = {Ansh Nagda and Prabhakar Raghavan and Abhradeep Thakurta},
+  title     = {Reinforced Generation of Combinatorial Structures: Hardness of Approximation},
+  year      = {2025},
+  url       = {https://arxiv.org/abs/2509.18057},
+  abstract  = {Can AI based methods help us make advances in complexity theory? We provide evidence towards answering this in the affirmative, using AlphaEvolve (an LLM code mutation agent) to obtain new results in three settings: a) We improve a recent result of Kunisky and Yu to obtain near-optimal upper and (conditional) lower bounds on certification algorithms for MAX-CUT and MAX-Independent Set on random 3- and 4-regular graphs. Our improved lower bounds are obtained by constructing nearly extremal Ramanujan graphs on as many as 163 vertices, and our upper bounds are obtained via analytical arguments. b) We obtain new inapproximability results for MAX-4-CUT and MAX-3-CUT, proving that it is NP-hard to approximate them within factors of 0.987 and 0.9649 respectively, using AlphaEvolve to discover new gadget reductions. Our MAX-4-CUT result improves upon the SOTA of 0.9883, and our MAX-3-CUT result improves on the current best gadget-based inapproximability result of 0.9853, but falls short of the SOTA of 16/17 that relies on a custom PCP (rather than a reduction from ``standard'' Hastad-style PCPs). c) Inapproximability for the metric Traveling Salesman Problem (TSP): We show that it is NP-hard to approximate the minimum cost tour within a factor of 111/110 using AlphaEvolve to discover a new gadget, thus improving the SOTA of 117/116. Along the way, we provide new modular soundness and completeness arguments that can be of independent interest. A key technical challenge we faced: verifying a candidate construction produced by AlphaEvolve is costly (sometimes requiring time exponential in the size of the construction). We used AlphaEvolve itself to evolve the verification procedure to be faster (sometimes by 10,000x for our gadgets). Our results suggest that gadget based proofs would benefit from a pass through AI-based tools to obtain stronger results.},
+}
+
+@article{Novikov2025AlphaEvolve,
+  author    = {Alexander Novikov and Ng{\^a}n V{\~u} and Marvin Eisenberger and Emilien Dupont and Po-Sen Huang and Adam Zsolt Wagner and S. Shirobokov and Borislav M. Kozlovskii and Francisco J. R. Ruiz and Abbas Mehrabian and M. P. Kumar and Abigail See and Swarat Chaudhuri and George Holland and A. Davies and Sebastian Nowozin and Pushmeet Kohli and Matej Balog},
+  title     = {{AlphaEvolve}: A Coding Agent for Scientific and Algorithmic Discovery},
+  journal   = {ArXiv},
+  volume    = {abs/2506.13131},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2506.13131},
+  abstract  = {In this white paper, we present AlphaEvolve, an evolutionary coding agent that substantially enhances capabilities of state-of-the-art LLMs on highly challenging tasks such as tackling open scientific problems or optimizing critical pieces of computational infrastructure. AlphaEvolve orchestrates an autonomous pipeline of LLMs, whose task is to improve an algorithm by making direct changes to the code. Using an evolutionary approach, continuously receiving feedback from one or more evaluators, AlphaEvolve iteratively improves the algorithm, potentially leading to new scientific and practical discoveries. We demonstrate the broad applicability of this approach by applying it to a number of important computational problems. When applied to optimizing critical components of large-scale computational stacks at Google, AlphaEvolve developed a more efficient scheduling algorithm for data centers, found a functionally equivalent simplification in the circuit design of hardware accelerators, and accelerated the training of the LLM underpinning AlphaEvolve itself. Furthermore, AlphaEvolve discovered novel, provably correct algorithms that surpass state-of-the-art solutions on a spectrum of problems in mathematics and computer science, significantly expanding the scope of prior automated discovery methods (Romera-Paredes et al., 2023). Notably, AlphaEvolve developed a search algorithm that found a procedure to multiply two $4 \times 4$ complex-valued matrices using 48 scalar multiplications; offering the first improvement, after 56 years, over Strassen's algorithm in this setting. We believe AlphaEvolve and coding agents like it can have a significant impact in improving solutions of problems across many areas of science and computation.},
+}
+
+@article{RomeraParedes2023FunSearch,
+  author    = {Bernardino Romera-Paredes and M. Barekatain and Alexander Novikov and Matej Balog and M. P. Kumar and Emilien Dupont and Francisco J. R. Ruiz and J. Ellenberg and Pengming Wang and Omar Fawzi and Pushmeet Kohli and Alhussein Fawzi},
+  title     = {Mathematical Discoveries from Program Search with Large Language Models},
+  journal   = {Nature},
+  volume    = {625},
+  pages     = {468--475},
+  year      = {2023},
+  doi       = {10.1038/s41586-023-06924-6},
+  abstract  = {Large language models (LLMs) have demonstrated tremendous capabilities in solving complex tasks, from quantitative reasoning to understanding natural language. However, LLMs sometimes suffer from confabulations (or hallucinations), which can result in them making plausible but incorrect statements. This hinders the use of current large models in scientific discovery. Here we introduce FunSearch (short for searching in the function space), an evolutionary procedure based on pairing a pretrained LLM with a systematic evaluator. We demonstrate the effectiveness of this approach to surpass the best-known results in important problems, pushing the boundary of existing LLM-based approaches. Applying FunSearch to a central problem in extremal combinatorics---the cap set problem---we discover new constructions of large cap sets going beyond the best-known ones, both in finite dimensional and asymptotic cases. This shows that it is possible to make discoveries for established open problems using LLMs. We showcase the generality of FunSearch by applying it to an algorithmic problem, online bin packing, finding new heuristics that improve on widely used baselines. In contrast to most computer search approaches, FunSearch searches for programs that describe how to solve a problem, rather than what the solution is. Beyond being an effective and scalable strategy, discovered programs tend to be more interpretable than raw solutions, enabling feedback loops between domain experts and FunSearch, and the deployment of such programs in real-world applications.},
+}
+
+@article{Imajuku2025ALEBench,
+  author    = {Yuki Imajuku and Kohki Horie and Yoichi Iwata and Kensho Aoki and Naohiro Takahashi and Takuya Akiba},
+  title     = {{ALE-Bench}: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering},
+  journal   = {ArXiv},
+  volume    = {abs/2506.09050},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2506.09050},
+  abstract  = {How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing? We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.},
+}
+
+@article{Janicic2025URSA,
+  author    = {Predrag Jani{\v{c}}i{\'c}},
+  title     = {A {SAT}-based Approach for Specification, Analysis, and Justification of Reductions between {NP}-complete Problems},
+  journal   = {ArXiv},
+  volume    = {abs/2511.18639},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2511.18639},
+  abstract  = {We propose a novel approach for the development, analysis, and verification of reductions between NP-complete problems. This method uses the URSA system, a SAT-based constraint solver and incorporates features that distinguish it from existing related systems.},
+}
+
+% ============================================================
+% Theme D (subset): Physics-Inspired QUBO/Ising Approaches
+% ============================================================
+
+@article{Schuetz2022PhysicsGNN,
+  author    = {M. Schuetz and J. K. Brubaker and H. Katzgraber},
+  title     = {Combinatorial Optimization with Physics-Inspired Graph Neural Networks},
+  journal   = {Nature Machine Intelligence},
+  volume    = {4},
+  pages     = {367--377},
+  year      = {2022},
+  doi       = {10.1038/s42256-022-00468-6},
+  abstract  = {Combinatorial optimization problems are pervasive across science and industry. Modern deep learning tools are poised to solve these problems at unprecedented scales, but a unifying framework that incorporates insights from statistical physics is still outstanding. Here we demonstrate how graph neural networks can be used to solve combinatorial optimization problems. Our approach is broadly applicable to canonical NP-hard problems in the form of quadratic unconstrained binary optimization problems, such as maximum cut, minimum vertex cover, maximum independent set, as well as Ising spin glasses and higher-order generalizations thereof in the form of polynomial unconstrained binary optimization problems. We apply a relaxation strategy to the problem Hamiltonian to generate a differentiable loss function with which we train the graph neural network and apply a simple projection to integer variables once the unsupervised training process has completed. We showcase our approach with numerical results for the canonical maximum cut and maximum independent set problems. We find that the graph neural network optimizer performs on par or outperforms existing solvers, with the ability to scale beyond the state of the art to problems with millions of variables.},
+}
+
+@inproceedings{He2024QuantumTSP,
+  author    = {Haoqi He},
+  title     = {Quantum Annealing and {GNN} for Solving {TSP} with {QUBO}},
+  booktitle = {Algorithmic Applications in Management},
+  pages     = {134--145},
+  year      = {2024},
+  doi       = {10.1007/978-981-97-7801-0_12},
+  abstract  = {This paper explores the application of Quadratic Unconstrained Binary Optimization (QUBO) models in solving the Travelling Salesman Problem (TSP) through Quantum Annealing algorithms and Graph Neural Networks. Quantum Annealing (QA), a quantum-inspired optimization method that exploits quantum tunneling to escape local minima, is used to solve QUBO formulations of TSP instances on Coherent Ising Machines (CIMs). The paper also presents a novel approach where QUBO is employed as a loss function within a GNN architecture tailored for solving TSP efficiently. By leveraging GNN's capability to learn graph representations, this method finds approximate solutions to TSP with improved computational time compared to traditional exact solvers.},
+}
+
+% ============================================================
+% Theme E: LLM-Assisted Formal Verification & Program Synthesis
+% ============================================================
+
+@article{Bursuc2025VeriCoding,
+  author    = {Sergiu Bursuc and Theodore Ehrenborg and Shaowei Lin and L. Astefanoaei and Ionel Emilian Chiosa and Jure Kukovec and Alok Singh and Oliver Butterley and Adem Bizid and Quinn Dougherty and Miranda Zhao and Max Tan and Max Tegmark},
+  title     = {A Benchmark for Vericoding: Formally Verified Program Synthesis},
+  journal   = {ArXiv},
+  volume    = {abs/2509.22908},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2509.22908},
+  abstract  = {We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications --- in contrast to vibe coding, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in Verus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27\% in Lean, 44\% in Verus/Rust and 82\% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68\% to 96\% over the past year. The benchmark and vericoding results are shared at https://github.com/Beneficial-AI-Foundation/vericoding-benchmark},
+}
+
+@article{Thakur2025CLEVER,
+  author    = {Amitayush Thakur and Jasper Lee and G. Tsoukalas and Meghana Sistla and Matthew Zhao and Stefan Zetzsche and Greg Durrett and Yisong Yue and Swarat Chaudhuri},
+  title     = {{CLEVER}: A Curated Benchmark for Formally Verified Code Generation},
+  journal   = {ArXiv},
+  volume    = {abs/2505.13938},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2505.13938},
+  abstract  = {We introduce CLEVER, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, CLEVER avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use CLEVER to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning.},
+}
+
+@inproceedings{Miranda2025VeriBench,
+  title     = {{VeriBench}: End-to-End Formal Verification Benchmark for {AI} Code Generation in {Lean} 4},
+  author    = {Brando Miranda and Zhanke Zhou and Allen Nie and Elyas Obbad and Leni Aniva and Kai Fronsdal and Weston Kirk and Dilara Soylu and Andrea Yu and Ying Li and Sanmi Koyejo},
+  year      = {2025},
+  booktitle = {2nd AI for Math Workshop at ICML 2025 (AI4Math@ICML)},
+  url       = {https://openreview.net/forum?id=rWkGFmnSNl},
+  abstract  = {VeriBench evaluates LLM capabilities in generating complete Lean 4 programs---implementations, unit tests, correctness theorems, and formal proofs---derived from reference Python functions or their docstrings. Testing 113 tasks across HumanEval problems, exercises, classical algorithms, and security challenges, the benchmark reveals that Claude 3.7 Sonnet achieves compilation on only 12.5\%, while LLaMA-70B fails to compile any programs in the Lean 4 HumanEval subset, even with 50 feedback-guided attempts. Only a self-optimizing agent architecture achieves meaningful compilation rates, approaching 90\%.},
+}
+
+@inproceedings{Mukherjee2025CoqPL,
+  title     = {Towards Automated Verification of {LLM}-Synthesized {C} Programs},
+  author    = {Prasita Mukherjee and Benjamin Delaware},
+  year      = {2025},
+  month     = jan,
+  booktitle = {CoqPL 2025: The Eleventh International Workshop on Coq for Programming Languages (co-located with POPL 2025)},
+  doi       = {10.48550/arXiv.2410.14835},
+  url       = {https://popl25.sigplan.org/details/CoqPL-2025-papers/5/Towards-Automated-Verification-of-LLM-Synthesized-C-Programs},
+  abstract  = {We present a synthesis and verification framework for C programs that leverages LLMs to generate candidate programs while imposing syntactic and semantic biases on programs generated by LLMs, such that the synthesized program is more amenable to automated verification. The key contribution is a specification-verification tool built on the Verified Software Toolchain. Experiments on diverse benchmarks from the deductive program synthesis community, including basic coding examples, Separation Logic based assertions, and API specifications, demonstrate scalability and extensibility.},
+}
+
+@inproceedings{Mukherjee2025SynVer,
+  title     = {{SYNVER}: {LLM}-Assisted Synthesis of High-Assurance {C} Programs},
+  author    = {Prasita Mukherjee and Minghai Lu and Benjamin Delaware},
+  year      = {2025},
+  month     = nov,
+  booktitle = {2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE)},
+  address   = {Seoul, Korea},
+  doi       = {10.1109/ASE63991.2025.00255},
+  url       = {https://ieeexplore.ieee.org/document/11334588/},
+  abstract  = {We present SynVer---a novel, general purpose synthesizer for C programs equipped with machine-checked proofs of correctness using the Verified Software Toolchain. SynVer employs two Large Language Models: the first generates candidate programs from user-provided specifications, and the second helps automatically generate proofs of correctness in the Rocq proof assistant. SynVer combines symbolic reasoning with LLM-powered proof generation to discharge proof obligations.},
+}
+
+% ============================================================
+% Theme F: LLM Reasoning Limitations
+% ============================================================
+
+@inproceedings{Mirzadeh2025GSMSymbolic,
+  author    = {Iman Mirzadeh and Keivan Alizadeh Vahid and Hooman Shahrokhi and Oncel Tuzel and Samy Bengio and Mehrdad Farajtabar},
+  title     = {{GSM}-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models},
+  booktitle = {International Conference on Learning Representations},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2410.05229},
+}
+
+@article{Shojaee2025IllusionOfThinking,
+  author    = {Parshin Shojaee and Iman Mirzadeh and Keivan Alizadeh and Maxwell Horton and Samy Bengio and Mehrdad Farajtabar},
+  title     = {The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity},
+  journal   = {ArXiv},
+  volume    = {abs/2506.06941},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2506.06941},
+}
+
+@article{Plaat2025MultiStepReasoning,
+  author    = {Aske Plaat and Annie Wong and Suzan Verberne and Joost Broekens and Niki van Stein and Thomas Back},
+  title     = {Reasoning with Large Language Models, a Survey},
+  journal   = {ACM Computing Surveys},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2407.11511},
+}
+
+@inproceedings{Liu2024LostInMiddle,
+  author    = {Nelson F. Liu and Kevin Lin and John Hewitt and Ashwin Paranjape and Michele Bevilacqua and Fabio Petroni and Percy Liang},
+  title     = {Lost in the Middle: How Language Models Use Long Contexts},
+  journal   = {Transactions of the Association for Computational Linguistics},
+  volume    = {12},
+  pages     = {157--173},
+  year      = {2024},
+  doi       = {10.1162/tacl_a_00638},
+}
+
+@article{Du2025ContextLengthHurts,
+  author    = {Yufeng Du and Minyang Tian and Srikanth Ronanki and Subendhu Rongali and Sravan Bodapati and Aram Galstyan and Azton Wells and Roy Schwartz and Eliu A. Huerta and Hao Peng},
+  title     = {Context Length Alone Hurts {LLM} Performance Despite Perfect Retrieval},
+  journal   = {ArXiv},
+  volume    = {abs/2510.05381},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2510.05381},
+}
+
+@article{Glazer2024FrontierMath,
+  author    = {Elliot Glazer and Ege Erdil and Tamay Besiroglu and Diego Chicharro and Evan Chen and Alex Gunning and Caroline Falkman Olsson and Jean-Stanislas Denain and Anson Ho and Emily de Oliveira Santos and Oam Patel and Niels Kornerup and Luca Zancato and Benjamin Feuer and Jonathan Tow and others},
+  title     = {{FrontierMath}: A Benchmark for Evaluating Advanced Mathematical Reasoning in {AI}},
+  journal   = {ArXiv},
+  volume    = {abs/2411.04872},
+  year      = {2024},
+  doi       = {10.48550/arXiv.2411.04872},
+}
+
+@article{Paster2025HLE,
+  author    = {Keiran Paster and Dan Hendrycks and others},
+  title     = {Humanity's Last Exam},
+  journal   = {Nature},
+  year      = {2025},
+  doi       = {10.1038/s41586-025-09962-4},
+}
+
+@article{Huang2025LLMWorkingMemory,
+  author    = {Jen-tse Huang and Kaiser Sun and Wenxuan Wang and Mark Dredze},
+  title     = {Language Models Do Not Have Human-Like Working Memory},
+  journal   = {ArXiv},
+  volume    = {abs/2505.10571},
+  year      = {2025},
+  doi       = {10.48550/arXiv.2505.10571},
+}
+
+@inproceedings{Merrill2024ExpressivePower,
+  author    = {William Merrill and Ashish Sabharwal},
+  title     = {The Expressive Power of Transformers with Chain of Thought},
+  booktitle = {International Conference on Learning Representations},
+  year      = {2024},
+  url       = {https://arxiv.org/abs/2310.07923},
+}
+
+@inproceedings{Dziri2023FaithFate,
+  author    = {Nouha Dziri and Ximing Lu and Melanie Sclar and Xiang Lorraine Li and Liwei Jiang and Bill Yuchen Lin and Sean Welleck and Peter West and Chandra Bhagavatula and Ronan Le Bras and Jena Hwang and Soumya Sanyal and Xiang Ren and Allyson Ettinger and Zaid Harchaoui and Yejin Choi},
+  title     = {Faith and Fate: Limits of Transformers on Compositionality},
+  booktitle = {Advances in Neural Information Processing Systems},
+  year      = {2023},
+  url       = {https://arxiv.org/abs/2305.18654},
+}
+
+% Theme G: AI Code Maintainability Concerns
+% ============================================================
+
+@misc{Jones2026LLMCompiler,
+  author       = {Derek M. Jones},
+  title        = {Investigating an {LLM} Generated {C} Compiler},
+  howpublished = {The Shape of Code (blog)},
+  year         = {2026},
+  url          = {https://shape-of-code.com/2026/02/22/investigating-an-llm-generated-c-compiler/},
+}
+
+@misc{GitClear2025CodeQuality,
+  author       = {{GitClear}},
+  title        = {{AI} Copilot Code Quality 2025: Analysis of 211 Million Changed Lines},
+  howpublished = {GitClear Whitepaper},
+  year         = {2025},
+  url          = {https://www.gitclear.com/ai_assistant_code_quality_2025_research},
+}
+
+@article{Becker2025METRProductivity,
+  author  = {Becker, Nate and Rush, Alexander and others},
+  title   = {Measuring the Impact of Early-2025 {AI} on Experienced Open-Source Developer Productivity},
+  journal = {ArXiv},
+  volume  = {abs/2507.09089},
+  year    = {2025},
+  url     = {https://arxiv.org/abs/2507.09089},
+}
+
+@article{CursorAI2025SpeedCost,
+  author  = {Others},
+  title   = {Speed at the Cost of Quality: How {Cursor AI} Increases Short-Term Velocity and Long-Term Complexity in Open-Source Projects},
+  journal = {ArXiv},
+  volume  = {abs/2511.04427},
+  year    = {2025},
+  url     = {https://arxiv.org/abs/2511.04427},
+}
+
+% ============================================================
+% Foundational References (from project bibliography)
+% ============================================================
+
+@inproceedings{karp1972,
+  author    = {Richard M. Karp},
+  title     = {Reducibility among Combinatorial Problems},
+  booktitle = {Complexity of Computer Computations},
+  publisher = {Plenum Press},
+  year      = {1972},
+  pages     = {85--103}
+}
+
+@inproceedings{cook1971,
+  author    = {Stephen A. Cook},
+  title     = {The Complexity of Theorem-Proving Procedures},
+  booktitle = {Proceedings of the Third Annual ACM Symposium on Theory of Computing},
+  year      = {1971},
+  pages     = {151--158}
+}
+
+@book{garey1979,
+  author    = {Michael R. Garey and David S. Johnson},
+  title     = {Computers and Intractability: A Guide to the Theory of NP-Completeness},
+  publisher = {W. H. Freeman},
+  year      = {1979}
+}
+
+@article{glover2019,
+  author  = {Fred Glover and Gary Kochenberger and Yu Du},
+  title   = {Quantum Bridge Analytics {I}: a tutorial on formulating and using {QUBO} models},
+  journal = {4OR},
+  volume  = {17},
+  pages   = {335--371},
+  year    = {2019},
+  doi     = {10.1007/s10288-019-00424-y}
+}
+
+@article{lucas2014,
+  author  = {Andrew Lucas},
+  title   = {Ising formulations of many NP problems},
+  journal = {Frontiers in Physics},
+  volume  = {2},
+  number  = {5},
+  year    = {2014}
+}
+
+@article{pichler2018,
+  author  = {Hannes Pichler and Sheng-Tao Wang and Leo Zhou and Soonwon Choi and Mikhail D. Lukin},
+  title   = {Quantum Optimization for Maximum Independent Set Using {Rydberg} Atom Arrays},
+  journal = {arXiv preprint arXiv:1808.10816},
+  year    = {2018},
+  doi     = {10.48550/arXiv.1808.10816}
+}
+
+@article{barahona1982,
+  author  = {Francisco Barahona},
+  title   = {On the computational complexity of Ising spin glass models},
+  journal = {Journal of Physics A: Mathematical and General},
+  volume  = {15},
+  number  = {10},
+  pages   = {3241--3253},
+  year    = {1982}
+}
diff --git a/docs/paper/arxiv/scripts/mine-git-history.py b/docs/paper/arxiv/scripts/mine-git-history.py
new file mode 100644
index 00000000..0488b017
--- /dev/null
+++ b/docs/paper/arxiv/scripts/mine-git-history.py
@@ -0,0 +1,150 @@
+#!/usr/bin/env python3
+"""Mine merged PRs from CodingThrust/problem-reductions to understand project evolution.
+
+Extracts all merged PRs, classifies them by type ([Rule]/[Model]), author type
+(agent vs human), and project phase (manual / basic skills / full pipeline).
+
+Phase boundaries:
+  - Phase 1 (manual):       before 2026-02-22 (no add-model/add-rule skills)
+  - Phase 2 (basic skills): 2026-02-22 to 2026-02-28 (add-model/add-rule exist)
+  - Phase 3 (full pipeline): 2026-03-01 onwards (meta-power batch resolution)
+
+Output: JSON to stdout with summary, per-phase breakdown, and full PR list.
+"""
+
+import json
+import re
+import subprocess
+import sys
+from datetime import datetime, timezone
+
+REPO = "CodingThrust/problem-reductions"
+
+# Phase boundary dates (UTC).  Determined from:
+#   git show 3ddc415 --format="%ai"  => 2026-02-22  (add-model / add-rule skills)
+#   git show 2cfb1b7 --format="%ai"  => 2026-03-01  (meta-power skill)
+PHASE_BOUNDARIES = [
+    datetime(2026, 2, 22, tzinfo=timezone.utc),   # Phase 1 -> Phase 2
+    datetime(2026, 3, 1, tzinfo=timezone.utc),     # Phase 2 -> Phase 3
+]
+
+PHASE_LABELS = ["manual", "basic-skills", "full-pipeline"]
+
+AGENT_LOGINS = {"github-actions"}
+
+
+def is_agent(author: dict) -> bool:
+    """Classify author as agent (bot) or human."""
+    login = author.get("login", "")
+    if "[bot]" in login:
+        return True
+    if login in AGENT_LOGINS:
+        return True
+    return author.get("is_bot", False)
+
+
+def classify_phase(merged_at: str) -> int:
+    """Return 1-based phase number from the merged-at timestamp."""
+    dt = datetime.fromisoformat(merged_at.replace("Z", "+00:00"))
+    for i, boundary in enumerate(PHASE_BOUNDARIES):
+        if dt < boundary:
+            return i + 1
+    return len(PHASE_BOUNDARIES) + 1
+
+
+def classify_type(title: str) -> str | None:
+    """Return 'Rule', 'Model', or None based on PR title heuristics."""
+    if "[Rule]" in title:
+        return "Rule"
+    if "[Model]" in title:
+        return "Model"
+    # Heuristic: detect issue-linked PRs whose branch or title imply a model/rule
+    # e.g. "Fix #52: TravelingSalesman to ILP reduction" => Rule
+    # e.g. "Fix #47: Add HamiltonianCycle model" => Model
+    title_lower = title.lower()
+    if re.search(r"\breduction\b", title_lower) and re.search(r"\bto\b", title_lower):
+        return "Rule"
+    if re.search(r"\badd\b.*\bmodel\b", title_lower):
+        return "Model"
+    return None
+
+
+def fetch_prs() -> list[dict]:
+    """Fetch all merged PRs from GitHub."""
+    cmd = [
+        "gh", "pr", "list",
+        "--repo", REPO,
+        "--state", "merged",
+        "--limit", "999",
+        "--json", "number,title,author,createdAt,mergedAt,labels,headRefName",
+    ]
+    result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+    return json.loads(result.stdout)
+
+
+def main():
+    prs_raw = fetch_prs()
+
+    prs = []
+    for pr in sorted(prs_raw, key=lambda x: x["number"]):
+        pr_type = classify_type(pr["title"])
+        agent = is_agent(pr["author"])
+        phase = classify_phase(pr["mergedAt"])
+
+        prs.append({
+            "number": pr["number"],
+            "title": pr["title"],
+            "author": pr["author"]["login"],
+            "created_at": pr["createdAt"],
+            "merged_at": pr["mergedAt"],
+            "branch": pr["headRefName"],
+            "is_agent": agent,
+            "phase": phase,
+            "type": pr_type,
+        })
+
+    # Summary
+    rule_prs = [p for p in prs if p["type"] == "Rule"]
+    model_prs = [p for p in prs if p["type"] == "Model"]
+    agent_prs = [p for p in prs if p["is_agent"]]
+    human_prs = [p for p in prs if not p["is_agent"]]
+
+    summary = {
+        "total_prs": len(prs),
+        "rule_prs": len(rule_prs),
+        "model_prs": len(model_prs),
+        "other_prs": len(prs) - len(rule_prs) - len(model_prs),
+        "agent_authored": len(agent_prs),
+        "human_authored": len(human_prs),
+    }
+
+    # Per-phase breakdown
+    by_phase = []
+    for phase_num, label in enumerate(PHASE_LABELS, start=1):
+        phase_prs = [p for p in prs if p["phase"] == phase_num]
+        by_phase.append({
+            "phase": phase_num,
+            "label": label,
+            "count": len(phase_prs),
+            "rule_count": len([p for p in phase_prs if p["type"] == "Rule"]),
+            "model_count": len([p for p in phase_prs if p["type"] == "Model"]),
+            "agent_count": len([p for p in phase_prs if p["is_agent"]]),
+            "human_count": len([p for p in phase_prs if not p["is_agent"]]),
+        })
+
+    output = {
+        "summary": summary,
+        "by_phase": by_phase,
+        "phase_boundaries": {
+            "phase_1_end": PHASE_BOUNDARIES[0].isoformat(),
+            "phase_2_end": PHASE_BOUNDARIES[1].isoformat(),
+        },
+        "prs": prs,
+    }
+
+    json.dump(output, sys.stdout, indent=2)
+    print()  # trailing newline
+
+
+if __name__ == "__main__":
+    main()
diff --git a/docs/paper/arxiv/writing-guidelines.md b/docs/paper/arxiv/writing-guidelines.md
new file mode 100644
index 00000000..a0944300
--- /dev/null
+++ b/docs/paper/arxiv/writing-guidelines.md
@@ -0,0 +1,93 @@
+# Writing Guidelines
+
+Lessons distilled from studying "Attention Is All You Need" (Vaswani et al., NeurIPS 2017) and applying them to our paper.
+
+## 1. Start from what the reader knows
+
+Open each section with a familiar concept, then pivot to the gap or novelty.
+
+- **Abstract**: Begin with the real-world context ("Many real-world optimization problems..."), not the technical contribution.
+- **Introduction**: Start with the concrete problem (airlines, chip designers, logistics), not benchmarks or related work.
+- **Each section**: The first sentence should orient the reader, not assume they just read the previous section closely.
+
+**Bad**: "NP-hard problem reductions form a directed graph that serves as compilation infrastructure."
+**Good**: "Many real-world optimization problems are computationally hard, yet specialized solvers exist for a handful of them."
+
+## 2. Define every concept before using it
+
+Never use a term or symbol without having introduced it first. If a concept appears in the abstract, it must be self-explanatory in context.
+
+- Spell out all abbreviations on first use: "Maximum Independent Set (MIS)", not just "MIS."
+- Define technical terms in plain language before using them: "A *reduction* is a mathematical transformation that converts one problem into another while preserving the solution."
+- Introduce notation gradually: describe in words what $G = (V, E)$ means before writing the formula.
+
+**The Vaswani rule**: Before any equation, explain in words what each symbol means and what the equation will do. The math *follows* the intuition, never the reverse.
+
+## 3. One idea per sentence
+
+Short, declarative sentences. Each sentence carries one fact or one claim.
+
+- **Bad**: "The Transformer, which is a novel network architecture based entirely on attention mechanisms rather than recurrence or convolutions, achieves state-of-the-art results."
+- **Good**: "The Transformer relies entirely on attention mechanisms. It uses no recurrence or convolution."
+
+Avoid hedging words ("it should be noted that", "it is worth mentioning that"). Just state the fact.
+
+## 4. Lead with the answer, not the reasoning
+
+Put the conclusion first, then the evidence. The reader should know where you're going before you take them there.
+
+- **Bad**: "Because each reduction implements the same trait, follows the same file convention, and requires the same test pattern, reusable skills are possible."
+- **Good**: "Reductions form a homogeneous task family, enabling reusable skills. Every reduction implements the same interface, follows the same file convention, and requires the same test pattern."
+
+## 5. Describe the thing, then justify it
+
+"Attention Is All You Need" describes the Transformer architecture in Section 3, then justifies the design choice in Section 4 ("Why Self-Attention"). The reader needs to understand *what* before they can appreciate *why*.
+
+Apply to our paper:
+- Section 2 describes the reduction graph and its properties.
+- Section 3 describes the methodology (skills, pipeline, verification).
+- Justification for choices (why skills? why this verification stack?) follows naturally from the description.
+
+## 6. Use concrete examples to anchor abstractions
+
+Every abstract concept should have a concrete example nearby.
+
+- "Emergent compositionality" → the Factoring → CircuitSAT → ILP story
+- "Round-trip testing" → reduce a graph, solve by brute force, extract, verify
+- "Quality gate" → 75% rejection rate on 322 batch-submitted issues
+
+When introducing a general pattern, immediately show one instance of it.
+
+## 7. Structure sections as self-contained units
+
+Each section should be readable on its own. A reader who skips straight to Section 4 (Evaluation) should understand what is being evaluated without re-reading Sections 1-3 in detail.
+
+- Re-introduce key terms briefly when they reappear ("round-trip testing, described in Section 2.4, ...").
+- Avoid forward references to undefined concepts. If Section 2 mentions "skills," the reader should already have a rough sense of what skills are from the introduction.
+
+## 8. Tables and figures earn their space
+
+Every table and figure must be:
+1. Referenced in the text (never orphaned).
+2. Self-contained with a caption that explains what the reader should see.
+3. Necessary—if the same information fits in one sentence, skip the table.
+
+Captions should tell a story: "Seven-layer verification stack. Each layer catches a distinct class of error that the layers below it miss." Not just: "Verification layers."
+
+## 9. The abstract is a standalone document
+
+The abstract should be understandable by someone who reads *only* the abstract. This means:
+- No undefined abbreviations
+- No forward references ("as shown in Section 3")
+- No citations
+- A clear problem → approach → result → significance arc
+
+## 10. Cut ruthlessly
+
+If a sentence doesn't advance the argument, delete it. Common cuts:
+- "In this section, we describe..." → just describe it
+- "It is important to note that..." → just state the note
+- "As mentioned above..." → if it matters, the reader remembers; if not, cut it
+- Restating what was just said in different words
+
+The Vaswani paper is 15 pages including references and appendix, covering a paradigm-shifting architecture. If they can do it in 15 pages, so can we.
diff --git a/docs/superpowers/plans/2026-03-12-arxiv-paper-impl.md b/docs/superpowers/plans/2026-03-12-arxiv-paper-impl.md
new file mode 100644
index 00000000..78cfd427
--- /dev/null
+++ b/docs/superpowers/plans/2026-03-12-arxiv-paper-impl.md
@@ -0,0 +1,1085 @@
+# Arxiv Paper Implementation Plan
+
+> **For agentic workers:** REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Write a full research paper (~10-12 pages) on skill-based agentic coding for NP-hard problem reductions, targeting an ICSE/ASE-class venue.
+
+**Architecture:** LaTeX document at `docs/paper/arxiv/paper.tex` using IEEEtran class with figures generated in Typst+CeTZ (compiled to PDF, included via `\includegraphics`), bibliography from survey, and data gathered from git history and the reduction graph.
+
+**Tech Stack:** LaTeX (IEEEtran class), BibTeX, pdflatex, Typst+CeTZ (figures only)
+
+**Spec:** `docs/superpowers/specs/2026-03-12-arxiv-paper-design.md`
+
+**Compile command** (used throughout):
+```bash
+# Compile Typst figures first
+for f in docs/paper/arxiv/figures/*.typ; do typst compile "$f"; done
+# Then build LaTeX
+cd docs/paper/arxiv && pdflatex paper.tex && bibtex paper && pdflatex paper.tex && pdflatex paper.tex && cd -
+```
+Or single-pass check (figures already compiled): `cd docs/paper/arxiv && pdflatex -interaction=nonstopmode paper.tex && cd -`
+
+**Review skill:** After writing is complete, use `academic-paper-reviewer` (installed at `.claude/skills/academic-research-skills/academic-paper-reviewer/`) for simulated 5-person peer review.
+
+---
+
+## File Structure
+
+| File | Purpose |
+|------|---------|
+| `docs/paper/arxiv/paper.tex` | Main paper document (IEEEtran) |
+| `docs/paper/arxiv/references.bib` | Bibliography (merged from survey + existing paper refs) |
+| `docs/paper/arxiv/figures/reduction-graph.typ` | Figure 1: Reduction graph (Typst+CeTZ → PDF) |
+| `docs/paper/arxiv/figures/architecture.typ` | Figure 2: System architecture (Typst+CeTZ → PDF) |
+| `docs/paper/arxiv/figures/pipeline.typ` | Figure 3: Card-based pipeline (Typst+CeTZ → PDF) |
+| `docs/paper/arxiv/figures/verification-pyramid.typ` | Figure 4: Verification stack pyramid (Typst+CeTZ → PDF) |
+| `docs/paper/arxiv/data/graph-metrics.json` | Reduction graph metrics (from Task 2) |
+| `docs/paper/arxiv/data/git-mining-results.json` | Git history mining results (from Task 11) |
+| `docs/paper/arxiv/scripts/mine-git-history.py` | Git history mining script |
+
+---
+
+## Chunk 1: Paper Scaffolding + Data Gathering
+
+### Task 1: Set up paper.tex scaffolding
+
+**Files:**
+- Create: `docs/paper/arxiv/paper.tex`
+- Create: `docs/paper/arxiv/references.bib`
+
+- [ ] **Step 1: Create bibliography file**
+
+Copy the survey bibliography:
+
+```bash
+cp .claude/survey/agentic-coding-reductions/references.bib docs/paper/arxiv/references.bib
+```
+
+Then append the following entries from `docs/paper/references.bib` (read that file and copy these exact `@` entries by key): `karp1972`, `cook1971`, `garey1979`, `glover2019`, `lucas2014`, `barahona1982`. These are foundational references not in the survey bib.
+
+- [ ] **Step 2: Write paper.tex with IEEEtran class**
+
+Create `docs/paper/arxiv/paper.tex` with:
+
+```latex
+\documentclass[conference]{IEEEtran}
+\usepackage{cite}
+\usepackage{amsmath,amssymb,amsfonts}
+\usepackage{graphicx}
+\usepackage{textcomp}
+\usepackage{xcolor}
+\usepackage{booktabs}
+\usepackage{listings}
+\usepackage{hyperref}
+\usepackage{cleveref}
+
+\begin{document}
+
+\title{Skill-Based Agentic Coding for Mathematical Software:\\
+A Case Study in NP-Hard Problem Reductions}
+
+\author{...}  % placeholder
+
+\maketitle
+
+\begin{abstract}
+...
+\end{abstract}
+
+\section{Introduction}\label{sec:intro}
+\section{Why Reductions? The Goldilocks Domain}\label{sec:domain}
+\section{System Architecture}\label{sec:architecture}
+\section{Skill-Based Task Decomposition}\label{sec:skills}
+\section{Multi-Layered Verification}\label{sec:verification}
+\section{Evaluation}\label{sec:evaluation}
+\section{Related Work}\label{sec:related}
+\section{Discussion \& Conclusion}\label{sec:conclusion}
+
+\bibliographystyle{IEEEtran}
+\bibliography{references}
+
+\end{document}
+```
+
+- [ ] **Step 3: Write abstract (~150 words)**
+
+Fill in the abstract covering:
+- Problem: agents fail at long-horizon math coding tasks (70-80% on SWE-Bench Verified, ~20% on long-horizon)
+- Insight: decompose into human-creative + agent-managed/executed via skill-based pipeline
+- Method: 13 skills + 7-layer verification stack
+- Result: 24 problem types, 40 implemented reductions, 52 graph edges
+- Contribution: methodology + verification stack + open-source artifact
+
+- [ ] **Step 4: Create figures directory**
+
+```bash
+mkdir -p docs/paper/arxiv/figures
+```
+
+- [ ] **Step 5: Verify scaffolding compiles**
+
+```bash
+cd docs/paper/arxiv && pdflatex -interaction=nonstopmode paper.tex && cd -
+```
+
+Expected: PDF with title, abstract, and empty section headings. BibTeX warnings about missing refs are expected at this stage.
+
+- [ ] **Step 6: Remove old paper.typ**
+
+```bash
+rm -f docs/paper/arxiv/paper.typ
+```
+
+- [ ] **Step 7: Commit**
+
+```bash
+git add -f docs/paper/arxiv/paper.tex docs/paper/arxiv/references.bib
+git rm -f docs/paper/arxiv/paper.typ 2>/dev/null; true
+git commit -m "docs(arxiv): LaTeX paper scaffolding with IEEEtran and bibliography"
+```
+
+---
+
+### Task 2: Data gathering — reduction graph metrics
+
+**Files:**
+- Create: `docs/paper/arxiv/data/graph-metrics.json`
+
+**Note:** The file `docs/src/reductions/reduction_graph.json` has a corrupted header (partial JSON + log message before line 10). Regenerate it first or parse from the second valid copy starting after `JSON content:`.
+
+- [ ] **Step 1: Regenerate the reduction graph JSON**
+
+```bash
+make rust-export
+```
+
+This regenerates `docs/src/reductions/reduction_graph.json` with clean content. Verify it starts with valid JSON:
+
+```bash
+python3 -c "import json; json.load(open('docs/src/reductions/reduction_graph.json'))"
+```
+
+If the file is still corrupted after `make rust-export`, extract the valid portion:
+
+```bash
+python3 -c "
+content = open('docs/src/reductions/reduction_graph.json').read()
+idx = content.find('JSON content:\n')
+if idx >= 0:
+    clean = content[idx+len('JSON content:\n'):]
+    open('docs/src/reductions/reduction_graph.json', 'w').write(clean)
+    print('Fixed corrupted JSON')
+else:
+    print('JSON is clean')
+"
+```
+
+- [ ] **Step 2: Count nodes, edges, and types**
+
+```bash
+python3 -c "
+import json
+data = json.load(open('docs/src/reductions/reduction_graph.json'))
+nodes = data['nodes']
+edges = data['edges']
+names = sorted(set(n['name'] for n in nodes))
+print(f'Unique problem types: {len(names)}')
+print(f'Variant nodes: {len(nodes)}')
+print(f'Total directed edges: {len(edges)}')
+print(f'Types: {names}')
+"
+```
+
+Expected: ~24 types, ~42 variant nodes, ~52 edges.
+
+Count implemented ReduceTo impls (the "40 reductions" number):
+
+```bash
+grep -c 'impl.*ReduceTo' src/rules/*_*.rs | awk -F: '{s+=$2} END {print "Total ReduceTo impls:", s}'
+```
+
+Expected: ~40. Inferred variant edges = total edges - ReduceTo impls.
+
+- [ ] **Step 3: Compute hub node degrees**
+
+```bash
+python3 << 'PYEOF'
+import json
+from collections import Counter
+data = json.load(open('docs/src/reductions/reduction_graph.json'))
+in_deg = Counter()
+out_deg = Counter()
+for e in data['edges']:
+    # edges use node dicts with 'name' field
+    src = e['source']['name'] if isinstance(e['source'], dict) else data['nodes'][e['source']]['name']
+    tgt = e['target']['name'] if isinstance(e['target'], dict) else data['nodes'][e['target']]['name']
+    in_deg[tgt] += 1
+    out_deg[src] += 1
+print('Top in-degree (reduce TO this):')
+for name, cnt in in_deg.most_common(5):
+    print(f'  {name}: {cnt}')
+print('Top out-degree (reduce FROM this):')
+for name, cnt in out_deg.most_common(5):
+    print(f'  {name}: {cnt}')
+PYEOF
+```
+
+Record QUBO and ILP in-degrees, MIS and SAT out-degrees for S2.
+
+- [ ] **Step 4: Count LOC per reduction (excluding casts files)**
+
+```bash
+for f in src/rules/*_*.rs; do
+    case "$f" in *_casts.rs) continue;; esac
+    echo "$(wc -l < "$f") $f"
+done | sort -n
+```
+
+Record min, max, median for the "~50-200 LOC" claim.
+
+- [ ] **Step 5: Save metrics to data file**
+
+```bash
+mkdir -p docs/paper/arxiv/data
+```
+
+Write a JSON file at `docs/paper/arxiv/data/graph-metrics.json` containing:
+```json
+{
+  "unique_types": 24,
+  "variant_nodes": 42,
+  "total_edges": 52,
+  "reduceto_impls": 40,
+  "inferred_edges": 12,
+  "hub_in_degree": {"QUBO": N, "ILP": N},
+  "hub_out_degree": {"MIS": N, "SAT": N},
+  "loc_per_reduction": {"min": N, "max": N, "median": N}
+}
+```
+
+Fill in actual numbers from Steps 2-4.
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add -f docs/paper/arxiv/data/graph-metrics.json
+git commit -m "docs(arxiv): gather reduction graph metrics"
+```
+
+---
+
+## Chunk 2: Figures
+
+**Conventions for all figure files (Typst+CeTZ → PDF hybrid):**
+- Each figure is a standalone `.typ` file in `docs/paper/arxiv/figures/`.
+- Figures use `#set page(width: auto, height: auto, margin: 5pt)` for tight bounding box.
+- Import CeTZ: `#import "@preview/cetz:0.4.2": canvas, draw`.
+- Import the project graph library when useful: `#import "../../../lib.typ": g-node, g-edge, graph-colors`.
+- Color scheme: graph=`rgb("#4e79a7")` (blue), formula=`rgb("#59a14f")` (green), set=`rgb("#e15759")` (orange-red), algebraic=`rgb("#b07aa1")` (purple), misc=`rgb("#999")` (gray). Human=`rgb("#f28e2b")` (orange), Agent=`rgb("#4e79a7")` (blue).
+- Arrow style: `mark: (end: "straight")` for directed edges.
+- Compile each figure to PDF: `typst compile docs/paper/arxiv/figures/filename.typ`.
+- Include in LaTeX via `\includegraphics{figures/filename.pdf}`.
+- Test figures by compiling individually before full paper build.
+- Do NOT commit generated `.pdf` files — they are build artifacts.
+
+### Task 3: Figure 1 — Reduction graph
+
+**Files:**
+- Create: `docs/paper/arxiv/figures/reduction-graph.typ`
+- Modify: `docs/paper/arxiv/paper.tex`
+
+- [ ] **Step 1: Create reduction graph figure**
+
+Create `docs/paper/arxiv/figures/reduction-graph.typ`. Read the graph data from `docs/src/reductions/reduction_graph.json` for edge connectivity.
+
+```typst
+#import "@preview/cetz:0.4.2": canvas, draw
+#set page(width: auto, height: auto, margin: 5pt)
+#set text(size: 7pt)
+
+// Category colors
+#let cat-graph = rgb("#4e79a7")
+#let cat-formula = rgb("#59a14f")
+#let cat-set = rgb("#e15759")
+#let cat-algebraic = rgb("#b07aa1")
+#let cat-misc = rgb("#999")
+
+#canvas(length: 1cm, {
+  import draw: *
+
+  // Node positions by category (column-based layout)
+  // Column 1: graph problems, Column 2: formula, etc.
+  // Place QUBO and ILP centrally as hub nodes (larger radius)
+
+  // ... define positions for all 24 unique problem types ...
+  // ... draw directed edges from graph JSON ...
+  // ... add legend box ...
+})
+```
+
+Use a column-based layout by category:
+- Column 1 (blue): graph problems (MIS, MaxClique, MaxCut, MinVC, MinDS, MaxMatching, MaximalIS, KColoring, TSP, SpinGlass, BicliqueCover)
+- Column 2 (green): formula problems (SAT, k-SAT, CircuitSAT)
+- Column 3 (orange-red): set problems (MinSetCovering, MaxSetPacking)
+- Column 4 (purple): algebraic problems (QUBO, ILP, CVP, BMF, Knapsack)
+- Column 5 (gray): misc problems (BinPacking, PaintShop, Factoring)
+
+Place QUBO and ILP centrally as hub nodes (larger circles, `radius: 0.4` vs `0.2`). Use the 24 unique problem type names (not all 42 variants). Mention variants in caption.
+
+For each node, use `draw.circle(pos, radius: r, fill: cat-color.lighten(70%), stroke: 0.5pt + cat-color, name: id)` and `draw.content(id, text(6pt, abbreviation))`.
+
+For directed edges, use `draw.line(src, tgt, stroke: 0.4pt + luma(100), mark: (end: "straight", scale: 0.4))`. Keep edges thin to avoid clutter with 52 edges.
+
+Add a small legend box in one corner with the 5 category colors.
+
+- [ ] **Step 2: Compile figure to PDF**
+
+```bash
+typst compile docs/paper/arxiv/figures/reduction-graph.typ
+```
+
+Verify output: `docs/paper/arxiv/figures/reduction-graph.pdf` exists.
+
+- [ ] **Step 3: Include in paper.tex**
+
+In Section 2, add:
+```latex
+\begin{figure*}[t]
+  \centering
+  \includegraphics[width=\textwidth]{figures/reduction-graph.pdf}
+  \caption{The reduction graph: 24 problem types connected by 52 directed edges (40 implemented reductions + 12 inferred variant edges). Hub nodes QUBO and ILP are highlighted.}
+  \label{fig:reduction-graph}
+\end{figure*}
+```
+
+Use `figure*` for full-width in two-column layout.
+
+- [ ] **Step 4: Verify full paper compiles**
+
+```bash
+cd docs/paper/arxiv && pdflatex -interaction=nonstopmode paper.tex && cd -
+```
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add -f docs/paper/arxiv/figures/reduction-graph.typ docs/paper/arxiv/paper.tex
+git commit -m "docs(arxiv): add Figure 1 — reduction graph (Typst+CeTZ)"
+```
+
+---
+
+### Task 4: Figure 3 — Pipeline diagram
+
+**Files:**
+- Create: `docs/paper/arxiv/figures/pipeline.typ`
+- Modify: `docs/paper/arxiv/paper.tex`
+
+- [ ] **Step 1: Create pipeline diagram**
+
+Create `docs/paper/arxiv/figures/pipeline.typ` using CeTZ:
+
+```typst
+#import "@preview/cetz:0.4.2": canvas, draw
+#set page(width: auto, height: auto, margin: 5pt)
+#set text(size: 8pt)
+
+#let human-color = rgb("#f28e2b")
+#let agent-color = rgb("#4e79a7")
+
+#canvas(length: 1cm, {
+  import draw: *
+
+  // Board columns as rounded rectangles, connected vertically
+  // Color-code: human decisions in orange, agent actions in blue
+  // Layout:
+  // Contributor → [Issue] → [Backlog]
+  //                            │ Maintainer moves card (orange)
+  //                            ▼
+  //                         [Ready]
+  //                            │ project-pipeline (blue)
+  //                            ▼
+  //                      [In Progress]
+  //                            │ issue-to-pr → check → implement → review (blue)
+  //                            ▼
+  //                    [review-agentic]
+  //                            │ review-pipeline (blue)
+  //                            ▼
+  //                      [In Review]
+  //                            │ Maintainer merges (orange)
+  //                            ▼
+  //                         [Done]
+
+  // Use rect(..., radius: 4pt) for rounded board columns
+  // Use line() with mark: (end: "straight") for arrows
+  // Add action labels on edges with draw.content()
+})
+```
+
+- [ ] **Step 2: Compile figure to PDF**
+
+```bash
+typst compile docs/paper/arxiv/figures/pipeline.typ
+```
+
+- [ ] **Step 3: Include in paper.tex in S4**
+
+```latex
+\begin{figure}[t]
+  \centering
+  \includegraphics[width=\columnwidth]{figures/pipeline.pdf}
+  \caption{Two-stage card-based pipeline. Human decisions (orange) are limited to Backlog$\to$Ready and In Review$\to$Done. Agent manages everything in between.}
+  \label{fig:pipeline}
+\end{figure}
+```
+
+- [ ] **Step 4: Verify compiles**
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add -f docs/paper/arxiv/figures/pipeline.typ docs/paper/arxiv/paper.tex
+git commit -m "docs(arxiv): add Figure 3 — card-based pipeline diagram (Typst+CeTZ)"
+```
+
+---
+
+### Task 5: Figure 4 — Verification pyramid
+
+**Files:**
+- Create: `docs/paper/arxiv/figures/verification-pyramid.typ`
+- Modify: `docs/paper/arxiv/paper.tex`
+
+- [ ] **Step 1: Create verification pyramid figure**
+
+Create `docs/paper/arxiv/figures/verification-pyramid.typ` using CeTZ:
+
+```typst
+#import "@preview/cetz:0.4.2": canvas, draw
+#set page(width: auto, height: auto, margin: 5pt)
+#set text(size: 7pt)
+
+#canvas(length: 1cm, {
+  import draw: *
+
+  // 7-layer trapezoid/pyramid, widest at bottom
+  // Each layer is a filled trapezoid with text on left (mechanism) and right (error class)
+  // Color gradient: bottom = blue (automated), top = orange/gold (human-readable)
+
+  // Layer data: (mechanism, error class caught)
+  // 1: Type system (Rust compiler)        → API misuse
+  // 2: Unit tests (eval, serialization)   → evaluation errors
+  // 3: Closed-loop tests (round-trip)     → mapping errors
+  // 4: Overhead validation (symbolic)     → formula errors
+  // 5: Materialized fixtures (JSON)       → test gaming
+  // 6: Agentic review (parallel)          → convention violations
+  // 7: Documentation (proof sketch)       → logical errors
+
+  // Draw each layer as a trapezoid using merge-path with line segments
+  // Width decreases from bottom to top
+  // Use draw.content() for labels on each layer
+  // Use color.mix() or manual gradient for blue→gold transition
+})
+```
+
+- [ ] **Step 2: Compile figure to PDF**
+
+```bash
+typst compile docs/paper/arxiv/figures/verification-pyramid.typ
+```
+
+- [ ] **Step 3: Include in paper.tex in S5**
+
+```latex
+\begin{figure}[t]
+  \centering
+  \includegraphics[width=\columnwidth]{figures/verification-pyramid.pdf}
+  \caption{Seven-layer verification stack. Lower layers (blue) are fully automated; upper layers (gold) involve human-readable arguments.}
+  \label{fig:verification}
+\end{figure}
+```
+
+- [ ] **Step 4: Verify compiles**
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add -f docs/paper/arxiv/figures/verification-pyramid.typ docs/paper/arxiv/paper.tex
+git commit -m "docs(arxiv): add Figure 4 — verification pyramid (Typst+CeTZ)"
+```
+
+---
+
+### Task 6: Figure 2 — System architecture
+
+**Files:**
+- Create: `docs/paper/arxiv/figures/architecture.typ`
+- Modify: `docs/paper/arxiv/paper.tex`
+
+- [ ] **Step 1: Create architecture diagram**
+
+Create `docs/paper/arxiv/figures/architecture.typ` using CeTZ:
+
+```typst
+#import "@preview/cetz:0.4.2": canvas, draw
+#set page(width: auto, height: auto, margin: 5pt)
+#set text(size: 8pt)
+
+#canvas(length: 1cm, {
+  import draw: *
+
+  // Three stacked boxes connected by labeled arrows:
+  //
+  // ┌─────────────────────────────────────┐
+  // │           Problem trait              │
+  // │  NAME, Metric, dims(), evaluate()   │
+  // ├──────────────┬──────────────────────┤
+  // │ Optimization │    Satisfaction      │
+  // │ SolutionSize │    bool              │
+  // └──────┬───────┴──────────────────────┘
+  //        │ ReduceTo<T>
+  //        ▼
+  // ┌─────────────────────────────────────┐
+  // │        ReductionResult<T>           │
+  // │  target_problem() + extract_solution│
+  // └──────┬──────────────────────────────┘
+  //        │ #[reduction(overhead = {...})]
+  //        ▼
+  // ┌─────────────────────────────────────┐
+  // │      Compile-time validation        │
+  // │  • Variable names → getter methods  │
+  // │  • Expr AST: symbolic overhead      │
+  // │  • declare_variants! → registry     │
+  // └─────────────────────────────────────┘
+
+  // Use rect() with name for each box
+  // Use draw.content() for text inside boxes (use raw() for code identifiers)
+  // Use line() with mark for connecting arrows
+  // Use draw.content() on arrow midpoints for edge labels
+})
+```
+
+Keep compact. Use `raw()` (backtick syntax) for code identifiers in Typst.
+
+- [ ] **Step 2: Compile figure to PDF**
+
+```bash
+typst compile docs/paper/arxiv/figures/architecture.typ
+```
+
+- [ ] **Step 3: Include in paper.tex in S3**
+
+```latex
+\begin{figure}[t]
+  \centering
+  \includegraphics[width=\columnwidth]{figures/architecture.pdf}
+  \caption{System architecture: the trait hierarchy and compile-time validation enforce round-trip testing capability by construction.}
+  \label{fig:architecture}
+\end{figure}
+```
+
+- [ ] **Step 4: Verify compiles**
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add -f docs/paper/arxiv/figures/architecture.typ docs/paper/arxiv/paper.tex
+git commit -m "docs(arxiv): add Figure 2 — system architecture (Typst+CeTZ)"
+```
+
+---
+
+## Chunk 3: Sections S1-S4
+
+**Convention:** All "Verify compiles" steps use: `cd docs/paper/arxiv && pdflatex -interaction=nonstopmode paper.tex && cd -`. Expected: no fatal errors. Citations use `\cite{BibKey}` (e.g., `\cite{Thai2025SWEEVO}`). Cross-references use `\Cref{fig:...}` or `Fig.~\ref{fig:...}`. Before writing any section, first read `paper.tex` to understand the formatting conventions established in Task 1.
+
+**Page budget reference** (IEEEtran two-column, ~800 words/page):
+- S1: ~1.5 pages (~1200 words)
+- S2: ~1 page (~800 words)
+- S3: ~1.5 pages (~1200 words)
+- S4: ~2 pages (~1600 words)
+
+### Task 7: Write S1 — Introduction
+
+**Files:**
+- Modify: `docs/paper/arxiv/paper.tex`
+
+- [ ] **Step 1: Write introduction body (~1200 words)**
+
+First read `paper.tex` to understand the document structure. Then fill in `\section{Introduction}`. Structure:
+
+1. Opening paragraph: agents hit 70-80% on SWE-Bench but ~20% on long-horizon → cite `\cite{Thai2025SWEEVO}`, `\cite{Deng2025SWEBenchPro}`
+2. Our thesis: bottleneck is decomposition, not capability
+3. "Review is harder than generation" for mathematical code → cite `\cite{Roychoudhury2025AgenticAI}`
+4. Three roles paragraph: contributors (creative issues), maintainer (board + skills), agents (manage + execute)
+5. Contributions list (3 items from spec) — use `\begin{itemize}...\end{itemize}`
+6. Paper organization paragraph
+
+- [ ] **Step 2: Verify compiles**
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add -f docs/paper/arxiv/paper.tex
+git commit -m "docs(arxiv): write S1 Introduction"
+```
+
+---
+
+### Task 8: Write S2 — Why Reductions?
+
+**Files:**
+- Modify: `docs/paper/arxiv/paper.tex`
+**Depends on:** Task 2 (graph metrics), Task 3 (Figure 1)
+
+- [ ] **Step 1: Write S2 body (~800 words)**
+
+Read graph metrics from `docs/paper/arxiv/data/graph-metrics.json` for concrete numbers. If not yet available, use: 24 types, 42 variants, 52 edges, 40 implemented, 12 inferred.
+
+Structure:
+1. Goldilocks domain paragraph: self-contained (~50-200 LOC), formally specified, automatable round-trip criterion
+2. Contrast with SWE-Bench: homogeneous tasks enable comparison
+3. Hardware solvers paragraph: Rydberg atoms for MIS (cite `\cite{lucas2014}`), D-Wave for QUBO/Ising (cite `\cite{glover2019}`) → the graph as compilation layer
+4. Real-world applications paragraph: SDN→ILP, airline→SetCovering, VLSI→coloring, logistics→TSP
+5. Reference `Fig.~\ref{fig:reduction-graph}` (placed by Task 3)
+
+- [ ] **Step 2: Verify compiles**
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add -f docs/paper/arxiv/paper.tex
+git commit -m "docs(arxiv): write S2 Why Reductions — Goldilocks domain"
+```
+
+---
+
+### Task 9: Write S3 — System Architecture
+
+**Files:**
+- Modify: `docs/paper/arxiv/paper.tex`
+**Depends on:** Task 6 (Figure 2)
+
+- [ ] **Step 1: Write S3 body (~1200 words)**
+
+Use the trait hierarchy from CLAUDE.md's Architecture section for reference. Do NOT read source files — CLAUDE.md has sufficient detail. Full trait code belongs in supplementary material.
+
+Structure:
+1. Problem trait: `evaluate()` enables brute-force verification of any configuration
+2. ReduceTo trait: type system enforces round-trip capability by construction
+3. `#[reduction(overhead)]` proc macro: compile-time validation of overhead expressions
+4. `declare_variants!`: registry enables automated graph export + completeness checking
+5. Design philosophy paragraph: reduce the space of possible agent errors
+6. Reference `Fig.~\ref{fig:architecture}` (placed by Task 6)
+
+Use `\texttt{}` for code identifiers and `\lstinline` for inline code snippets.
+
+- [ ] **Step 2: Verify compiles**
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add -f docs/paper/arxiv/paper.tex
+git commit -m "docs(arxiv): write S3 System Architecture"
+```
+
+---
+
+### Task 10: Write S4 — Skill-Based Task Decomposition
+
+**Files:**
+- Modify: `docs/paper/arxiv/paper.tex`
+**Depends on:** Task 4 (Figure 3)
+
+- [ ] **Step 1: Write S4.1 — Three Roles (~300 words)**
+
+The roles table from the spec (Contributor/Maintainer/Agent). Use `\begin{table}...\end{table}` with `booktabs`.
+
+- [ ] **Step 2: Read skill files and extract metadata**
+
+Read all 13 skill files (`.claude/skills/*/SKILL.md`). For each, record: name, one-line description, invocation trigger, step count. This data populates Table 1.
+
+- [ ] **Step 3: Write S4.2 — Skills as Agent Functions (~800 words)**
+
+Group the 13 skills into 5 categories (from spec):
+- **Orchestration** (4): project-pipeline, review-pipeline, issue-to-pr, meta-power
+- **Implementation** (2): add-model, add-rule
+- **Quality gate** (4): check-issue, check-rule-redundancy, review-implementation, fix-pr
+- **Documentation** (2): write-model-in-paper, write-rule-in-paper
+- **Release** (1): release
+
+Create Table 1 with `booktabs`:
+```latex
+\begin{table}[t]
+\caption{Skills inventory.}\label{tab:skills}
+\centering
+\begin{tabular}{llcc}
+\toprule
+Skill & Category & Steps & Success \\
+\midrule
+...
+\bottomrule
+\end{tabular}
+\end{table}
+```
+
+Success Rate column: use "TBD" — filled after Task 11.
+
+- [ ] **Step 4: Write S4.3 — Card-Based Orchestration (~500 words)**
+
+Two-stage pipeline (project-pipeline → review-pipeline). Human touches only Backlog→Ready and In Review→Done. Reference `Fig.~\ref{fig:pipeline}`.
+
+- [ ] **Step 5: Verify compiles**
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add -f docs/paper/arxiv/paper.tex
+git commit -m "docs(arxiv): write S4 Skill-Based Task Decomposition"
+```
+
+---
+
+## Chunk 4: Sections S5-S6
+
+### Task 11: Git history mining script
+
+**Files:**
+- Create: `docs/paper/arxiv/scripts/mine-git-history.py`
+- Create: `docs/paper/arxiv/data/git-mining-results.json`
+
+- [ ] **Step 1: Create directories**
+
+```bash
+mkdir -p docs/paper/arxiv/scripts docs/paper/arxiv/data
+```
+
+- [ ] **Step 2: Write PR listing and field extraction**
+
+Write `docs/paper/arxiv/scripts/mine-git-history.py` — Part 1: list all merged PRs with `[Rule]` or `[Model]` in the title.
+
+```bash
+gh pr list --repo CodingThrust/ProblemReductions --state merged --limit 999 --json number,title,author,createdAt,mergedAt,labels,headRefName
+```
+
+For each PR, extract: number, title, author login, created date, merged date, whether title contains `[Rule]` or `[Model]`.
+
+Author classification: if `author.login` contains `[bot]` or is `github-actions`, classify as "agent"; otherwise "human".
+
+- [ ] **Step 3: Add phase classification**
+
+**Phase boundaries** (based on when key skills were introduced — determine by running):
+```bash
+git log --all --oneline --diff-filter=A -- '.claude/skills/add-rule/SKILL.md' | tail -1
+git log --all --oneline --diff-filter=A -- '.claude/skills/project-pipeline/SKILL.md' | tail -1
+```
+
+Define phases:
+- Phase 1 (manual): PRs before add-model/add-rule skills existed
+- Phase 2 (basic skills): PRs after implementation skills but before pipeline skills
+- Phase 3 (full pipeline): PRs after project-pipeline/review-pipeline skills existed
+
+- [ ] **Step 4: Run script and save results**
+
+```bash
+python3 docs/paper/arxiv/scripts/mine-git-history.py > docs/paper/arxiv/data/git-mining-results.json
+```
+
+Expected output schema:
+```json
+{
+  "summary": {"total_prs": N, "rule_prs": N, "model_prs": N, "agent_authored": N, "human_authored": N},
+  "by_phase": [{"phase": 1, "label": "manual", "count": N, "agent_count": N}, ...],
+  "prs": [{"number": 42, "title": "...", "is_agent": false, "phase": 1, "type": "Rule"}, ...]
+}
+```
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add -f docs/paper/arxiv/scripts/ docs/paper/arxiv/data/git-mining-results.json
+git commit -m "docs(arxiv): git history mining script and results"
+```
+
+---
+
+### Task 12: Write S5 — Multi-Layered Verification
+
+**Files:**
+- Modify: `docs/paper/arxiv/paper.tex`
+**Depends on:** Task 5 (Figure 4)
+
+- [ ] **Step 1: Write S5.1 — The Verification Stack (~700 words)**
+
+Write the 7-layer table from the spec using `booktabs`. Use these concrete error examples:
+
+| Layer | Mechanism | Example Error Caught |
+|-------|-----------|---------------------|
+| 1. Type system | Rust compiler | Agent returns `bool` instead of `SolutionSize<i32>` from `evaluate()` |
+| 2. Unit tests | `test_*_basic` | Agent evaluates MaxCut objective with wrong sign |
+| 3. Closed-loop tests | `test_*_to_*_closed_loop` | SAT→MIS maps clause variables to wrong vertex indices |
+| 4. Overhead validation | Symbolic expr vs sizes | Agent writes `num_edges = num_clauses` instead of `3 * num_clauses` |
+| 5. Materialized fixtures | JSON ground truth | Agent changes expected QUBO matrix to make failing test pass |
+| 6. Agentic review | Parallel subagents | Missing `declare_variants!`, wrong file naming |
+| 7. Documentation | Proof sketch | Proof assumes connected graph but problem allows disconnected |
+
+Reference `Fig.~\ref{fig:verification}`.
+
+- [ ] **Step 2: Write S5.2 — Why Layers? (~400 words)**
+
+The "lazy agent" problem. Materialized test data as defense. No single layer is sufficient. Cross-reference Table 2 in S6.
+
+- [ ] **Step 3: Verify compiles**
+
+```bash
+cd docs/paper/arxiv && pdflatex -interaction=nonstopmode paper.tex && cd -
+```
+
+- [ ] **Step 4: Commit**
+
+```bash
+git add -f docs/paper/arxiv/paper.tex
+git commit -m "docs(arxiv): write S5 Multi-Layered Verification"
+```
+
+---
+
+### Task 13: Write S6 — Evaluation
+
+**Files:**
+- Modify: `docs/paper/arxiv/paper.tex`
+**Depends on:** Task 11 (git mining data)
+
+- [ ] **Step 1: Write S6.1 — Ablation setup (~500 words)**
+
+Experimental DESIGN only (results are `[TBD]` placeholders):
+- Setup: 5-10 reductions, skill-based vs no-skill baseline
+- Metrics: first-attempt CI pass rate, review rounds, correctness, convention adherence
+- Framing: "controlled illustration" (n=5-10)
+
+- [ ] **Step 2: Write S6.2 — Git History Mining results (~700 words)**
+
+Read data from `docs/paper/arxiv/data/git-mining-results.json`. If not yet available, use `[TBD]` placeholders.
+
+Create Table 2 (error taxonomy × verification layer):
+```latex
+\begin{table}[t]
+\caption{Error taxonomy by verification layer.}\label{tab:errors}
+\centering
+\begin{tabular}{llc}
+\toprule
+Error Category & Layer & Count \\
+\midrule
+Type errors & 1 (type system) & [TBD] \\
+Mapping errors & 3 (closed-loop) & [TBD] \\
+...
+\bottomrule
+\end{tabular}
+\end{table}
+```
+
+- [ ] **Step 3: Write S6.3 — Case Studies (~800 words)**
+
+Search for actual PRs:
+```bash
+gh pr list --repo CodingThrust/ProblemReductions --state merged --limit 999 --search "MinimumVertexCover MaximumIndependentSet" --json number,title
+gh pr list --repo CodingThrust/ProblemReductions --state merged --limit 999 --search "Satisfiability MaximumIndependentSet" --json number,title
+gh pr list --repo CodingThrust/ProblemReductions --state merged --limit 999 --search "Factoring CircuitSAT" --json number,title
+```
+
+**Case 1 — Simple (MVC→MIS):** complement relationship, ~30 LOC.
+**Case 2 — Complex (SAT→MIS):** clause-variable gadget, quadratic blowup.
+**Case 3 — Composition (Factoring→CircuitSAT→ILP):** two independent reductions composing in graph.
+
+- [ ] **Step 4: Verify compiles**
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add -f docs/paper/arxiv/paper.tex
+git commit -m "docs(arxiv): write S6 Evaluation"
+```
+
+---
+
+## Chunk 5: Sections S7-S8 + Review + Final Assembly
+
+### Task 14: Write S7 — Related Work
+
+**Files:**
+- Modify: `docs/paper/arxiv/paper.tex`
+
+- [ ] **Step 1: Write S7 body (~800 words)**
+
+Four subsections with specific citation keys:
+
+1. **AI coding agents:** `\cite{Yang2024SWEagent}`, `\cite{Wang2024OpenHands}`, `\cite{Anthropic2025ClaudeCode}`, `\cite{Wu2024Devin}`, `\cite{Thai2025SWEEVO}`, `\cite{Deng2025SWEBenchPro}`, `\cite{Xia2025LiveSWEagent}`, `\cite{Roychoudhury2025AgenticAI}`, `\cite{Anthropic2026AgenticCoding}`
+
+2. **AI-discovered reductions:** `\cite{Novikov2025AlphaEvolve}`, `\cite{Janicic2025URSA}`, `\cite{RomeraParedes2023FunSearch}`
+
+3. **Formal verification:** `\cite{Bursuc2025VeriCoding}`, `\cite{Thakur2025CLEVER}`, `\cite{Miranda2025VeriBench}`, `\cite{Mukherjee2025CoqPL}`, `\cite{Mukherjee2025SynVer}`
+
+4. **Physics-inspired optimization:** `\cite{Schuetz2022PhysicsGNN}`, `\cite{He2024QuantumTSP}`
+
+Position our work as complementary, not competing.
+
+- [ ] **Step 2: Verify compiles**
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add -f docs/paper/arxiv/paper.tex
+git commit -m "docs(arxiv): write S7 Related Work"
+```
+
+---
+
+### Task 15: Write S8 — Discussion & Conclusion
+
+**Files:**
+- Modify: `docs/paper/arxiv/paper.tex`
+
+- [ ] **Step 1: Write S8 body (~800 words)**
+
+Four parts:
+1. **Generalizability:** Goldilocks property, candidate domains
+2. **Limitations:** n=1, skill engineering cost, domain specificity, confounds, maintainer requirement
+3. **Human value proposition:** repositioned not eliminated. Cite `\cite{Anthropic2026AgenticCoding}`.
+4. **Future directions:** AlphaEvolve, formal verification, scaling to 100+
+
+End with `\subsection{Conclusion}`: 2-3 crisp sentences.
+
+- [ ] **Step 2: Verify compiles**
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add -f docs/paper/arxiv/paper.tex
+git commit -m "docs(arxiv): write S8 Discussion and Conclusion"
+```
+
+---
+
+### Task 16: Simulated peer review
+
+**Files:** None modified (review only)
+
+- [ ] **Step 1: Run academic-paper-reviewer**
+
+Read `.claude/skills/academic-research-skills/academic-paper-reviewer/SKILL.md` and invoke the review process on `docs/paper/arxiv/paper.tex`. This simulates a 5-person review panel (Editor-in-Chief + 3 domain reviewers + Devil's Advocate) with quality rubrics.
+
+- [ ] **Step 2: Record review findings**
+
+Save the review output to `docs/paper/arxiv/data/peer-review-round1.md`.
+
+- [ ] **Step 3: Address critical review findings**
+
+Fix any issues scored below 65 (Major Revision threshold). Update paper.tex accordingly.
+
+- [ ] **Step 4: Commit fixes**
+
+```bash
+git add -f docs/paper/arxiv/paper.tex docs/paper/arxiv/data/peer-review-round1.md
+git commit -m "docs(arxiv): address peer review round 1 findings"
+```
+
+---
+
+### Task 17: Final assembly and polish
+
+**Files:**
+- Modify: `docs/paper/arxiv/paper.tex`
+
+- [ ] **Step 1: Compile all Typst figures**
+
+```bash
+for f in docs/paper/arxiv/figures/*.typ; do typst compile "$f"; done
+```
+
+Verify all 4 PDFs exist:
+```bash
+ls docs/paper/arxiv/figures/*.pdf
+```
+
+Expected: `reduction-graph.pdf`, `architecture.pdf`, `pipeline.pdf`, `verification-pyramid.pdf`.
+
+- [ ] **Step 2: Verify all figures are placed correctly**
+
+Check that these references exist in the paper text:
+- `\ref{fig:reduction-graph}` in S2
+- `\ref{fig:architecture}` in S3
+- `\ref{fig:pipeline}` in S4
+- `\ref{fig:verification}` in S5
+
+- [ ] **Step 3: Verify all tables are placed**
+
+Check for `\ref{tab:skills}` in S4 and `\ref{tab:errors}` in S6.
+
+- [ ] **Step 4: Full compile with bibliography**
+
+```bash
+cd docs/paper/arxiv && pdflatex paper.tex && bibtex paper && pdflatex paper.tex && pdflatex paper.tex && cd -
+```
+
+Check for unresolved citations: `grep "Citation.*undefined" docs/paper/arxiv/paper.log`
+Expected: no undefined citations.
+
+- [ ] **Step 5: Check page count**
+
+```bash
+pdfinfo docs/paper/arxiv/paper.pdf | grep Pages
+```
+
+Expected: 10-12 pages. If over, trim. If under, expand.
+
+- [ ] **Step 6: Flag visual review for maintainer**
+
+Verify no LaTeX warnings about overfull hboxes (>1pt). Visual inspection (layout, figures, tables) requires human review.
+
+- [ ] **Step 7: Commit final version**
+
+```bash
+git add -f docs/paper/arxiv/paper.tex docs/paper/arxiv/figures/*.typ docs/paper/arxiv/references.bib
+git commit -m "docs(arxiv): final paper assembly and polish"
+```
+
+Note: Do NOT commit `paper.pdf`, `paper.aux`, `paper.bbl`, `paper.blg`, `paper.log`, or `figures/*.pdf` — these are build artifacts. Add them to `.gitignore` if needed.
+
+---
+
+## Execution Notes
+
+### Dependency Graph
+
+```
+Task 1 (scaffolding) ──→ Task 2 (metrics) ──→ Task 8 (S2)
+                     ──→ Tasks 3-6 (figures, parallel)
+                     ──→ Task 11 (git mining)
+
+Task 3 (Fig 1) ──→ Task 8 (S2)
+Task 4 (Fig 3) ──→ Task 10 (S4)
+Task 5 (Fig 4) ──→ Task 12 (S5)
+Task 6 (Fig 2) ──→ Task 9 (S3)
+
+Task 7 (S1): no figure dependency — can run after Task 1
+Task 11 (git mining) ──→ Task 13 (S6)
+Task 14 (S7): independent — can run after Task 1
+Task 15 (S8): independent — can run after Task 1
+Task 16 (peer review): must run after Tasks 7-15
+Task 17 (assembly): must run LAST
+```
+
+### Suggested Parallel Batches
+
+1. **Tasks 1-2** (scaffolding + data) — sequential, run first
+2. **Tasks 3-6** (all figures) + **Task 7** (S1) + **Task 11** (git mining) — parallel
+3. **Tasks 8-10** (S2-S4) + **Tasks 14-15** (S7-S8) — parallel
+4. **Tasks 12-13** (S5-S6) — parallel
+5. **Task 16** (peer review) — after all sections written
+6. **Task 17** (assembly) — last
+
+### Open Dependencies
+
+- **S6.1 ablation results** are `[TBD]` placeholders. The ablation experiment is a separate effort outside this plan.
+- **Table 1 success rates** are `[TBD]` — filled from git mining data if available.
+- **Peer review** (Task 16) may surface issues requiring additional revision cycles.
diff --git a/docs/superpowers/specs/2026-03-12-arxiv-paper-design.md b/docs/superpowers/specs/2026-03-12-arxiv-paper-design.md
new file mode 100644
index 00000000..2b705f93
--- /dev/null
+++ b/docs/superpowers/specs/2026-03-12-arxiv-paper-design.md
@@ -0,0 +1,299 @@
+# Skill-Based Agentic Coding for Mathematical Software: A Case Study in NP-Hard Problem Reductions
+
+**Type:** Full research paper (~10-12 pages)
+**Venue:** ICSE/ASE-class SE conference
+**Output:** `docs/paper/arxiv/paper.tex` (LaTeX, IEEEtran class)
+
+## Thesis
+
+The bottleneck in agentic coding is not agent capability but task decomposition and the division of labor between human creativity and agent management/execution. We demonstrate a skill-based pipeline where humans (contributors + maintainer) provide judgment — which problems matter, which reductions are useful — while agents handle both management (orchestrating the pipeline, picking cards, dispatching sub-agents) and execution (implementation, testing, documentation, review). Applied to NP-hard problem reductions, this produces a verified library of 24 problem types with 40 implemented reduction rules and 52 total graph edges (including 12 inferred variant edges), with multi-layered correctness guarantees.
+
+**Terminology note:** "40 reductions" = hand-coded `ReduceTo` implementations. "52 graph edges" = total directed edges in the reduction graph, including natural edges inferred from the type-parameter subtype lattice (e.g., `MIS<KingsSubgraph>` → `MIS<SimpleGraph>`). The paper must consistently distinguish these counts.
+
+## Paper Outline
+
+### S1. Introduction (~1.5 pages)
+
+Frame the problem:
+- AI coding agents achieve 70-80% on isolated bug fixes (SWE-Bench Verified) but drop to ~20% on long-horizon, multi-file tasks. The common response is to push for more agent autonomy.
+- We argue the bottleneck is not capability but decomposition: how to split creative/judgment work (human) from management/mechanical work (agent).
+- The "review is harder than generation" challenge — especially for mathematical/scientific code where correctness is hard to verify.
+
+Present the three roles:
+- **Contributors** create issues (creative: identify which reductions are useful, propose new problems, spot gaps in the graph).
+- **Maintainer** curates the project board and writes skills (creative: priorities, domain knowledge encoding, quality standards).
+- **Agents** both manage (pick cards from the board, orchestrate the pipeline, dispatch sub-agents for review) and execute (implement, test, document).
+
+Contributions:
+1. A skill-based methodology for decomposing mathematical coding tasks into agent-manageable steps.
+2. A multi-layered verification stack that catches errors across different abstraction levels.
+3. A verified reduction library (24 problem types, 40 implemented reductions, 52 graph edges) as a practical artifact.
+
+### S2. Why Reductions? The Goldilocks Domain (~1 page)
+
+Why this domain is ideal for studying agentic coding:
+- Each reduction is self-contained (~50-200 LOC), requires non-trivial mathematical reasoning, yet has an automatable correctness criterion (round-trip: reduce → solve target → extract solution back → verify against source).
+- Homogeneous task structure enables systematic comparison across tasks (unlike SWE-Bench's heterogeneous issues).
+- Contrast with general SE tasks: reductions have a clear mathematical spec, a ground-truth, and bounded scope.
+
+Practical motivation — hardware solvers:
+- Rydberg atom arrays solve Maximum Independent Set natively.
+- D-Wave quantum annealers solve Ising/QUBO problems.
+- A verified reduction graph serves as a **compilation layer**: reduce SAT → MIS → run on Rydberg atoms; reduce MaxCut → SpinGlass → QUBO → run on D-Wave. The library lets specialized hardware solve a much larger class of problems.
+
+Practical motivation — real-world applications:
+- Software-defined networking (routing/scheduling → ILP).
+- Airline crew scheduling (→ SetCovering).
+- VLSI design (→ graph coloring).
+- Logistics (→ TSP, BinPacking).
+- These domains reduce to problems that already have hardware or algorithmic solutions; the library provides the verified bridge.
+
+Figure 1: The reduction graph (24 problem types, 42 variant nodes, 52 directed edges, QUBO/ILP hubs visible, color-coded by category: graph/formula/set/algebraic/misc). Caption distinguishes 40 implemented reductions from 12 inferred variant edges.
+
+### S3. System Architecture (~1.5 pages)
+
+The Rust library design that makes agent-generated code verifiable by construction. Focus on the aspects that directly enable the verification story (details of trait hierarchy and proc macros in supplementary material).
+
+**Key design choices:**
+- `Problem` trait with `evaluate()` enables brute-force verification of any configuration.
+- `ReduceTo<T>` trait with `ReductionResult` enforces that every reduction can produce a target problem AND extract solutions back — the type system makes round-trip testing possible by construction.
+- `#[reduction(overhead = {...})]` proc macro: overhead expressions are compile-time validated against getter methods — agents cannot write incorrect variable names in overhead formulas.
+- `declare_variants!` registers problem variants with complexity strings — the registry enables automated graph export and completeness checking.
+
+**Design philosophy:** Reduce the space of possible agent errors through type-level enforcement. The architecture is not just a code organization choice — it is the foundation of the verification stack (elaborated in S5).
+
+Figure 2: System architecture diagram (key traits + compile-time validation flow). Full trait hierarchy in supplementary material.
+
+### S4. Skill-Based Task Decomposition (~2 pages)
+
+#### 4.1 The Three Roles
+
+How creative/judgment work distributes across human roles, with management and execution delegated to agents:
+
+| Role | Responsibility | Creative/Judgment | Examples |
+|------|---------------|-------------------|----------|
+| Contributor | Open issues | Which reductions are useful? Non-trivial? | "Add SAT → DominatingSet rule" |
+| Maintainer | Curate board, write skills | Priorities, quality standards, domain knowledge | Move card to "Ready", evolve check-issue skill |
+| Agent | Manage pipeline + execute | — | Pick card, implement, test, review, create PR |
+
+#### 4.2 Skills as Agent Functions
+
+A skill is a markdown script that decomposes a complex task into agent-manageable subtasks. Key insight: if a task is small and explicit enough, agents handle it well.
+
+Skills inventory (13 skills, grouped by function):
+
+**Orchestration skills** (agent-as-manager):
+- **project-pipeline**: The primary card-based automation skill. Picks a "Ready" issue from the GitHub Project board, moves it to "In Progress", runs `issue-to-pr --execute` in an isolated git worktree, then moves to "review-agentic". Supports single-issue, specific-issue, and `--all` batch modes. Processes Models before Rules to satisfy dependencies.
+- **review-pipeline**: Second-stage orchestration. Picks a PR from the "review-agentic" column, fixes Copilot review comments, runs agentic feature tests, fixes CI (up to 3 retries), then moves to "In Review" for human merge. Also supports batch mode.
+- **issue-to-pr**: The per-issue entry point invoked by `project-pipeline`. Receives a GitHub issue, classifies it (model vs. rule), dispatches to the appropriate implementation skill, and creates a PR.
+- **meta-power**: Batch mode alternative. Resolves all open issues autonomously in dependency order. Experimental — being superseded by the pipeline skills above.
+
+**Implementation skills** (agent-as-executor):
+- **add-model**: Brainstorm (if interactive) → implement Problem trait → unit tests → serialization tests → review.
+- **add-rule**: Brainstorm (if interactive) → implement ReduceTo trait → closed-loop tests → overhead expressions → example → review.
+
+**Quality gate skills:**
+- **check-issue**: Validates usefulness, non-triviality, literature correctness of a proposed rule/model. Posts structured report.
+- **check-rule-redundancy**: Determines if a proposed rule is dominated by a composite path through existing rules.
+- **review-implementation**: Dispatches parallel subagents (structural check + quality check) with fresh context windows.
+- **fix-pr**: Resolves review comments, CI failures, coverage gaps.
+
+**Documentation skills** (also serve as verification Layer 7 — see S5):
+- **write-model-in-paper**: Generates Typst problem definition (formal definition, background, example with visualization).
+- **write-rule-in-paper**: Generates Typst reduction theorem (complexity citation, self-contained proof sketch, detailed example). The proof sketch is the final verification layer — it forces a human-readable argument for correctness.
+
+**Release skill:**
+- **release**: Determines version bump from diff, verifies tests/clippy, tags and publishes.
+
+Table 1: Skills inventory — trigger condition, inputs, outputs, typical agent turns, first-attempt success rate from git history.
+
+#### 4.3 Card-Based Orchestration
+
+- GitHub Project board with columns: Backlog → Ready → In Progress → review-agentic → In Review → Done.
+- **Two-stage agent pipeline:**
+  - Stage 1 (`project-pipeline`): picks Ready card → moves to In Progress → runs issue-to-pr in isolated worktree → moves to review-agentic.
+  - Stage 2 (`review-pipeline`): picks review-agentic card → fixes Copilot comments → runs agentic feature tests → fixes CI (up to 3 retries) → moves to In Review.
+- **Human touches only two transitions:**
+  - Backlog → Ready (maintainer decides what to work on next — the creative/strategic decision).
+  - In Review → Done (maintainer merges after final review — the quality gate).
+- The agent handles everything in between: worktree creation, implementation, testing, review, CI fixing, board status updates.
+- Batch mode (`--all`) processes all Ready issues or all review-agentic PRs in a single invocation, with Models before Rules to satisfy dependencies.
+
+Figure 3: Pipeline diagram — two-stage card flow: contributor opens issue → [Backlog] → maintainer moves to [Ready] → agent: project-pipeline [In Progress → review-agentic] → agent: review-pipeline [In Review] → maintainer merges [Done]. Human decisions highlighted in distinct color.
+
+### S5. Multi-Layered Verification (~1.5 pages)
+
+#### 5.1 The Verification Stack
+
+Seven layers, each catching different error classes:
+
+| Layer | Mechanism | Catches |
+|-------|-----------|---------|
+| 1. Type system | Rust compiler, trait bounds | Wrong return types, missing trait impls, API misuse |
+| 2. Unit tests | `test_*_basic`, `test_*_serialization` | Evaluation errors, serialization roundtrip failures |
+| 3. Closed-loop tests | `test_*_to_*_closed_loop` | Incorrect reduction mapping, wrong solution extraction |
+| 4. Overhead validation | Symbolic expr vs. actual sizes | Overhead formula errors (e.g., quadratic vs linear edge count) |
+| 5. Materialized fixtures | JSON ground truth in `tests/data/` | Agents silently changing expected values to make tests pass |
+| 6. Agentic review | Parallel subagents with fresh context | Structural issues, missing edge cases, convention violations |
+| 7. Documentation | Paper entry with proof sketch | Logical errors in the reduction argument itself |
+
+#### 5.2 Why Layers?
+
+The "lazy agent" problem: agents take the shortest path to close an issue. Given a failing test, an agent is more likely to change the expected value than fix the underlying bug. Materialized test data (Layer 5) prevents this by locking expected outputs in version-controlled JSON files that the agent cannot modify as part of a rule implementation PR.
+
+No single layer is sufficient: the type system catches API misuse but not logical errors; closed-loop tests verify functional correctness but not overhead formulas; documentation catches proof-level mistakes that no automated test can detect.
+
+Table 2 (defined in S6.2, referenced here): Error taxonomy × verification layer matrix.
+
+Figure 4: Verification pyramid with concrete error examples at each layer.
+
+### S6. Evaluation (~2.5 pages)
+
+#### 6.1 Ablation: Skill-Based vs. No-Skill Agent (quantitative)
+
+To demonstrate that the skill-based approach matters (not just "use a good agent"), we run a controlled comparison:
+
+**Setup:** Select 5-10 reductions of varying complexity. For each, run two configurations:
+- **Skill-based:** Full pipeline (issue-to-pr skill, add-rule skill, review-implementation, fix-pr).
+- **No-skill baseline:** Raw Claude Code on the same codebase with the same issue description but no skills (only CLAUDE.md for project context).
+
+**Metrics:** First-attempt CI pass rate, number of review rounds, final correctness (round-trip test pass), lines of code quality (convention adherence).
+
+**Framing:** With n=5-10, this ablation is a **controlled illustration** of the skill-based approach's value, not a statistically powered experiment. The results demonstrate the mechanism (how skills prevent specific error classes) rather than establishing effect sizes. The git mining in S6.2 provides broader quantitative evidence across the full project history.
+
+This is feasible: create the same issues on a branch without skill files, run the agent, measure outcomes.
+
+#### 6.2 Git History Mining (quantitative)
+
+Data source: full git/PR history of the problemreductions repository.
+
+Metrics:
+- Agent-implemented vs. human-implemented reductions (count and %).
+- First-attempt success rate per skill invocation (does the PR pass CI on first push?).
+- Number of review rounds before merge.
+- Error taxonomy: categorize all errors found during review, map to verification layer that caught them.
+- Test coverage across the codebase (>95% target).
+- Lines of code per reduction (distribution, compare agent vs human).
+
+**Addressing the confound:** Skills evolved during the project, so early reductions had less agent support. We address this by:
+- Stratifying results by skill maturity phase (Phase 1: manual, Phase 2: basic skills, Phase 3: full pipeline with card automation).
+- Plotting success rate over time with skill milestone annotations.
+- Restricting primary quantitative claims to Phase 3 reductions (stable pipeline).
+
+**Preliminary error taxonomy** (to be populated from git history):
+- *Type errors*: wrong return type, missing trait impl → caught by Layer 1 (type system)
+- *Mapping errors*: incorrect vertex/edge index in reduction → caught by Layer 3 (closed-loop tests)
+- *Formula errors*: wrong overhead expression (e.g., linear vs quadratic edge count) → caught by Layer 4 (overhead validation)
+- *Test gaming*: agent changes expected value instead of fixing bug → caught by Layer 5 (materialized fixtures)
+- *Convention violations*: wrong file naming, missing `declare_variants!` → caught by Layer 6 (agentic review)
+- *Logical errors*: incorrect proof argument → caught by Layer 7 (documentation review)
+
+Table 2: Error taxonomy × verification layer matrix (populated from git mining).
+
+#### 6.3 Case Studies (qualitative)
+
+Three reductions spanning the complexity spectrum:
+
+**Simple — MinimumVertexCover → MaximumIndependentSet:**
+- Complement relationship: MIS(G) = V \ MVC(G).
+- Near-trivial mapping, ~30 LOC.
+- Shows the pipeline working smoothly with minimal human intervention.
+
+**Complex — Satisfiability → MaximumIndependentSet:**
+- Clause-variable gadget construction, quadratic blowup in edges.
+- Requires understanding both CNF formulas and graph structure.
+- Shows where agent makes mistakes (edge count in intersection graph) and how verification layers catch them.
+
+**Composition — Factoring → CircuitSAT → ILP (graph-level, not single-agent):**
+- Two independently implemented reductions (Factoring→CircuitSAT and CircuitSAT→ILP) that compose in the reduction graph.
+- This case study analyzes each reduction's implementation pipeline separately, then demonstrates how the graph enables composition: factor a number by chaining reductions to ILP and using an off-the-shelf solver.
+- The "composition" is a property of the graph structure, not a single agent managing a multi-hop chain.
+- Highlights the practical value: the library serves as compilation infrastructure.
+
+For each case study: show the full pipeline from issue to merged PR, highlight where human judgment was needed vs. where agent executed autonomously, and which verification layers activated.
+
+### S7. Related Work (~1 page)
+
+**AI coding agents:**
+- SWE-agent (ACI design), OpenHands (open platform + SDK), Claude Code (agentic CLI), Devin (autonomous engineer).
+- Benchmarks: SWE-Bench Verified (~70-80%), SWE-EVO (~20% on long-horizon), SWE-Bench Pro (~45%).
+- Our contribution: skill-based decomposition as an alternative to pushing for more raw capability.
+- Live-SWE-agent's self-evolution is complementary — skills are human-authored evolution.
+
+**AI-assisted discovery of reductions and complexity:**
+- AlphaEvolve discovers new NP-hardness gadgets (MAX-3-CUT, MAX-4-CUT, metric TSP bounds).
+- URSA uses SAT solvers for formal verification of NP-complete reductions.
+- Our work is complementary: we focus on implementing and verifying known reductions, not discovering new ones. AlphaEvolve discovers; our pipeline implements and verifies.
+
+**Formal verification of AI-generated code:**
+- VeriCoding (27% Lean, 44% Verus, 82% Dafny success rates).
+- CLEVER (near-zero on hard Lean problems).
+- VeriBench (self-optimizing agents reach ~90% compilation).
+- Our approach: pragmatic multi-layer verification instead of end-to-end formal proofs. Trade-off: less formal guarantee, but practically effective at catching real errors.
+
+**Physics-inspired optimization:**
+- GNNs via QUBO Hamiltonian relaxation solve MIS, MaxCut, MinVC at million-variable scale.
+- Quantum annealing + GNN hybrids for TSP.
+- Our reduction graph provides the verified compilation layer that connects arbitrary problems to these solvers.
+
+### S8. Discussion & Conclusion (~1 page)
+
+**Generalizability:**
+- What other domains have the "Goldilocks" property? Candidates: compiler optimizations (peephole rules), algebraic identities, protocol verification lemmas.
+- The skill-based approach generalizes to any domain where tasks are homogeneous, formally specified, and independently verifiable.
+
+**Limitations:**
+- **n=1 threat to validity**: This is a single case study of a single project by a single maintainer. While we argue the methodology generalizes to other Goldilocks domains, the empirical evidence is from one project. We mitigate this by providing the ablation comparison (S6.1) and by identifying concrete candidate domains for future validation.
+- Requires upfront skill engineering — the maintainer must invest significant effort in writing and evolving skills.
+- Domain expertise embedded in skills doesn't transfer across domains (a reduction skill won't help with web development).
+- Git history mining has confounds: skills evolved during the project (addressed by stratification in S6.2).
+- The three-role model requires a knowledgeable maintainer; fully open-source contribution without oversight is not supported.
+
+**The human value proposition:**
+- Humans are not eliminated from the pipeline — they are repositioned. Creative work (which problems matter, which reductions are useful, what quality standards to enforce) remains human. Mechanical work (implementation, testing, documentation, review) is delegated to agents that also manage their own workflow.
+- This mirrors the broader trend identified in industry surveys: developers increasingly use AI but maintain active oversight on delegated tasks.
+
+**Future directions:**
+- Connecting to AlphaEvolve-style discovery: use agents to discover new reductions, then feed them into the verification pipeline.
+- Formal verification integration: replace round-trip tests with Lean/Coq proofs for the strongest guarantees.
+- Scaling the graph: can the pipeline maintain quality as the number of problems grows from 24 to 100+?
+
+## Page Budget
+
+| Section | Pages | Notes |
+|---------|-------|-------|
+| S1. Introduction | ~1.5 | |
+| S2. Why Reductions? | ~1 | Including Fig 1 (reduction graph) |
+| S3. System Architecture | ~1.5 | Trimmed; full trait details in supplementary |
+| S4. Skill-Based Decomposition | ~2 | Including Fig 3 (pipeline) + Table 1 |
+| S5. Verification Stack | ~1.5 | Including Fig 4 (pyramid) |
+| S6. Evaluation | ~2.5 | Ablation + git mining + case studies + Table 2 |
+| S7. Related Work | ~1 | |
+| S8. Discussion | ~1 | |
+| **Total** | **~12** | Page counts include embedded figures/tables for each section. References ~0.5 pages. Supplementary material (full trait hierarchy, proc macro details) is a separate appendix outside the page limit, per ICSE/ASE norms. |
+
+## Key Figures
+
+1. **Reduction graph** — 24 problem types, 42 variant nodes, 52 directed edges, color-coded by category. QUBO/ILP hubs visible. Caption distinguishes 40 implemented reductions from 12 inferred variant edges.
+2. **System architecture** — Key traits + compile-time validation flow (compact). Full hierarchy in supplementary.
+3. **Pipeline diagram** — Three-role pipeline: contributor → issue → agent:check → maintainer:move card → agent:implement/review → PR → merge. Human decisions highlighted in distinct color.
+4. **Verification pyramid** — 7 layers from type system (base) to documentation (top), each annotated with concrete error examples.
+
+## Key Tables
+
+1. **Skills inventory** — Each skill with: trigger condition, inputs, outputs, typical agent turns, first-attempt success rate.
+2. **Error taxonomy** — Error categories × which verification layer caught them. Demonstrates complementary coverage.
+
+## References
+
+Survey bibliography: `.claude/survey/agentic-coding-reductions/references.bib` (22 papers across 4 themes).
+
+## Non-Goals
+
+- This paper does NOT claim agents can discover new reductions (that's AlphaEvolve territory).
+- This paper does NOT provide formal verification proofs (pragmatic multi-layer approach instead).
+- This paper does NOT benchmark against SWE-Bench (different task structure; we argue for domain-specific evaluation).
+
+## Artifact Availability
+
+The code repository (including all skill files, git history, and test fixtures) will be made publicly available as a reproducibility artifact. The reduction graph can be explored interactively via the project's MCP server and CLI tool. This supports ICSE/ASE artifact evaluation tracks.