diff --git a/.deepreview b/.deepreview index 22e2296e..2ca3c797 100644 --- a/.deepreview +++ b/.deepreview @@ -194,6 +194,47 @@ requirements_traceability: Produce a structured review with Coverage Gaps, Test Stability Violations, Traceability Issues, and a Summary with PASS/FAIL verdicts. +requirement_file_format: + description: "Validate RFC 2119 compliance, unique IDs, and sequential numbering in requirement spec files." + match: + include: + - "specs/**/*-REQ-*.md" + review: + strategy: individual + instructions: | + Review this requirements specification file for format correctness. + + Check the following: + + 1. **RFC 2119 keywords**: Every requirement statement MUST use at least one + RFC 2119 keyword (MUST, MUST NOT, SHALL, SHALL NOT, SHOULD, SHOULD NOT, + MAY, REQUIRED, RECOMMENDED, OPTIONAL). Flag any numbered requirement + that lacks an RFC 2119 keyword — e.g., "The system generates a UUID" + should be "The system MUST generate a UUID." + + 2. **Unique requirement IDs**: Each section heading must follow the pattern + `### {PREFIX}-REQ-NNN.M: Title` where PREFIX matches the filename prefix + (e.g., JOBS-REQ for JOBS-REQ-001-*.md). Within each section, requirements + are numbered lists (1., 2., 3., ...). Flag any duplicate section IDs. + + 3. **Sequential numbering**: Within each section, numbered requirements + should be sequential without gaps (1, 2, 3 — not 1, 2, 4). Flag gaps + or out-of-order numbers. + + 4. **Section ID consistency**: The section ID prefix must match the file's + naming convention. For example, in `JOBS-REQ-001-mcp-workflow-tools.md`, + all sections should use `JOBS-REQ-001.X` (not `JOBS-REQ-002.X`). + + 5. **Testability**: Each requirement should be specific enough to be + verifiable — either by an automated test or a review rule. Flag vague + requirements that cannot be objectively evaluated (e.g., "The system + SHOULD be fast" — fast compared to what?). + + Output Format: + - PASS: All requirements are properly formatted. + - FAIL: Issues found. List each with the section ID, requirement number, + and a concise description of the issue. + update_documents_relating_to_src_deepwork: description: "Ensure project documentation stays current when DeepWork source files, plugins, or platform content change." match: @@ -388,36 +429,33 @@ deepreview_config_quality: and a specific recommendation. job_schema_instruction_compatibility: - description: "Verify deepwork_jobs instruction files, templates, and examples are compatible with the job schema." + description: "Verify deepwork_jobs job.yml inline instructions are compatible with the job schema." match: include: - "src/deepwork/jobs/job.schema.json" - - "src/deepwork/standard_jobs/deepwork_jobs/steps/*.md" - - "src/deepwork/standard_jobs/deepwork_jobs/templates/*" - "src/deepwork/standard_jobs/deepwork_jobs/job.yml" + - "src/deepwork/standard_jobs/deepwork_reviews/job.yml" review: strategy: matches_together additional_context: unchanged_matching_files: true instructions: | - When the job schema or deepwork_jobs instruction files change, verify they + When the job schema or standard job definitions change, verify they are still compatible with each other. Read src/deepwork/jobs/job.schema.json to understand the current schema. - Then read each instruction file, template, and example in - src/deepwork/standard_jobs/deepwork_jobs/ and check: + Then read each standard job's job.yml and check: - 1. **Field references**: Every field name mentioned in prose instructions, - templates, or examples must exist in the schema at the correct level. - Pay special attention to root-level vs step-level fields — a field - that exists on steps may not exist at the root, and vice versa. + 1. **Field references**: Every field name referenced in inline step + instructions must exist in the schema at the correct level. + Pay special attention to step_arguments vs workflow vs step fields. 2. **Required vs optional**: If instructions say a field is required, verify the schema agrees. If instructions say a field is optional, verify the schema doesn't require it. - 3. **Schema structure**: Template files and examples that show YAML - structure must match the schema's property names and nesting. + 3. **Schema structure**: Any YAML examples shown in inline instructions + must match the schema's property names and nesting. 4. **Terminology consistency**: Instructions should use the same field names as the schema (e.g., if the schema uses @@ -425,6 +463,38 @@ job_schema_instruction_compatibility: should not call it "description" or "job_description"). Output Format: - - PASS: All instruction files are compatible with the schema. + - PASS: All job definitions are compatible with the schema. - FAIL: Incompatibilities found. List each with the file path, line reference, the incompatible content, and what the schema actually says. + +nix_claude_wrapper: + description: "Ensure flake.nix always wraps the claude command with the required plugin dirs." + match: + include: + - "flake.nix" + - ".envrc" + review: + strategy: matches_together + instructions: | + The nix dev shell must ensure that running `claude` locally automatically + loads the project's plugin directories via `--plugin-dir` flags. Verify: + + 1. **Wrapper exists**: flake.nix creates a wrapper (script or function) + that invokes the real `claude` binary with extra arguments. + + 2. **Required plugin dirs**: The wrapper MUST pass both of these + `--plugin-dir` flags: + - `--plugin-dir "$REPO_ROOT/plugins/claude"` + - `--plugin-dir "$REPO_ROOT/learning_agents"` + + 3. **PATH setup**: The wrapper must be discoverable — either via a + script placed on PATH (e.g. `.venv/bin/claude`) with `.envrc` + adding that directory to PATH, or via a shell function/alias. + + 4. **Real binary resolution**: The wrapper must resolve the real + `claude` binary correctly, avoiding infinite recursion (e.g. by + stripping the wrapper's directory from PATH before lookup). + + Output Format: + - PASS: The claude wrapper is correctly configured with both plugin dirs. + - FAIL: Describe what is missing or broken. diff --git a/.envrc b/.envrc index 3550a30f..f85094c8 100644 --- a/.envrc +++ b/.envrc @@ -1 +1,2 @@ use flake +PATH_add .venv/bin diff --git a/README.md b/README.md index b4dc77ec..f8f0db6e 100644 --- a/README.md +++ b/README.md @@ -167,13 +167,6 @@ For workflows that need to interact with websites, you can use any browser autom Here are some known issues that affect some early users — we're working on improving normal performance on these, but here are some known workarounds. -### Stop hooks firing unexpectedly - -Occasionally, especially after updating a job or running the `deepwork_jobs learn` process after completing a task, Claude will get confused about which workflow it's running checks for. For now, if stop hooks fire when they shouldn't, you can either: -- Ask claude `do we need to address any of these stop hooks or can we ignore them for now?` -- Ignore the stop hooks and keep going until the workflow steps are complete -- Run the `/clear` command to start a new context window (you'll have to re-run the job after this) - ### Claude "just does the task" instead of using DeepWork If Claude attempts to bypass the workflow and do the task on it's own, tell it explicitly to use the skill. You can also manually run the step command: @@ -198,8 +191,7 @@ your-project/ │ ├── tmp/ # Session state (created lazily) │ └── jobs/ # Job definitions │ └── job_name/ -│ ├── job.yml # Job metadata -│ └── steps/ # Step instructions +│ └── job.yml # Job definition (self-contained with inline instructions) ``` diff --git a/claude.md b/claude.md index 5a30d2a3..c2df9de1 100644 --- a/claude.md +++ b/claude.md @@ -57,7 +57,7 @@ deepwork/ │ │ │ ├── deepwork/SKILL.md │ │ │ ├── review/SKILL.md │ │ │ └── configure_reviews/SKILL.md -│ │ ├── hooks/ # hooks.json, post_commit_reminder.sh, post_compact.sh +│ │ ├── hooks/ # hooks.json, post_commit_reminder.sh, post_compact.sh, startup_context.sh │ │ └── .mcp.json # MCP server config │ └── gemini/ # Gemini CLI extension │ └── skills/deepwork/SKILL.md diff --git a/doc/architecture.md b/doc/architecture.md index 0a20b4b9..30eafeab 100644 --- a/doc/architecture.md +++ b/doc/architecture.md @@ -56,8 +56,7 @@ deepwork/ # DeepWork tool repository │ │ ├── tools.py # MCP tool implementations │ │ ├── state.py # Workflow session state management │ │ ├── schemas.py # Pydantic models for I/O -│ │ ├── quality_gate.py # Quality gate with review agent -│ │ └── claude_cli.py # Claude CLI subprocess wrapper +│ │ └── quality_gate.py # Quality gate via DeepWork Reviews │ ├── hooks/ # Hook system and cross-platform wrappers │ │ ├── wrapper.py # Cross-platform input/output normalization │ │ ├── claude_hook.sh # Shell wrapper for Claude Code @@ -94,7 +93,7 @@ deepwork/ # DeepWork tool repository │ │ │ ├── deepwork/SKILL.md │ │ │ ├── review/SKILL.md │ │ │ └── configure_reviews/SKILL.md -│ │ ├── hooks/ # hooks.json, post_commit_reminder.sh, post_compact.sh +│ │ ├── hooks/ # hooks.json, post_commit_reminder.sh, post_compact.sh, startup_context.sh │ │ └── .mcp.json # MCP server config │ └── gemini/ # Gemini CLI extension │ └── skills/deepwork/SKILL.md @@ -114,7 +113,7 @@ The CLI has four active commands: `serve`, `hook`, `review`, and `jobs`. Depreca Starts the MCP server for workflow management: ```bash -deepwork serve --path . --external-runner claude +deepwork serve --path . ``` The serve command: @@ -185,13 +184,11 @@ my-project/ # User's project (target) │ ├── tmp/ # Temporary session state (gitignored, created lazily) │ └── jobs/ # Job definitions │ ├── deepwork_jobs/ # Core job (auto-discovered from package) -│ │ ├── job.yml -│ │ └── steps/ +│ │ └── job.yml │ ├── competitive_research/ -│ │ ├── job.yml # Job metadata -│ │ └── steps/ +│ │ └── job.yml # Job definition (steps are inline) │ └── ad_campaign/ -│ └── ... +│ └── job.yml ├── (rest of user's project files) └── README.md ``` @@ -200,213 +197,138 @@ my-project/ # User's project (target) **Note**: Work outputs are created directly in the project on dedicated Git branches (e.g., `deepwork/competitive_research-acme-2026-01-11`). The branch naming convention is `deepwork/[job_name]-[instance]-[date]`. -## Job Definition Example +## Job Definition Format + +Job definitions use `step_arguments` to declare data that flows between steps, and `workflows` to define step sequences with inline instructions. There are no separate step instruction files, no root-level `steps[]`, and no `version`, `dependencies`, `hooks`, or `exposed/hidden` fields. + +### Key Concepts + +- **`step_arguments`**: Named data items (strings or file paths) passed between steps. Each argument has a `name`, `description`, `type` (`string` or `file_path`), optional `review` block, and optional `json_schema`. +- **`workflows`**: Named sequences of steps with inline instructions. Each workflow has a `summary`, optional `agent`, optional `common_job_info_provided_to_all_steps_at_runtime`, `steps`, and optional `post_workflow_instructions`. +- **Steps**: Each step has `inputs` and `outputs` that reference `step_arguments` by name. Step logic is defined via `instructions` (inline string) or `sub_workflow` (delegates to another workflow). +- **Reviews on outputs**: The `review` block on step arguments or step outputs uses the same format as `.deepreview` review rules. These are applied *in addition to* any `.deepreview` file-defined rules. +- **`process_quality_attributes`**: Optional per-step object where keys are attribute names and values are descriptions. These review the *process and work* done (not individual output files). + +### Example: `job.yml` `.deepwork/jobs/competitive_research/job.yml`: ```yaml name: competitive_research -version: "1.0.0" summary: "Systematic competitive analysis workflow" -common_job_info_provided_to_all_steps_at_runtime: | - A comprehensive workflow for analyzing competitors in your market segment. - Designed for product teams conducting quarterly competitive analysis. - -# Workflows define named sequences of steps that form complete processes. -# Steps not in any workflow are "standalone skills" that can be run anytime. -# Steps can be listed as simple strings (sequential) or arrays (concurrent execution). -# -# Concurrent step patterns: -# 1. Multiple different steps: [step_a, step_b] - run both in parallel -# 2. Single step with multiple instances: [fetch_campaign_data] - indicates this -# step should be run in parallel for each instance (e.g., each ad campaign) -# -# Use a single-item array when a step needs multiple parallel instances, like -# "fetch performance data" that runs once per campaign in an ad reporting job. + +step_arguments: + - name: market_segment + description: "The market segment to analyze" + type: string + - name: competitors_list + description: "List of competitors with descriptions" + type: file_path + review: + instructions: "Verify at least 5 direct and 3 indirect competitors are listed with descriptions." + strategy: individual + - name: primary_findings + description: "Primary research findings document" + type: file_path + - name: secondary_findings + description: "Secondary research findings document" + type: file_path + - name: comparison_matrix + description: "Detailed comparison matrix" + type: file_path + - name: positioning_strategy + description: "Market positioning strategy" + type: file_path + workflows: - - name: full_analysis + full_analysis: summary: "Complete competitive analysis from identification through positioning" + common_job_info_provided_to_all_steps_at_runtime: | + A comprehensive workflow for analyzing competitors in your market segment. + Designed for product teams conducting quarterly competitive analysis. steps: - - identify_competitors - # Steps in an array execute concurrently (as "Background Tasks") - - [primary_research, secondary_research] - - comparative_report - - positioning - -steps: - - id: identify_competitors - name: "Identify Competitors" - description: "Research and list direct and indirect competitors" - instructions_file: steps/identify_competitors.md - inputs: - - name: market_segment - description: "The market segment to analyze" - - name: product_category - description: "Product category" - outputs: - competitors.md: - type: file - description: "List of competitors with descriptions" - required: true - dependencies: [] - - - id: primary_research - name: "Primary Research" - description: "Analyze competitors' self-presentation" - instructions_file: steps/primary_research.md - inputs: - - file: competitors.md - from_step: identify_competitors - outputs: - primary_research.md: - type: file - description: "Primary research findings" - required: true - dependencies: - - identify_competitors - - - id: secondary_research - name: "Secondary Research" - description: "Research third-party perspectives on competitors" - instructions_file: steps/secondary_research.md - inputs: - - file: competitors.md - from_step: identify_competitors - - file: primary_research.md - from_step: primary_research - outputs: - - secondary_research.md - dependencies: - - primary_research - - - id: comparative_report - name: "Comparative Report" - description: "Create detailed comparison matrix" - instructions_file: steps/comparative_report.md - inputs: - - file: primary_research.md - from_step: primary_research - - file: secondary_research.md - from_step: secondary_research - outputs: - - comparison_matrix.md - - strengths_weaknesses.md - dependencies: - - primary_research - - secondary_research - - - id: positioning - name: "Market Positioning" - description: "Define positioning strategy against competitors" - instructions_file: steps/positioning.md - inputs: - - file: comparison_matrix.md - from_step: comparative_report - outputs: - - positioning_strategy.md - dependencies: - - comparative_report + - name: identify_competitors + instructions: | + Research and list direct and indirect competitors in the given market segment. + Create a document listing 5-10 direct competitors and 3-5 indirect competitors, + each with website, description, and value proposition. + inputs: + market_segment: + required: true + outputs: + competitors_list: + required: true + process_quality_attributes: + research_thoroughness: "Research used multiple sources (web search, analyst reports, review sites)" + + - name: primary_research + instructions: | + Analyze each competitor's self-presentation: website messaging, product pages, + pricing, and positioning. Document findings for each competitor. + inputs: + competitors_list: + required: true + outputs: + primary_findings: + required: true + + - name: secondary_research + instructions: | + Research third-party perspectives on competitors: analyst reports, reviews, + press coverage, and community sentiment. + inputs: + competitors_list: + required: true + primary_findings: + required: true + outputs: + secondary_findings: + required: true + + - name: comparative_report + instructions: | + Create a detailed comparison matrix and strengths/weaknesses analysis + based on all research gathered. + inputs: + primary_findings: + required: true + secondary_findings: + required: true + outputs: + comparison_matrix: + required: true + + - name: positioning + instructions: | + Define a positioning strategy based on the competitive landscape analysis. + inputs: + comparison_matrix: + required: true + outputs: + positioning_strategy: + required: true + + post_workflow_instructions: | + The competitive analysis is complete. Create a PR with all artifacts + for team review. ``` -### Lifecycle Hooks in Job Definitions +### Sub-Workflow References -Steps can define lifecycle hooks that trigger at specific points during execution. Hooks are defined using generic event names that are mapped to platform-specific names by adapters: +Steps can delegate to other workflows instead of providing inline instructions: ```yaml steps: - - id: build_report - name: "Build Report" - description: "Generate the final report" - instructions_file: steps/build_report.md + - name: run_deep_analysis + sub_workflow: + workflow_name: deep_analysis + workflow_job: competitive_research # optional, defaults to current job + inputs: + competitors_list: + required: true outputs: - - report.md - hooks: - after_agent: # Triggers after agent finishes (Claude: "Stop") - - prompt: | - Verify the report includes all required sections: - - Executive summary - - Data analysis - - Recommendations - - script: hooks/validate_report.sh - before_tool: # Triggers before tool use (Claude: "PreToolUse") - - prompt: "Confirm tool execution is appropriate" -``` - -**Supported Lifecycle Events**: -- `after_agent` - Triggered after the agent finishes responding (quality validation) -- `before_tool` - Triggered before the agent uses a tool -- `before_prompt` - Triggered when user submits a new prompt - -**Hook Action Types**: -- `prompt` - Inline prompt text -- `prompt_file` - Path to a file containing the prompt -- `script` - Path to a shell script - -**Note**: The deprecated `stop_hooks` field is still supported for backward compatibility but maps to `hooks.after_agent`. - -### Step Instructions Example - -`.deepwork/jobs/competitive_research/steps/identify_competitors.md`: - -```markdown -# Identify Competitors - -## Objective -Research and create a comprehensive list of direct and indirect competitors in the specified market segment. - -## Task Description -You will identify companies that compete with us in {{market_segment}} for {{product_category}}. - -### Direct Competitors -Companies offering similar products/services to the same customer base: -- List 5-10 companies -- Include company name, website, and brief description -- Note their primary value proposition - -### Indirect Competitors -Companies solving the same problem with different approaches: -- List 3-5 companies -- Explain how they're indirect competitors - -## Output Format -Create `competitors.md` with this structure: - -```markdown -# Competitor Analysis: {{market_segment}} - -## Direct Competitors - -### [Company Name] -- **Website**: [URL] -- **Description**: [Brief description] -- **Value Proposition**: [What they claim] -- **Target Market**: [Who they serve] - -[Repeat for each direct competitor] - -## Indirect Competitors - -### [Company Name] -- **Website**: [URL] -- **Alternative Approach**: [How they differ] -- **Why Relevant**: [Why they compete with us] - -[Repeat for each indirect competitor] -``` - -## Research Tips -1. Start with web searches for "[product category] companies" -2. Check industry analyst reports (Gartner, Forrester) -3. Look at review sites (G2, Capterra) -4. Check LinkedIn for similar companies -5. Use Crunchbase or similar databases - -## Quality Checklist -- [ ] At least 5 direct competitors identified -- [ ] At least 3 indirect competitors identified -- [ ] Each competitor has website and description -- [ ] Value propositions are clearly stated -- [ ] No duplicate entries + primary_findings: + required: true ``` ## Workflow Execution via MCP @@ -464,88 +386,32 @@ This section describes how AI agents (like Claude Code) actually execute jobs us PR created: https://github.com/user/project/pull/123 ``` -## How Claude Code Executes Skills - -When user types `/competitive_research.identify_competitors`: +## How Agents Execute Workflows -1. **Skill Discovery**: - - Claude Code scans `.claude/skills/` directory - - Finds `competitive_research.identify_competitors.md` - - Loads the skill definition +Agents use the `/deepwork` skill which instructs them to interact with MCP tools: -2. **Context Loading**: - - Skill file contains embedded instructions - - References to job definition and step files - - Claude reads these files to understand the full context +1. **Workflow Discovery**: Agent calls `get_workflows` to list available jobs and workflows +2. **Workflow Start**: Agent calls `start_workflow` with goal, job name, workflow name, and optional inputs +3. **Step Execution**: Agent follows the inline instructions returned by the MCP server +4. **Checkpoint**: Agent calls `finished_step` with outputs and work summary +5. **Quality Gate**: MCP server runs DeepWork Reviews on outputs, returns feedback or advances +6. **Repeat**: Agent continues until `workflow_complete` -3. **Execution**: - - Claude follows the instructions in the skill - - Uses its tools (Read, Write, WebSearch, WebFetch, etc.) - - Creates outputs in the specified format - -4. **State Management** (via filesystem): - - Work branch name encodes the job instance - - Output files track progress - - Git provides version control and resumability - -5. **No DeepWork Runtime**: - - DeepWork CLI is NOT running during execution - - Everything happens through Claude Code's native execution - - Skills are just markdown instruction files that Claude interprets +All state is managed by the MCP server in `.deepwork/tmp/sessions/`. The agent never reads session files directly. ## Context Passing Between Steps -Since there's no DeepWork runtime process, context is passed through: - ### 1. Filesystem (Primary Mechanism) -On a work branch like `deepwork/competitive_research-acme-2026-01-11`, outputs are created in the project: +On a work branch like `deepwork/competitive_research-acme-2026-01-11`, outputs are created in the project. Step arguments with `type: file_path` reference files on disk; `type: string` values are passed inline through the MCP server. -``` -(project root on work branch) -├── competitors.md ← Step 1 output -├── primary_research.md ← Step 2 output -├── competitor_profiles/ ← Step 2 output -│ ├── acme_corp.md -│ ├── widgets_inc.md -│ └── ... -├── secondary_research.md ← Step 3 output -├── comparison_matrix.md ← Step 4 output -└── positioning_strategy.md ← Step 5 output -``` - -Each command instructs Claude to: -- Read specific input files from previous steps -- Write specific output files for this step -- All on the same work branch +### 2. Step Instructions -### 2. Skill Instructions - -Each skill file explicitly states its dependencies: - -```markdown -### Prerequisites -This step requires outputs from: -- Step 1 (identify_competitors): competitors.md -- Step 2 (primary_research): primary_research.md - -### Your Task -Conduct web research on secondary sources for each competitor identified in competitors.md. -``` +Each step's instructions (inline in job.yml) describe what inputs to read and what outputs to produce. The MCP server automatically includes input values/references when returning step instructions. ### 3. Git History -When working on similar jobs: -- User: "Do competitive research for Acme Corp, similar to our Widget Corp analysis" -- Claude can read old existing branches like `deepwork/competitive_research-widget-corp-2024-01-05` from git history -- Uses it as a template for style, depth, format - -### 4. No Environment Variables Needed - -Unlike the original architecture, we don't need special environment variables because: -- The work branch name encodes the job instance -- File paths are explicit in skill instructions -- Git provides all the state management +When working on similar jobs, agents can read old branches from git history to use as templates for style, depth, and format. ## Branching Strategy @@ -652,8 +518,8 @@ Claude: I'll analyze this conversation for DeepWork job executions... 2. Output format for competitor_profiles/ not specified Improvements made: - ✓ Updated steps/primary_research.md with source prioritization guidance - ✓ Added output format example to steps/primary_research.md + ✓ Updated job.yml step instructions with source prioritization guidance + ✓ Added output format example to primary_research step instructions Bespoke learnings captured: ✓ Created AGENTS.md with project-specific notes about this competitive research instance @@ -663,48 +529,16 @@ Claude: I'll analyze this conversation for DeepWork job executions... This standalone skill can be run anytime after executing a job to capture learnings and improve instructions. -### Template System - -Templates are Markdown files with variable interpolation: +### Step Instructions at Runtime -```markdown -# {{STEP_NAME}} +When `start_workflow` or `finished_step` returns step instructions to the agent, the MCP server assembles them from the job definition: -## Objective -{{STEP_DESCRIPTION}} +- **Common job info**: The `common_job_info_provided_to_all_steps_at_runtime` block from the workflow +- **Inline instructions**: The `instructions` string from the step definition +- **Inputs**: The values/file paths for all declared step inputs, resolved from previous step outputs +- **Expected outputs**: The list of outputs the step must produce, with descriptions from `step_arguments` -## Context -You are working on: {{JOB_NAME}} -Current step: {{STEP_ID}} ({{STEP_NUMBER}}/{{TOTAL_STEPS}}) - -## Inputs -{% for input in INPUTS %} -- Read `{{input.file}}` for {{input.description}} -{% endfor %} - -## Your Task -[Detailed instructions for the AI agent...] - -## Output Format -Create the following files: -{% for output in OUTPUTS %} -### {{output.file}} -{{output.template}} -{% endfor %} - -## Quality Checklist -- [ ] Criterion 1 -- [ ] Criterion 2 - -## Examples -{{EXAMPLES}} -``` - -Variables populated by runtime: -- Job metadata: `{{JOB_NAME}}`, `{{JOB_DESCRIPTION}}` -- Step metadata: `{{STEP_ID}}`, `{{STEP_NAME}}`, `{{STEP_NUMBER}}` -- Context: `{{INPUTS}}`, `{{OUTPUTS}}`, `{{DEPENDENCIES}}` -- Examples: `{{EXAMPLES}}` (loaded from `examples/` directory if present) +There is no separate template engine or Jinja2 rendering. Instructions are composed directly from the job.yml data at runtime. --- @@ -715,26 +549,30 @@ Variables populated by runtime: ``` tests/ ├── unit/ # Unit tests for core components -│ ├── test_job_parser.py -│ ├── test_registry.py -│ ├── test_runtime_engine.py -│ └── test_template_renderer.py +│ ├── jobs/ +│ │ ├── test_parser.py # Job parser and dataclasses +│ │ ├── test_discovery.py # Job folder discovery +│ │ └── mcp/ +│ │ ├── test_tools.py # MCP tool implementations +│ │ ├── test_state.py # State management +│ │ ├── test_quality_gate.py # Quality gate (DeepWork Reviews) +│ │ ├── test_schemas.py # Pydantic models +│ │ ├── test_server.py # Server creation +│ │ └── test_async_interface.py +│ ├── cli/ +│ │ └── test_jobs_get_stack.py +│ ├── review/ # DeepWork Reviews tests +│ └── test_validation.py # Schema validation ├── integration/ # Integration tests -│ ├── test_job_import.py -│ ├── test_workflow_execution.py -│ └── test_git_integration.py +│ └── test_quality_gate_integration.py ├── e2e/ # End-to-end tests -│ ├── test_full_workflow.py -│ └── test_multi_platform.py -├── fixtures/ # Test data -│ ├── jobs/ -│ │ ├── simple_job/ -│ │ └── complex_job/ -│ ├── templates/ -│ └── mock_responses/ -└── mocks/ # Mock AI agent responses - ├── claude_mock.py - └── gemini_mock.py +│ └── test_claude_code_integration.py +└── fixtures/ # Test data + └── jobs/ + ├── simple_job/ + ├── complex_job/ + ├── fruits/ + └── job_with_doc_spec/ ``` ### Test Strategy @@ -761,9 +599,6 @@ Use fixtures to provide test data. def test_large_job_parsing(): """Ensure parser handles jobs with 50+ steps""" -def test_template_rendering_performance(): - """Benchmark template rendering with large datasets""" - def test_git_operations_at_scale(): """Test with repositories containing 100+ work branches""" ``` @@ -839,12 +674,16 @@ quality_criteria: ### Using Doc Specs in Jobs -Reference doc specs in job.yml outputs: +Reference doc specs in job.yml step arguments: ```yaml -outputs: - - file: reports/monthly_spending.md - doc_spec: .deepwork/doc_specs/monthly_aws_report.md +step_arguments: + - name: monthly_spending_report + description: "Monthly AWS spending report" + type: file_path + json_schema: .deepwork/doc_specs/monthly_aws_report_schema.json + review: + instructions: "Verify the report meets the doc spec quality criteria." ``` ### How Doc Specs Are Used at Runtime @@ -896,10 +735,6 @@ See `doc/doc-specs.md` for complete documentation. - **Rationale**: Transparent, auditable, reviewable, collaborative - **Alternatives**: Database (opaque), JSON files (no versioning) -### Template Engine: Jinja2 (dev dependency only) -- **Rationale**: Industry standard, powerful, well-documented; used in development tooling, not at runtime -- **Alternatives**: Mustache (too simple), custom (NIH syndrome) - ### Validation: JSON Schema + Custom Scripts - **Rationale**: Flexible, extensible, supports both structure and semantics - **Alternatives**: Only custom scripts (inconsistent), only schemas (limited) @@ -979,31 +814,35 @@ Begins a new workflow session. - `goal: str` - What the user wants to accomplish - `job_name: str` - Name of the job - `workflow_name: str` - Name of the workflow within the job -- `instance_id: str | None` - Optional identifier (e.g., "acme", "q1-2026") +- `inputs: dict[str, str | list[str]] | None` - Inputs for the first step (file paths for `file_path` type, strings for `string` type) +- `session_id: str` - Claude Code session ID (required) +- `agent_id: str | None` - Claude Code agent ID for sub-agent scoping -**Returns**: Session ID, branch name, first step instructions +**Returns**: Session ID, branch name, first step instructions (including resolved inputs) #### 3. `finished_step` Reports step completion and gets next instructions. **Parameters**: -- `outputs: dict[str, str | list[str]]` - Map of output names to file path(s) -- `notes: str | None` - Optional notes about work done +- `outputs: dict[str, str | list[str]]` - Map of output names to file path(s) or string values +- `work_summary: str | None` - Summary of the work done in the step - `quality_review_override_reason: str | None` - If provided, skips quality review -- `session_id: str | None` - Target a specific workflow session +- `session_id: str` - Claude Code session ID (required) +- `agent_id: str | None` - Claude Code agent ID for sub-agent scoping **Returns**: -- `status: "needs_work" | "next_step" | "workflow_complete"` -- If `needs_work`: feedback from quality gate, failed criteria -- If `next_step`: next step instructions -- If `workflow_complete`: summary of all outputs +- `status: "needs_review" | "next_step" | "workflow_complete"` +- If `needs_review`: review instructions in the same format as the `/review` skill, with guidance on running reviews and calling `mark_review_as_passed` or fixing issues +- If `next_step`: next step instructions (with resolved inputs) +- If `workflow_complete`: summary of all outputs, plus `post_workflow_instructions` from the workflow definition #### 4. `abort_workflow` Aborts the current workflow and returns to the parent (if nested). **Parameters**: - `explanation: str` - Why the workflow is being aborted -- `session_id: str | None` - Target a specific workflow session +- `session_id: str` - Claude Code session ID (required) +- `agent_id: str | None` - Claude Code agent ID for sub-agent scoping **Returns**: Aborted workflow info, resumed parent info (if any), current stack @@ -1012,15 +851,15 @@ Navigates back to a prior step, clearing progress from that step onward. **Parameters**: - `step_id: str` - ID of the step to navigate back to -- `session_id: str | None` - Target a specific workflow session +- `session_id: str` - Claude Code session ID (required) +- `agent_id: str | None` - Claude Code agent ID for sub-agent scoping **Returns**: `begin_step` (step info for the target step), `invalidated_steps` (step IDs whose progress was cleared), `stack` (current workflow stack) **Behavior**: - Validates the target step exists in the workflow -- Rejects forward navigation (target entry index > current entry index) +- Rejects forward navigation (target step index > current step index) - Clears session tracking state for all steps from target onward (files on disk are not deleted) -- For concurrent entries, navigates to the first step in the entry - Marks the target step as started #### 6. `get_review_instructions` @@ -1059,65 +898,67 @@ Cleanup between runs deletes only `.md` files, preserving `.passed` markers acro ### State Management (`jobs/mcp/state.py`) -Manages workflow session state persisted to `.deepwork/tmp/session_[id].json`: +Manages workflow session state persisted to `.deepwork/tmp/sessions//session-/state.json`. Sub-agents get isolated stacks in `agent_.json` alongside the main state file. ```python class StateManager: - async def create_session(...) -> WorkflowSession - def resolve_session(session_id=None) -> WorkflowSession - async def start_step(step_id, session_id=None) -> None - async def complete_step(step_id, outputs, notes, session_id=None) -> None - async def advance_to_step(step_id, entry_index, session_id=None) -> None - async def go_to_step(step_id, entry_index, invalidate_step_ids, session_id=None) -> None - async def complete_workflow(session_id=None) -> None - async def abort_workflow(explanation, session_id=None) -> tuple - async def record_quality_attempt(step_id, session_id=None) -> int - def get_all_outputs(session_id=None) -> dict - def get_stack() -> list[StackEntry] + def __init__(self, project_root: Path, platform: str) + async def create_session(session_id, ..., agent_id=None) -> WorkflowSession + def resolve_session(session_id, agent_id=None) -> WorkflowSession + async def start_step(session_id, step_id, agent_id=None) -> None + async def complete_step(session_id, step_id, outputs, work_summary, agent_id=None) -> None + async def advance_to_step(session_id, step_id, step_index, agent_id=None) -> None + async def go_to_step(session_id, step_id, step_index, invalidate_step_ids, agent_id=None) -> None + async def complete_workflow(session_id, agent_id=None) -> None + async def abort_workflow(session_id, explanation, agent_id=None) -> tuple + async def record_quality_attempt(session_id, step_id, agent_id=None) -> int + def get_all_outputs(session_id, agent_id=None) -> dict + def get_stack(session_id, agent_id=None) -> list[StackEntry] + def get_stack_depth(session_id, agent_id=None) -> int ``` Session state includes: - Session ID and timestamps - Job/workflow/instance identification -- Current step and entry index -- Per-step progress (started_at, completed_at, outputs, quality_attempts) +- Current step and step index +- Per-step progress (started_at, completed_at, outputs, work_summary, quality_attempts) ### Quality Gate (`jobs/mcp/quality_gate.py`) -Evaluates step outputs against quality criteria: +The quality gate integrates with the DeepWork Reviews infrastructure rather than invoking a separate Claude CLI subprocess. When `finished_step` is called: -```python -class QualityGate: - async def evaluate_reviews( - reviews: list[dict], - outputs: dict[str, str | list[str]], - output_specs: dict[str, str], - project_root: Path, - notes: str | None = None, - ) -> list[ReviewResult] - - async def build_review_instructions_file( - reviews: list[dict], - outputs: dict[str, str | list[str]], - output_specs: dict[str, str], - project_root: Path, - notes: str | None = None, - ) -> str -``` +1. **JSON schema validation**: If a `json_schema` is defined for any file argument, the output file is validated against it first. Validation errors cause immediate failure before any reviews run. + +2. **Build dynamic review rules**: For each `review` block defined on step outputs (either inline on the step or on the `step_argument`), a `ReviewRule` object is constructed dynamically. The `common_job_info_provided_to_all_steps_at_runtime` is included in review instructions for context. + +3. **Process quality attributes**: If the step defines `process_quality_attributes`, a review is created that evaluates the `work_summary` against those criteria. The reviewer is instructed to tell the agent to fix its work or the `work_summary` if issues are found. + +4. **Merge with `.deepreview` rules**: The dynamically built rules are merged with any `.deepreview` file-defined rules that match the output files. The changed file list comes from the `outputs` parameter (not git diff). -The quality gate supports two modes: -- **External runner** (`evaluate_reviews`): Invokes Claude Code via subprocess to evaluate each review, returns list of failed `ReviewResult` objects -- **Self-review** (`build_review_instructions_file`): Generates a review instructions file for the agent to spawn a subagent for self-review +5. **Apply review strategies**: Review strategies (`individual`, `matches_together`, etc.) work normally on the merged rule set. + +6. **Honor pass caching**: Reviews that have already passed (via `mark_review_as_passed`) are skipped. + +7. **Return review instructions**: If there are reviews to run, they are returned to the agent in the same format as the `/review` skill, along with instructions on how to run them and call `mark_review_as_passed` or fix issues. The agent then runs reviews itself until all pass. ### Schemas (`jobs/mcp/schemas.py`) Pydantic models for all tool inputs and outputs: - `StartWorkflowInput`, `FinishedStepInput`, `AbortWorkflowInput`, `GoToStepInput` - `GetWorkflowsResponse`, `StartWorkflowResponse`, `FinishedStepResponse`, `AbortWorkflowResponse`, `GoToStepResponse` -- `ActiveStepInfo`, `ExpectedOutput`, `ReviewInfo`, `ReviewResult`, `StackEntry` +- `ActiveStepInfo`, `StepInputInfo`, `ExpectedOutput`, `StackEntry` - `JobInfo`, `WorkflowInfo`, `JobLoadErrorInfo` - `WorkflowSession`, `StepProgress` -- `QualityGateResult`, `QualityCriteriaResult` + +### Parser Dataclasses (`jobs/parser.py`) + +The parser produces these dataclasses from `job.yml`: +- `ReviewBlock` - Review instructions (same format as `.deepreview` rules) +- `StepArgument` - Named data item with type, description, optional review and json_schema +- `StepInputRef` - Reference to a step argument as an input (with `required` flag) +- `StepOutputRef` - Reference to a step argument as an output (with `required` flag, optional `review`) +- `SubWorkflowRef` - Reference to another workflow (with `workflow_name`, optional `workflow_job`) +- `WorkflowStep` - A step within a workflow (name, instructions or sub_workflow, inputs, outputs, process_quality_attributes) ## MCP Server Registration @@ -1128,7 +969,7 @@ The plugin's `.mcp.json` registers the MCP server automatically: "mcpServers": { "deepwork": { "command": "uvx", - "args": ["deepwork", "serve", "--path", ".", "--external-runner", "claude"] + "args": ["deepwork", "serve", "--path", "."] } } } @@ -1148,7 +989,7 @@ Execute multi-step workflows with quality gate checkpoints. 2. Start a workflow: Call `start_workflow` with your goal 3. Execute steps: Follow the instructions returned 4. Checkpoint: Call `finished_step` with your outputs -5. Iterate or continue: Handle needs_work, next_step, or workflow_complete +5. Iterate or continue: Handle needs_review, next_step, or workflow_complete ``` ## MCP Execution Flow @@ -1159,43 +1000,28 @@ Execute multi-step workflows with quality gate checkpoints. 2. **Agent calls `start_workflow`** - MCP server creates session, generates branch name - - Returns first step instructions and expected outputs + - Returns first step instructions with resolved inputs and expected outputs 3. **Agent executes step** - - Follows step instructions - - Creates output files + - Follows inline step instructions + - Creates output files / produces string values 4. **Agent calls `finished_step`** - - MCP server evaluates outputs against quality criteria (if configured) - - If `needs_work`: returns feedback for agent to fix issues - - If `next_step`: returns next step instructions - - If `workflow_complete`: workflow finished + - MCP server validates outputs, runs json_schema checks, then runs DeepWork Reviews + - If `needs_review`: returns review instructions for agent to run reviews (same format as `/review` skill) + - If `next_step`: returns next step instructions with resolved inputs + - If `workflow_complete`: returns summary and `post_workflow_instructions` 5. **Loop continues until workflow complete** -## Quality Gate - -Quality gate is enabled by default and uses Claude Code to evaluate step outputs -against quality criteria. The command is constructed internally with proper flag -ordering (see `doc/reference/calling_claude_in_print_mode.md`). - -To disable quality gate: - -```bash -deepwork serve --no-quality-gate -``` - ## Serve Command Start the MCP server manually: ```bash -# Basic usage (quality gate enabled by default) +# Basic usage deepwork serve -# With quality gate disabled -deepwork serve --no-quality-gate - # For a specific project deepwork serve --path /path/to/project @@ -1220,7 +1046,6 @@ deepwork serve --transport sse --port 8000 - [Claude Code Documentation](https://claude.com/claude-code) - [Git Workflows](https://www.atlassian.com/git/tutorials/comparing-workflows) - [JSON Schema](https://json-schema.org/) -- [Jinja2 Documentation](https://jinja.palletsprojects.com/) - [Model Context Protocol](https://modelcontextprotocol.io/) - [FastMCP Documentation](https://github.com/jlowin/fastmcp) diff --git a/doc/job_yml_guidance.md b/doc/job_yml_guidance.md new file mode 100644 index 00000000..025d179f --- /dev/null +++ b/doc/job_yml_guidance.md @@ -0,0 +1,297 @@ +# job.yml Field Guidance + +This document explains what each `job.yml` field *does* at runtime. It is not a schema reference -- it describes behavioral impact so you can make informed authoring decisions. For the authoritative schema, see `src/deepwork/jobs/job.schema.json`. + +--- + +## Root Fields + +### `name` + +The job's unique identifier (pattern: `^[a-z][a-z0-9_]*$`). This is the value agents pass as `job_name` to `start_workflow`. It also determines the directory name under `.deepwork/jobs/` and appears in `get_workflows` output. + +### `summary` + +A one-line description (max 200 chars). Shown in `get_workflows` output so agents can decide which job to use. Write it as an action -- "Analyze competitors and produce a positioning report" -- not as a label. + +### `step_arguments` + +The shared data vocabulary. Every piece of data that flows between steps must be declared here. Steps reference these by name in their `inputs` and `outputs` maps. Think of step_arguments as the schema for the pipeline's data bus. + +### `workflows` + +A map of named workflows, each defining a sequence of steps. A job can have multiple workflows (e.g., `create` and `repair`). Workflow names are the `workflow_name` parameter in `start_workflow`. + +--- + +## step_arguments: The Data Contract + +Each step_argument defines a named piece of data with three required fields (`name`, `description`, `type`) and two optional fields (`review`, `json_schema`). + +### `name` + +Unique identifier that steps reference in their `inputs` and `outputs` maps. Can contain letters, numbers, dots, slashes, hyphens, and underscores -- so you can use file-like names like `job.yml` or `.deepwork/tmp/test_feedback.md`. + +### `description` + +Shown to the agent when it needs to produce or consume this argument. Be specific -- "The job.yml definition file for the new job" is far more useful than "A YAML file". + +### `type`: string vs file_path + +This controls **output validation** in `finished_step`: + +- **`file_path`**: The agent provides a path (or list of paths). The framework checks that every referenced file exists on disk. If any file is missing, `finished_step` returns an error immediately. When shown as a step input, file paths appear as backtick-quoted references (e.g., `` `path/to/file.md` ``). Reviews examine the file contents. + +- **`string`**: The agent provides inline text. No file existence check -- the value is accepted as-is. When shown as a step input, the actual string content is included inline in the step instructions. + +Rule of thumb: if the data would be committed to Git (a report, a config file), use `file_path`. If it is transient context (a user's answer, a job name), use `string`. + +### `review` + +An optional review block that applies **whenever this argument is produced as a step output, in any step, in any workflow**. This is a default review for this piece of data. + +```yaml +- name: step_instruction_files + description: "Instruction Markdown files for each step" + type: file_path + review: + strategy: individual + instructions: | + Evaluate: Complete instructions, specific & actionable, output examples shown. +``` + +You define quality criteria once, and they apply everywhere. If three workflows all produce `step_instruction_files`, they all get this review. Steps can add additional scrutiny with output-level reviews (see "Review Cascade" below). + +### `json_schema` + +Only applies to `file_path` arguments. When set, the framework parses each output file as JSON and validates it against the schema **before any reviews run**. If validation fails, `finished_step` returns the error immediately -- reviews are skipped entirely. This is a hard gate, not a soft review. Use for structured outputs where format correctness is non-negotiable. + +--- + +## Workflows + +Workflows are the primary execution unit. Agents start workflows, not individual steps. + +### `summary` + +A one-line description (max 200 chars) shown alongside the workflow name in `get_workflows` output. Helps the agent pick the right workflow when a job has multiple. + +### `agent` + +Changes how the workflow appears in `get_workflows`. Without `agent`, the response tells the caller to invoke `start_workflow` directly: + +> Call `start_workflow` with job_name="X" and workflow_name="Y", then follow the step instructions it returns. + +With `agent` set (e.g., `"general-purpose"`), the response tells the caller to spawn a **Task sub-agent** of that type: + +> Invoke as a Task using subagent_type="general-purpose" with a prompt giving full context and instructions to call `start_workflow`... + +If the agent does not have the Task tool available, the instructions fall back to direct invocation. + +Use `agent` for workflows that should execute autonomously without blocking the main conversation. + +### `common_job_info_provided_to_all_steps_at_runtime` + +This text has **two runtime effects**: + +1. **Step instructions**: Delivered as a separate `common_job_info` field in the response when a step starts. The agent sees it alongside the step instructions. +2. **Review prompts**: Included as a "Job Context" preamble in every dynamic review built from this workflow's step outputs. Reviewers see it automatically. + +Use it for shared knowledge every step (and every reviewer) needs: project background, key terminology, constraints, conventions, schema references. This avoids duplicating the same context in every step's `instructions` field and every review's `instructions` field. + +### `post_workflow_instructions` + +Returned to the agent when the **last step completes successfully** (in the `workflow_complete` response from `finished_step`). Use for guidance on what to do after the workflow finishes -- "Create a PR", "Run the test suite", "Notify the user". + +This text is only delivered once, at the end. It is not included in any step instructions or reviews. + +### `steps` + +An ordered array of step definitions. Steps execute sequentially -- the agent completes one step (via `finished_step`) before receiving the next step's instructions. + +--- + +## Steps + +Each step must have a `name` and exactly one of `instructions` or `sub_workflow`. Having both or neither is a parse error. + +### `name` + +Unique step identifier within the workflow (pattern: `^[a-z][a-z0-9_]*$`). Used as `step_id` in MCP responses and for tracking progress. + +### `instructions` + +Inline markdown telling the agent what to do. At runtime, the framework builds the final instructions by prepending resolved input values (file paths as backtick references, string values inline), then appending the step's instructions. The `common_job_info` is delivered as a separate field in the response. + +### `sub_workflow` + +Instead of inline instructions, delegate this step to another workflow. The framework auto-generates instructions: + +> Call `start_workflow` with job_name="current_job" and workflow_name="target_workflow", then follow the instructions it returns until the sub-workflow completes. + +See "Sub-workflows" below for details on same-job vs cross-job references and stack behavior. + +### `inputs` + +A map of step_argument names to input configuration. Input values are resolved at runtime from two sources, checked in order: + +1. **Provided inputs** from `start_workflow`'s `inputs` parameter (first step only) +2. **Previous step outputs** accumulated in the session + +Each input has a `required` flag (default `true`). Missing required inputs show as "not yet available" in the step instructions rather than causing an error. Optional inputs (`required: false`) behave the same way but signal intent that the value may not exist. + +Resolved input values are formatted and prepended to the step instructions: +- `file_path` inputs: `` - **name** (required): `path/to/file.md` `` +- `string` inputs: `- **name** (required): the actual value` + +These same input values are also included in review prompts as "Step Inputs" context. + +### `outputs` + +A map of step_argument names to output configuration. When the agent calls `finished_step`, validation runs in this order: + +1. **Completeness**: All required outputs must be present. No unknown output names allowed. +2. **Type validation**: `file_path` values must point to existing files. `string` values must be strings. +3. **JSON schema**: If the step_argument has `json_schema`, file contents are parsed and validated. Failures are returned immediately; reviews are skipped. +4. **Quality reviews**: Dynamic reviews from the output ref and step_argument, plus .deepreview rules. + +**Important**: The agent must provide ALL required outputs on every `finished_step` call, even outputs whose files have not changed since a previous attempt. The framework re-validates everything each time. + +The `review` field on an output is step-specific and **supplements** (does not replace) any review on the step_argument. See "Review Cascade" below. + +### `process_quality_attributes` + +A map of attribute names to **statements that should be true** if the work was done correctly: + +```yaml +process_quality_attributes: + tests_written: "Unit tests were written before implementation code." + user_consulted: "The user was asked to confirm the approach." +``` + +At runtime, this creates a synthetic review with `matches_together` strategy that evaluates the agent's `work_summary` (provided in `finished_step`) against these criteria. The review prompt includes: +- The workflow's `common_job_info` +- The step's input values +- All quality criteria as a bulleted list +- The `work_summary` text +- References to all output files (so the reviewer can cross-check claims) + +The reviewer checks whether the work described in `work_summary` satisfies each criterion. If the work_summary is incomplete or inaccurate, the reviewer tells the agent to fix its work or its work_summary. + +This is for **process quality** -- did the agent follow the right process? -- not for output quality, which is handled by output reviews. + +--- + +## The Review Cascade + +Reviews on step outputs come from **three independent sources** that are merged at runtime. Understanding their interaction is essential. + +### Source 1: Step output review + +A `review` block on a specific step's output ref. Created as a dynamic `ReviewRule` named `step_{step_name}_output_{arg_name}`. + +### Source 2: Step_argument review + +A `review` block on the step_argument itself. Created as a dynamic `ReviewRule` named `step_{step_name}_output_{arg_name}_arg` (note the `_arg` suffix). + +### Source 3: .deepreview rules + +Project-wide review rules from `.deepreview` files. These match output files by glob pattern and are loaded independently of the job definition. + +### How they merge + +All three sources produce `ReviewRule` objects that are matched against the output file paths. They run as **separate, independent reviews** -- they do not replace each other. + +The ordering matters: for each output, the step output review (source 1) is added first, then the step_argument review (source 2) with the `_arg` suffix. Both run as separate review tasks. Then .deepreview rules are matched and added after all dynamic rules. + +``` +Step output review: step_define_output_job.yml -> runs +Step_argument review: step_define_output_job.yml_arg -> runs (separately) +.deepreview rule: yaml_standards -> runs (if pattern matches) +``` + +The practical effect: a step_argument review provides a baseline quality check that applies everywhere, a step output review adds step-specific scrutiny, and .deepreview rules add project-wide standards. They stack. + +### Review context + +Every dynamic review (from sources 1 and 2) automatically receives a preamble containing: +- The workflow's `common_job_info` as "Job Context" (if set) +- The step's resolved input values as "Step Inputs" + +This is prepended to the review's own `instructions`. You do not need to repeat domain context in review instructions. + +### After reviews + +If any reviews need to run, `finished_step` returns `NEEDS_WORK` status with instructions for the agent to launch review tasks. After fixing issues (or marking reviews as passed), the agent calls `finished_step` again. Previously passing reviews are skipped via `.passed` marker files. + +--- + +## Sub-workflows + +### Same-job references + +```yaml +sub_workflow: + workflow_name: code_review +``` + +References another workflow in the same job. Validated at parse time -- the parser checks that the target workflow exists. + +### Cross-job references + +```yaml +sub_workflow: + workflow_name: quality_check + workflow_job: shared_tools +``` + +References a workflow in a different job. **Not validated at parse time** because the other job may not be loaded. Validated at runtime when `start_workflow` is called. + +### Stack behavior + +When a step has `sub_workflow`, the agent calls `start_workflow` for the sub-workflow. This **pushes onto the session stack**. The sub-workflow runs its steps normally. When its last step completes, `finished_step` returns `workflow_complete` and the sub-workflow **pops off the stack**, returning control to the parent workflow. + +The agent still needs to call `finished_step` on the parent step after the sub-workflow completes -- the sub-workflow's completion does not automatically advance the parent. + +The `abort_workflow` tool can unwind the stack, aborting the current sub-workflow and resuming the parent. + +--- + +## review_block Fields + +Both step_argument reviews and step output reviews use the same shape: + +### `strategy` + +- **`individual`**: One review per output file. Each file gets its own review agent call. Use when multiple files should be evaluated independently. Many files do NOT cause timeout accumulation -- each is a separate MCP call. +- **`matches_together`**: All matched output files reviewed in one call. Use when files form a coherent set that must be evaluated together. + +Note: `all_changed_files` (available in `.deepreview` rules) is not available in job.yml review blocks. + +### `instructions` + +What to tell the reviewer. Be specific and actionable -- "Verify the YAML has at least 3 steps and each step has both inputs and outputs" is better than "Check if the job looks good." The framework prepends job context and step inputs automatically. + +### `agent` + +Routes the review to a specific agent persona. A map of platform names to persona identifiers: + +```yaml +agent: + claude: "security-expert" +``` + +When not set, reviews use the default reviewer. + +### `additional_context` + +Flags controlling extra information in the review prompt: + +- **`all_changed_filenames: true`**: Include a list of all output files, even if the review strategy only examines a subset. Useful when reviewing one file but needing awareness of the full change set. +- **`unchanged_matching_files: true`**: Include files that match the include patterns but were not produced as outputs. Useful for freshness reviews where the reviewer needs to see existing documents alongside new ones. + +--- + +## Data Flow Summary + +Input values are resolved in order: (1) `start_workflow` provided inputs, then (2) accumulated outputs from previous steps. All required outputs must be provided on every `finished_step` call, even unchanged ones. When the last step completes, all accumulated outputs are returned alongside `post_workflow_instructions`. diff --git a/doc/mcp_interface.md b/doc/mcp_interface.md index 5e1623c7..2ec6e8a9 100644 --- a/doc/mcp_interface.md +++ b/doc/mcp_interface.md @@ -64,7 +64,8 @@ Start a new workflow session. Creates a git branch, initializes state tracking, | `goal` | `string` | Yes | What the user wants to accomplish | | `job_name` | `string` | Yes | Name of the job | | `workflow_name` | `string` | Yes | Name of the workflow within the job. If the name doesn't match but the job has only one workflow, that workflow is selected automatically. If the job has multiple workflows, an error is returned listing the available workflow names. | -| `instance_id` | `string \| null` | No | Optional identifier for naming (e.g., 'acme', 'q1-2026') | +| `session_id` | `string` | Yes | The Claude Code session ID (CLAUDE_CODE_SESSION_ID from startup context). Identifies the persistent state storage for this agent session. | +| `agent_id` | `string \| null` | No | The Claude Code agent ID (CLAUDE_CODE_AGENT_ID from startup context), if running as a sub-agent. When set, this workflow is scoped to this agent. | #### Returns @@ -88,7 +89,8 @@ Report that you've finished a workflow step. Validates outputs against quality c | `outputs` | `Record` | Yes | Map of output names to file path(s). For outputs declared as type `file`: pass a single string path (e.g. `"report.md"`). For outputs declared as type `files`: pass a list of string paths (e.g. `["a.md", "b.md"]`). Outputs with `required: false` can be omitted. Check `step_expected_outputs` to see each output's declared type and required status. | | `notes` | `string \| null` | No | Optional notes about work done | | `quality_review_override_reason` | `string \| null` | No | If provided, skips quality review (must explain why) | -| `session_id` | `string \| null` | No | Target a specific workflow session by ID. Use when multiple workflows are active concurrently. If omitted, operates on the top-of-stack session. The session_id is returned in `ActiveStepInfo` from `start_workflow` and `finished_step`. | +| `session_id` | `string` | Yes | The Claude Code session ID (CLAUDE_CODE_SESSION_ID from startup context). Identifies the persistent state storage for this agent session. | +| `agent_id` | `string \| null` | No | The Claude Code agent ID (CLAUDE_CODE_AGENT_ID from startup context), if running as a sub-agent. When set, operates on this agent's scoped workflow stack. | #### Returns @@ -125,7 +127,8 @@ Abort the current workflow and return to the parent workflow (if nested). Use th | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `explanation` | `string` | Yes | Why the workflow is being aborted | -| `session_id` | `string \| null` | No | Target a specific workflow session by ID. Use when multiple workflows are active concurrently. If omitted, aborts the top-of-stack session. | +| `session_id` | `string` | Yes | The Claude Code session ID (CLAUDE_CODE_SESSION_ID from startup context). Identifies the persistent state storage for this agent session. | +| `agent_id` | `string \| null` | No | The Claude Code agent ID (CLAUDE_CODE_AGENT_ID from startup context), if running as a sub-agent. When set, operates on this agent's scoped workflow stack. | #### Returns @@ -151,7 +154,8 @@ Navigate back to a prior step in the current workflow. Clears all progress from | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `step_id` | `string` | Yes | ID of the step to navigate back to. Must exist in the current workflow. | -| `session_id` | `string \| null` | No | Target a specific workflow session by ID. Use when multiple workflows are active concurrently. If omitted, operates on the top-of-stack session. | +| `session_id` | `string` | Yes | The Claude Code session ID (CLAUDE_CODE_SESSION_ID from startup context). Identifies the persistent state storage for this agent session. | +| `agent_id` | `string \| null` | No | The Claude Code agent ID (CLAUDE_CODE_AGENT_ID from startup context), if running as a sub-agent. When set, operates on this agent's scoped workflow stack. | #### Returns @@ -306,13 +310,13 @@ The `finished_step` tool returns one of three statuses: | Discover available jobs and workflows | -2. start_workflow(goal, job_name, workflow_name) +2. start_workflow(goal, job_name, workflow_name, session_id) | Get session_id, first step instructions | 3. Execute step instructions, create outputs | -4. finished_step(outputs) +4. finished_step(outputs, session_id) | +-- status = "needs_work" -> Fix issues, goto 4 +-- status = "next_step" -> Execute new instructions, goto 4 @@ -420,6 +424,7 @@ Add to your `.mcp.json`: | Version | Changes | |---------|---------| +| 2.0.0 | **Breaking**: `session_id` is now a required `string` parameter on all mutation tools (`start_workflow`, `finished_step`, `abort_workflow`, `go_to_step`). Added `agent_id` optional parameter for sub-agent scoping — sub-agents get their own isolated workflow stacks. State persistence path changed to `.deepwork/tmp/sessions//session-/state.json` (with sub-agent state in `agent_.json`). | | 1.9.0 | Added `go_to_step` tool for navigating back to prior steps. Clears all step progress from the target step onward, forcing re-execution of subsequent steps. Supports `session_id` for concurrent workflow safety. | | 1.8.0 | Added `how_to_invoke` field to `WorkflowInfo` in `get_workflows` response. Always populated with invocation instructions: when a workflow's `agent` field is set, directs callers to delegate via the Task tool; otherwise, directs callers to use the `start_workflow` MCP tool directly. Also added optional `agent` field to workflow definitions in job.yml. | | 1.7.0 | Added `mark_review_as_passed` tool for review pass caching. Instruction files now include an "After Review" section with the review ID. Reviews with a `.passed` marker are automatically skipped by `get_review_instructions`. | diff --git a/flake.nix b/flake.nix index 8f67dece..ef92359f 100644 --- a/flake.nix +++ b/flake.nix @@ -84,6 +84,19 @@ # Also register as a uv tool so `uvx deepwork serve` uses local source uv tool install -e "$REPO_ROOT" --quiet 2>/dev/null || true + # Create claude wrapper script so direnv (which can't export functions) works + _claude_real=$(PATH="$(echo "$PATH" | sed "s|$REPO_ROOT/.venv/bin:||g")" command -v claude) + if [ -n "$_claude_real" ]; then + cat > "$REPO_ROOT/.venv/bin/claude" <