Summary
Add configurable retry policies at the agent level so transient failures (API errors, rate limits, timeouts) don't kill entire workflows.
Motivation
Production agentic AI demands resilience. Today, one transient API failure kills the entire conductor workflow — the only recovery is checkpoint/resume (all-or-nothing). Production frameworks like LangGraph are positioned as leaders partly due to configurable retry, and real-world incidents (Amazon retail outages from AI agents, "two AIs talked for 2 hours booking nothing") underscore the need for fault tolerance.
Proposed Design
agents:
- name: analyzer
model: gpt-5.2
retry:
max_attempts: 3 # default: 1 (no retry)
backoff: exponential # or "fixed"
delay_seconds: 2 # base delay
retry_on:
- provider_error # API 500s, rate limits
- timeout # agent-level timeout exceeded
# NOT: validation_error — output schema mismatches indicate logic bugs, not transience
Behavior
- Retry counter resets per agent execution (not per workflow)
- Exponential backoff:
delay * 2^attempt with jitter
- Events emitted on each retry attempt (
agent_retry event)
- Final failure after exhausting retries follows existing error handling (
fail_fast, continue_on_error, etc.)
- Works in both sequential and parallel execution contexts
What Does NOT Retry
- Output validation errors (schema mismatches) — these indicate prompt/schema issues, not transience
- Human gate timeouts — these are intentional pauses
- Workflow-level timeout exceeded — hard stop
Why It Fits Conductor
Effort Estimate
Low — wraps existing AgentExecutor.execute() in a retry loop with backoff logic.
Summary
Add configurable retry policies at the agent level so transient failures (API errors, rate limits, timeouts) don't kill entire workflows.
Motivation
Production agentic AI demands resilience. Today, one transient API failure kills the entire conductor workflow — the only recovery is checkpoint/resume (all-or-nothing). Production frameworks like LangGraph are positioned as leaders partly due to configurable retry, and real-world incidents (Amazon retail outages from AI agents, "two AIs talked for 2 hours booking nothing") underscore the need for fault tolerance.
Proposed Design
Behavior
delay * 2^attemptwith jitteragent_retryevent)fail_fast,continue_on_error, etc.)What Does NOT Retry
Why It Fits Conductor
fail_fast/continue_on_errorparallel semanticsEffort Estimate
Low — wraps existing
AgentExecutor.execute()in a retry loop with backoff logic.