feat: A2A platform — gateway, directory, routing, certification, observability#104
feat: A2A platform — gateway, directory, routing, certification, observability#104khaliqgant merged 17 commits intomainfrom
Conversation
…rvability Implements the full A2A platform for Relaycast: **Gateway (Phase 1)** - A2A agent registration, removal, listing - Relay ↔ A2A JSON-RPC message translation - Webhook handling for external agent callbacks - Agent Card serving (/.well-known/agent-card.json) - Periodic health checking for external agents - DM intercept for transparent A2A routing **Observability (Phase 2)** - Message logging pipeline hooked into send path - Console API: messages, stats, agent stats, costs - Observer dashboard: ConsoleFeed, AgentMetrics, CostBreakdown **Directory (Phase 3)** - Publish, search (FTS5), browse, rate agents - Skill indexing for agents - CRUD routes for directory entries **Certification (Phase 4)** - 3-level certification test runner - Badge SVG generation - Continuous monitoring support **Smart Routing (Phase 5)** - Skill-based agent matching with configurable weights - Circuit breaker for failing agents - Route feedback (success/failure tracking) **SDK** - New methods: registerA2a, listA2aAgents, removeA2aAgent, getA2aAgentCard - route(), searchDirectory(), publishToDirectory(), importSkills() - getRoutingConfig(), updateRoutingConfig() **E2E** - 24 new E2E tests covering all A2A endpoints Depends on #103 for migrations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
a1cde03 to
228c25e
Compare
- Update a2a-health test to expect origin-level agent card URL - Add getAgentByName mock for PATCH /v1/agents/:name test - Add directoryEngine mock for agent route tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Inline ternaries inside sql`` template literals caused Drizzle to render duplicate/truncated CTE blocks. Extract conditional fragments into variables before interpolating. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ring bug Drizzle's sql`` template corrupts output when nested sql`` fragments are interpolated inside complex CTE queries. Split searchDirectory into two branches (FTS with CTEs vs tags-only) using only plain value interpolation, matching the safe pattern in routing.ts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cation Drizzle's sql`` template on D1 duplicates the entire query text for complex CTE queries, even with plain value interpolation. Switch to db.$client.prepare() with positional bind params, matching the pattern used in a2a-health.ts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert db.$client.prepare() back to Drizzle sql`` with two query branches (FTS vs tags-only) using plain value interpolation. Need to investigate the actual D1 failure root cause before bypassing Drizzle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Append error.cause to the error message so the actual D1/SQLite error is visible in e2e test output instead of just "Failed query: <sql>". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nctions in CTE context Split the CTE-based FTS queries in directory search and routing into separate direct queries where bm25() is called on the FTS table directly (which D1 supports), then merge the rank maps in JS. Error was: D1_ERROR: unable to use function bm25 in the requested context: SQLITE_ERROR Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gate context D1 errors with "unable to use function bm25 in the requested context" when bm25() is wrapped in MIN() with GROUP BY. Fetch raw FTS rows and compute min rank per entity in JS instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…2e agent - D1 rejects bm25() in aggregate context (MIN + GROUP BY), so fetch raw FTS rows and compute min rank per entity in JS - Add sourceAgentName to routing e2e publish so the directory entry links to the registered agent (required by INNER JOIN in routing query) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Preview deployed!
This preview shares the staging database and will be cleaned up when the PR is merged or closed. Run E2E testsnpm run e2e -- https://pr104-api.relaycast.dev --ciOpen observer dashboard |
- Remove unused imports (and, sql, asc, lt, gt, channels, messages, reactions, directoryAgents, directorySkills, DirectorySkillRow, toIso) - Prefix unused destructured vars with _ (url, workspace_id, channel_id, timestamp) - Change let → const for agentRankMap/skillRankMap - Replace no-explicit-any with proper types across routes, engine, and middleware files Co-Authored-By: codex-linter <noreply@anthropic.com> Co-Authored-By: claude-linter <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e race The AgentDO webSocketClose handler was fire-and-forgetting the disconnect call to PresenceDO. An in-flight heartbeat could settle at PresenceDO after the disconnect, leaving the agent stuck as "online". Awaiting the disconnect ensures it completes before the handler returns. Also increased e2e disconnect polling from 10 to 15 attempts to give more time for the async DO close callback to settle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gent.test.ts Finding: The new mock (lines 14-16) stubs syncSourceAgentDirectoryEntry to always resolve, and the 'updates agent' test (lines 200-208) adds a getAgentByName mock return. These are correctness improvements but PR: #104 Auto-fixed by msd fix (claude)
- sendToExternalAgent: only retry 5xx/network errors, not 4xx or JSON-RPC errors; remove dead retry delay entry - setRoutingConfig: merge partial weights on top of current values instead of discarding them - findWebhookAgentByName: add workspace_id scope to prevent cross-tenant collisions; update webhook URL to include workspace_id in path - openapi.yaml: add servers override for root-level A2A endpoints - local daemon: route A2A well-known/rpc/webhook at root (not /v1/) Co-Authored-By: codex-a2a-fix <noreply@anthropic.com> Co-Authored-By: claude-spec-fix <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lope
- Update parity checker to treat /.well-known/agent*, /a2a/rpc, and
/a2a/webhook* as root-level routes (no /v1 prefix)
- Wrap /v1/a2a/agents/:name/card response in standard { ok, data } envelope
- Switch SDK from getRaw to standard get for agent card endpoint
- Remove unused requestRaw/getRaw methods from SDK client
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Track pending auto heartbeats in SDK and await them during disconnect to prevent stale heartbeats from re-creating agent presence after the authoritative HTTP disconnect - Only retry errors explicitly marked retryable (5xx/network), not ZodErrors or JSON-RPC parse failures - Remove non-null assertion in webhook handler; guard against agent deletion mid-request - Add A2A gateway and directory endpoints to README.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The a2a_register stub was formatting the webhook URL as
/a2a/webhook/{name} but the route expects /a2a/webhook/{workspace_id}/{name}.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| await a2aEngine.sendToExternalAgent(a2aTarget.external_url, payload, { | ||
| scheme: a2aTarget.auth_scheme, | ||
| credential: a2aTarget.auth_credential, | ||
| }); | ||
| await a2aEngine.incrementA2aMessagesSent(db, a2aTarget.id); |
There was a problem hiding this comment.
🔴 DM message is permanently lost when A2A external agent delivery fails
In sendDm, the A2A outbound call at line 274 (sendToExternalAgent) is awaited before the message is persisted to the database at line 281 (persistDmMessage). If the external agent returns a 4xx, exhausts 5xx retries, returns a JSON-RPC error, or is unreachable, sendToExternalAgent throws and the entire sendDm function throws — the DM is never written to the messages table. From the caller's perspective the message simply vanishes. This breaks the Relaycast invariant that sent DMs are always persisted in the conversation history, and it makes A2A agent reliability failures silently destructive to user data.
Prompt for agents
In packages/server/src/engine/dm.ts, inside the sendDm function, move the persistDmMessage call (currently at line 281) to BEFORE the sendToExternalAgent call (currently at line 274). The message must be written to the database regardless of whether the external A2A agent delivery succeeds or fails. After persisting, the A2A send can be attempted, and its failure can be logged or surfaced without losing the message. Alternatively, wrap the sendToExternalAgent call in a try/catch so that a delivery failure does not prevent persistDmMessage from executing, and attach an error status to the console log entry instead.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Full A2A platform implementation for Relaycast, adding 27 new endpoints across 5 feature areas.
Depends on #103 (migrations must land first).
Gateway
/.well-known/agent-card.json(auth header, query param, or path param)Directory
Smart Routing
POST /v1/route— skill-based agent matching with configurable weightsCertification
Observability Console
SDK
registerA2a(),listA2aAgents(),removeA2aAgent(),getA2aAgentCard()route(),searchDirectory(),publishToDirectory(),importSkills()getRoutingConfig(),updateRoutingConfig()E2E
Test plan
npm run e2e) against local servernpx turbo buildpasses🤖 Generated with Claude Code