Skip to content

feat: A2A platform — gateway, directory, routing, certification, observability#104

Merged
khaliqgant merged 17 commits intomainfrom
a2a-implementation
Mar 25, 2026
Merged

feat: A2A platform — gateway, directory, routing, certification, observability#104
khaliqgant merged 17 commits intomainfrom
a2a-implementation

Conversation

@khaliqgant
Copy link
Member

@khaliqgant khaliqgant commented Mar 24, 2026

Summary

Full A2A platform implementation for Relaycast, adding 27 new endpoints across 5 feature areas.

Depends on #103 (migrations must land first).

Gateway

  • Register/remove/list external A2A agents
  • Relay ↔ A2A JSON-RPC translation (DM intercept + webhook)
  • /.well-known/agent-card.json (auth header, query param, or path param)
  • Health checking (cron-triggered, 3-strike suspension)

Directory

  • Publish, search (FTS5), browse, update, delete agents
  • Skill indexing, ratings system

Smart Routing

  • POST /v1/route — skill-based agent matching with configurable weights
  • Circuit breaker for failing agents
  • Route feedback loop (success/failure)

Certification

  • 3-level test runner against external agent URLs
  • Badge SVG generation
  • Continuous monitoring

Observability Console

  • Message logging pipeline (hooked into send path)
  • Stats, agent metrics, cost breakdown APIs
  • Observer dashboard components (ConsoleFeed, AgentMetrics, CostBreakdown)

SDK

  • registerA2a(), listA2aAgents(), removeA2aAgent(), getA2aAgentCard()
  • route(), searchDirectory(), publishToDirectory(), importSkills()
  • getRoutingConfig(), updateRoutingConfig()

E2E

  • 24 new tests covering all A2A endpoints (agent cards, directory CRUD, routing, certification, console)

Test plan

  • Schema tests pass (26/26)
  • SDK unit tests pass
  • Local server starts and endpoints respond correctly
  • Agent card resolves via auth header and query param
  • Full E2E suite (npm run e2e) against local server
  • npx turbo build passes

🤖 Generated with Claude Code


Open with Devin

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

…rvability

Implements the full A2A platform for Relaycast:

**Gateway (Phase 1)**
- A2A agent registration, removal, listing
- Relay ↔ A2A JSON-RPC message translation
- Webhook handling for external agent callbacks
- Agent Card serving (/.well-known/agent-card.json)
- Periodic health checking for external agents
- DM intercept for transparent A2A routing

**Observability (Phase 2)**
- Message logging pipeline hooked into send path
- Console API: messages, stats, agent stats, costs
- Observer dashboard: ConsoleFeed, AgentMetrics, CostBreakdown

**Directory (Phase 3)**
- Publish, search (FTS5), browse, rate agents
- Skill indexing for agents
- CRUD routes for directory entries

**Certification (Phase 4)**
- 3-level certification test runner
- Badge SVG generation
- Continuous monitoring support

**Smart Routing (Phase 5)**
- Skill-based agent matching with configurable weights
- Circuit breaker for failing agents
- Route feedback (success/failure tracking)

**SDK**
- New methods: registerA2a, listA2aAgents, removeA2aAgent, getA2aAgentCard
- route(), searchDirectory(), publishToDirectory(), importSkills()
- getRoutingConfig(), updateRoutingConfig()

**E2E**
- 24 new E2E tests covering all A2A endpoints

Depends on #103 for migrations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update a2a-health test to expect origin-level agent card URL
- Add getAgentByName mock for PATCH /v1/agents/:name test
- Add directoryEngine mock for agent route tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

Inline ternaries inside sql`` template literals caused Drizzle to
render duplicate/truncated CTE blocks. Extract conditional fragments
into variables before interpolating.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

…ring bug

Drizzle's sql`` template corrupts output when nested sql`` fragments
are interpolated inside complex CTE queries. Split searchDirectory into
two branches (FTS with CTEs vs tags-only) using only plain value
interpolation, matching the safe pattern in routing.ts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

khaliqgant and others added 6 commits March 25, 2026 09:51
…cation

Drizzle's sql`` template on D1 duplicates the entire query text for
complex CTE queries, even with plain value interpolation. Switch to
db.$client.prepare() with positional bind params, matching the pattern
used in a2a-health.ts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert db.$client.prepare() back to Drizzle sql`` with two query
branches (FTS vs tags-only) using plain value interpolation. Need to
investigate the actual D1 failure root cause before bypassing Drizzle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Append error.cause to the error message so the actual D1/SQLite error
is visible in e2e test output instead of just "Failed query: <sql>".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nctions in CTE context

Split the CTE-based FTS queries in directory search and routing into
separate direct queries where bm25() is called on the FTS table directly
(which D1 supports), then merge the rank maps in JS.

Error was: D1_ERROR: unable to use function bm25 in the requested context: SQLITE_ERROR

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gate context

D1 errors with "unable to use function bm25 in the requested context"
when bm25() is wrapped in MIN() with GROUP BY. Fetch raw FTS rows and
compute min rank per entity in JS instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…2e agent

- D1 rejects bm25() in aggregate context (MIN + GROUP BY), so fetch raw
  FTS rows and compute min rank per entity in JS
- Add sourceAgentName to routing e2e publish so the directory entry links
  to the registered agent (required by INNER JOIN in routing query)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

Preview deployed!

Environment URL
API https://pr104-api.relaycast.dev
Health https://pr104-api.relaycast.dev/health
Observer https://pr104-observer.relaycast.dev/observer

This preview shares the staging database and will be cleaned up when the PR is merged or closed.

Run E2E tests

npm run e2e -- https://pr104-api.relaycast.dev --ci

Open observer dashboard

https://pr104-observer.relaycast.dev/observer

khaliqgant and others added 4 commits March 25, 2026 10:54
- Remove unused imports (and, sql, asc, lt, gt, channels, messages,
  reactions, directoryAgents, directorySkills, DirectorySkillRow, toIso)
- Prefix unused destructured vars with _ (url, workspace_id, channel_id,
  timestamp)
- Change let → const for agentRankMap/skillRankMap
- Replace no-explicit-any with proper types across routes, engine, and
  middleware files

Co-Authored-By: codex-linter <noreply@anthropic.com>
Co-Authored-By: claude-linter <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e race

The AgentDO webSocketClose handler was fire-and-forgetting the disconnect
call to PresenceDO. An in-flight heartbeat could settle at PresenceDO
after the disconnect, leaving the agent stuck as "online". Awaiting the
disconnect ensures it completes before the handler returns.

Also increased e2e disconnect polling from 10 to 15 attempts to give
more time for the async DO close callback to settle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gent.test.ts

Finding: The new mock (lines 14-16) stubs syncSourceAgentDirectoryEntry to always resolve, and the 'updates agent' test (lines 200-208) adds a getAgentByName mock return. These are correctness improvements but
PR: #104

Auto-fixed by msd fix (claude)
- sendToExternalAgent: only retry 5xx/network errors, not 4xx or
  JSON-RPC errors; remove dead retry delay entry
- setRoutingConfig: merge partial weights on top of current values
  instead of discarding them
- findWebhookAgentByName: add workspace_id scope to prevent cross-tenant
  collisions; update webhook URL to include workspace_id in path
- openapi.yaml: add servers override for root-level A2A endpoints
- local daemon: route A2A well-known/rpc/webhook at root (not /v1/)

Co-Authored-By: codex-a2a-fix <noreply@anthropic.com>
Co-Authored-By: claude-spec-fix <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

…lope

- Update parity checker to treat /.well-known/agent*, /a2a/rpc, and
  /a2a/webhook* as root-level routes (no /v1 prefix)
- Wrap /v1/a2a/agents/:name/card response in standard { ok, data } envelope
- Switch SDK from getRaw to standard get for agent card endpoint
- Remove unused requestRaw/getRaw methods from SDK client

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

- Track pending auto heartbeats in SDK and await them during disconnect
  to prevent stale heartbeats from re-creating agent presence after
  the authoritative HTTP disconnect
- Only retry errors explicitly marked retryable (5xx/network), not
  ZodErrors or JSON-RPC parse failures
- Remove non-null assertion in webhook handler; guard against agent
  deletion mid-request
- Add A2A gateway and directory endpoints to README.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

The a2a_register stub was formatting the webhook URL as
/a2a/webhook/{name} but the route expects /a2a/webhook/{workspace_id}/{name}.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@khaliqgant khaliqgant merged commit fc375dc into main Mar 25, 2026
4 checks passed
@khaliqgant khaliqgant deleted the a2a-implementation branch March 25, 2026 11:32
Copy link

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 23 additional findings in Devin Review.

Open in Devin Review

Comment on lines +274 to +278
await a2aEngine.sendToExternalAgent(a2aTarget.external_url, payload, {
scheme: a2aTarget.auth_scheme,
credential: a2aTarget.auth_credential,
});
await a2aEngine.incrementA2aMessagesSent(db, a2aTarget.id);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 DM message is permanently lost when A2A external agent delivery fails

In sendDm, the A2A outbound call at line 274 (sendToExternalAgent) is awaited before the message is persisted to the database at line 281 (persistDmMessage). If the external agent returns a 4xx, exhausts 5xx retries, returns a JSON-RPC error, or is unreachable, sendToExternalAgent throws and the entire sendDm function throws — the DM is never written to the messages table. From the caller's perspective the message simply vanishes. This breaks the Relaycast invariant that sent DMs are always persisted in the conversation history, and it makes A2A agent reliability failures silently destructive to user data.

Prompt for agents
In packages/server/src/engine/dm.ts, inside the sendDm function, move the persistDmMessage call (currently at line 281) to BEFORE the sendToExternalAgent call (currently at line 274). The message must be written to the database regardless of whether the external A2A agent delivery succeeds or fails. After persisting, the A2A send can be attempted, and its failure can be logged or surfaced without losing the message. Alternatively, wrap the sendToExternalAgent call in a try/catch so that a delivery failure does not prevent persistDmMessage from executing, and attach an error status to the console log entry instead.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant