Skip to content

feat(compute): add ComputeStrategy interface with ECS Fargate backend#8

Open
MichaelWalker-git wants to merge 12 commits intomainfrom
feat/compute-strategy
Open

feat(compute): add ComputeStrategy interface with ECS Fargate backend#8
MichaelWalker-git wants to merge 12 commits intomainfrom
feat/compute-strategy

Conversation

@MichaelWalker-git
Copy link
Copy Markdown

@MichaelWalker-git MichaelWalker-git commented Apr 7, 2026

Summary

Introduces a ComputeStrategy abstraction that decouples the orchestrator from any specific compute backend. Implements two strategies behind this interface:

  • AgentCoreComputeStrategy — extracted from the existing hardcoded AgentCore logic (zero behavior change for existing deployments)
  • EcsComputeStrategy — new ECS Fargate backend that runs agent tasks as standalone containers, selected via compute_type: 'ecs' in blueprint config

Key changes

  • ComputeStrategy interface (compute-strategy.ts): startSession, pollSession, stopSession with SessionHandle for durable serialization
  • EcsComputeStrategy (strategies/ecs-strategy.ts): RunTask with container command override to invoke entrypoint.run_task() directly in batch mode (bypasses the uvicorn server that would otherwise sit idle)
  • EcsAgentCluster CDK construct (ecs-agent-cluster.ts): ECS cluster, Fargate task definition (2 vCPU / 4 GB ARM64), security group (egress 443 only), CloudWatch log group, scoped IAM roles
  • Orchestrator adaptation (orchestrate-task.ts): Returns full SessionHandle from start-session step; adds ECS-level crash detection in the polling loop alongside existing DDB polling
  • TaskOrchestrator construct: Conditional ECS env vars, IAM policies (ecs:RunTask/DescribeTasks/StopTask, scoped iam:PassRole), all-or-nothing prop validation
  • AgentStack wiring: ECS infrastructure gated behind ABCA_ENABLE_ECS=true env flag — zero impact on existing deployments
  • CI: Re-enabled gitleaks and osv-scanner in build workflow

Testing

  • Unit tests for both strategies, compute-strategy resolver, ECS construct, orchestrator construct
  • Integration-tested end-to-end: deployed with ABCA_ENABLE_ECS=true, submitted a task with compute_type: 'ecs' blueprint, agent completed in ~5 min and opened a PR via ECS Fargate

Test plan

  • mise run build passes (compile, test, synth, lint, docs)
  • CDK synth with ABCA_ENABLE_ECS=true produces valid template
  • Existing AgentCore path unchanged (no ECS resources created without flag)
  • ECS task starts, agent runs in batch mode, writes terminal status, exits cleanly
  • Orchestrator detects ECS container completion/failure via poll loop

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a compute strategy abstraction for orchestrating agent sessions, extracting AgentCore-specific session invocation logic out of the shared orchestrator and updating the durable “start-session” step to delegate compute-specific work to the selected strategy.

Changes:

  • Added a ComputeStrategy interface plus resolveComputeStrategy() factory keyed off blueprintConfig.compute_type.
  • Extracted AgentCore session invocation/stop logic into AgentCoreComputeStrategy, and refactored orchestration to use it.
  • Added/updated unit tests for the new strategy + factory, and adjusted existing orchestrator-related tests accordingly.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
cdk/src/handlers/shared/compute-strategy.ts Adds strategy interface/types and a factory for selecting a compute backend.
cdk/src/handlers/shared/strategies/agentcore-strategy.ts Implements AgentCore-specific session start/poll/stop behind the strategy interface.
cdk/src/handlers/shared/orchestrator.ts Removes AgentCore session start logic from the shared orchestrator.
cdk/src/handlers/orchestrate-task.ts Updates the durable start-session step to resolve and use a compute strategy.
cdk/test/handlers/shared/compute-strategy.test.ts Tests strategy resolution and unknown compute type behavior.
cdk/test/handlers/shared/strategies/agentcore-strategy.test.ts Tests AgentCore strategy session start/poll/stop behavior with SDK mocking.
cdk/test/handlers/orchestrate-task.test.ts Removes old startSession tests and adjusts mocks (now mostly orchestrator-focused).
.github/workflows/build.yml Updates CI environment variables and disables several security/scanning tools via MISE_DISABLE_TOOLS.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 8, 2026

Thank you ! Couple of findings, will add each one of them as a separate comment. Regarding the CI failures, don't worry about it in this PR, i'll fix it separately

@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 8, 2026

Finding 1 — Dual source of truth for runtimeArn

Category: Architecture
Severity: Medium
Location: agentcore-strategy.ts:31 (constructor) and agentcore-strategy.ts:47 (startSession)

What happens:
The AgentCoreComputeStrategy constructor accepts runtimeArn and stores it as this.runtimeArn:

constructor(options: { runtimeArn: string }) {
this.runtimeArn = options.runtimeArn; // source 1: from factory
this.client = new BedrockAgentCoreClient({});
}

Then in startSession, the method immediately overrides it from blueprintConfig:

const runtimeArn = input.blueprintConfig.runtime_arn ?? this.runtimeArn; // source 2

Why it matters:
The factory in compute-strategy.ts:49 already passes blueprintConfig.runtime_arn to the constructor:

return new AgentCoreComputeStrategy({ runtimeArn: blueprintConfig.runtime_arn });

So this.runtimeArn and input.blueprintConfig.runtime_arn are always the same value at construction time. The ?? fallback can never fire. This creates the false impression that two
different ARNs could be in play, which is confusing for future maintainers.

Recommendation:
Either pass the ARN only through the constructor (remove it from the startSession input), or only read it from blueprintConfig at call time (remove the constructor parameter). One
source of truth, not two.

@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 8, 2026

Finding 2 — Lost SDK client singleton

Category: Architecture / Performance
Severity: Medium
Location: agentcore-strategy.ts:33 vs the deleted orchestrator.ts:35

Before (main branch):
// orchestrator.ts — module-level singleton, reused across Lambda invocations
const agentCoreClient = new BedrockAgentCoreClient({});

After (PR):
// agentcore-strategy.ts — new instance every time resolveComputeStrategy() is called
constructor(options: { runtimeArn: string }) {
this.runtimeArn = options.runtimeArn;
this.client = new BedrockAgentCoreClient({}); // new client per call
}

The factory resolveComputeStrategy() is called inside the durable execution step 'start-session' at orchestrate-task.ts:120. In durable execution, steps can replay on Lambda
re-invocations, and the factory runs on each replay. Even without replay, every new task invocation goes through the factory.

Why it matters:
AWS SDK clients hold HTTP connection pools and TLS sessions. A module-level singleton amortizes the TLS handshake across all invocations on the same Lambda container. Creating a new
client per call forces a fresh TLS negotiation on every task. Under moderate concurrency, this adds latency and garbage-collection pressure.

Recommendation:
Hoist the BedrockAgentCoreClient to module scope or use a lazy singleton pattern:

let _client: BedrockAgentCoreClient | undefined;
function getClient(): BedrockAgentCoreClient {
return (_client ??= new BedrockAgentCoreClient({}));
}

@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 8, 2026

Finding 3 — compute_type is an untyped string

Category: Types
Severity: Medium
Location: repo-config.ts:49, compute-strategy.ts:46-52

The CDK construct already defines a strict union at blueprint.ts:50:

readonly type?: 'agentcore' | 'ecs';

But the runtime types in BlueprintConfig and RepoConfig declare it as a bare string:

// repo-config.ts:49
readonly compute_type: string;

The factory switch in compute-strategy.ts then does:

switch (computeType) {
case 'agentcore':
return new AgentCoreComputeStrategy({ ... });
default:
throw new Error(Unknown compute_type: '${computeType}');
}

Why it matters:

  1. The default branch is a runtime bomb. If someone adds 'ecs' to the CDK construct but forgets to add a case 'ecs' here, the factory throws at runtime with no compile-time warning.
  2. TypeScript's exhaustiveness checking only works on discriminated unions, not string. With a proper union type, adding a new variant without handling it would be a compile error.

Recommendation:
Define a shared type and use the never exhaustiveness pattern:

// repo-config.ts
export type ComputeType = 'agentcore' | 'ecs';

export interface BlueprintConfig {
readonly compute_type: ComputeType;
// ...
}

// compute-strategy.ts
switch (computeType) {
case 'agentcore':
return new AgentCoreComputeStrategy({ ... });
// case 'ecs': return new EcsComputeStrategy({ ... });
default: {
const _exhaustive: never = computeType;
throw new Error(Unknown compute_type: '${_exhaustive}');
}
}

@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 8, 2026

Finding 4 — stopSession swallows all errors

Category: Error handling
Severity: High
Location: agentcore-strategy.ts:78-87

async stopSession(handle: SessionHandle): Promise {
// ...
try {
await this.client.send(new StopRuntimeSessionCommand({ ... }));
logger.info('AgentCore session stopped', { session_id: handle.sessionId });
} catch (err) {
logger.warn('Failed to stop AgentCore session (best-effort)', {
session_id: handle.sessionId,
error: err instanceof Error ? err.message : String(err),
});
}
}

The entire catch is a logger.warn — the error is swallowed unconditionally.

Why it matters:
Not all stop failures are equal:

  • ResourceNotFoundException — the session already ended. Swallowing this is correct.
  • ThrottlingException — rate limited. The session is still running, consuming compute, and accumulating cost. Swallowing this means the session runs until it self-terminates or times
    out.
  • AccessDeniedException — a permissions misconfiguration. This will never self-heal and affects all future stop attempts. Swallowing it means the operator never learns about it except
    by reviewing warn-level logs.
  • ServiceUnavailableException — transient. A retry would likely succeed.

Logging at warn for all of these treats a billing leak the same as a benign no-op.

Recommendation:
At minimum, distinguish retriable/transient errors from permanent ones. Consider re-throwing non-retriable errors (or a subset like auth/config errors), and retrying transient ones. If
the intent is truly "best-effort fire-and-forget", at least log AccessDeniedException at error level so alarms can fire.

@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 8, 2026

Finding 5 — Missing logger.info('Session started') after full sequence

Category: Observability
Severity: Low
Location: orchestrate-task.ts:120-133 (the new start-session step)

Before (main branch — orchestrator.ts:340):
await agentCoreClient.send(command);
await transitionTask(task.task_id, TaskStatus.HYDRATING, TaskStatus.RUNNING, { ... });
await emitTaskEvent(task.task_id, 'session_started', { session_id: sessionId });
logger.info('Session started', { task_id: task.task_id, session_id: sessionId }); // ← present

After (PR — orchestrate-task.ts:120-133):
const strategy = resolveComputeStrategy(blueprintConfig);
const handle = await strategy.startSession({ taskId, payload, blueprintConfig });
await transitionTask(taskId, TaskStatus.HYDRATING, TaskStatus.RUNNING, { ... });
await emitTaskEvent(taskId, 'session_started', { ... });
return handle.sessionId;
// ← no logger.info after the full sequence

The strategy itself logs 'AgentCore session invoked' at agentcore-strategy.ts:61, but that fires before the DDB transition and event emit. The original code logged after the entire
sequence completed (invoke + transition + event), confirming the step succeeded end-to-end.

Why it matters:
In a durable execution context, the invoke can succeed but the DDB transition can fail (and the step retries). The strategy-level log gives a false positive — "session invoked" appears
in logs even when the step ultimately fails. The old post-sequence log confirmed the step completed atomically.

Recommendation:
Add a logger.info after emitTaskEvent in the start-session step:

logger.info('Session started', { task_id: taskId, session_id: handle.sessionId, strategy: handle.strategyType });

@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 8, 2026

Finding 6 — No integration tests for the rewritten start-session step

Category: Testing
Severity: High
Location: orchestrate-task.test.ts — deleted startSession tests at lines 194-206 and 375-387 (old), no replacements for the orchestrator-level flow

What was removed:
// Deleted: tested startSession() directly
describe('startSession', () => {
test('invokes agent runtime and transitions to RUNNING', async () => { ... });
});
describe('startSession with blueprint config', () => {
test('uses blueprint runtime ARN override', async () => { ... });
});

What was added:

  • compute-strategy.test.ts — 2 tests: factory returns correct type, factory throws on unknown type
  • agentcore-strategy.test.ts — 5 tests: type check, startSession invoke + handle, ARN override, pollSession stub, stopSession (success, failure, missing ARN)

What's missing:
The new strategy tests validate the strategy class in isolation. But the orchestrator's start-session step in orchestrate-task.ts:117-133 is itself a composition of three operations:

  1. resolveComputeStrategy(blueprintConfig)
  2. strategy.startSession({ taskId, payload, blueprintConfig })
  3. transitionTask(...) + emitTaskEvent(...)

No test validates this composition. Specifically:

  • Happy path: strategy starts → DDB transitions → event emits → returns sessionId
  • Error path: strategy.startSession throws → failTask is called → error re-thrown
  • Partial failure: strategy starts successfully → transitionTask throws (e.g., conditional check fails because task was cancelled concurrently)

The old startSession function was monolithic and tested end-to-end. The refactored version split the logic across two layers but only tested the bottom layer.

Recommendation:
Add integration-level tests for the start-session step that mock the strategy and DDB, verifying the full composition, especially the error path where failTask must be called before
re-throwing.

@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 8, 2026

Finding 7 — pollSession() silently returns 'running' forever

Category: Interface contract
Severity: Medium
Location: agentcore-strategy.ts:67-70

async pollSession(_handle: SessionHandle): Promise {
// Polling is currently done at the orchestrator level via DDB reads.
// This method exists for PR 2 where different strategies may poll differently.
return { status: 'running' };
}

The ComputeStrategy interface at compute-strategy.ts:38 declares:

pollSession(handle: SessionHandle): Promise;

This returns SessionStatus, a discriminated union of 'running' | 'completed' | 'failed'. The current implementation always returns 'running'.

Why it matters:
If any code path ever calls strategy.pollSession(handle) to check whether a session has finished, it will get { status: 'running' } forever, and the polling loop will never terminate.
The caller has no way to know this method is a stub — the type system says it returns a real answer.

The existing orchestrator polling in orchestrator.ts:352-366 (pollTaskStatus) reads from DDB, which is the actual source of truth. So currently pollSession is dead code. But:

  1. A future developer seeing the ComputeStrategy interface will reasonably assume pollSession works and use it.
  2. The test at agentcore-strategy.test.ts:85-92 asserts result.status is 'running', which locks in the stub behavior and will pass even when someone intends to implement it.

Recommendation:
Either:

  • Throw new Error('Not implemented') — makes it immediately clear this is a stub and any caller will fail loudly, or
  • Remove the method from the interface entirely until it's needed (YAGNI), or
  • Add a @deprecated / JSDoc warning and name the test clearly (e.g., 'pollSession is a stub — returns running (NOT IMPLEMENTED)')

@MichaelWalker-git
Copy link
Copy Markdown
Author

@krokoko All 7 findings addressed in d68edd6. Summary:

Finding 1 — Dual source of truth for runtimeArn: Removed runtimeArn from the constructor. AgentCoreComputeStrategy is now stateless — reads runtime_arn exclusively from blueprintConfig in startSession(). Factory no longer passes constructor args.

Finding 2 — Lost SDK client singleton: Hoisted BedrockAgentCoreClient to a module-level lazy singleton via getClient(). Created once on first use, reused across all strategy instances within the same Lambda execution context.

Finding 3 — compute_type is untyped string: Added ComputeType = 'agentcore' | 'ecs' union type in repo-config.ts. Updated BlueprintConfig.compute_type and RepoConfig.compute_type to use it. Added exhaustiveness checking with never pattern in resolveComputeStrategy switch — adding a new compute type without a case arm is now a compile error.

Finding 4 — stopSession swallows all errors: Distinguished error types:

  • ResourceNotFoundExceptionlogger.info (session already gone, benign)
  • ThrottlingException / AccessDeniedExceptionlogger.error (actionable, potential billing leak or config error)
  • Other errors → logger.warn (best-effort, as before)

Finding 5 — Missing logger.info after full sequence: Added logger.info('Session started', { task_id, session_id, strategy_type }) in orchestrate-task.ts after the invoke + transition + event sequence completes.

Finding 6 — No integration tests: Created start-session-composition.test.ts with 3 tests:

  1. Happy path: strategy.startSession → transitionTask → emitTaskEvent (all succeed)
  2. Error path: strategy.startSession throws → failTask is called
  3. Partial failure: strategy succeeds but transitionTask throws → failTask is called

Finding 7 — pollSession returns 'running' forever: Now throws Error('pollSession is not implemented for AgentCore — use orchestrator-level DDB polling'). Updated test to verify it throws. Clear signal for future developers; future strategies (ECS/Fargate) will implement real compute-level polling.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 20 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

cdk/src/stacks/agent.ts:257

  • hasEcsBlueprint is currently hard-coded to always be false (.some(() => false)), so needsEcs effectively depends only on ABCA_ENABLE_ECS. This makes the preceding comment (“Wire ECS infrastructure when any blueprint uses compute.type: 'ecs'”) inaccurate and will silently skip provisioning ECS even if a Blueprint is later configured with compute.type: 'ecs'. Consider either removing hasEcsBlueprint entirely and documenting that ECS infra is env-flag driven, or implementing a real detection mechanism (e.g., tracking blueprint compute type in code or exposing it from the Blueprint construct).
    new CfnOutput(this, 'RepoTableName', {
      value: repoTable.table.tableName,
      description: 'Name of the DynamoDB repo config table',
    });


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

MichaelWalker-git added a commit that referenced this pull request Apr 9, 2026
- Keep gitleaks/osv-scanner enabled in CI (only disable trivy/grype/semgrep)
- Type ComputeStrategy.type and SessionHandle.strategyType as ComputeType
- Trim/filter ECS_SUBNETS to handle whitespace and trailing commas
- Handle undefined exit code in ECS pollSession (container never started)
- Scope iam:PassRole to specific ECS task/execution role ARNs
- Validate all-or-nothing ECS props in TaskOrchestrator constructor
- Remove dead hasEcsBlueprint detection; document env-flag driven approach
- Add comment noting strategy_type as additive event field
…AgentCore logic

Introduce ComputeStrategy interface with SessionHandle/SessionStatus types
and resolveComputeStrategy factory. Extract AgentCoreComputeStrategy from
orchestrator.ts. Refactor orchestrate-task handler to use strategy pattern
for session lifecycle (start/poll/stop). Pure refactor — no behavior change,
identical CloudFormation output.
The mise install step downloads tools (trivy) from GitHub releases.
Without GITHUB_TOKEN, unauthenticated requests hit the 60 req/hr
rate limit, causing flaky CI failures.
Mise uses GITHUB_API_TOKEN (not GITHUB_TOKEN) for authenticated
GitHub API requests when downloading aqua tools like trivy.
Trivy, grype, semgrep, osv-scanner, and gitleaks are only needed for
security scanning tasks, not for the build/test/synth pipeline. Disable
them via MISE_DISABLE_TOOLS to avoid GitHub API rate limits when mise
tries to download them on every PR build.
- Keep gitleaks and osv-scanner enabled in CI build (only disable
  trivy/grype/semgrep which need GitHub API downloads)
- Remove unused @aws-sdk/client-bedrock-agentcore mock from
  orchestrate-task.test.ts (SDK is no longer imported by orchestrator)
- Update PR description to note additive strategy_type event field
1. Single source of truth for runtimeArn — removed constructor param,
   strategy now reads exclusively from blueprintConfig.runtime_arn
2. Lazy singleton for BedrockAgentCoreClient — module-level shared
   client avoids creating new TLS sessions per invocation
3. ComputeType union type ('agentcore' | 'ecs') with exhaustive switch
   and never-pattern in resolveComputeStrategy
4. Differentiated error handling in stopSession — ResourceNotFoundException
   (info), ThrottlingException/AccessDeniedException (error), others (warn)
5. Added logger.info('Session started') after full invoke+transition+event
   sequence in orchestrate-task.ts
6. Added start-session-composition.test.ts with integration tests for
   happy path, error path (failTask), and partial failure (transitionTask throws)
7. pollSession now throws NotImplementedError instead of returning stale
   'running' status — clear signal for future developers
- Replace require() with ES import for BedrockAgentCoreClient mock
- Fix import ordering in start-session-composition test
Wire ECS Fargate as a compute backend behind the existing ComputeStrategy
interface, using the existing durable Lambda orchestrator. No separate
stacks or Step Functions — ECS is a strategy option alongside AgentCore.

Changes:
- EcsComputeStrategy: startSession (RunTask), pollSession (DescribeTasks
  state mapping), stopSession (StopTask with graceful error handling)
- EcsAgentCluster construct: ECS Cluster (container insights), Fargate
  task def (2 vCPU/4GB/ARM64), security group (TCP 443 egress only),
  CloudWatch log group, task role (DynamoDB, SecretsManager, Bedrock)
- TaskOrchestrator: optional ECS props for env vars and IAM policies
  (ecs:RunTask/DescribeTasks/StopTask conditioned on cluster ARN,
  iam:PassRole conditioned on ecs-tasks.amazonaws.com)
- Orchestrator polling: ECS compute-level crash detection alongside
  existing DDB polling (non-fatal, wrapped in try/catch)
- AgentStack: conditional ECS infrastructure (ABCA_ENABLE_ECS env var)
- Full test coverage: 15 ECS strategy tests, 9 construct tests,
  5 orchestrator ECS tests. All 563 tests pass.

Deployed and verified: stack deploys cleanly, CDK synth passes cdk-nag,
agent task running on AgentCore path unaffected.
- Keep gitleaks/osv-scanner enabled in CI (only disable trivy/grype/semgrep)
- Type ComputeStrategy.type and SessionHandle.strategyType as ComputeType
- Trim/filter ECS_SUBNETS to handle whitespace and trailing commas
- Handle undefined exit code in ECS pollSession (container never started)
- Scope iam:PassRole to specific ECS task/execution role ARNs
- Validate all-or-nothing ECS props in TaskOrchestrator constructor
- Remove dead hasEcsBlueprint detection; document env-flag driven approach
- Add comment noting strategy_type as additive event field
The ECS container's default CMD starts uvicorn server:app which waits
for HTTP POST to /invocations — but in standalone ECS nobody sends that
request, leaving the agent idle. Override the container command to invoke
entrypoint.run_task() directly with the full orchestrator payload via
AGENT_PAYLOAD env var. Also add GITHUB_TOKEN_SECRET_ARN to the ECS task
definition base environment.
@MichaelWalker-git MichaelWalker-git changed the title feat: introduce ComputeStrategy interface and extract AgentCoreComputeStrategy feat(compute): add ComputeStrategy interface with ECS Fargate backend Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants