feat(compute): add ComputeStrategy interface with ECS Fargate backend#8
feat(compute): add ComputeStrategy interface with ECS Fargate backend#8MichaelWalker-git wants to merge 12 commits intomainfrom
Conversation
236d7bd to
bad8867
Compare
There was a problem hiding this comment.
Pull request overview
Introduces a compute strategy abstraction for orchestrating agent sessions, extracting AgentCore-specific session invocation logic out of the shared orchestrator and updating the durable “start-session” step to delegate compute-specific work to the selected strategy.
Changes:
- Added a
ComputeStrategyinterface plusresolveComputeStrategy()factory keyed offblueprintConfig.compute_type. - Extracted AgentCore session invocation/stop logic into
AgentCoreComputeStrategy, and refactored orchestration to use it. - Added/updated unit tests for the new strategy + factory, and adjusted existing orchestrator-related tests accordingly.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
cdk/src/handlers/shared/compute-strategy.ts |
Adds strategy interface/types and a factory for selecting a compute backend. |
cdk/src/handlers/shared/strategies/agentcore-strategy.ts |
Implements AgentCore-specific session start/poll/stop behind the strategy interface. |
cdk/src/handlers/shared/orchestrator.ts |
Removes AgentCore session start logic from the shared orchestrator. |
cdk/src/handlers/orchestrate-task.ts |
Updates the durable start-session step to resolve and use a compute strategy. |
cdk/test/handlers/shared/compute-strategy.test.ts |
Tests strategy resolution and unknown compute type behavior. |
cdk/test/handlers/shared/strategies/agentcore-strategy.test.ts |
Tests AgentCore strategy session start/poll/stop behavior with SDK mocking. |
cdk/test/handlers/orchestrate-task.test.ts |
Removes old startSession tests and adjusts mocks (now mostly orchestrator-focused). |
.github/workflows/build.yml |
Updates CI environment variables and disables several security/scanning tools via MISE_DISABLE_TOOLS. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thank you ! Couple of findings, will add each one of them as a separate comment. Regarding the CI failures, don't worry about it in this PR, i'll fix it separately |
|
Finding 1 — Dual source of truth for runtimeArn Category: Architecture What happens: constructor(options: { runtimeArn: string }) { Then in startSession, the method immediately overrides it from blueprintConfig: const runtimeArn = input.blueprintConfig.runtime_arn ?? this.runtimeArn; // source 2 Why it matters: return new AgentCoreComputeStrategy({ runtimeArn: blueprintConfig.runtime_arn }); So this.runtimeArn and input.blueprintConfig.runtime_arn are always the same value at construction time. The ?? fallback can never fire. This creates the false impression that two Recommendation: |
|
Finding 2 — Lost SDK client singleton Category: Architecture / Performance Before (main branch): After (PR): The factory resolveComputeStrategy() is called inside the durable execution step 'start-session' at orchestrate-task.ts:120. In durable execution, steps can replay on Lambda Why it matters: Recommendation: let _client: BedrockAgentCoreClient | undefined; |
|
Finding 3 — compute_type is an untyped string Category: Types The CDK construct already defines a strict union at blueprint.ts:50: readonly type?: 'agentcore' | 'ecs'; But the runtime types in BlueprintConfig and RepoConfig declare it as a bare string: // repo-config.ts:49 The factory switch in compute-strategy.ts then does: switch (computeType) { Why it matters:
Recommendation: // repo-config.ts export interface BlueprintConfig { // compute-strategy.ts |
|
Finding 4 — stopSession swallows all errors Category: Error handling async stopSession(handle: SessionHandle): Promise { The entire catch is a logger.warn — the error is swallowed unconditionally. Why it matters:
Logging at warn for all of these treats a billing leak the same as a benign no-op. Recommendation: |
|
Finding 5 — Missing logger.info('Session started') after full sequence Category: Observability Before (main branch — orchestrator.ts:340): After (PR — orchestrate-task.ts:120-133): The strategy itself logs 'AgentCore session invoked' at agentcore-strategy.ts:61, but that fires before the DDB transition and event emit. The original code logged after the entire Why it matters: Recommendation: logger.info('Session started', { task_id: taskId, session_id: handle.sessionId, strategy: handle.strategyType }); |
|
Finding 6 — No integration tests for the rewritten start-session step Category: Testing What was removed: What was added:
What's missing:
No test validates this composition. Specifically:
The old startSession function was monolithic and tested end-to-end. The refactored version split the logic across two layers but only tested the bottom layer. Recommendation: |
|
Finding 7 — pollSession() silently returns 'running' forever Category: Interface contract async pollSession(_handle: SessionHandle): Promise { The ComputeStrategy interface at compute-strategy.ts:38 declares: pollSession(handle: SessionHandle): Promise; This returns SessionStatus, a discriminated union of 'running' | 'completed' | 'failed'. The current implementation always returns 'running'. Why it matters: The existing orchestrator polling in orchestrator.ts:352-366 (pollTaskStatus) reads from DDB, which is the actual source of truth. So currently pollSession is dead code. But:
Recommendation:
|
|
@krokoko All 7 findings addressed in d68edd6. Summary: Finding 1 — Dual source of truth for runtimeArn: Removed Finding 2 — Lost SDK client singleton: Hoisted Finding 3 — compute_type is untyped string: Added Finding 4 — stopSession swallows all errors: Distinguished error types:
Finding 5 — Missing logger.info after full sequence: Added Finding 6 — No integration tests: Created
Finding 7 — pollSession returns 'running' forever: Now throws |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 20 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
cdk/src/stacks/agent.ts:257
hasEcsBlueprintis currently hard-coded to always be false (.some(() => false)), soneedsEcseffectively depends only onABCA_ENABLE_ECS. This makes the preceding comment (“Wire ECS infrastructure when any blueprint uses compute.type: 'ecs'”) inaccurate and will silently skip provisioning ECS even if a Blueprint is later configured withcompute.type: 'ecs'. Consider either removinghasEcsBlueprintentirely and documenting that ECS infra is env-flag driven, or implementing a real detection mechanism (e.g., tracking blueprint compute type in code or exposing it from theBlueprintconstruct).
new CfnOutput(this, 'RepoTableName', {
value: repoTable.table.tableName,
description: 'Name of the DynamoDB repo config table',
});
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Keep gitleaks/osv-scanner enabled in CI (only disable trivy/grype/semgrep) - Type ComputeStrategy.type and SessionHandle.strategyType as ComputeType - Trim/filter ECS_SUBNETS to handle whitespace and trailing commas - Handle undefined exit code in ECS pollSession (container never started) - Scope iam:PassRole to specific ECS task/execution role ARNs - Validate all-or-nothing ECS props in TaskOrchestrator constructor - Remove dead hasEcsBlueprint detection; document env-flag driven approach - Add comment noting strategy_type as additive event field
…AgentCore logic Introduce ComputeStrategy interface with SessionHandle/SessionStatus types and resolveComputeStrategy factory. Extract AgentCoreComputeStrategy from orchestrator.ts. Refactor orchestrate-task handler to use strategy pattern for session lifecycle (start/poll/stop). Pure refactor — no behavior change, identical CloudFormation output.
The mise install step downloads tools (trivy) from GitHub releases. Without GITHUB_TOKEN, unauthenticated requests hit the 60 req/hr rate limit, causing flaky CI failures.
Mise uses GITHUB_API_TOKEN (not GITHUB_TOKEN) for authenticated GitHub API requests when downloading aqua tools like trivy.
Trivy, grype, semgrep, osv-scanner, and gitleaks are only needed for security scanning tasks, not for the build/test/synth pipeline. Disable them via MISE_DISABLE_TOOLS to avoid GitHub API rate limits when mise tries to download them on every PR build.
- Keep gitleaks and osv-scanner enabled in CI build (only disable trivy/grype/semgrep which need GitHub API downloads) - Remove unused @aws-sdk/client-bedrock-agentcore mock from orchestrate-task.test.ts (SDK is no longer imported by orchestrator) - Update PR description to note additive strategy_type event field
1. Single source of truth for runtimeArn — removed constructor param,
strategy now reads exclusively from blueprintConfig.runtime_arn
2. Lazy singleton for BedrockAgentCoreClient — module-level shared
client avoids creating new TLS sessions per invocation
3. ComputeType union type ('agentcore' | 'ecs') with exhaustive switch
and never-pattern in resolveComputeStrategy
4. Differentiated error handling in stopSession — ResourceNotFoundException
(info), ThrottlingException/AccessDeniedException (error), others (warn)
5. Added logger.info('Session started') after full invoke+transition+event
sequence in orchestrate-task.ts
6. Added start-session-composition.test.ts with integration tests for
happy path, error path (failTask), and partial failure (transitionTask throws)
7. pollSession now throws NotImplementedError instead of returning stale
'running' status — clear signal for future developers
- Replace require() with ES import for BedrockAgentCoreClient mock - Fix import ordering in start-session-composition test
Wire ECS Fargate as a compute backend behind the existing ComputeStrategy interface, using the existing durable Lambda orchestrator. No separate stacks or Step Functions — ECS is a strategy option alongside AgentCore. Changes: - EcsComputeStrategy: startSession (RunTask), pollSession (DescribeTasks state mapping), stopSession (StopTask with graceful error handling) - EcsAgentCluster construct: ECS Cluster (container insights), Fargate task def (2 vCPU/4GB/ARM64), security group (TCP 443 egress only), CloudWatch log group, task role (DynamoDB, SecretsManager, Bedrock) - TaskOrchestrator: optional ECS props for env vars and IAM policies (ecs:RunTask/DescribeTasks/StopTask conditioned on cluster ARN, iam:PassRole conditioned on ecs-tasks.amazonaws.com) - Orchestrator polling: ECS compute-level crash detection alongside existing DDB polling (non-fatal, wrapped in try/catch) - AgentStack: conditional ECS infrastructure (ABCA_ENABLE_ECS env var) - Full test coverage: 15 ECS strategy tests, 9 construct tests, 5 orchestrator ECS tests. All 563 tests pass. Deployed and verified: stack deploys cleanly, CDK synth passes cdk-nag, agent task running on AgentCore path unaffected.
- Keep gitleaks/osv-scanner enabled in CI (only disable trivy/grype/semgrep) - Type ComputeStrategy.type and SessionHandle.strategyType as ComputeType - Trim/filter ECS_SUBNETS to handle whitespace and trailing commas - Handle undefined exit code in ECS pollSession (container never started) - Scope iam:PassRole to specific ECS task/execution role ARNs - Validate all-or-nothing ECS props in TaskOrchestrator constructor - Remove dead hasEcsBlueprint detection; document env-flag driven approach - Add comment noting strategy_type as additive event field
926778b to
eb0bfa9
Compare
The ECS container's default CMD starts uvicorn server:app which waits for HTTP POST to /invocations — but in standalone ECS nobody sends that request, leaving the agent idle. Override the container command to invoke entrypoint.run_task() directly with the full orchestrator payload via AGENT_PAYLOAD env var. Also add GITHUB_TOKEN_SECRET_ARN to the ECS task definition base environment.
Summary
Introduces a
ComputeStrategyabstraction that decouples the orchestrator from any specific compute backend. Implements two strategies behind this interface:compute_type: 'ecs'in blueprint configKey changes
ComputeStrategyinterface (compute-strategy.ts):startSession,pollSession,stopSessionwithSessionHandlefor durable serializationEcsComputeStrategy(strategies/ecs-strategy.ts): RunTask with container command override to invokeentrypoint.run_task()directly in batch mode (bypasses the uvicorn server that would otherwise sit idle)EcsAgentClusterCDK construct (ecs-agent-cluster.ts): ECS cluster, Fargate task definition (2 vCPU / 4 GB ARM64), security group (egress 443 only), CloudWatch log group, scoped IAM rolesorchestrate-task.ts): Returns fullSessionHandlefrom start-session step; adds ECS-level crash detection in the polling loop alongside existing DDB pollingTaskOrchestratorconstruct: Conditional ECS env vars, IAM policies (ecs:RunTask/DescribeTasks/StopTask, scopediam:PassRole), all-or-nothing prop validationAgentStackwiring: ECS infrastructure gated behindABCA_ENABLE_ECS=trueenv flag — zero impact on existing deploymentsTesting
ABCA_ENABLE_ECS=true, submitted a task withcompute_type: 'ecs'blueprint, agent completed in ~5 min and opened a PR via ECS FargateTest plan
mise run buildpasses (compile, test, synth, lint, docs)ABCA_ENABLE_ECS=trueproduces valid template