Conversation
- Add SessionHealth enum (Healthy, Unhealthy, Starting, Unknown) and SessionExitInfo struct to the ExecutionBackend trait for container liveness detection and failure diagnostics - Implement session_health(), session_exit_info(), and shutdown_all_sessions() on DockerBackend using bollard container inspection (health check status, exit codes, OOM detection) - Update AgentManager::reconcile() to use exit info for smarter status transitions: exit code 0 → Stopped, non-zero → Failed - Add orphaned session cleanup during reconciliation (sessions with backend prefix but no matching DB record are removed) - Add AgentManager::shutdown_all() for graceful shutdown with configurable leave-running behavior (AGENTD_SHUTDOWN_LEAVE_RUNNING) - Add HEALTHCHECK directive to Dockerfile (claude --version every 30s) - Log container lifecycle events (start, stop, remove) for observability Closes #289 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #324 +/- ##
==========================================
- Coverage 44.58% 43.99% -0.60%
==========================================
Files 80 80
Lines 7354 7472 +118
==========================================
+ Hits 3279 3287 +8
- Misses 4075 4185 +110 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
geoffjay
left a comment
There was a problem hiding this comment.
Clean, well-scoped PR. The trait extension is backward-compatible (all new methods have sensible defaults), the reconcile improvements are a genuine correctness upgrade (exit code 0 → Stopped vs Failed), and the orphan cleanup closes an important gap. No blocking issues. One consistency concern and a few minor notes below.
| if !state.running.unwrap_or(false) { | ||
| return Ok(SessionHealth::Unknown); | ||
| } | ||
|
|
There was a problem hiding this comment.
format!("{:?}", status).to_lowercase() uses the Debug representation of bollard::models::HealthStatusEnum for string matching. This is fragile — a bollard version bump that changes the Debug format would silently break health detection. Notably, the existing session_exists() in this same file already matches ContainerStateStatusEnum by variant directly (== Some(ContainerStateStatusEnum::CREATED)). The same pattern should work here:
use bollard::models::HealthStatusEnum;
match status {
HealthStatusEnum::HEALTHY => Ok(SessionHealth::Healthy),
HealthStatusEnum::UNHEALTHY => Ok(SessionHealth::Unhealthy),
HealthStatusEnum::STARTING => Ok(SessionHealth::Starting),
_ => Ok(SessionHealth::Unknown),
}Same applies to container_state() at line 320, though that method is diagnostics-only so the risk is lower.
docker/claude-code/Dockerfile
Outdated
| # ── Health check ──────────────────────────────────────────────────── | ||
| # Docker HEALTHCHECK used by the orchestrator to detect container liveness. | ||
| # Checks that the claude CLI binary is still accessible and functional. | ||
| HEALTHCHECK --interval=30s --timeout=5s --retries=3 \ |
There was a problem hiding this comment.
Consider adding --start-period to the HEALTHCHECK. Without it the default is 0s, so Docker starts counting retries immediately from container start. If the agent container ever takes a moment to become ready (e.g., slow node.js startup), it could be marked unhealthy before it has had a chance to become healthy. A small grace window like --start-period=10s would be more defensive.
| let url_default = backend.agent_ws_url("test-prefix-abc123", None); | ||
| assert_eq!(url_default, Some("ws://host.docker.internal:7006/ws/abc123".to_string())); | ||
| } | ||
|
|
There was a problem hiding this comment.
This test only holds a reference to backend without calling container_state or asserting any return value — the compiler would catch a missing method at build time anyway. Consider replacing with an actual unit-testable assertion (e.g., confirming the 404-not-found path returns Ok(None) using the existing is_not_found path logic), or removing it in favour of the equivalent tests already in backend.rs.
| if let Err(e) = self.storage.update(&agent).await { | ||
| error!(agent_id = %agent.id, %e, "Failed to update agent status"); | ||
| } | ||
| } else if !self.registry.is_connected(&agent.id).await { |
There was a problem hiding this comment.
The health value is fetched and logged here but does not influence the restart decision — the agent is always restarted when disconnected from the registry regardless of health status. This is the right behaviour (stale WS must be replaced), but a brief comment explaining why health is checked (observability/logging only, not gating the restart) would help future readers.
- Use bollard HealthStatusEnum variant matching instead of fragile Debug string formatting in session_health() - Use Display (to_string) instead of Debug format in container_state() - Add --start-period=10s to Dockerfile HEALTHCHECK for startup grace - Remove useless container_state_method_exists test - Clarify that health check in reconcile is for observability only, not gating the restart decision Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
SessionHealthenum (Healthy,Unhealthy,Starting,Unknown) andSessionExitInfostruct to theExecutionBackendtrait for container liveness detection and failure diagnosticssession_health(),session_exit_info(), andshutdown_all_sessions()onDockerBackendusing bollard container inspection (health check status, exit codes, OOM detection)AgentManager::reconcile()to use exit info for smarter status transitions: exit code 0 →Stopped, non-zero →FailedAgentManager::shutdown_all()for graceful shutdown with configurable leave-running behavior (AGENTD_SHUTDOWN_LEAVE_RUNNING)HEALTHCHECKdirective to Dockerfile (claude --versionevery 30s)Closes #289
Test plan
cargo buildcleanAGENTD_SHUTDOWN_LEAVE_RUNNING=trueleaves containers runningdocker inspect🤖 Generated with Claude Code