Skip to content

Add reconciliation and health check support for Docker backend#324

Merged
geoffjay merged 4 commits intomainfrom
issue-289
Mar 11, 2026
Merged

Add reconciliation and health check support for Docker backend#324
geoffjay merged 4 commits intomainfrom
issue-289

Conversation

@geoffjay
Copy link
Copy Markdown
Owner

Summary

  • Add SessionHealth enum (Healthy, Unhealthy, Starting, Unknown) and SessionExitInfo struct to the ExecutionBackend trait for container liveness detection and failure diagnostics
  • Implement session_health(), session_exit_info(), and shutdown_all_sessions() on DockerBackend using bollard container inspection (health check status, exit codes, OOM detection)
  • Update AgentManager::reconcile() to use exit info for smarter status transitions: exit code 0 → Stopped, non-zero → Failed
  • Add orphaned session cleanup during reconciliation (sessions with backend prefix but no matching DB record are removed)
  • Add AgentManager::shutdown_all() for graceful shutdown with configurable leave-running behavior (AGENTD_SHUTDOWN_LEAVE_RUNNING)
  • Add HEALTHCHECK directive to Dockerfile (claude --version every 30s)
  • Log container lifecycle events (start, stop, remove) for observability

Closes #289

Test plan

  • All 77 wrap lib tests pass (13 new: SessionHealth serde/display, SessionExitInfo, default trait method behavior)
  • All 102 orchestrator lib tests pass
  • Full workspace cargo build clean
  • Manual: Docker backend reconciliation with exited containers (exit code 0 vs non-zero)
  • Manual: Orphaned container cleanup on orchestrator restart
  • Manual: Graceful shutdown stops all Docker containers
  • Manual: AGENTD_SHUTDOWN_LEAVE_RUNNING=true leaves containers running
  • Manual: Docker HEALTHCHECK visible via docker inspect

🤖 Generated with Claude Code

- Add SessionHealth enum (Healthy, Unhealthy, Starting, Unknown) and
  SessionExitInfo struct to the ExecutionBackend trait for container
  liveness detection and failure diagnostics
- Implement session_health(), session_exit_info(), and
  shutdown_all_sessions() on DockerBackend using bollard container
  inspection (health check status, exit codes, OOM detection)
- Update AgentManager::reconcile() to use exit info for smarter status
  transitions: exit code 0 → Stopped, non-zero → Failed
- Add orphaned session cleanup during reconciliation (sessions with
  backend prefix but no matching DB record are removed)
- Add AgentManager::shutdown_all() for graceful shutdown with
  configurable leave-running behavior (AGENTD_SHUTDOWN_LEAVE_RUNNING)
- Add HEALTHCHECK directive to Dockerfile (claude --version every 30s)
- Log container lifecycle events (start, stop, remove) for observability

Closes #289

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@geoffjay geoffjay added the review-agent Used to invoke a review by an agent tracking this label label Mar 11, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 11, 2026

Codecov Report

❌ Patch coverage is 10.44776% with 120 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.99%. Comparing base (6f93362) to head (cac3736).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
crates/wrap/src/docker.rs 0.00% 55 Missing ⚠️
crates/orchestrator/src/manager.rs 0.00% 54 Missing ⚠️
crates/wrap/src/backend.rs 62.50% 6 Missing ⚠️
crates/orchestrator/src/main.rs 0.00% 3 Missing ⚠️
crates/cli/src/commands/apply.rs 0.00% 1 Missing ⚠️
crates/cli/src/commands/orchestrator.rs 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #324      +/-   ##
==========================================
- Coverage   44.58%   43.99%   -0.60%     
==========================================
  Files          80       80              
  Lines        7354     7472     +118     
==========================================
+ Hits         3279     3287       +8     
- Misses       4075     4185     +110     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Owner Author

@geoffjay geoffjay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-scoped PR. The trait extension is backward-compatible (all new methods have sensible defaults), the reconcile improvements are a genuine correctness upgrade (exit code 0 → Stopped vs Failed), and the orphan cleanup closes an important gap. No blocking issues. One consistency concern and a few minor notes below.

if !state.running.unwrap_or(false) {
return Ok(SessionHealth::Unknown);
}

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format!("{:?}", status).to_lowercase() uses the Debug representation of bollard::models::HealthStatusEnum for string matching. This is fragile — a bollard version bump that changes the Debug format would silently break health detection. Notably, the existing session_exists() in this same file already matches ContainerStateStatusEnum by variant directly (== Some(ContainerStateStatusEnum::CREATED)). The same pattern should work here:

use bollard::models::HealthStatusEnum;

match status {
    HealthStatusEnum::HEALTHY  => Ok(SessionHealth::Healthy),
    HealthStatusEnum::UNHEALTHY => Ok(SessionHealth::Unhealthy),
    HealthStatusEnum::STARTING  => Ok(SessionHealth::Starting),
    _                           => Ok(SessionHealth::Unknown),
}

Same applies to container_state() at line 320, though that method is diagnostics-only so the risk is lower.

# ── Health check ────────────────────────────────────────────────────
# Docker HEALTHCHECK used by the orchestrator to detect container liveness.
# Checks that the claude CLI binary is still accessible and functional.
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding --start-period to the HEALTHCHECK. Without it the default is 0s, so Docker starts counting retries immediately from container start. If the agent container ever takes a moment to become ready (e.g., slow node.js startup), it could be marked unhealthy before it has had a chance to become healthy. A small grace window like --start-period=10s would be more defensive.

let url_default = backend.agent_ws_url("test-prefix-abc123", None);
assert_eq!(url_default, Some("ws://host.docker.internal:7006/ws/abc123".to_string()));
}

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test only holds a reference to backend without calling container_state or asserting any return value — the compiler would catch a missing method at build time anyway. Consider replacing with an actual unit-testable assertion (e.g., confirming the 404-not-found path returns Ok(None) using the existing is_not_found path logic), or removing it in favour of the equivalent tests already in backend.rs.

if let Err(e) = self.storage.update(&agent).await {
error!(agent_id = %agent.id, %e, "Failed to update agent status");
}
} else if !self.registry.is_connected(&agent.id).await {
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The health value is fetched and logged here but does not influence the restart decision — the agent is always restarted when disconnected from the registry regardless of health status. This is the right behaviour (stale WS must be replaced), but a brief comment explaining why health is checked (observability/logging only, not gating the restart) would help future readers.

@geoffjay geoffjay removed the review-agent Used to invoke a review by an agent tracking this label label Mar 11, 2026
geoffjay and others added 2 commits March 11, 2026 14:58
- Use bollard HealthStatusEnum variant matching instead of fragile
  Debug string formatting in session_health()
- Use Display (to_string) instead of Debug format in container_state()
- Add --start-period=10s to Dockerfile HEALTHCHECK for startup grace
- Remove useless container_state_method_exists test
- Clarify that health check in reconcile is for observability only,
  not gating the restart decision

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@geoffjay geoffjay merged commit 833e18d into main Mar 11, 2026
8 of 11 checks passed
@geoffjay geoffjay deleted the issue-289 branch March 11, 2026 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add reconciliation and health check support for Docker backend

1 participant