Add reconciliation and health check support for Docker backend by geoffjay · Pull Request #324 · geoffjay/agentd

geoffjay · 2026-03-11T21:18:13Z

Summary

Add SessionHealth enum (Healthy, Unhealthy, Starting, Unknown) and SessionExitInfo struct to the ExecutionBackend trait for container liveness detection and failure diagnostics
Implement session_health(), session_exit_info(), and shutdown_all_sessions() on DockerBackend using bollard container inspection (health check status, exit codes, OOM detection)
Update AgentManager::reconcile() to use exit info for smarter status transitions: exit code 0 → Stopped, non-zero → Failed
Add orphaned session cleanup during reconciliation (sessions with backend prefix but no matching DB record are removed)
Add AgentManager::shutdown_all() for graceful shutdown with configurable leave-running behavior (AGENTD_SHUTDOWN_LEAVE_RUNNING)
Add HEALTHCHECK directive to Dockerfile (claude --version every 30s)
Log container lifecycle events (start, stop, remove) for observability

Closes #289

Test plan

All 77 wrap lib tests pass (13 new: SessionHealth serde/display, SessionExitInfo, default trait method behavior)
All 102 orchestrator lib tests pass
Full workspace cargo build clean
Manual: Docker backend reconciliation with exited containers (exit code 0 vs non-zero)
Manual: Orphaned container cleanup on orchestrator restart
Manual: Graceful shutdown stops all Docker containers
Manual: AGENTD_SHUTDOWN_LEAVE_RUNNING=true leaves containers running
Manual: Docker HEALTHCHECK visible via docker inspect

🤖 Generated with Claude Code

- Add SessionHealth enum (Healthy, Unhealthy, Starting, Unknown) and SessionExitInfo struct to the ExecutionBackend trait for container liveness detection and failure diagnostics - Implement session_health(), session_exit_info(), and shutdown_all_sessions() on DockerBackend using bollard container inspection (health check status, exit codes, OOM detection) - Update AgentManager::reconcile() to use exit info for smarter status transitions: exit code 0 → Stopped, non-zero → Failed - Add orphaned session cleanup during reconciliation (sessions with backend prefix but no matching DB record are removed) - Add AgentManager::shutdown_all() for graceful shutdown with configurable leave-running behavior (AGENTD_SHUTDOWN_LEAVE_RUNNING) - Add HEALTHCHECK directive to Dockerfile (claude --version every 30s) - Log container lifecycle events (start, stop, remove) for observability Closes #289 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-03-11T21:25:59Z

Codecov Report

❌ Patch coverage is 10.44776% with 120 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.99%. Comparing base (6f93362) to head (cac3736).
⚠️ Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/wrap/src/docker.rs	0.00%	55 Missing ⚠️
crates/orchestrator/src/manager.rs	0.00%	54 Missing ⚠️
crates/wrap/src/backend.rs	62.50%	6 Missing ⚠️
crates/orchestrator/src/main.rs	0.00%	3 Missing ⚠️
crates/cli/src/commands/apply.rs	0.00%	1 Missing ⚠️
crates/cli/src/commands/orchestrator.rs	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #324      +/-   ##
==========================================
- Coverage   44.58%   43.99%   -0.60%     
==========================================
  Files          80       80              
  Lines        7354     7472     +118     
==========================================
+ Hits         3279     3287       +8     
- Misses       4075     4185     +110

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

geoffjay

Clean, well-scoped PR. The trait extension is backward-compatible (all new methods have sensible defaults), the reconcile improvements are a genuine correctness upgrade (exit code 0 → Stopped vs Failed), and the orphan cleanup closes an important gap. No blocking issues. One consistency concern and a few minor notes below.

geoffjay · 2026-03-11T21:30:32Z

crates/wrap/src/docker.rs

+                if !state.running.unwrap_or(false) {
+                    return Ok(SessionHealth::Unknown);
+                }
+


format!("{:?}", status).to_lowercase() uses the Debug representation of bollard::models::HealthStatusEnum for string matching. This is fragile — a bollard version bump that changes the Debug format would silently break health detection. Notably, the existing session_exists() in this same file already matches ContainerStateStatusEnum by variant directly (== Some(ContainerStateStatusEnum::CREATED)). The same pattern should work here:

use bollard::models::HealthStatusEnum; match status { HealthStatusEnum::HEALTHY => Ok(SessionHealth::Healthy), HealthStatusEnum::UNHEALTHY => Ok(SessionHealth::Unhealthy), HealthStatusEnum::STARTING => Ok(SessionHealth::Starting), _ => Ok(SessionHealth::Unknown), }

Same applies to container_state() at line 320, though that method is diagnostics-only so the risk is lower.

geoffjay · 2026-03-11T21:30:32Z

docker/claude-code/Dockerfile

+# ── Health check ────────────────────────────────────────────────────
+# Docker HEALTHCHECK used by the orchestrator to detect container liveness.
+# Checks that the claude CLI binary is still accessible and functional.
+HEALTHCHECK --interval=30s --timeout=5s --retries=3 \


Consider adding --start-period to the HEALTHCHECK. Without it the default is 0s, so Docker starts counting retries immediately from container start. If the agent container ever takes a moment to become ready (e.g., slow node.js startup), it could be marked unhealthy before it has had a chance to become healthy. A small grace window like --start-period=10s would be more defensive.

geoffjay · 2026-03-11T21:30:32Z

crates/wrap/src/docker.rs

        let url_default = backend.agent_ws_url("test-prefix-abc123", None);
        assert_eq!(url_default, Some("ws://host.docker.internal:7006/ws/abc123".to_string()));
    }
+


This test only holds a reference to backend without calling container_state or asserting any return value — the compiler would catch a missing method at build time anyway. Consider replacing with an actual unit-testable assertion (e.g., confirming the 404-not-found path returns Ok(None) using the existing is_not_found path logic), or removing it in favour of the equivalent tests already in backend.rs.

geoffjay · 2026-03-11T21:30:32Z

crates/orchestrator/src/manager.rs

                if let Err(e) = self.storage.update(&agent).await {
                    error!(agent_id = %agent.id, %e, "Failed to update agent status");
                }
            } else if !self.registry.is_connected(&agent.id).await {


The health value is fetched and logged here but does not influence the restart decision — the agent is always restarted when disconnected from the registry regardless of health status. This is the right behaviour (stale WS must be replaced), but a brief comment explaining why health is checked (observability/logging only, not gating the restart) would help future readers.

- Use bollard HealthStatusEnum variant matching instead of fragile Debug string formatting in session_health() - Use Display (to_string) instead of Debug format in container_state() - Add --start-period=10s to Dockerfile HEALTHCHECK for startup grace - Remove useless container_state_method_exists test - Clarify that health check in reconcile is for observability only, not gating the restart decision Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

geoffjay added the review-agent Used to invoke a review by an agent tracking this label label Mar 11, 2026

fix: run cargo fmt

ae07a96

geoffjay commented Mar 11, 2026

View reviewed changes

geoffjay removed the review-agent Used to invoke a review by an agent tracking this label label Mar 11, 2026

geoffjay and others added 2 commits March 11, 2026 14:58

fix: run cargo fmt

cac3736

geoffjay merged commit 833e18d into main Mar 11, 2026
8 of 11 checks passed

geoffjay deleted the issue-289 branch March 11, 2026 22:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reconciliation and health check support for Docker backend#324

Add reconciliation and health check support for Docker backend#324
geoffjay merged 4 commits intomainfrom
issue-289

geoffjay commented Mar 11, 2026

Uh oh!

codecov bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

geoffjay left a comment

Uh oh!

geoffjay Mar 11, 2026

Uh oh!

geoffjay Mar 11, 2026

Uh oh!

geoffjay Mar 11, 2026

Uh oh!

geoffjay Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

geoffjay commented Mar 11, 2026

Summary

Test plan

Uh oh!

codecov bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

geoffjay left a comment

Choose a reason for hiding this comment

Uh oh!

geoffjay Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

geoffjay Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

geoffjay Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

geoffjay Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Mar 11, 2026 •

edited

Loading