-
Notifications
You must be signed in to change notification settings - Fork 171
Handle operation client recovery when monitor client is healthy #10754
Copy link
Copy link
Open
Description
Objective
Clarify and improve the relationship between _operation_client and _monitor_client in MonitoringValkeyClient. The operation client can time out even when the monitor client is responding to pings normally. Define how to detect this state and recover the operation client.
Problem
- The operation client has a user-specified request timeout and may perform long-running operations (e.g., stream reads), making it structurally prone to timeouts independent of actual connection health
- The monitor client has a fixed 3-second timeout and is used only for pings, so it can remain healthy while the operation client is effectively broken
- The current _check_connection() only pings the monitor client, so it cannot detect operation client failures
- When the operation client is broken but the monitor client is fine, the current _reconnect() logic (which tears down both) may not even be triggered
Details
- Define the recovery strategy for the operation client when it is broken but the monitor client is healthy — e.g., reconnect only the operation client, or reconnect both, or other approaches
- Determine how to detect operation client failure — via the internal health API, via usage-time failure tracking (see BA-5578), or a combination
- Expose the operation client health status through the internal port (default 8081) health API so external monitoring systems can detect this state
Code References
- MonitoringValkeyClient: src/ai/backend/common/clients/valkey_client/client.py (~line 413)
- _operation_client (user-specified timeout) vs _monitor_client (fixed 3s timeout): same file (~line 425-426)
- _check_connection() — currently only pings monitor client: same file (~line 457)
- _reconnect() — tears down and reconnects both clients: same file (~line 551)
- ValkeyHealthChecker: src/ai/backend/common/health_checker/checkers/valkey.py
- InternalHealthHandler: src/ai/backend/manager/api/rest/internal/health/handler.py
Acceptance Criteria
- When the operation client is broken but the monitor client is healthy, the system detects this and recovers the operation client
- The internal health API distinguishes operation and monitor client status
- Verified in both Standalone and Sentinel modes
- Tests covering the operation-client-only failure scenario are included
JIRA Issue: BA-5577
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels