Handle operation client recovery when monitor client is healthy

## Objective

Clarify and improve the relationship between _operation_client and _monitor_client in MonitoringValkeyClient. The operation client can time out even when the monitor client is responding to pings normally. Define how to detect this state and recover the operation client.

## Problem

- The operation client has a user-specified request timeout and may perform long-running operations (e.g., stream reads), making it structurally prone to timeouts independent of actual connection health
- The monitor client has a fixed 3-second timeout and is used only for pings, so it can remain healthy while the operation client is effectively broken
- The current _check_connection() only pings the monitor client, so it cannot detect operation client failures
- When the operation client is broken but the monitor client is fine, the current _reconnect() logic (which tears down both) may not even be triggered

## Details

- Define the recovery strategy for the operation client when it is broken but the monitor client is healthy — e.g., reconnect only the operation client, or reconnect both, or other approaches
- Determine how to detect operation client failure — via the internal health API, via usage-time failure tracking (see BA-5578), or a combination
- Expose the operation client health status through the internal port (default 8081) health API so external monitoring systems can detect this state

## Code References

- MonitoringValkeyClient: src/ai/backend/common/clients/valkey_client/client.py (~line 413)
- _operation_client (user-specified timeout) vs _monitor_client (fixed 3s timeout): same file (~line 425-426)
- _check_connection() — currently only pings monitor client: same file (~line 457)
- _reconnect() — tears down and reconnects both clients: same file (~line 551)
- ValkeyHealthChecker: src/ai/backend/common/health_checker/checkers/valkey.py
- InternalHealthHandler: src/ai/backend/manager/api/rest/internal/health/handler.py

## Acceptance Criteria

- When the operation client is broken but the monitor client is healthy, the system detects this and recovers the operation client
- The internal health API distinguishes operation and monitor client status
- Verified in both Standalone and Sentinel modes
- Tests covering the operation-client-only failure scenario are included


JIRA Issue: BA-5577

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle operation client recovery when monitor client is healthy #10754

Objective

Problem

Details

Code References

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handle operation client recovery when monitor client is healthy #10754

Description

Objective

Problem

Details

Code References

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions