Skip to content

Harden Standalone/Sentinel reconnection logic #10753

@jopemachine

Description

@jopemachine

Objective

Strengthen the reconnection logic for both Standalone and Sentinel modes so that recovery succeeds even in abnormal termination scenarios such as socket deletion or container restarts.

Details

  • ValkeyStandaloneClient.need_reconnect() currently only checks if client is None — improve to verify actual connection state
  • Accurately reflect GlideClient internal connection state
  • In Sentinel mode, detect connection loss beyond just master address changes
  • Ensure reconnection succeeds after socket file deletion under /tmp

Reproduction Scenarios

  • Stop and restart the Redis/Valkey container
  • Delete the Valkey socket file under /tmp, then restart the container
  • Scenario where only the operation client connection is lost

Code References

  • ValkeyStandaloneClient: src/ai/backend/common/clients/valkey_client/client.py (~line 133)
  • ValkeyStandaloneClient.need_reconnect(): same file — currently returns client is None
  • ValkeySentinelClient: same file (~line 226)
  • ValkeySentinelClient.need_reconnect() and _get_master_address(): same file (~line 327, 348)
  • MonitoringValkeyClient._reconnect(): same file (~line 551)

Acceptance Criteria

  • need_reconnect() accurately detects broken connections beyond client is None
  • Reconnection succeeds after socket deletion + container restart
  • Verified in both Standalone and Sentinel modes
  • Tests covering each reproduction scenario are included

JIRA Issue: BA-5576

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions