Skip to content

feat(BA-5578): refactor ValkeyClient to async with acquire() pattern#10766

Open
jopemachine wants to merge 9 commits intomainfrom
feat/BA-5578-valkey-client-acquire-pattern
Open

feat(BA-5578): refactor ValkeyClient to async with acquire() pattern#10766
jopemachine wants to merge 9 commits intomainfrom
feat/BA-5578-valkey-client-acquire-pattern

Conversation

@jopemachine
Copy link
Copy Markdown
Member

@jopemachine jopemachine commented Apr 3, 2026

Resolves BA-5578.

Summary

  • Add acquire() async context manager to AbstractValkeyClient and MonitoringValkeyClient that wraps operations and tracks consecutive connection failures
  • When the failure threshold is exceeded, the connection is torn down and reconnected rather than continuing to retry on a broken connection
  • Migrate all 13 domain-specific Valkey clients to use the new async with self._client.acquire() as conn: pattern
  • Add tests covering the retry-based disconnection logic

Changes

  • Core (client.py): Define _VALKEY_CONNECTION_ERRORS tuple, add acquire() base implementation to AbstractValkeyClient, override in MonitoringValkeyClient with _operation_failure_count tracking and threshold-based _reconnect() trigger
  • Domain clients (13 files): Replace all self._client.client.xxx() calls with async with self._client.acquire() as conn: pattern
  • Tests: Add TestMonitoringValkeyClientAcquire class with 8 test cases covering success/failure counting, threshold detection, reconnection, and error type tracking

Test plan

  • pants fmt/fix/lint --changed-since=HEAD passes
  • pants check --changed-since=HEAD (mypy) passes
  • pants test --changed-since=HEAD passes
  • Verify in Standalone mode
  • Verify in Sentinel mode

🤖 Generated with Claude Code

…ttern

Add an acquire() async context manager to AbstractValkeyClient that
wraps operations and tracks consecutive connection failures. When the
failure threshold is exceeded, the connection is torn down and
reconnected rather than continuing to retry on a broken connection.

- Add _VALKEY_CONNECTION_ERRORS tuple for connection error detection
- Add acquire() base implementation to AbstractValkeyClient
- Override acquire() in MonitoringValkeyClient with failure tracking
  and threshold-based reconnection
- Migrate all 13 domain-specific Valkey clients to use the new pattern
- Add tests for retry-based disconnection logic

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 3, 2026 02:39
@github-actions github-actions bot added size:XL 500~ LoC comp:common Related to Common component labels Apr 3, 2026
@jopemachine jopemachine marked this pull request as draft April 3, 2026 02:39
@jopemachine jopemachine added this to the 26.4 milestone Apr 3, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the Valkey client abstraction to standardize operations behind an async with ... acquire() context manager, enabling operation-level failure tracking and reconnect-on-threshold behavior, and migrates all domain Valkey clients to use the new pattern.

Changes:

  • Added AbstractValkeyClient.acquire() async context manager and a MonitoringValkeyClient.acquire() override that counts consecutive operation connection failures and triggers _reconnect() at a threshold.
  • Migrated domain-specific Valkey clients to use async with self._client.acquire() as conn: for all Valkey operations.
  • Added unit tests validating failure counting, threshold behavior, and reconnection outcomes for MonitoringValkeyClient.acquire().

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/ai/backend/common/clients/valkey_client/client.py Introduces acquire() and implements operation failure tracking + threshold-based reconnection.
tests/unit/common/clients/valkey_client/test_monitoring_valkey_client.py Adds test coverage for the new acquire() behavior and reconnection triggering.
src/ai/backend/common/clients/valkey_client/valkey_volume_stats/client.py Migrates volume stats operations to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_stream/client.py Migrates stream operations (xgroup/xreadgroup/etc.) to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_stat/client.py Migrates stat/cache operations and batched execs to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_session/client.py Migrates session CRUD and time/ping to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_schedule/client.py Migrates scheduling state operations and batch execs to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_rate_limit/client.py Migrates rate-limit scripts and sorted-set operations to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_live/client.py Migrates live data operations, scans, and pipelines to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_leader/client.py Migrates leadership Lua script invocations to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_image/client.py Migrates image membership operations and scans to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_container_log/client.py Migrates list/log operations and batch execs to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_bgtask/client.py Migrates task metadata operations, scripts, and execs to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_artifact/client.py Migrates artifact progress tracking (get/exec/hincrby/hset/scan) to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_artifact_registries/client.py Migrates registry CRUD operations to the acquire() pattern.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jopemachine and others added 5 commits April 3, 2026 16:11
The ValkeyLeaderClient was refactored to use `async with self._client.acquire()`,
but the test mock didn't support the async context manager protocol.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add MonitoringValkeyClientSpec dataclass to group configuration fields
  (reconnectable_exceptions, consecutive_failure_threshold, monitor_interval)
- Make AbstractValkeyClient.acquire() a proper abstract method
- Add acquire() implementations to ValkeyStandaloneClient and ValkeySentinelClient

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merge test_acquire_resets_counter_and_reconnects_at_threshold and
test_acquire_reconnect_restores_connectivity into
test_operation_client_recovery_while_monitor_healthy, which verifies
the core BA-5577 scenario: operation client broken while monitor
client is healthy, with counter reset and recovery assertions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ingValkeyClient

Serialize _reconnect() calls from acquire() and monitor loop with
asyncio.Lock to prevent interleaving disconnect/connect operations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jopemachine jopemachine requested review from a team and HyeockJinKim April 3, 2026 08:40
@jopemachine jopemachine marked this pull request as ready for review April 3, 2026 08:40
- Rename AbstractValkeyClient.client property to _raw_client (internal)
- Rename acquire() context manager to client() (public API)
- Replace _reconnect_lock with _needs_reconnect flag; monitor loop
  handles actual reconnection to avoid concurrent reconnect races
- Update all 14 subdomain clients, ValkeyCache, plugins, and tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the comp:manager Related to Manager component label Apr 3, 2026
Instead of waiting up to monitor_interval (10s) for the next cycle,
client() now sets a _reconnect_event that immediately wakes the
monitor loop. Reconnection is still performed solely by the monitor
loop, avoiding concurrent reconnect races.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:common Related to Common component comp:manager Related to Manager component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants