Skip to content

Fix race condition: prevent selection of dead connection pools during node failures#3

Open
sagarrohankar-bsft wants to merge 1 commit intobsft-mainfrom
fix/race-condition-dead-node-selection
Open

Fix race condition: prevent selection of dead connection pools during node failures#3
sagarrohankar-bsft wants to merge 1 commit intobsft-mainfrom
fix/race-condition-dead-node-selection

Conversation

@sagarrohankar-bsft
Copy link
Copy Markdown
Collaborator

Problem

When a Scylla node fails, there's a race condition where terminated connection pools can still be selected during the async shutdown window. This happens because:

  1. StatusChange DOWN event triggers async terminate_child/2
  2. Pool remains in ConnectionRegistry during termination
  3. Retry logic queries registry and can select the dying pool
  4. Connection attempt fails

Solution

Add synchronous Process.alive?/1 checks to filter out dead pools before selection:

  • Added filter_alive_pools/1 in lib/xandra/cluster.ex (pool selection)
  • Added alive check in random_connections/1 in lib/xandra/clusters/cluster.ex (token ring sync)

This ensures pools are only selected if the process is actually running, even if still registered.

…failures

Add synchronous alive checks to filter out terminated pools before selection.
This fixes a race condition where pools could be selected during the async
termination window after a StatusChange DOWN event.

Changes:
- Add filter_alive_pools/1 to check Process.alive? before pool selection
- Add alive check to random_connections/1 for token ring sync
- Ensures retry logic only selects from healthy, running pools

Fixes race condition observed when Scylla nodes fail under load, where
terminated pools remained in ConnectionRegistry during async shutdown.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants