Fix race condition: prevent selection of dead connection pools during node failures by sagarrohankar-bsft · Pull Request #3 · blueshift-labs/xandra

sagarrohankar-bsft · 2025-11-06T12:25:21Z

Problem

When a Scylla node fails, there's a race condition where terminated connection pools can still be selected during the async shutdown window. This happens because:

StatusChange DOWN event triggers async terminate_child/2
Pool remains in ConnectionRegistry during termination
Retry logic queries registry and can select the dying pool
Connection attempt fails

Solution

Add synchronous Process.alive?/1 checks to filter out dead pools before selection:

Added filter_alive_pools/1 in lib/xandra/cluster.ex (pool selection)
Added alive check in random_connections/1 in lib/xandra/clusters/cluster.ex (token ring sync)

This ensures pools are only selected if the process is actually running, even if still registered.

…failures Add synchronous alive checks to filter out terminated pools before selection. This fixes a race condition where pools could be selected during the async termination window after a StatusChange DOWN event. Changes: - Add filter_alive_pools/1 to check Process.alive? before pool selection - Add alive check to random_connections/1 for token ring sync - Ensures retry logic only selects from healthy, running pools Fixes race condition observed when Scylla nodes fail under load, where terminated pools remained in ConnectionRegistry during async shutdown.

merajblueshift approved these changes Dec 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition: prevent selection of dead connection pools during node failures#3

Fix race condition: prevent selection of dead connection pools during node failures#3
sagarrohankar-bsft wants to merge 1 commit intobsft-mainfrom
fix/race-condition-dead-node-selection

sagarrohankar-bsft commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sagarrohankar-bsft commented Nov 6, 2025

Problem

Solution

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants