Problem
The fix-split-brain.sh sidecar script (defined in _configs.tpl, lines ~454-539) does not handle the case where Sentinel quorum is broken and no master can be determined.
When identify_redis_master() returns an empty string — because Sentinels disagree on who the master is or cannot reach each other — the script's if/elif branches both evaluate false and execution falls through silently. No log message, no alert, no recovery attempt.
This is the most dangerous failure mode: the cluster is genuinely in a split-brain or no-master state, and the script designed to fix it does nothing.
Root Cause
The current logic in fix-split-brain.sh:
MASTER="$(identify_redis_master)"
if [ "$MASTER" = "$ANNOUNCE_IP" ]; then
# Branch A: this pod should be master — verify local role matches
elif [ -n "$MASTER" ]; then
# Branch B: another pod is master — ensure this node replicates to it
fi
# No else branch — when MASTER is empty, nothing happens
When MASTER is empty (quorum broken), neither condition matches. The script loops back and keeps getting empty results indefinitely with no visibility into the failure.
How to Reproduce
- Deploy redis-ha with 3 Redis + 3 Sentinel pods and
splitBrainDetection.enabled: true
- Simulate quorum loss by killing or network-partitioning 2 of 3 Sentinel pods
- Observe that
fix-split-brain.sh logs show no output — it silently skips the broken state
- The cluster remains stuck with no master election and no recovery attempt
Related Issues
All three share a common pattern: the script fails silently in edge cases rather than logging or recovering.
Suggested Fix
Add an else branch with a consecutive-failure counter and sentinel reset recovery after a configurable threshold:
QUORUM_FAIL_COUNT=0
MAX_QUORUM_FAILURES=${MAX_QUORUM_FAILURES:-5}
# Inside the loop:
MASTER="$(identify_redis_master)"
if [ "$MASTER" = "$ANNOUNCE_IP" ]; then
QUORUM_FAIL_COUNT=0
# existing branch A logic
elif [ -n "$MASTER" ]; then
QUORUM_FAIL_COUNT=0
# existing branch B logic
else
QUORUM_FAIL_COUNT=$((QUORUM_FAIL_COUNT + 1))
echo "WARNING: Sentinel returned no master (quorum may be broken). Failure count: $QUORUM_FAIL_COUNT/$MAX_QUORUM_FAILURES"
if [ "$QUORUM_FAIL_COUNT" -ge "$MAX_QUORUM_FAILURES" ]; then
echo "ERROR: Quorum broken for $MAX_QUORUM_FAILURES consecutive checks. Attempting sentinel reset..."
for SENTINEL in $SENTINEL_HOSTS; do
redis-cli -h "$SENTINEL" -p "${SENTINEL_PORT}" sentinel reset mymaster || true
done
QUORUM_FAIL_COUNT=0
fi
fi
This change:
- Makes the failure visible in logs immediately
- Uses a configurable threshold (
MAX_QUORUM_FAILURES, default 5) to avoid reacting to transient blips
- Attempts automated recovery via
sentinel reset which forces Sentinels to re-discover the topology
- Resets the counter after recovery attempt to allow another cycle
Environment
- Chart: redis-ha
- Relevant file:
templates/_configs.tpl (fix-split-brain.sh section)
- Affects all versions with
splitBrainDetection enabled
Problem
The
fix-split-brain.shsidecar script (defined in_configs.tpl, lines ~454-539) does not handle the case where Sentinel quorum is broken and no master can be determined.When
identify_redis_master()returns an empty string — because Sentinels disagree on who the master is or cannot reach each other — the script'sif/elifbranches both evaluate false and execution falls through silently. No log message, no alert, no recovery attempt.This is the most dangerous failure mode: the cluster is genuinely in a split-brain or no-master state, and the script designed to fix it does nothing.
Root Cause
The current logic in
fix-split-brain.sh:When
MASTERis empty (quorum broken), neither condition matches. The script loops back and keeps getting empty results indefinitely with no visibility into the failure.How to Reproduce
splitBrainDetection.enabled: truefix-split-brain.shlogs show no output — it silently skips the broken stateRelated Issues
==vs=comparison bug that silently disabled the fix entirely (closed)identify_redis_master()returns empty on newly promoted master, causing unnecessary shutdown (open)All three share a common pattern: the script fails silently in edge cases rather than logging or recovering.
Suggested Fix
Add an
elsebranch with a consecutive-failure counter andsentinel resetrecovery after a configurable threshold:This change:
MAX_QUORUM_FAILURES, default 5) to avoid reacting to transient blipssentinel resetwhich forces Sentinels to re-discover the topologyEnvironment
templates/_configs.tpl(fix-split-brain.sh section)splitBrainDetectionenabled