Skip to content

dandydev/redis-ha: HA testing is failing with big dataset/or replica is not coming up after resilience testing #399

@akshaysharama

Description

@akshaysharama

Describe the bug
instance is not able to reload the dataset (2gb and more)
cluster is not forming the quorum cleanly.

To Reproduce
Steps to reproduce the behavior:

  1. deploy dandydev/redis-ha chart
    using "helm install redis-new dandydev/redis-ha -n pod-services -f values.yaml"

  2. once deploy, push random data into redis master using below command
    for i in $(seq 1 8200); do
    key="randkey:$i"
    keylen=${#key}
    {
    printf "*3\r\n"
    printf "$3\r\nSET\r\n"
    printf "$%d\r\n%s\r\n" "$keylen" "$key"
    printf "$262144\r\n"
    head -c 262144 /dev/urandom
    printf "\r\n"
    }
    done | kubectl exec -i -n services redis-server-new-0 -c redis -- redis-cli -a redis --pipe

  3. once all data is copied, and stable, delete master instance, and check all come into the cluster again.

Expected behavior
check all come into the cluster again.

Additional context

i have used:

fullnameOverride: "redis-new"

replicas: 3

priorityClassName: "critical-services"

global:
  priorityClassName: "critical-services"

image:
  repository: public.ecr.aws/docker/library/redis
  tag: 7.2.4-alpine
  pullPolicy: IfNotPresent

imagePullSecrets:
  - name: artifactory

##################################################
# AUTH
##################################################
auth: true
existingSecret: "redis-secret"
authKey: "redis-password"

##################################################
# PERSISTENCE
##################################################
persistentVolume:
  enabled: true
  size: 8Gi
  accessModes:
    - ReadWriteOnce
  storageClass: null

##################################################
# REDIS CONFIGURATION
##################################################
redis:
  port: 6379
  masterGroupName: "mymaster"
  terminationGracePeriodSeconds: 120

  resources:
    requests:
      memory: "300Mi"
      cpu: "1"
    limits:
      memory: "8000Mi"
      cpu: "8"

  config:
    appendonly: "yes"
    appendfsync: "everysec"
    save: ""
    notify-keyspace-events: "KEA"
    min-replicas-to-write: 1
    min-replicas-max-lag: 5
    repl-backlog-size: 512mb
    repl-diskless-sync: "yes"
    rdbcompression: "yes"
    rdbchecksum: "yes"

  ##################################################
  # Probes 
  ##################################################
  livenessProbe:
    enabled: true
    initialDelaySeconds: 20
    periodSeconds: 5
    timeoutSeconds: 6
    successThreshold: 1
    failureThreshold: 5

  readinessProbe:
    enabled: true
    initialDelaySeconds: 20
    periodSeconds: 5
    timeoutSeconds: 2
    successThreshold: 1
    failureThreshold: 5

  startupProbe:
    enabled: true
    initialDelaySeconds: 30
    periodSeconds: 10
    timeoutSeconds: 5
    successThreshold: 1
    failureThreshold: 3

##################################################
# SENTINEL CONFIGURATION
##################################################
sentinel:
  port: 26379
  quorum: 2

  auth: true
  existingSecret: "redis-secret"
  authKey: "redis-password"

  config:
    down-after-milliseconds: 10000
    failover-timeout: 180000
    parallel-syncs: 5
    maxclients: 10000

  livenessProbe:
    enabled: true
    initialDelaySeconds: 20
    periodSeconds: 5
    timeoutSeconds: 6
    successThreshold: 1
    failureThreshold: 5

  readinessProbe:
    enabled: true
    initialDelaySeconds: 20
    periodSeconds: 5
    timeoutSeconds: 2
    successThreshold: 3
    failureThreshold: 5

  startupProbe:
    enabled: true
    initialDelaySeconds: 30
    periodSeconds: 10
    timeoutSeconds: 15
    successThreshold: 1
    failureThreshold: 3

  resources:
    requests:
      memory: "300Mi"
      cpu: "50m"
    limits:
      memory: "1000Mi"
      cpu: "1"

##################################################
# BITNAMI AOF REPAIR SIDECAR
##################################################
extraContainers:

  - name: aof-repair
    image: public.ecr.aws/docker/library/redis:7.2.4-alpine
    command:
      - /bin/sh
      - -c
      - |
        echo "Starting AOF check sidecar..."
        while true; do
          if ls /data/appendonlydir/*.aof* 1> /dev/null 2>&1; then
            echo "AOF files detected. Attempting repair...";
            for file in /data/appendonlydir/*.aof*; do
              yes | redis-check-aof --fix "$file" && echo "redis-check-aof --fix success for $file";
            done
          else
            echo "No AOF files found in /data/appendonlydir/, skipping repair.";
          fi;
          sleep 300
        done
    volumeMounts:
      - name: data
        mountPath: /data

##################################################
# SECURITY CONTEXT 
##################################################
securityContext:
  runAsUser: 1001
  fsGroup: 1001
  runAsNonRoot: true

containerSecurityContext:
  runAsUser: 1001
  runAsNonRoot: true
  allowPrivilegeEscalation: false
  seccompProfile:
    type: RuntimeDefault
  capabilities:
    drop:
      - ALL

##################################################
# DISABLE HAPROXY 
##################################################
haproxy:
  enabled: false

##################################################
# HARD ANTI AFFINITY
##################################################
hardAntiAffinity: true

logs

 sudo kubectl logs -f pao-redis-new-server-1 -n pod-services
Defaulted container "redis" out of: redis, sentinel, split-brain-fix, aof-repair, config-init (init)
1:C 06 Apr 2026 12:00:42.127 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 06 Apr 2026 12:00:42.127 * Redis version=7.2.4, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 06 Apr 2026 12:00:42.127 * Configuration loaded
1:S 06 Apr 2026 12:00:42.128 * monotonic clock: POSIX clock_gettime
1:S 06 Apr 2026 12:00:42.128 * Running mode=standalone, port=6379.
1:S 06 Apr 2026 12:00:42.128 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:S 06 Apr 2026 12:00:42.128 * Server initialized
1:S 06 Apr 2026 12:00:42.134 * Reading RDB base file on AOF loading...
1:S 06 Apr 2026 12:00:42.134 * Loading RDB produced by version 7.2.4
1:S 06 Apr 2026 12:00:42.134 * RDB age 302 seconds
1:S 06 Apr 2026 12:00:42.134 * RDB memory usage when created 2831.30 Mb
1:S 06 Apr 2026 12:00:42.134 * RDB is base AOF
1:S 06 Apr 2026 12:01:05.564 * Done loading RDB, keys loaded: 7422, keys expired: 0.
1:S 06 Apr 2026 12:01:05.564 * DB loaded from base file appendonly.aof.8.base.rdb: 23.433 seconds
1:S 06 Apr 2026 12:01:07.469 * DB loaded from incr file appendonly.aof.8.incr.aof: 1.905 seconds
1:S 06 Apr 2026 12:01:07.469 * DB loaded from append only file: 25.338 seconds
1:S 06 Apr 2026 12:01:07.469 * Opening AOF incr file appendonly.aof.8.incr.aof on server start
1:S 06 Apr 2026 12:01:07.469 * Ready to accept connections tcp
1:S 06 Apr 2026 12:01:07.977 * Connecting to MASTER 192.168.254.227:6379
1:S 06 Apr 2026 12:01:07.977 * MASTER <-> REPLICA sync started
1:S 06 Apr 2026 12:01:07.978 * Non blocking connect for SYNC fired the event.
1:S 06 Apr 2026 12:01:07.978 * Master replied to PING, replication can continue...
1:S 06 Apr 2026 12:01:07.978 * Partial resynchronization not possible (no cached master)
1:S 06 Apr 2026 12:01:12.677 * Full resync from master: 469d4f3b5479b4e010bdf9509f32661966fba63c:4148981196
1:S 06 Apr 2026 12:01:12.780 * MASTER <-> REPLICA sync: receiving streamed RDB from master with EOF to disk
1:signal-handler (1775476900) Received SIGTERM scheduling shutdown...
1:S 06 Apr 2026 12:01:41.059 * User requested shutdown...
1:S 06 Apr 2026 12:01:41.059 * Calling fsync() on the AOF file.
1:S 06 Apr 2026 12:01:41.059 # Redis is now ready to exit, bye bye...

~ # sudo kubectl describe po redis-new-server-1 -n services
Name:                 redis-new-server-1
Namespace:            services
Priority:             4000
Priority Class Name:  critical-services
Service Account:      redis-new
Node:                 server3/192.168.11.3
Start Time:           Mon, 06 Apr 2026 12:00:34 +0000
Labels:               app=redis-ha
                      apps.kubernetes.io/pod-index=1
                      controller-revision-hash=redis-new-server-67744877f9
                      redis-new=replica
                      release=redis-new
                      statefulset.kubernetes.io/pod-name=redis-new-server-1
Annotations:          checksum/init-config: dab9fb7c0beaa52722cd1ff7ccad151f8ab6bb8c7bd9830141a28c9e7f269c8e
Status:               Running
IP:                   192.168.250.219
IPs:
  IP:           192.168.250.219
Controlled By:  StatefulSet/redis-new-server
Init Containers:
  config-init:
    Container ID:    containerd://82b4727be07282a7c78e78ef6f12d7cf8dff6115dc14010711c9738933cb9829
    Image:           public.ecr.aws/docker/library/redis:7.2.4-alpine
    Image ID:        public.ecr.aws/docker/library/redis@sha256:c8bb255c3559b3e458766db810aa7b3c7af1235b204cfdb304e79ff388fe1a5a
    Port:            <none>
    Host Port:       <none>
    SeccompProfile:  RuntimeDefault
    Command:
      sh
    Args:
      /readonly-config/init.sh
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 06 Apr 2026 12:00:41 +0000
      Finished:     Mon, 06 Apr 2026 12:00:41 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      SENTINEL_ID_0:  4ec322c9018270bd52453c319fc8868df0f8727b
      SENTINEL_ID_1:  fcf3a9e314980721aa93e8ca6314b03a2143682d
      SENTINEL_ID_2:  5c4bc6e744473d9b673b8e9a89867a8334a3f023
      AUTH:           <set to the key 'redis-password' in secret 'pao-redis-secret'>  Optional: false
      SENTINELAUTH:   <set to the key 'redis-password' in secret 'pao-redis-secret'>  Optional: false
    Mounts:
      /data from data (rw)
      /readonly-config from config (ro)
Containers:
  redis:
    Container ID:    containerd://e816ef254f85aa1634cbed9a62973e2d12b6a4973e59a6f8d426a050ed89ef45
    Image:           public.ecr.aws/docker/library/redis:7.2.4-alpine
    Image ID:        public.ecr.aws/docker/library/redis@sha256:c8bb255c3559b3e458766db810aa7b3c7af1235b204cfdb304e79ff388fe1a5a
    Port:            6379/TCP
    Host Port:       0/TCP
    SeccompProfile:  RuntimeDefault
    Command:
      redis-server
    Args:
      /data/conf/redis.conf
    State:          Running
      Started:      Mon, 06 Apr 2026 12:01:41 +0000
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 06 Apr 2026 12:00:42 +0000
      Finished:     Mon, 06 Apr 2026 12:01:41 +0000
    Ready:          False
    Restart Count:  1
    Limits:
      cpu:     8
      memory:  8000Mi
    Requests:
      cpu:      1
      memory:   300Mi
    Liveness:   exec [sh -c /health/redis_liveness.sh] delay=20s timeout=6s period=5s #success=1 #failure=5
    Readiness:  exec [sh -c /health/redis_readiness.sh] delay=20s timeout=2s period=5s #success=1 #failure=5
    Startup:    exec [sh -c /health/redis_readiness.sh] delay=30s timeout=5s period=10s #success=1 #failure=3
    Environment:
      AUTH:  <set to the key 'redis-password' in secret 'pao-redis-secret'>  Optional: false
    Mounts:
      /data from data (rw)
      /health from health (rw)
      /readonly-config from config (ro)
  sentinel:
    Container ID:    containerd://6dd56f036cc0ca82ddc7a813e315eb9d51bb7d872b3bee1ff3bebc2cb18c21d0
    Image:           public.ecr.aws/docker/library/redis:7.2.4-alpine
    Image ID:        public.ecr.aws/docker/library/redis@sha256:c8bb255c3559b3e458766db810aa7b3c7af1235b204cfdb304e79ff388fe1a5a
    Port:            26379/TCP
    Host Port:       0/TCP
    SeccompProfile:  RuntimeDefault
    Command:
      redis-sentinel
    Args:
      /data/conf/sentinel.conf
    State:          Running
      Started:      Mon, 06 Apr 2026 12:00:42 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1000Mi
    Requests:
      cpu:      50m
      memory:   300Mi
    Liveness:   exec [sh -c /health/sentinel_liveness.sh] delay=20s timeout=6s period=5s #success=1 #failure=5
    Readiness:  exec [sh -c /health/sentinel_liveness.sh] delay=20s timeout=2s period=5s #success=3 #failure=5
    Startup:    exec [sh -c /health/sentinel_liveness.sh] delay=30s timeout=15s period=10s #success=1 #failure=3
    Environment:
      AUTH:          <set to the key 'redis-password' in secret 'pao-redis-secret'>  Optional: false
      SENTINELAUTH:  <set to the key 'redis-password' in secret 'pao-redis-secret'>  Optional: false
    Mounts:
      /data from data (rw)
      /health from health (rw)
  split-brain-fix:
    Container ID:    containerd://261f112415883e9d5fe28c98e16c089d63ae402e2aab734252c5960a9c80b25a
    Image:           public.ecr.aws/docker/library/redis:7.2.4-alpine
    Image ID:        public.ecr.aws/docker/library/redis@sha256:c8bb255c3559b3e458766db810aa7b3c7af1235b204cfdb304e79ff388fe1a5a
    Port:            <none>
    Host Port:       <none>
    SeccompProfile:  RuntimeDefault
    Command:
      sh
    Args:
      /readonly-config/fix-split-brain.sh
    State:          Running
      Started:      Mon, 06 Apr 2026 12:00:42 +0000
    Ready:          True
    Restart Count:  0
    Liveness:       exec [cat /readonly-config/redis.conf] delay=30s timeout=15s period=15s #success=1 #failure=5
    Readiness:      exec [sh -c test -d /proc/1] delay=30s timeout=15s period=15s #success=1 #failure=5
    Environment:
      SENTINEL_ID_0:  4ec322c9018270bd52453c319fc8868df0f8727b
      SENTINEL_ID_1:  fcf3a9e314980721aa93e8ca6314b03a2143682d
      SENTINEL_ID_2:  5c4bc6e744473d9b673b8e9a89867a8334a3f023
      AUTH:           <set to the key 'redis-password' in secret 'pao-redis-secret'>  Optional: false
      SENTINELAUTH:   <set to the key 'redis-password' in secret 'pao-redis-secret'>  Optional: false
    Mounts:
      /data from data (rw)
      /readonly-config from config (ro)
  aof-repair:
    Container ID:  containerd://96d948fe8dca892d0ff929822e54b50790e843d327afd22335e59d001ca511b3
    Image:         public.ecr.aws/docker/library/redis:7.2.4-alpine
    Image ID:      public.ecr.aws/docker/library/redis@sha256:c8bb255c3559b3e458766db810aa7b3c7af1235b204cfdb304e79ff388fe1a5a
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      echo "Starting AOF check sidecar..."
      while true; do
        if ls /data/appendonlydir/*.aof* 1> /dev/null 2>&1; then
          echo "AOF files detected. Attempting repair...";
          for file in /data/appendonlydir/*.aof*; do
            yes | redis-check-aof --fix "$file" && echo "redis-check-aof --fix success for $file";
          done
        else
          echo "No AOF files found in /data/appendonlydir/, skipping repair.";
        fi;
        sleep 300
      done

    State:          Running
      Started:      Mon, 06 Apr 2026 12:00:42 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /data from data (rw)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-redis-new-server-1
    ReadOnly:   false
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      redis-new-configmap
    Optional:  false
  health:
    Type:        ConfigMap (a volume populated by a ConfigMap)
    Name:        redis-new-health-configmap
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  82s                default-scheduler  Successfully assigned services/redis-new-server-1 to pod-server3-49-6155-502-7kd2
  Normal   Pulled     77s                kubelet            Container image "public.ecr.aws/docker/library/redis:7.2.4-alpine" already present on machine
  Normal   Created    77s                kubelet            Created container config-init
  Normal   Started    76s                kubelet            Started container config-init
  Normal   Started    75s                kubelet            Started container split-brain-fix
  Normal   Created    75s                kubelet            Created container split-brain-fix
  Normal   Started    75s                kubelet            Started container aof-repair
  Normal   Pulled     75s                kubelet            Container image "public.ecr.aws/docker/library/redis:7.2.4-alpine" already present on machine
  Normal   Created    75s                kubelet            Created container sentinel
  Normal   Started    75s                kubelet            Started container sentinel
  Normal   Pulled     75s                kubelet            Container image "public.ecr.aws/docker/library/redis:7.2.4-alpine" already present on machine
  Normal   Created    75s                kubelet            Created container aof-repair
  Normal   Pulled     75s                kubelet            Container image "public.ecr.aws/docker/library/redis:7.2.4-alpine" already present on machine
  Warning  Unhealthy  17s (x3 over 37s)  kubelet            Startup probe failed: role=slave; repl=sync
  Normal   Killing    17s                kubelet            Container redis failed startup probe, will be restarted
  Normal   Pulled     16s (x2 over 76s)  kubelet            Container image "public.ecr.aws/docker/library/redis:7.2.4-alpine" already present on machine
  Normal   Created    16s (x2 over 76s)  kubelet            Created container redis
  Normal   Started    16s (x2 over 75s)  kubelet            Started container redis

 # sudo kubectl get po -A | grep redis
services                 redis-new-server-0                                            4/4     Running             2 (13m ago)        17m
services                 redis-new-server-1                                            3/4     Running             1 (27s ago)        94s
services                 redis-new-server-2                                            4/4     Running             0                  53m
apiVersion: v2
appVersion: 7.2.4
description: This Helm chart provides a highly available Redis implementation with
  a master/slave configuration and uses Sentinel sidecars for failover management
name: redis-dandy
# ✅ Use local folder for dependency to avoid helm repo issues
dependencies:
  - name: redis-ha
    version: 4.32.0
    repository: "https://dandydeveloper.github.io/charts"

Metadata

Metadata

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions