OCPBUGS-77773: fix backend server health check if DCM is enabled by jcmoraisjr · Pull Request #747 · openshift/router

jcmoraisjr · 2026-03-04T19:05:36Z

Without DCM, router configure single replica backends without health check. This saves a bit of cpu and network io, since a single failing replica without health check will continue to disrupt the service. This works without DCM because whenever a scaling out happens, haproxy is reloaded with new configurations enabling health check on all the replicas, including the first one.

This is not working with DCM because the code simply add the new replica on an empty slot, ignoring the status of the other ones. In the end the new replica has health check enabled and the first one continues with health check disabled.

The approach used here is to skip DCM when detecting this scenario. This has the advantage of not changing any current behavior, otoh a DCM scenario is being disabled for now. This is going to be revisited after 4.22 via https://issues.redhat.com/browse/NE-2496

That said there are two distinct changes happening in this PR:

Skipping dynamic change if scaling out and have just one replica
Adding inter keyword in the server-template, so the dynamically added server will use the same health check interval of the other replicas.

https://issues.redhat.com/browse/OCPBUGS-77773

Summary by CodeRabbit

Refactor
- Simplified HAProxy health check configuration by consolidating repeated calculations into a single variable.
Improvements
- Enhanced backend endpoint addition logic to validate configuration and provide guidance when system reload is required for health check functionality.

openshift-ci-robot · 2026-03-04T19:05:42Z

@jcmoraisjr: This pull request references Jira Issue OCPBUGS-77773, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Without DCM, router configure single replica backends without health check. This saves a bit of cpu and network io, since a single failing replica without health check will continue to disrupt the service. This works without DCM because whenever a scaling out happens, haproxy is reloaded with new configurations enabling health check on all the replicas, including the first one.

This is not working with DCM because the code simply add the new replica on an empty slot, ignoring the status of the other ones. In the end the new replica has health check enabled and the first one continues with health check disabled.

The approach used here is to skip DCM when detecting this scenario. This has the advantage of not changing any current behavior, otoh a DCM scenario is being disabled for now. This is going to be revisited after 4.22 via https://issues.redhat.com/browse/NE-2496

That said there are two distinct changes happening in this PR:

Skipping dynamic change if scaling out and have just one replica

Adding inter keyword in the server-template, so the dynamically added server will use the same health check interval of the other replicas.

https://issues.redhat.com/browse/OCPBUGS-77773

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-03-04T19:06:04Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign grzpiotrowski for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Without DCM, router configure single replica backends without health check. This saves a bit of cpu and network io, since a single failing replica without health check will continue to disrupt the service. This works without DCM because whenever a scaling out happens, haproxy is reloaded with new configurations enabling health check on all the replicas, including the first one. This is not working with DCM because the code simply add the new replica on an empty slot, ignoring the status of the other ones. In the end the new replica has health check enabled and the first one continues with health check disabled. The approach used here is to skip DCM when detecting this scenario. This has the advantage of not changing any current behavior, otoh a DCM scenario is being disabled for now. This is going to be revisited after 4.22 via https://issues.redhat.com/browse/NE-2496 That said there are two distinct changes happening in this PR: * Skipping dynamic change if scaling out and have just one replica * Adding `inter` keyword in the server-template, so the dynamically added server will use the same health check interval of the other replicas. https://issues.redhat.com/browse/OCPBUGS-77773

coderabbitai · 2026-03-09T20:09:33Z

Walkthrough

Refactors HAProxy configuration template to compute and reuse health_check_interval variables across backend definitions, eliminating repeated inline calculations. Adds validation in the endpoint manager to detect single-backend scenarios requiring full reload before enabling dynamic health checks.

Changes

Cohort / File(s)	Summary
HAProxy Configuration Template `images/router/haproxy/conf/haproxy-config.template`	Introduces per-backend `health_check_interval` variable computed from annotations/env and consolidates repeated interval calculations into a single variable reference across "check inter" directives.
Endpoint Manager Validation `pkg/router/template/configmanager/haproxy/manager.go`	Adds conditional check in ReplaceRouteEndpoints to detect when new endpoints are being added to a backend with only one static server, returning an error that suggests a reload is needed to enable health checks, with extensive inline documentation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly references the specific issue (OCPBUGS-77773) and accurately describes the main purpose: fixing backend server health check behavior when DCM is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names	✅ Passed	The pull request does not contain any Ginkgo test files or modifications to existing tests; changes are limited to configuration template and manager logic files with no test code.
Test Structure And Quality	✅ Passed	This pull request does not contain any Ginkgo test code changes. Modifications are limited to a HAProxy configuration template and a Go manager source file, neither of which are test files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/router/template/configmanager/haproxy/manager.go`:
- Around line 513-526: The current branch that forces a reload when
len(newEndpoints) > len(oldEndpoints) incorrectly triggers on 0→1 recoveries;
change the condition so it only considers additions when there was at least one
previous endpoint (e.g., require len(oldEndpoints) > 0) before iterating servers
and applying the staticCount/isDynamicBackendServer check for backendName; this
limits the single-endpoint health-check reload logic to true single→multi
transitions rather than 0→1 recoveries.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6ba5be08-7415-484f-a7bf-d842c835c717

📥 Commits

Reviewing files that changed from the base of the PR and between b3414b2 and 1a6e021.

📒 Files selected for processing (2)

images/router/haproxy/conf/haproxy-config.template
pkg/router/template/configmanager/haproxy/manager.go

pkg/router/template/configmanager/haproxy/manager.go

openshift-ci · 2026-03-10T00:51:47Z

@jcmoraisjr: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

alebedev87 · 2026-03-11T14:49:07Z

/assign @Thealisyed
/assign @gcs278

Thealisyed

Does the current integration test coverage exercise the 1 - 2 scale out path with DCM. Is it worth adding that test coverage anyway?

jcmoraisjr · 2026-03-12T11:48:05Z

Hi @Thealisyed this is a good point. I still have openshift/origin#30741 pending due to some missing PRs merged, I'll add a specific test for this scenario as well. Just cc you from there.

ShudiLi · 2026-03-13T08:11:34Z

tested it with 4.22.0-0-2026-03-13-061337-test-ci-ln-2qjtwd2-latest, when the replicas was scale up from 1 to 2, check inter 5000ms was added to the server pod in the haproxy.config as expected, also it was added to the server-template _dynamic-pod

1.  when replicas was 1
 cookie b94bb237dc742029fe83e6d395082b86 insert indirect nocache httponly dynamic
  server pod:appach-server-59c457d9d4-sf4zq:unsec-apach:unsec-apach:10.129.2.18:8080 10.129.2.18:8080 cookie fcd6b1516c6d1198f367e4a9286c9c51 weight 1
  dynamic-cookie-key b94bb237dc742029fe83e6d395082b86
  server-template _dynamic-pod- 1-1 172.4.0.4:8765 check inter 5000ms disabled

2. when replicas was 2
  cookie b94bb237dc742029fe83e6d395082b86 insert indirect nocache httponly dynamic
  server pod:appach-server-59c457d9d4-p84sz:unsec-apach:unsec-apach:10.128.2.10:8080 10.128.2.10:8080 cookie 64fb594ed721f91956579b9b337ae61f weight 1 check inter 5000ms
  server pod:appach-server-59c457d9d4-sf4zq:unsec-apach:unsec-apach:10.129.2.18:8080 10.129.2.18:8080 cookie fcd6b1516c6d1198f367e4a9286c9c51 weight 1 check inter 5000ms
  dynamic-cookie-key b94bb237dc742029fe83e6d395082b86
  server-template _dynamic-pod- 1-1 172.4.0.4:8765 check inter 5000ms disabled

ShudiLi · 2026-03-13T08:14:20Z

/verified by @ShudiLi

openshift-ci-robot · 2026-03-13T08:14:33Z

@ShudiLi: This PR has been marked as verified by @ShudiLi.

Details

In response to this:

/verified by @ShudiLi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Thealisyed

/lgtm
Thanks! Left the approval tag for Grant :)

openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Mar 4, 2026

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Mar 4, 2026

openshift-ci bot requested review from ShudiLi, bentito and davidesalerno March 4, 2026 19:05

jcmoraisjr force-pushed the OCPBUGS-77773-single-replica-reload branch from 9f0feb6 to 1a6e021 Compare March 9, 2026 20:09

jcmoraisjr mentioned this pull request Mar 9, 2026

OCPBUGS-77773: fix backend server health check for DCM enabled #746

Closed

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

pkg/router/template/configmanager/haproxy/manager.go Show resolved Hide resolved

openshift-ci bot assigned gcs278 and Thealisyed Mar 11, 2026

Thealisyed reviewed Mar 12, 2026

View reviewed changes

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 13, 2026

Thealisyed reviewed Mar 16, 2026

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-77773: fix backend server health check if DCM is enabled#747

OCPBUGS-77773: fix backend server health check if DCM is enabled#747
jcmoraisjr wants to merge 1 commit intoopenshift:masterfrom
jcmoraisjr:OCPBUGS-77773-single-replica-reload

jcmoraisjr commented Mar 4, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

openshift-ci-robot commented Mar 4, 2026

Uh oh!

openshift-ci bot commented Mar 4, 2026

Uh oh!

coderabbitai bot commented Mar 9, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

openshift-ci bot commented Mar 10, 2026

Uh oh!

alebedev87 commented Mar 11, 2026

Uh oh!

Thealisyed left a comment

Uh oh!

jcmoraisjr commented Mar 12, 2026

Uh oh!

ShudiLi commented Mar 13, 2026

Uh oh!

ShudiLi commented Mar 13, 2026

Uh oh!

openshift-ci-robot commented Mar 13, 2026

Uh oh!

Thealisyed left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

jcmoraisjr commented Mar 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Mar 4, 2026

Uh oh!

openshift-ci bot commented Mar 4, 2026

Uh oh!

coderabbitai bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-ci bot commented Mar 10, 2026

Uh oh!

alebedev87 commented Mar 11, 2026

Uh oh!

Thealisyed left a comment

Choose a reason for hiding this comment

Uh oh!

jcmoraisjr commented Mar 12, 2026

Uh oh!

ShudiLi commented Mar 13, 2026

Uh oh!

ShudiLi commented Mar 13, 2026

Uh oh!

openshift-ci-robot commented Mar 13, 2026

Uh oh!

Thealisyed left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jcmoraisjr commented Mar 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 9, 2026 •

edited

Loading