CORENET-6572: only report Progressing for active network rollouts by jluhrsen · Pull Request #2937 · openshift/cluster-network-operator

jluhrsen · 2026-03-17T21:45:18Z

Keep pod-based Progressing tied to an actual CNO rollout instead of temporary unavailability during node reboot churn. Persist the rollout generation in status manager state so Progressing stays true until the rollout is both observed and fully available.

For machine config status, stop treating generic MCP node convergence as a CNO rollout signal. Check whether the CNO machine config is still present in the pool's rendered source list so routine MCO updates do not flip network Progressing to true.

Co-Authored-by: Claude Code and Codex

openshift-ci-robot · 2026-03-17T21:45:22Z

@jluhrsen: This pull request references CORENET-6572 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Keep pod-based Progressing tied to an actual CNO rollout instead of temporary unavailability during node reboot churn. Persist the rollout generation in status manager state so Progressing stays true until the rollout is both observed and fully available.

For machine config status, stop treating generic MCP node convergence as a CNO rollout signal. Check whether the CNO machine config is still present in the pool's rendered source list so routine MCO updates do not flip network Progressing to true.

Co-Authored-by: Claude Code and Codex

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-03-17T21:45:32Z

Walkthrough

Reworks MachineConfigPool checks to use new internal helpers and removes the pool-progressing detection. Adds per-object RolloutGeneration and InstallComplete persistence to pod status logic, gating certain status messages on active rollouts. Adds/updates tests for reboot churn, rollout behavior, and install-complete restore.

Changes

Cohort / File(s)	Summary
MachineConfig status refactor `pkg/controller/statusmanager/machineconfig_status.go`	Removed `mcutil` usage and the `isAnyMachineConfigPoolProgressing` path. Added two internal helpers that evaluate pool membership against `pool.Status.Configuration.Source`: `areMachineConfigsRenderedOnPoolSource` and `areMachineConfigsRemovedFromPoolSource`.
Pod status rollout & install-complete `pkg/controller/statusmanager/pod_status.go`	Persisted `InstallComplete` in `podState`; added `RolloutGeneration int64` to daemonset/deployment/statefulset state structs and `rolloutGeneration(...)`. Gated availability/scheduling messages and some progressing logic on per-object rollout activity; adjusted last-seen state write semantics.
Tests and expectations `pkg/controller/statusmanager/status_manager_test.go`	Added tests `TestStatusManagerSetFromMachineConfigPoolIgnoresNodeRebootChurn` and `TestStatusManagerRestoresInstallCompleteAfterRestart`. Updated existing tests to reflect rollout-tracking and install-complete behavior, adjusted fixtures (Generation/ObservedGeneration) and some expected Progressing conditions; ignored an unused return value in one call site.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.3)

level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/Masterminds/semver@v1.5.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/Masterminds/sprig/v3@v3.2.3: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/containernetworking/cni@v0.8.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ghodss/yaml@v1.0.1-0.20190212211648-25d852aebe32: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/go-bindata/go-bindata@v3.1.2+incompatible: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/onsi/gomega@v1.38.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ope

... [truncated 17231 characters] ...

ired in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/gengo/v2@v2.0.0-20250922181213-ec3ebc5fd46b: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kms@v0.34.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kube-aggregator@v0.34.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/randfill@v1.0.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/structured-merge-diff/v6@v6.3.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n"

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-03-17T21:48:41Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jluhrsen
Once this PR has been reviewed and has the lgtm label, please assign danwinship for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

pkg/controller/statusmanager/status_manager_test.go (1)
1700-1736: Please add a cold-start mid-rollout regression case.

Current tests validate rollout continuation with existing state and reboot churn without rollout, but they don’t cover first-observation behavior when last-seen state is empty and workload is already at observedGeneration==generation, updated==replicas, unavailable>0. That scenario should be pinned to avoid false-negative Progressing.

As per coding guidelines, "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Also applies to: 1826-1856
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/statusmanager/status_manager_test.go` around lines 1700 -
1736, Add a new unit test that covers the cold-start mid-rollout regression:
create a Deployment (reuse depB or name it depColdStart) with
Status.ObservedGeneration == Generation, Status.UpdatedReplicas ==
Status.Replicas, and Status.UnavailableReplicas > 0 but simulate an empty
last-seen state (i.e., do not pre-populate any prior status in the status
manager). Then call setStatus(t, client, depColdStart), run
status.SetFromPods(), and call getStatuses(client, "testing"); finally assert
via conditionsInclude on oc.Status.Conditions that OperatorStatusTypeProgressing
is ConditionTrue (and other expected conditions mirror the other tests). Use the
same helpers referenced in the diff (setStatus, status.SetFromPods, getStatuses,
conditionsInclude) and place the test near the existing rollout tests so it
executes in the same context.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controller/statusmanager/pod_status.go`:
- Around line 93-110: The dsState.RolloutGeneration must be initialized when the
controller first sees a DaemonSet mid-convergence even if
generation==observedGeneration; change the initialization logic in the DaemonSet
handling (variables: dsState, RolloutGeneration, currentRolloutGeneration) so
that when a state entry is new or RolloutGeneration==0 and the DaemonSet has
in-flight convergence (ds.Status.NumberUnavailable > 0 OR
(ds.Status.DesiredNumberScheduled > 0 && ds.Status.UpdatedNumberScheduled <
ds.Status.DesiredNumberScheduled)), you set dsState.RolloutGeneration =
currentRolloutGeneration (respecting the existing status.installComplete guard
if needed); apply the same initialization pattern to the analogous StatefulSet
and Deployment blocks (the corresponding variables and checks around
observedGeneration, UpdatedNumberScheduled/Ready/DesiredNumberScheduled, and
NumberUnavailable/NotReady).

---

Nitpick comments:
In `@pkg/controller/statusmanager/status_manager_test.go`:
- Around line 1700-1736: Add a new unit test that covers the cold-start
mid-rollout regression: create a Deployment (reuse depB or name it depColdStart)
with Status.ObservedGeneration == Generation, Status.UpdatedReplicas ==
Status.Replicas, and Status.UnavailableReplicas > 0 but simulate an empty
last-seen state (i.e., do not pre-populate any prior status in the status
manager). Then call setStatus(t, client, depColdStart), run
status.SetFromPods(), and call getStatuses(client, "testing"); finally assert
via conditionsInclude on oc.Status.Conditions that OperatorStatusTypeProgressing
is ConditionTrue (and other expected conditions mirror the other tests). Use the
same helpers referenced in the diff (setStatus, status.SetFromPods, getStatuses,
conditionsInclude) and place the test near the existing rollout tests so it
executes in the same context.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c8f95521-c9ef-41f3-bf7d-f168f710f1c4

📥 Commits

Reviewing files that changed from the base of the PR and between 3b5ef2d and da9acf2.

📒 Files selected for processing (3)

pkg/controller/statusmanager/machineconfig_status.go
pkg/controller/statusmanager/pod_status.go
pkg/controller/statusmanager/status_manager_test.go

pkg/controller/statusmanager/pod_status.go

jluhrsen · 2026-03-18T22:47:36Z

@danwinship , if you have a chance, please check this one out. looks like network operator will hit this issue in 90% of the 4.21->4.22 aws-ovn-upgrade jobs so running it here a few times will be enough to validate the fix from that high level.

/payload-job-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5

jluhrsen · 2026-03-19T04:19:25Z

/retest

Keep pod-based Progressing tied to an actual CNO rollout instead of temporary unavailability during node reboot churn. Persist the rollout generation in status manager state so Progressing stays true until the rollout is both observed and fully available. For machine config status, stop treating generic MCP node convergence as a CNO rollout signal. Check whether the CNO machine config is still present in the pool's rendered source list so routine MCO updates do not flip network Progressing to true. Also persist install completion in the same annotation so a restarted CNO does not fall back into startup-only Progressing paths during a cluster upgrade. That prevents noncritical networking pods from briefly setting Progressing=true after the actual network rollout is already complete. Signed-off-by: Jamo Luhrsen <jluhrsen@gmail.com> Co-Authored-by: Claude Code and Codex

jluhrsen · 2026-03-19T16:09:44Z

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5

openshift-ci · 2026-03-19T16:10:05Z

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0c24b550-23ae-11f1-97df-f16d2b83fd79-0

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controller/statusmanager/status_manager_test.go`:
- Line 520: The test is asserting co.Status.Versions from a stale
ClusterOperator object after calling getStatuses; update the test to re-fetch or
assign the returned updated ClusterOperator before asserting Versions (e.g., use
the returned oc from getStatuses to set co = oc or call getStatuses again) so
that assertions against Status.Versions use the latest ClusterOperator state;
specifically adjust the places around the getStatuses call and the subsequent
assertions that reference co.Status.Versions (occurrences around getStatuses, co
variable, and Status.Versions) to use the refreshed object.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e3654515-9f48-47fd-8ed3-357ead25e962

📥 Commits

Reviewing files that changed from the base of the PR and between 2e2113f and bb1965f.

📒 Files selected for processing (3)

pkg/controller/statusmanager/machineconfig_status.go
pkg/controller/statusmanager/pod_status.go
pkg/controller/statusmanager/status_manager_test.go

✅ Files skipped from review due to trivial changes (1)

pkg/controller/statusmanager/pod_status.go

coderabbitai · 2026-03-19T16:13:44Z

pkg/controller/statusmanager/status_manager_test.go

 		t.Fatalf("error processing machine config pools: %v", err)
 	}
-	co, oc, err = getStatuses(client, "testing")
+	_, oc, err = getStatuses(client, "testing")


⚠️ Potential issue | 🟠 Major

Refresh co before asserting Status.Versions to avoid stale-state checks.

At Line 520 and Line 564, the updated ClusterOperator result is discarded, but later assertions still read co.Status.Versions (Lines 540 and 576). That validates an older object and can hide regressions.

Proposed fix

- _, oc, err = getStatuses(client, "testing") + co, oc, err = getStatuses(client, "testing") @@ - _, oc, err = getStatuses(client, "testing") + co, oc, err = getStatuses(client, "testing")

Also applies to: 540-542, 564-564, 576-578

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/controller/statusmanager/status_manager_test.go` at line 520, The test is asserting co.Status.Versions from a stale ClusterOperator object after calling getStatuses; update the test to re-fetch or assign the returned updated ClusterOperator before asserting Versions (e.g., use the returned oc from getStatuses to set co = oc or call getStatuses again) so that assertions against Status.Versions use the latest ClusterOperator state; specifically adjust the places around the getStatuses call and the subsequent assertions that reference co.Status.Versions (occurrences around getStatuses, co variable, and Status.Versions) to use the refreshed object.

petr-muller · 2026-03-19T16:40:27Z

sorry about the /payload breakage, that was unfortunately me

petr-muller · 2026-03-19T16:40:40Z

I hope it shoudl work now

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5

openshift-ci · 2026-03-19T16:40:44Z

@petr-muller: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5e2eeb00-23b2-11f1-9a0c-ae4d6f10dbf5-0

jluhrsen · 2026-03-19T17:06:50Z

I hope it shoudl work now

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5

no worries @petr-muller . thanks for kicking it off again

openshift-ci · 2026-03-19T22:20:53Z

@jluhrsen: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-upgrade	`bb1965f`	link	false	`/test 4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-upgrade`
ci/prow/e2e-aws-ovn-hypershift-conformance	`bb1965f`	link	true	`/test e2e-aws-ovn-hypershift-conformance`
ci/prow/security	`bb1965f`	link	false	`/test security`
ci/prow/e2e-aws-ovn-rhcos10-techpreview	`bb1965f`	link	false	`/test e2e-aws-ovn-rhcos10-techpreview`
ci/prow/4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade	`bb1965f`	link	false	`/test 4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade`
ci/prow/e2e-azure-ovn-upgrade	`bb1965f`	link	true	`/test e2e-azure-ovn-upgrade`
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw	`bb1965f`	link	true	`/test e2e-metal-ipi-ovn-dualstack-bgp-local-gw`
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp	`bb1965f`	link	true	`/test e2e-metal-ipi-ovn-dualstack-bgp`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 17, 2026

openshift-ci bot requested review from arghosh93 and arkadeepsen March 17, 2026 21:48

jluhrsen mentioned this pull request Mar 17, 2026

CORENET-6572: avoid reporting Progressing during node reboots #2936

Open

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

pkg/controller/statusmanager/pod_status.go Show resolved Hide resolved

jluhrsen force-pushed the CORENET-6572-codex branch from da9acf2 to 2e2113f Compare March 18, 2026 22:30

jluhrsen force-pushed the CORENET-6572-codex branch from 2e2113f to bb1965f Compare March 19, 2026 16:07

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

Conversation

jluhrsen commented Mar 17, 2026

Uh oh!

openshift-ci-robot commented Mar 17, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci bot commented Mar 17, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jluhrsen commented Mar 18, 2026

Uh oh!

jluhrsen commented Mar 19, 2026

Uh oh!

jluhrsen commented Mar 19, 2026

Uh oh!

openshift-ci bot commented Mar 19, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

petr-muller commented Mar 19, 2026

Uh oh!

petr-muller commented Mar 19, 2026

Uh oh!

openshift-ci bot commented Mar 19, 2026

Uh oh!

jluhrsen commented Mar 19, 2026

Uh oh!

openshift-ci bot commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

openshift-ci-robot commented Mar 17, 2026 •

edited by openshift-ci bot

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading