Skip to content

CORENET-6572: only report Progressing for active network rollouts#2937

Open
jluhrsen wants to merge 1 commit intoopenshift:masterfrom
jluhrsen:CORENET-6572-codex
Open

CORENET-6572: only report Progressing for active network rollouts#2937
jluhrsen wants to merge 1 commit intoopenshift:masterfrom
jluhrsen:CORENET-6572-codex

Conversation

@jluhrsen
Copy link
Contributor

Keep pod-based Progressing tied to an actual CNO rollout instead of temporary unavailability during node reboot churn. Persist the rollout generation in status manager state so Progressing stays true until the rollout is both observed and fully available.

For machine config status, stop treating generic MCP node convergence as a CNO rollout signal. Check whether the CNO machine config is still present in the pool's rendered source list so routine MCO updates do not flip network Progressing to true.

Co-Authored-by: Claude Code and Codex

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 17, 2026
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 17, 2026

@jluhrsen: This pull request references CORENET-6572 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Keep pod-based Progressing tied to an actual CNO rollout instead of temporary unavailability during node reboot churn. Persist the rollout generation in status manager state so Progressing stays true until the rollout is both observed and fully available.

For machine config status, stop treating generic MCP node convergence as a CNO rollout signal. Check whether the CNO machine config is still present in the pool's rendered source list so routine MCO updates do not flip network Progressing to true.

Co-Authored-by: Claude Code and Codex

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

Walkthrough

Reworks MachineConfigPool checks to use new internal helpers and removes the pool-progressing detection. Adds per-object RolloutGeneration and InstallComplete persistence to pod status logic, gating certain status messages on active rollouts. Adds/updates tests for reboot churn, rollout behavior, and install-complete restore.

Changes

Cohort / File(s) Summary
MachineConfig status refactor
pkg/controller/statusmanager/machineconfig_status.go
Removed mcutil usage and the isAnyMachineConfigPoolProgressing path. Added two internal helpers that evaluate pool membership against pool.Status.Configuration.Source: areMachineConfigsRenderedOnPoolSource and areMachineConfigsRemovedFromPoolSource.
Pod status rollout & install-complete
pkg/controller/statusmanager/pod_status.go
Persisted InstallComplete in podState; added RolloutGeneration int64 to daemonset/deployment/statefulset state structs and rolloutGeneration(...). Gated availability/scheduling messages and some progressing logic on per-object rollout activity; adjusted last-seen state write semantics.
Tests and expectations
pkg/controller/statusmanager/status_manager_test.go
Added tests TestStatusManagerSetFromMachineConfigPoolIgnoresNodeRebootChurn and TestStatusManagerRestoresInstallCompleteAfterRestart. Updated existing tests to reflect rollout-tracking and install-complete behavior, adjusted fixtures (Generation/ObservedGeneration) and some expected Progressing conditions; ignored an unused return value in one call site.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.3)

level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/Masterminds/semver@v1.5.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/Masterminds/sprig/v3@v3.2.3: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/containernetworking/cni@v0.8.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ghodss/yaml@v1.0.1-0.20190212211648-25d852aebe32: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/go-bindata/go-bindata@v3.1.2+incompatible: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/onsi/gomega@v1.38.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ope

... [truncated 17231 characters] ...

ired in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/gengo/v2@v2.0.0-20250922181213-ec3ebc5fd46b: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kms@v0.34.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kube-aggregator@v0.34.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/randfill@v1.0.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/structured-merge-diff/v6@v6.3.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n"


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from arghosh93 and arkadeepsen March 17, 2026 21:48
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 17, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jluhrsen
Once this PR has been reviewed and has the lgtm label, please assign danwinship for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/controller/statusmanager/status_manager_test.go (1)

1700-1736: Please add a cold-start mid-rollout regression case.

Current tests validate rollout continuation with existing state and reboot churn without rollout, but they don’t cover first-observation behavior when last-seen state is empty and workload is already at observedGeneration==generation, updated==replicas, unavailable>0. That scenario should be pinned to avoid false-negative Progressing.

As per coding guidelines, "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Also applies to: 1826-1856

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/statusmanager/status_manager_test.go` around lines 1700 -
1736, Add a new unit test that covers the cold-start mid-rollout regression:
create a Deployment (reuse depB or name it depColdStart) with
Status.ObservedGeneration == Generation, Status.UpdatedReplicas ==
Status.Replicas, and Status.UnavailableReplicas > 0 but simulate an empty
last-seen state (i.e., do not pre-populate any prior status in the status
manager). Then call setStatus(t, client, depColdStart), run
status.SetFromPods(), and call getStatuses(client, "testing"); finally assert
via conditionsInclude on oc.Status.Conditions that OperatorStatusTypeProgressing
is ConditionTrue (and other expected conditions mirror the other tests). Use the
same helpers referenced in the diff (setStatus, status.SetFromPods, getStatuses,
conditionsInclude) and place the test near the existing rollout tests so it
executes in the same context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controller/statusmanager/pod_status.go`:
- Around line 93-110: The dsState.RolloutGeneration must be initialized when the
controller first sees a DaemonSet mid-convergence even if
generation==observedGeneration; change the initialization logic in the DaemonSet
handling (variables: dsState, RolloutGeneration, currentRolloutGeneration) so
that when a state entry is new or RolloutGeneration==0 and the DaemonSet has
in-flight convergence (ds.Status.NumberUnavailable > 0 OR
(ds.Status.DesiredNumberScheduled > 0 && ds.Status.UpdatedNumberScheduled <
ds.Status.DesiredNumberScheduled)), you set dsState.RolloutGeneration =
currentRolloutGeneration (respecting the existing status.installComplete guard
if needed); apply the same initialization pattern to the analogous StatefulSet
and Deployment blocks (the corresponding variables and checks around
observedGeneration, UpdatedNumberScheduled/Ready/DesiredNumberScheduled, and
NumberUnavailable/NotReady).

---

Nitpick comments:
In `@pkg/controller/statusmanager/status_manager_test.go`:
- Around line 1700-1736: Add a new unit test that covers the cold-start
mid-rollout regression: create a Deployment (reuse depB or name it depColdStart)
with Status.ObservedGeneration == Generation, Status.UpdatedReplicas ==
Status.Replicas, and Status.UnavailableReplicas > 0 but simulate an empty
last-seen state (i.e., do not pre-populate any prior status in the status
manager). Then call setStatus(t, client, depColdStart), run
status.SetFromPods(), and call getStatuses(client, "testing"); finally assert
via conditionsInclude on oc.Status.Conditions that OperatorStatusTypeProgressing
is ConditionTrue (and other expected conditions mirror the other tests). Use the
same helpers referenced in the diff (setStatus, status.SetFromPods, getStatuses,
conditionsInclude) and place the test near the existing rollout tests so it
executes in the same context.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c8f95521-c9ef-41f3-bf7d-f168f710f1c4

📥 Commits

Reviewing files that changed from the base of the PR and between 3b5ef2d and da9acf2.

📒 Files selected for processing (3)
  • pkg/controller/statusmanager/machineconfig_status.go
  • pkg/controller/statusmanager/pod_status.go
  • pkg/controller/statusmanager/status_manager_test.go

@jluhrsen jluhrsen force-pushed the CORENET-6572-codex branch from da9acf2 to 2e2113f Compare March 18, 2026 22:30
@jluhrsen
Copy link
Contributor Author

@danwinship , if you have a chance, please check this one out. looks like network operator will hit this issue in 90% of the 4.21->4.22 aws-ovn-upgrade jobs so running it here a few times will be enough to validate the fix from that high level.

/payload-job-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5

@jluhrsen
Copy link
Contributor Author

/retest

Keep pod-based Progressing tied to an actual CNO rollout instead of
temporary unavailability during node reboot churn. Persist the rollout
generation in status manager state so Progressing stays true until the
rollout is both observed and fully available.

For machine config status, stop treating generic MCP node convergence as
a CNO rollout signal. Check whether the CNO machine config is still
present in the pool's rendered source list so routine MCO updates do not
flip network Progressing to true.

Also persist install completion in the same annotation so a restarted CNO
does not fall back into startup-only Progressing paths during a cluster
upgrade. That prevents noncritical networking pods from briefly setting
Progressing=true after the actual network rollout is already complete.

Signed-off-by: Jamo Luhrsen <jluhrsen@gmail.com>
Co-Authored-by: Claude Code and Codex
@jluhrsen jluhrsen force-pushed the CORENET-6572-codex branch from 2e2113f to bb1965f Compare March 19, 2026 16:07
@jluhrsen
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 19, 2026

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0c24b550-23ae-11f1-97df-f16d2b83fd79-0

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controller/statusmanager/status_manager_test.go`:
- Line 520: The test is asserting co.Status.Versions from a stale
ClusterOperator object after calling getStatuses; update the test to re-fetch or
assign the returned updated ClusterOperator before asserting Versions (e.g., use
the returned oc from getStatuses to set co = oc or call getStatuses again) so
that assertions against Status.Versions use the latest ClusterOperator state;
specifically adjust the places around the getStatuses call and the subsequent
assertions that reference co.Status.Versions (occurrences around getStatuses, co
variable, and Status.Versions) to use the refreshed object.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e3654515-9f48-47fd-8ed3-357ead25e962

📥 Commits

Reviewing files that changed from the base of the PR and between 2e2113f and bb1965f.

📒 Files selected for processing (3)
  • pkg/controller/statusmanager/machineconfig_status.go
  • pkg/controller/statusmanager/pod_status.go
  • pkg/controller/statusmanager/status_manager_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/controller/statusmanager/pod_status.go

t.Fatalf("error processing machine config pools: %v", err)
}
co, oc, err = getStatuses(client, "testing")
_, oc, err = getStatuses(client, "testing")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Refresh co before asserting Status.Versions to avoid stale-state checks.

At Line 520 and Line 564, the updated ClusterOperator result is discarded, but later assertions still read co.Status.Versions (Lines 540 and 576). That validates an older object and can hide regressions.

Proposed fix
-	_, oc, err = getStatuses(client, "testing")
+	co, oc, err = getStatuses(client, "testing")
@@
-	_, oc, err = getStatuses(client, "testing")
+	co, oc, err = getStatuses(client, "testing")

Also applies to: 540-542, 564-564, 576-578

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/statusmanager/status_manager_test.go` at line 520, The test is
asserting co.Status.Versions from a stale ClusterOperator object after calling
getStatuses; update the test to re-fetch or assign the returned updated
ClusterOperator before asserting Versions (e.g., use the returned oc from
getStatuses to set co = oc or call getStatuses again) so that assertions against
Status.Versions use the latest ClusterOperator state; specifically adjust the
places around the getStatuses call and the subsequent assertions that reference
co.Status.Versions (occurrences around getStatuses, co variable, and
Status.Versions) to use the refreshed object.

@petr-muller
Copy link
Member

sorry about the /payload breakage, that was unfortunately me

@petr-muller
Copy link
Member

I hope it shoudl work now

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 19, 2026

@petr-muller: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5e2eeb00-23b2-11f1-9a0c-ae4d6f10dbf5-0

@jluhrsen
Copy link
Contributor Author

I hope it shoudl work now

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5

no worries @petr-muller . thanks for kicking it off again

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 19, 2026

@jluhrsen: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-upgrade bb1965f link false /test 4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-upgrade
ci/prow/e2e-aws-ovn-hypershift-conformance bb1965f link true /test e2e-aws-ovn-hypershift-conformance
ci/prow/security bb1965f link false /test security
ci/prow/e2e-aws-ovn-rhcos10-techpreview bb1965f link false /test e2e-aws-ovn-rhcos10-techpreview
ci/prow/4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade bb1965f link false /test 4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade
ci/prow/e2e-azure-ovn-upgrade bb1965f link true /test e2e-azure-ovn-upgrade
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw bb1965f link true /test e2e-metal-ipi-ovn-dualstack-bgp-local-gw
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp bb1965f link true /test e2e-metal-ipi-ovn-dualstack-bgp

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants