CORENET-6572: only report Progressing for active network rollouts#2937
CORENET-6572: only report Progressing for active network rollouts#2937jluhrsen wants to merge 1 commit intoopenshift:masterfrom
Conversation
|
@jluhrsen: This pull request references CORENET-6572 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughReworks MachineConfigPool checks to use new internal helpers and removes the pool-progressing detection. Adds per-object RolloutGeneration and InstallComplete persistence to pod status logic, gating certain status messages on active rollouts. Adds/updates tests for reboot churn, rollout behavior, and install-complete restore. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.3)level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/Masterminds/semver@v1.5.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/Masterminds/sprig/v3@v3.2.3: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/containernetworking/cni@v0.8.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ghodss/yaml@v1.0.1-0.20190212211648-25d852aebe32: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/go-bindata/go-bindata@v3.1.2+incompatible: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/onsi/gomega@v1.38.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ope ... [truncated 17231 characters] ... ired in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/gengo/v2@v2.0.0-20250922181213-ec3ebc5fd46b: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kms@v0.34.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kube-aggregator@v0.34.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/randfill@v1.0.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/structured-merge-diff/v6@v6.3.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n" Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jluhrsen The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/controller/statusmanager/status_manager_test.go (1)
1700-1736: Please add a cold-start mid-rollout regression case.Current tests validate rollout continuation with existing state and reboot churn without rollout, but they don’t cover first-observation behavior when last-seen state is empty and workload is already at
observedGeneration==generation,updated==replicas,unavailable>0. That scenario should be pinned to avoid false-negative Progressing.As per coding guidelines, "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
Also applies to: 1826-1856
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/statusmanager/status_manager_test.go` around lines 1700 - 1736, Add a new unit test that covers the cold-start mid-rollout regression: create a Deployment (reuse depB or name it depColdStart) with Status.ObservedGeneration == Generation, Status.UpdatedReplicas == Status.Replicas, and Status.UnavailableReplicas > 0 but simulate an empty last-seen state (i.e., do not pre-populate any prior status in the status manager). Then call setStatus(t, client, depColdStart), run status.SetFromPods(), and call getStatuses(client, "testing"); finally assert via conditionsInclude on oc.Status.Conditions that OperatorStatusTypeProgressing is ConditionTrue (and other expected conditions mirror the other tests). Use the same helpers referenced in the diff (setStatus, status.SetFromPods, getStatuses, conditionsInclude) and place the test near the existing rollout tests so it executes in the same context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controller/statusmanager/pod_status.go`:
- Around line 93-110: The dsState.RolloutGeneration must be initialized when the
controller first sees a DaemonSet mid-convergence even if
generation==observedGeneration; change the initialization logic in the DaemonSet
handling (variables: dsState, RolloutGeneration, currentRolloutGeneration) so
that when a state entry is new or RolloutGeneration==0 and the DaemonSet has
in-flight convergence (ds.Status.NumberUnavailable > 0 OR
(ds.Status.DesiredNumberScheduled > 0 && ds.Status.UpdatedNumberScheduled <
ds.Status.DesiredNumberScheduled)), you set dsState.RolloutGeneration =
currentRolloutGeneration (respecting the existing status.installComplete guard
if needed); apply the same initialization pattern to the analogous StatefulSet
and Deployment blocks (the corresponding variables and checks around
observedGeneration, UpdatedNumberScheduled/Ready/DesiredNumberScheduled, and
NumberUnavailable/NotReady).
---
Nitpick comments:
In `@pkg/controller/statusmanager/status_manager_test.go`:
- Around line 1700-1736: Add a new unit test that covers the cold-start
mid-rollout regression: create a Deployment (reuse depB or name it depColdStart)
with Status.ObservedGeneration == Generation, Status.UpdatedReplicas ==
Status.Replicas, and Status.UnavailableReplicas > 0 but simulate an empty
last-seen state (i.e., do not pre-populate any prior status in the status
manager). Then call setStatus(t, client, depColdStart), run
status.SetFromPods(), and call getStatuses(client, "testing"); finally assert
via conditionsInclude on oc.Status.Conditions that OperatorStatusTypeProgressing
is ConditionTrue (and other expected conditions mirror the other tests). Use the
same helpers referenced in the diff (setStatus, status.SetFromPods, getStatuses,
conditionsInclude) and place the test near the existing rollout tests so it
executes in the same context.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c8f95521-c9ef-41f3-bf7d-f168f710f1c4
📒 Files selected for processing (3)
pkg/controller/statusmanager/machineconfig_status.gopkg/controller/statusmanager/pod_status.gopkg/controller/statusmanager/status_manager_test.go
da9acf2 to
2e2113f
Compare
|
@danwinship , if you have a chance, please check this one out. looks like network operator will hit this issue in 90% of the 4.21->4.22 aws-ovn-upgrade jobs so running it here a few times will be enough to validate the fix from that high level. /payload-job-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5 |
|
/retest |
Keep pod-based Progressing tied to an actual CNO rollout instead of temporary unavailability during node reboot churn. Persist the rollout generation in status manager state so Progressing stays true until the rollout is both observed and fully available. For machine config status, stop treating generic MCP node convergence as a CNO rollout signal. Check whether the CNO machine config is still present in the pool's rendered source list so routine MCO updates do not flip network Progressing to true. Also persist install completion in the same annotation so a restarted CNO does not fall back into startup-only Progressing paths during a cluster upgrade. That prevents noncritical networking pods from briefly setting Progressing=true after the actual network rollout is already complete. Signed-off-by: Jamo Luhrsen <jluhrsen@gmail.com> Co-Authored-by: Claude Code and Codex
2e2113f to
bb1965f
Compare
|
/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0c24b550-23ae-11f1-97df-f16d2b83fd79-0 |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controller/statusmanager/status_manager_test.go`:
- Line 520: The test is asserting co.Status.Versions from a stale
ClusterOperator object after calling getStatuses; update the test to re-fetch or
assign the returned updated ClusterOperator before asserting Versions (e.g., use
the returned oc from getStatuses to set co = oc or call getStatuses again) so
that assertions against Status.Versions use the latest ClusterOperator state;
specifically adjust the places around the getStatuses call and the subsequent
assertions that reference co.Status.Versions (occurrences around getStatuses, co
variable, and Status.Versions) to use the refreshed object.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e3654515-9f48-47fd-8ed3-357ead25e962
📒 Files selected for processing (3)
pkg/controller/statusmanager/machineconfig_status.gopkg/controller/statusmanager/pod_status.gopkg/controller/statusmanager/status_manager_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/controller/statusmanager/pod_status.go
| t.Fatalf("error processing machine config pools: %v", err) | ||
| } | ||
| co, oc, err = getStatuses(client, "testing") | ||
| _, oc, err = getStatuses(client, "testing") |
There was a problem hiding this comment.
Refresh co before asserting Status.Versions to avoid stale-state checks.
At Line 520 and Line 564, the updated ClusterOperator result is discarded, but later assertions still read co.Status.Versions (Lines 540 and 576). That validates an older object and can hide regressions.
Proposed fix
- _, oc, err = getStatuses(client, "testing")
+ co, oc, err = getStatuses(client, "testing")
@@
- _, oc, err = getStatuses(client, "testing")
+ co, oc, err = getStatuses(client, "testing")Also applies to: 540-542, 564-564, 576-578
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/controller/statusmanager/status_manager_test.go` at line 520, The test is
asserting co.Status.Versions from a stale ClusterOperator object after calling
getStatuses; update the test to re-fetch or assign the returned updated
ClusterOperator before asserting Versions (e.g., use the returned oc from
getStatuses to set co = oc or call getStatuses again) so that assertions against
Status.Versions use the latest ClusterOperator state; specifically adjust the
places around the getStatuses call and the subsequent assertions that reference
co.Status.Versions (occurrences around getStatuses, co variable, and
Status.Versions) to use the refreshed object.
|
sorry about the |
|
I hope it shoudl work now /payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5 |
|
@petr-muller: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5e2eeb00-23b2-11f1-9a0c-ae4d6f10dbf5-0 |
no worries @petr-muller . thanks for kicking it off again |
|
@jluhrsen: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Keep pod-based Progressing tied to an actual CNO rollout instead of temporary unavailability during node reboot churn. Persist the rollout generation in status manager state so Progressing stays true until the rollout is both observed and fully available.
For machine config status, stop treating generic MCP node convergence as a CNO rollout signal. Check whether the CNO machine config is still present in the pool's rendered source list so routine MCO updates do not flip network Progressing to true.
Co-Authored-by: Claude Code and Codex