CORENET-6714: Enable Network Observability on Day 0#2925
CORENET-6714: Enable Network Observability on Day 0#2925OlivierCazade wants to merge 9 commits intoopenshift:masterfrom
Conversation
…pport Implements a new controller to automatically install and manage the Network Observability Operator via OLM. The controller handles the complete lifecycle including operator installation, readiness checking, and FlowCollector creation. Key features: - Opt-out installation model: Network Observability is installed by default unless explicitly disabled via spec.installNetworkObservability - SNO (Single Node OpenShift) detection: Automatically skips installation on SNO clusters unless explicitly enabled to reduce resource consumption - Comprehensive status reporting: Sets degraded status with detailed error messages for all failure scenarios (operator installation, readiness timeouts, FlowCollector creation) - Idempotent reconciliation: Safely handles multiple invocations and concurrent reconciliations Implementation details: - Added shouldInstallNetworkObservability() function with SNO topology check via Infrastructure.Status.ControlPlaneTopology - Created StatusReporter interface for testability and status management - Added ObservabilityConfig status level to StatusManager - Updated RBAC to allow management of OLM resources (Subscriptions, ClusterServiceVersions, OperatorGroups) - Renamed observabilityEnabled to installNetworkObservability in Network spec for clarity and consistency with API conventions Testing: - 43 comprehensive unit tests covering all scenarios - 80.3% code coverage including error paths - Tests for SNO detection, status updates, and reconciliation flows
|
@OlivierCazade: This pull request references CORENET-6714 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a Network Observability controller and tests, RBAC ClusterRole/Binding, OLM Operator and FlowCollector manifests, a new StatusLevel constant, two go.mod replace directives, and a sample-config field for networkObservability.installationPolicy. Changes
Sequence DiagramsequenceDiagram
participant Controller as Observability Controller
participant K8sAPI as Kubernetes API
participant Operator as NetObserv Operator
participant FlowCollector as FlowCollector Resource
Controller->>K8sAPI: Reconcile Network CR (cluster)
activate Controller
Controller->>Controller: Check if should install<br/>(based on spec & topology)
alt Should Install
Controller->>K8sAPI: Create namespaces<br/>(openshift-netobserv-operator, netobserv)
Controller->>K8sAPI: Apply operator manifest<br/>(OperatorGroup, Subscription)
activate Operator
loop Poll for Readiness
Controller->>K8sAPI: Check ClusterServiceVersion status
K8sAPI-->>Controller: CSV status
Note over Controller: Wait for Succeeded state
end
Operator-->>Controller: Operator ready
deactivate Operator
Controller->>K8sAPI: Check if FlowCollector exists
alt FlowCollector Missing
Controller->>K8sAPI: Apply FlowCollector manifest
K8sAPI->>FlowCollector: Create resource
end
Controller->>K8sAPI: Update Network status<br/>(NetworkObservabilityDeployed condition)
end
deactivate Controller
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.3)level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/Masterminds/semver@v1.5.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/Masterminds/sprig/v3@v3.2.3: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/containernetworking/cni@v0.8.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ghodss/yaml@v1.0.1-0.20190212211648-25d852aebe32: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/go-bindata/go-bindata@v3.1.2+incompatible: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/onsi/gomega@v1.38.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ope ... [truncated 17367 characters] ... quired in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kms@v0.34.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kube-aggregator@v0.34.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/randfill@v1.0.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/structured-merge-diff/v6@v6.3.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/openshift/api@v0.0.0-20260116192047-6fb7fdae95fd: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: OlivierCazade The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@OlivierCazade: This pull request references CORENET-6714 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (4)
manifests/0000_70_cluster-network-operator_02_rbac_observability.yaml (1)
11-14: Consider addingwatchanddeleteverbs for OLM resources.The controller may need
watchto properly track subscription and CSV state changes via informers. Additionally,deletepermission on subscriptions might be needed if you ever want to support uninstallation or cleanup scenarios.🔧 Suggested addition
# Manage OLM resources for operator installation - apiGroups: ["operators.coreos.com"] resources: ["subscriptions", "clusterserviceversions", "operatorgroups"] - verbs: ["get", "list", "create", "update", "patch"] + verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@manifests/0000_70_cluster-network-operator_02_rbac_observability.yaml` around lines 11 - 14, The RBAC rule for apiGroups "operators.coreos.com" covering resources ["subscriptions", "clusterserviceversions", "operatorgroups"] is missing the "watch" and "delete" verbs; update the verbs array for that rule (for the resources "subscriptions", "clusterserviceversions", and "operatorgroups") to include "watch" (so informers can track state changes) and "delete" (to allow cleanup/uninstallation) in addition to the existing verbs.pkg/controller/observability/observability_controller_test.go (1)
1512-1518: Test creates files in working directory - may cause side effects.This test creates a
manifests/directory and writesFlowCollectorYAMLin the working directory. Whiledefer os.Remove(FlowCollectorYAML)removes the file, it doesn't remove themanifestsdirectory, which could persist across test runs.🔧 Suggested fix for proper cleanup
err := os.MkdirAll("manifests", 0755) g.Expect(err).NotTo(HaveOccurred()) + defer os.RemoveAll("manifests") // Clean up directory // Create the FlowCollector manifest at the expected path err = os.WriteFile(FlowCollectorYAML, []byte(flowCollectorManifest), 0644) g.Expect(err).NotTo(HaveOccurred()) - defer os.Remove(FlowCollectorYAML)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/observability/observability_controller_test.go` around lines 1512 - 1518, The test creates a "manifests" directory and a file at FlowCollectorYAML using os.MkdirAll and os.WriteFile but only defers removal of the file, leaving the manifests directory behind; update the test to always clean up the directory by deferring a call to remove the directory (e.g. defer os.RemoveAll("manifests")) immediately after os.MkdirAll succeeds (and keep or replace the existing defer os.Remove(FlowCollectorYAML)), so that both the file created from flowCollectorManifest and the manifests directory are removed after the test.pkg/controller/observability/observability_controller.go (2)
356-392: Potential race condition in markNetworkObservabilityDeployed.The function has a TOCTOU (time-of-check-time-of-use) pattern: it reads the latest Network CR, modifies conditions, then updates. If another controller or user modifies the Network CR between Get and Update, the update will fail with a conflict error (which would be retried by the caller).
This is acceptable since Kubernetes retries on conflicts, but consider using a retry loop here for better resilience.
🔧 Suggested: Add retry for conflict resilience
func (r *ReconcileObservability) markNetworkObservabilityDeployed(ctx context.Context, network *configv1.Network) error { + return retry.RetryOnConflict(retry.DefaultBackoff, func() error { + return r.doMarkNetworkObservabilityDeployed(ctx) + }) +} + +func (r *ReconcileObservability) doMarkNetworkObservabilityDeployed(ctx context.Context) error { // Check if condition already exists and is true - for _, condition := range network.Status.Conditions { + latest := &configv1.Network{} + if err := r.client.Get(ctx, types.NamespacedName{Name: "cluster"}, latest); err != nil { + return err + } + + for _, condition := range latest.Status.Conditions { if condition.Type == NetworkObservabilityDeployed && condition.Status == metav1.ConditionTrue { return nil // Already marked as deployed } } - - // Get the latest version of the Network CR to avoid conflicts - latest := &configv1.Network{} - if err := r.client.Get(ctx, types.NamespacedName{Name: "cluster"}, latest); err != nil { - return err - } // ... rest of the function🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/observability/observability_controller.go` around lines 356 - 392, markNetworkObservabilityDeployed currently does a Get -> modify -> Status().Update which can fail on a concurrent update; wrap the Get/modify/Status().Update sequence in a retry loop using k8s.io/apimachinery/pkg/util/wait.RetryOnConflict (or retry.RetryOnConflict) to retry on conflicts, re-fetching the latest Network (using r.client.Get) inside the loop before applying the condition and calling r.client.Status().Update so transient conflicts are retried safely; keep the same condition logic and return the final error from the retry.
288-319: Polling uses parent context which may already have timeout pressure.
waitForNetObservOperatoruses the passed context for polling, but also defines its owncheckTimeout(10 minutes). If the caller's context has a shorter deadline, the poll will exit early with the caller's context error rather thancontext.DeadlineExceededfromwait.PollUntilContextTimeout.The current implementation may work correctly in practice since the controller-runtime context typically doesn't have a deadline, but this could cause unexpected behavior in tests or if the calling context changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/observability/observability_controller.go` around lines 288 - 319, The polling uses the caller's ctx which can have a shorter deadline; instead create a new timeout-bound context for the poll so it always runs up to checkTimeout: inside waitForNetObservOperator, call context.WithTimeout(context.Background(), checkTimeout) (defer cancel) and pass that new ctx to wait.PollUntilContextTimeout (keep checkInterval, checkTimeout, condition as before); reference function waitForNetObservOperator and variables checkInterval/checkTimeout to locate where to replace the ctx usage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@go.mod`:
- Line 166: The go.mod contains a temporary replace directive pointing to a
personal fork for openshift/api#2752; do not merge until the upstream PR is
merged. Add a CI/merge-block gate that fails or flags the PR when the replace
directive "replace github.com/openshift/api" is present, add a clear TODO
comment adjacent to the replace line referencing openshift/api#2752 and stating
it must be removed when that PR merges, and add repository-level tracking (e.g.,
in the PR description or a maintainer checklist) to actively monitor
openshift/api#2752 and remove the replace directive immediately once the
upstream PR is approved and merged.
In `@pkg/controller/observability/observability_controller_test.go`:
- Around line 1250-1265: The test spawns goroutines that call r.Reconcile and
uses g.Expect inside those goroutines, which can panic because Gomega's fail
handler (t.FailNow) must run in the main test goroutine; fix by removing
g.Expect from the goroutines and collecting errors via a channel (e.g., make
errCh chan error), have each goroutine send its err to errCh after calling
r.Reconcile(req), then in the main goroutine close/iterate the channel and
assert with g.Expect that all received errors are nil; alternatively create a
per-goroutine Gomega tied to a testing.T (NewWithT) if you need per-goroutine
assertions—reference r.Reconcile, req, and the done/err channel approach to
locate where to change code.
In `@pkg/controller/observability/observability_controller.go`:
- Around line 169-173: The call to markNetworkObservabilityDeployed(err) is only
logged on failure which prevents the controller from retrying and can lead to
repeated FlowCollector creation; modify the reconciler to return the error
instead of just logging it so the reconciliation is requeued—locate the call to
r.markNetworkObservabilityDeployed(ctx, &network) in the reconciler function
(the surrounding code that currently does klog.Warningf on error) and replace
the log-without-return with returning a wrapped/annotated error (or simply
return err) so the controller will retry the reconcile loop; keep any logging
but ensure the function exits with an error when
markNetworkObservabilityDeployed fails.
- Around line 144-152: When waitForNetObservOperator(ctx) times out you
currently return ctrl.Result{RequeueAfter: 0}, nil which prevents any automatic
retry; update the timeout branch in the reconciler to either return a non-zero
RequeueAfter (e.g., RequeueAfter: time.Minute*5) so the controller will re-check
later, or return a wrapped error instead of nil to trigger error-based
requeueing; adjust the block around waitForNetObservOperator, the
r.status.SetDegraded call, and the return statement so the controller will retry
(refer to waitForNetObservOperator,
r.status.SetDegraded(statusmanager.ObservabilityConfig, "OperatorNotReady",
...), and the existing return ctrl.Result{RequeueAfter: 0}, nil).
---
Nitpick comments:
In `@manifests/0000_70_cluster-network-operator_02_rbac_observability.yaml`:
- Around line 11-14: The RBAC rule for apiGroups "operators.coreos.com" covering
resources ["subscriptions", "clusterserviceversions", "operatorgroups"] is
missing the "watch" and "delete" verbs; update the verbs array for that rule
(for the resources "subscriptions", "clusterserviceversions", and
"operatorgroups") to include "watch" (so informers can track state changes) and
"delete" (to allow cleanup/uninstallation) in addition to the existing verbs.
In `@pkg/controller/observability/observability_controller_test.go`:
- Around line 1512-1518: The test creates a "manifests" directory and a file at
FlowCollectorYAML using os.MkdirAll and os.WriteFile but only defers removal of
the file, leaving the manifests directory behind; update the test to always
clean up the directory by deferring a call to remove the directory (e.g. defer
os.RemoveAll("manifests")) immediately after os.MkdirAll succeeds (and keep or
replace the existing defer os.Remove(FlowCollectorYAML)), so that both the file
created from flowCollectorManifest and the manifests directory are removed after
the test.
In `@pkg/controller/observability/observability_controller.go`:
- Around line 356-392: markNetworkObservabilityDeployed currently does a Get ->
modify -> Status().Update which can fail on a concurrent update; wrap the
Get/modify/Status().Update sequence in a retry loop using
k8s.io/apimachinery/pkg/util/wait.RetryOnConflict (or retry.RetryOnConflict) to
retry on conflicts, re-fetching the latest Network (using r.client.Get) inside
the loop before applying the condition and calling r.client.Status().Update so
transient conflicts are retried safely; keep the same condition logic and return
the final error from the retry.
- Around line 288-319: The polling uses the caller's ctx which can have a
shorter deadline; instead create a new timeout-bound context for the poll so it
always runs up to checkTimeout: inside waitForNetObservOperator, call
context.WithTimeout(context.Background(), checkTimeout) (defer cancel) and pass
that new ctx to wait.PollUntilContextTimeout (keep checkInterval, checkTimeout,
condition as before); reference function waitForNetObservOperator and variables
checkInterval/checkTimeout to locate where to replace the ctx usage.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 826708e4-9aa4-4a6f-941f-be6e5a7c5d63
⛔ Files ignored due to path filters (36)
go.sumis excluded by!**/*.sumvendor/github.com/openshift/api/config/v1/types_network.gois excluded by!**/vendor/**,!vendor/**vendor/github.com/openshift/api/config/v1/zz_generated.deepcopy.gois excluded by!**/vendor/**,!vendor/**,!**/zz_generated*vendor/github.com/openshift/api/config/v1/zz_generated.swagger_doc_generated.gois excluded by!**/vendor/**,!vendor/**,!**/zz_generated*vendor/modules.txtis excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/.gitignoreis excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/.golangci.ymlis excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/.gomodcheck.yamlis excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/CONTRIBUTING.mdis excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/FAQ.mdis excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/Makefileis excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/OWNERSis excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/OWNERS_ALIASESis excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/README.mdis excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/RELEASE.mdis excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/SECURITY_CONTACTSis excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/TMP-LOGGING.mdis excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/VERSIONING.mdis excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/alias.gois excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/code-of-conduct.mdis excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/doc.gois excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/pkg/builder/controller.gois excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/pkg/builder/doc.gois excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/pkg/builder/options.gois excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/pkg/builder/webhook.gois excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/pkg/client/config/config.gois excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/pkg/client/config/doc.gois excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/pkg/conversion/conversion.gois excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/pkg/manager/signals/doc.gois excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/pkg/manager/signals/signal.gois excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/pkg/manager/signals/signal_posix.gois excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/pkg/manager/signals/signal_windows.gois excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/pkg/scheme/scheme.gois excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/pkg/webhook/conversion/conversion.gois excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/pkg/webhook/conversion/decoder.gois excluded by!**/vendor/**,!vendor/**vendor/sigs.k8s.io/controller-runtime/pkg/webhook/conversion/metrics/metrics.gois excluded by!**/vendor/**,!vendor/**
📒 Files selected for processing (9)
go.modmanifests/0000_70_cluster-network-operator_02_rbac_observability.yamlmanifests/07-observability-operator.yamlmanifests/08-flowcollector.yamlpkg/controller/add_networkconfig.gopkg/controller/observability/observability_controller.gopkg/controller/observability/observability_controller_test.gopkg/controller/statusmanager/status_manager.gosample-config.yaml
go.mod
Outdated
| sigs.k8s.io/structured-merge-diff/v6 v6.3.0 // indirect | ||
| ) | ||
|
|
||
| replace github.com/openshift/api v0.0.0-20260116192047-6fb7fdae95fd => github.com/OlivierCazade/api v0.0.0-20260310123814-0d1d1592146d |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify the fork repository and commit hashes are valid
echo "Checking if fork repository exists..."
gh api repos/OlivierCazade/api --jq '.full_name' 2>/dev/null || echo "Fork repository not found or not accessible"
echo -e "\nChecking if fork commit exists..."
gh api repos/OlivierCazade/api/commits/0d1d1592146d --jq '.sha' 2>/dev/null || echo "Fork commit not found"
echo -e "\nChecking if original commit exists in upstream..."
gh api repos/openshift/api/commits/6fb7fdae95fd --jq '.sha' 2>/dev/null || echo "Original commit not found"
echo -e "\nChecking status of related API PR..."
gh pr view 2752 --repo openshift/api --json state,title,url --jq '{state,title,url}' 2>/dev/null || echo "API PR `#2752` not found or not accessible"Repository: openshift/cluster-network-operator
Length of output: 708
🏁 Script executed:
#!/bin/bash
# Try public GitHub API access without authentication
echo "Checking fork repository (public API)..."
curl -s "https://api.github.com/repos/OlivierCazade/api" | jq -r '.full_name // .message' 2>/dev/null || echo "Unable to access fork repository"
echo -e "\nChecking upstream openshift/api repository..."
curl -s "https://api.github.com/repos/openshift/api" | jq -r '.full_name // .message' 2>/dev/null || echo "Unable to access upstream repository"Repository: openshift/cluster-network-operator
Length of output: 497
🌐 Web query:
openshift/api#2752 pull request status 2026
💡 Result:
As of March 10, 2026, openshift/api#2752 (“CORENET-6714: Enable Network Observability on Day 0”) is OPEN (not merged/closed). [1]
Current notable status items:
- Not approved yet (openshift-ci bot: “This PR is NOT APPROVED”; needs
/approvefrom an approver). [2] - Awaiting requested review from @stleerh. [1]
- Automated pre-merge checks reported 4 passed / 1 failed (inconclusive), with the inconclusive failure being a “Description check” mismatch vs the actual implementation details. [2]
Sources: [1], [2]
🌐 Web query:
OlivierCazade/api github fork openshift
💡 Result:
OlivierCazade/api is a GitHub repository that is a fork of openshift/api (the canonical OpenShift API type definitions / serialization code). [1][2]
Repo (fork): https://github.com/OlivierCazade/api [1]
Upstream: https://github.com/openshift/api [2]
Do not merge until openshift/api#2752 is merged; establish blocking gate.
This temporary replace directive points to a personal fork to test unmerged API changes from openshift/api#2752. The fork repository exists and is legitimate; however, as of March 10, 2026, the upstream PR is OPEN and NOT APPROVED, meaning the replace directive could remain in production code indefinitely if this PR merges first.
Required actions:
- Add a CI check or merge-blocking comment to prevent merge while the replace directive is present
- Add a TODO comment referencing CORENET-6714: Enable Network Observability on Day 0 api#2752 to make removal explicit
- Track the upstream PR status and remove this directive immediately when CORENET-6714: Enable Network Observability on Day 0 api#2752 merges
- Do not merge this PR unless CORENET-6714: Enable Network Observability on Day 0 api#2752 is approved and merged, or the replace directive is removed
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@go.mod` at line 166, The go.mod contains a temporary replace directive
pointing to a personal fork for openshift/api#2752; do not merge until the
upstream PR is merged. Add a CI/merge-block gate that fails or flags the PR when
the replace directive "replace github.com/openshift/api" is present, add a clear
TODO comment adjacent to the replace line referencing openshift/api#2752 and
stating it must be removed when that PR merges, and add repository-level
tracking (e.g., in the PR description or a maintainer checklist) to actively
monitor openshift/api#2752 and remove the replace directive immediately once the
upstream PR is approved and merged.
| // Run 5 concurrent reconciliations | ||
| done := make(chan bool, 5) | ||
| for i := 0; i < 5; i++ { | ||
| go func() { | ||
| _, err := r.Reconcile(context.TODO(), req) | ||
| // All should complete without error (idempotent) | ||
| g.Expect(err).NotTo(HaveOccurred()) | ||
| done <- true | ||
| }() | ||
| } | ||
|
|
||
| // Wait for all to complete | ||
| for i := 0; i < 5; i++ { | ||
| <-done | ||
| } | ||
| } |
There was a problem hiding this comment.
Potential race condition in concurrent reconciliation test.
Calling g.Expect() inside goroutines can cause test panics if an assertion fails, because Gomega's default fail handler calls t.FailNow() which cannot be called from a goroutine other than the one running the test. Additionally, errors are not properly collected across goroutines.
🔧 Suggested fix using error channels
// Run 5 concurrent reconciliations
done := make(chan bool, 5)
+ errChan := make(chan error, 5)
for i := 0; i < 5; i++ {
go func() {
_, err := r.Reconcile(context.TODO(), req)
- // All should complete without error (idempotent)
- g.Expect(err).NotTo(HaveOccurred())
+ errChan <- err
done <- true
}()
}
// Wait for all to complete
for i := 0; i < 5; i++ {
<-done
+ err := <-errChan
+ g.Expect(err).NotTo(HaveOccurred())
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/controller/observability/observability_controller_test.go` around lines
1250 - 1265, The test spawns goroutines that call r.Reconcile and uses g.Expect
inside those goroutines, which can panic because Gomega's fail handler
(t.FailNow) must run in the main test goroutine; fix by removing g.Expect from
the goroutines and collecting errors via a channel (e.g., make errCh chan
error), have each goroutine send its err to errCh after calling
r.Reconcile(req), then in the main goroutine close/iterate the channel and
assert with g.Expect that all received errors are nil; alternatively create a
per-goroutine Gomega tied to a testing.T (NewWithT) if you need per-goroutine
assertions—reference r.Reconcile, req, and the done/err channel approach to
locate where to change code.
| if err := r.waitForNetObservOperator(ctx); err != nil { | ||
| if err == context.DeadlineExceeded { | ||
| klog.Errorf("Timed out waiting for Network Observability Operator to be ready after %v. Stopping reconciliation.", checkTimeout) | ||
| r.status.SetDegraded(statusmanager.ObservabilityConfig, "OperatorNotReady", fmt.Sprintf("Timed out waiting for Network Observability Operator to be ready after %v", checkTimeout)) | ||
| return ctrl.Result{RequeueAfter: 0}, nil // Don't requeue | ||
| } | ||
| r.status.SetDegraded(statusmanager.ObservabilityConfig, "WaitOperatorError", fmt.Sprintf("Failed waiting for Network Observability Operator: %v", err)) | ||
| return ctrl.Result{}, err | ||
| } |
There was a problem hiding this comment.
Timeout handling stops reconciliation without requeue.
When waitForNetObservOperator times out (after 10 minutes), the controller returns RequeueAfter: 0 without an error, which means no automatic retry will occur. The operator installation may eventually succeed, but the controller won't notice and won't create the FlowCollector.
Consider returning a non-zero RequeueAfter to check again later, or returning an error to trigger retry via the error-based requeue mechanism.
🔧 Suggested fix to enable retry
if err := r.waitForNetObservOperator(ctx); err != nil {
if err == context.DeadlineExceeded {
klog.Errorf("Timed out waiting for Network Observability Operator to be ready after %v. Stopping reconciliation.", checkTimeout)
r.status.SetDegraded(statusmanager.ObservabilityConfig, "OperatorNotReady", fmt.Sprintf("Timed out waiting for Network Observability Operator to be ready after %v", checkTimeout))
- return ctrl.Result{RequeueAfter: 0}, nil // Don't requeue
+ return ctrl.Result{RequeueAfter: 5 * time.Minute}, nil // Retry later to check if operator eventually succeeds
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if err := r.waitForNetObservOperator(ctx); err != nil { | |
| if err == context.DeadlineExceeded { | |
| klog.Errorf("Timed out waiting for Network Observability Operator to be ready after %v. Stopping reconciliation.", checkTimeout) | |
| r.status.SetDegraded(statusmanager.ObservabilityConfig, "OperatorNotReady", fmt.Sprintf("Timed out waiting for Network Observability Operator to be ready after %v", checkTimeout)) | |
| return ctrl.Result{RequeueAfter: 0}, nil // Don't requeue | |
| } | |
| r.status.SetDegraded(statusmanager.ObservabilityConfig, "WaitOperatorError", fmt.Sprintf("Failed waiting for Network Observability Operator: %v", err)) | |
| return ctrl.Result{}, err | |
| } | |
| if err := r.waitForNetObservOperator(ctx); err != nil { | |
| if err == context.DeadlineExceeded { | |
| klog.Errorf("Timed out waiting for Network Observability Operator to be ready after %v. Stopping reconciliation.", checkTimeout) | |
| r.status.SetDegraded(statusmanager.ObservabilityConfig, "OperatorNotReady", fmt.Sprintf("Timed out waiting for Network Observability Operator to be ready after %v", checkTimeout)) | |
| return ctrl.Result{RequeueAfter: 5 * time.Minute}, nil // Retry later to check if operator eventually succeeds | |
| } | |
| r.status.SetDegraded(statusmanager.ObservabilityConfig, "WaitOperatorError", fmt.Sprintf("Failed waiting for Network Observability Operator: %v", err)) | |
| return ctrl.Result{}, err | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/controller/observability/observability_controller.go` around lines 144 -
152, When waitForNetObservOperator(ctx) times out you currently return
ctrl.Result{RequeueAfter: 0}, nil which prevents any automatic retry; update the
timeout branch in the reconciler to either return a non-zero RequeueAfter (e.g.,
RequeueAfter: time.Minute*5) so the controller will re-check later, or return a
wrapped error instead of nil to trigger error-based requeueing; adjust the block
around waitForNetObservOperator, the r.status.SetDegraded call, and the return
statement so the controller will retry (refer to waitForNetObservOperator,
r.status.SetDegraded(statusmanager.ObservabilityConfig, "OperatorNotReady",
...), and the existing return ctrl.Result{RequeueAfter: 0}, nil).
| // Mark as deployed in Network CR status | ||
| if err := r.markNetworkObservabilityDeployed(ctx, &network); err != nil { | ||
| klog.Warningf("Failed to update Network Observability deployment status: %v", err) | ||
| } | ||
|
|
There was a problem hiding this comment.
Failure to mark deployment status could cause repeated FlowCollector creation attempts.
The error from markNetworkObservabilityDeployed is only logged, not returned. If this fails, subsequent reconciliations will not see the NetworkObservabilityDeployed condition and may attempt to re-create the FlowCollector, potentially causing conflicts.
Consider either returning this error or implementing a retry mechanism.
🔧 Option: Return error to ensure retry
// Mark as deployed in Network CR status
if err := r.markNetworkObservabilityDeployed(ctx, &network); err != nil {
- klog.Warningf("Failed to update Network Observability deployment status: %v", err)
+ klog.Warningf("Failed to update Network Observability deployment status: %v", err)
+ // Continue anyway - the deployment itself succeeded, and we'll retry on next reconciliation
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Mark as deployed in Network CR status | |
| if err := r.markNetworkObservabilityDeployed(ctx, &network); err != nil { | |
| klog.Warningf("Failed to update Network Observability deployment status: %v", err) | |
| } | |
| // Mark as deployed in Network CR status | |
| if err := r.markNetworkObservabilityDeployed(ctx, &network); err != nil { | |
| klog.Warningf("Failed to update Network Observability deployment status: %v", err) | |
| // Continue anyway - the deployment itself succeeded, and we'll retry on next reconciliation | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/controller/observability/observability_controller.go` around lines 169 -
173, The call to markNetworkObservabilityDeployed(err) is only logged on failure
which prevents the controller from retrying and can lead to repeated
FlowCollector creation; modify the reconciler to return the error instead of
just logging it so the reconciliation is requeued—locate the call to
r.markNetworkObservabilityDeployed(ctx, &network) in the reconciler function
(the surrounding code that currently does klog.Warningf on error) and replace
the log-without-return with returning a wrapped/annotated error (or simply
return err) so the controller will retry the reconcile loop; keep any logging
but ensure the function exits with an error when
markNetworkObservabilityDeployed fails.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
pkg/controller/observability/observability_controller_test.go (1)
1268-1283:⚠️ Potential issue | 🟡 MinorRace condition risk:
g.Expectcalled inside goroutines.Gomega's fail handler calls
t.FailNow()which must only be called from the test goroutine. If an assertion fails inside a spawned goroutine, this can cause a test panic.Suggested fix using error channel
// Run 5 concurrent reconciliations done := make(chan bool, 5) + errChan := make(chan error, 5) for i := 0; i < 5; i++ { go func() { _, err := r.Reconcile(context.TODO(), req) - // All should complete without error (idempotent) - g.Expect(err).NotTo(HaveOccurred()) + errChan <- err done <- true }() } // Wait for all to complete for i := 0; i < 5; i++ { <-done + err := <-errChan + g.Expect(err).NotTo(HaveOccurred()) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/observability/observability_controller_test.go` around lines 1268 - 1283, The test currently calls g.Expect inside spawned goroutines (see r.Reconcile, req, and done), which risks calling Gomega's FailNow from non-test goroutines; instead capture errors in a channel from each goroutine and perform the assertion in the main test goroutine. Spawn the 5 goroutines to call r.Reconcile and send any returned error (or nil) into an errs channel, then after reading all results from errs in the main goroutine use g.Expect(err).NotTo(HaveOccurred()) for each entry, removing g.Expect from inside the goroutines.pkg/controller/observability/observability_controller.go (1)
144-152:⚠️ Potential issue | 🟠 MajorTimeout stops reconciliation permanently without requeue.
When
waitForNetObservOperatortimes out, returningRequeueAfter: 0with no error means the controller stops trying. If the operator eventually succeeds, this controller won't notice and won't create the FlowCollector.Consider returning a non-zero
RequeueAfterto periodically check if the operator eventually becomes ready:Suggested fix
if err := r.waitForNetObservOperator(ctx); err != nil { if err == context.DeadlineExceeded { klog.Errorf("Timed out waiting for Network Observability Operator to be ready after %v. Stopping reconciliation.", checkTimeout) r.status.SetDegraded(statusmanager.ObservabilityConfig, "OperatorNotReady", fmt.Sprintf("Timed out waiting for Network Observability Operator to be ready after %v", checkTimeout)) - return ctrl.Result{RequeueAfter: 0}, nil // Don't requeue + return ctrl.Result{RequeueAfter: 5 * time.Minute}, nil // Retry later }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/observability/observability_controller.go` around lines 144 - 152, When waitForNetObservOperator(ctx) returns context.DeadlineExceeded the current code returns ctrl.Result{RequeueAfter: 0} which stops reconciliation permanently; change that return to schedule a future requeue (e.g., ctrl.Result{RequeueAfter: someNonZeroDuration}) so the controller will periodically re-check for the operator and eventually create the FlowCollector. Update the branch handling DeadlineExceeded in the Reconcile logic (the block around waitForNetObservOperator and r.status.SetDegraded) to return a non-zero RequeueAfter (choose a sensible poll interval constant or reuse checkTimeout) instead of 0 and keep returning nil error.
🧹 Nitpick comments (1)
pkg/controller/observability/observability_controller_test.go (1)
1507-1572: Test creates manifest file in working directory.This test writes to
manifests/08-flowcollector.yamlin the working directory. While the cleanup withdefer os.Remove()handles the file, consider:
- If the test panics before
deferis registered (lines 1534-1536), the file persists- Could conflict with parallel tests using the same path
Using a temp directory and temporarily overriding the
FlowCollectorYAMLconstant (if possible) would be more robust, but this is acceptable for an integration-style test.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/observability/observability_controller_test.go` around lines 1507 - 1572, TestReconcile_FirstTimeDeploymentSetsCondition writes to a fixed manifests path (FlowCollectorYAML) which can leak or conflict; fix by creating a temporary directory with os.MkdirTemp at test start, write the FlowCollector manifest into that temp dir (use os.Create or os.WriteFile), set or inject the FlowCollectorYAML path used by the reconciler to point to the temp file (or refactor the code to accept a manifestPath parameter) before calling Reconcile, and register defer os.RemoveAll(tempDir) immediately after creating the temp directory to guarantee cleanup even on panic; update references to FlowCollectorYAML in the test to use the temp file path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/controller/observability/observability_controller_test.go`:
- Around line 1268-1283: The test currently calls g.Expect inside spawned
goroutines (see r.Reconcile, req, and done), which risks calling Gomega's
FailNow from non-test goroutines; instead capture errors in a channel from each
goroutine and perform the assertion in the main test goroutine. Spawn the 5
goroutines to call r.Reconcile and send any returned error (or nil) into an errs
channel, then after reading all results from errs in the main goroutine use
g.Expect(err).NotTo(HaveOccurred()) for each entry, removing g.Expect from
inside the goroutines.
In `@pkg/controller/observability/observability_controller.go`:
- Around line 144-152: When waitForNetObservOperator(ctx) returns
context.DeadlineExceeded the current code returns ctrl.Result{RequeueAfter: 0}
which stops reconciliation permanently; change that return to schedule a future
requeue (e.g., ctrl.Result{RequeueAfter: someNonZeroDuration}) so the controller
will periodically re-check for the operator and eventually create the
FlowCollector. Update the branch handling DeadlineExceeded in the Reconcile
logic (the block around waitForNetObservOperator and r.status.SetDegraded) to
return a non-zero RequeueAfter (choose a sensible poll interval constant or
reuse checkTimeout) instead of 0 and keep returning nil error.
---
Nitpick comments:
In `@pkg/controller/observability/observability_controller_test.go`:
- Around line 1507-1572: TestReconcile_FirstTimeDeploymentSetsCondition writes
to a fixed manifests path (FlowCollectorYAML) which can leak or conflict; fix by
creating a temporary directory with os.MkdirTemp at test start, write the
FlowCollector manifest into that temp dir (use os.Create or os.WriteFile), set
or inject the FlowCollectorYAML path used by the reconciler to point to the temp
file (or refactor the code to accept a manifestPath parameter) before calling
Reconcile, and register defer os.RemoveAll(tempDir) immediately after creating
the temp directory to guarantee cleanup even on panic; update references to
FlowCollectorYAML in the test to use the temp file path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b46e56ed-85bd-48c6-8a11-d0576b268eb9
⛔ Files ignored due to path filters (8)
go.sumis excluded by!**/*.sumvendor/github.com/openshift/api/config/v1/types_network.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/zz_generated.deepcopy.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/zz_generated.featuregated-crd-manifests.yamlis excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/zz_generated.swagger_doc_generated.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/features.mdis excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/features/features.gois excluded by!vendor/**,!**/vendor/**vendor/modules.txtis excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (4)
go.modpkg/controller/observability/observability_controller.gopkg/controller/observability/observability_controller_test.gosample-config.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- go.mod
- sample-config.yaml
|
@OlivierCazade: This pull request references CORENET-6714 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
pkg/controller/observability/observability_controller.go (2)
157-162:⚠️ Potential issue | 🟠 MajorTimeout path disables automatic recovery
Line 161 returns
ctrl.Result{RequeueAfter: 0}, nilon timeout, so reconciliation stops and may never create the FlowCollector even if the operator becomes ready later.Suggested fix
if err := r.waitForNetObservOperator(ctx); err != nil { if err == context.DeadlineExceeded { klog.Errorf("Timed out waiting for Network Observability Operator to be ready after %v. Stopping reconciliation.", checkTimeout) r.status.SetDegraded(statusmanager.ObservabilityConfig, "OperatorNotReady", fmt.Sprintf("Timed out waiting for Network Observability Operator to be ready after %v", checkTimeout)) - return ctrl.Result{RequeueAfter: 0}, nil // Don't requeue + return ctrl.Result{RequeueAfter: 5 * time.Minute}, nil }
183-185:⚠️ Potential issue | 🟠 MajorDeployment-condition update errors are swallowed
At Line 183-Line 185,
markNetworkObservabilityDeployedfailure is only logged. That can cause repeated deploy-path execution on later reconciles instead of retrying the status write immediately.Suggested fix
// Mark as deployed in Network CR status if err := r.markNetworkObservabilityDeployed(ctx, &network); err != nil { - klog.Warningf("Failed to update Network Observability deployment status: %v", err) + r.status.SetDegraded(statusmanager.ObservabilityConfig, "MarkDeployedError", fmt.Sprintf("Failed to update Network Observability deployment status: %v", err)) + return ctrl.Result{}, err }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/observability/observability_controller.go` around lines 183 - 185, The call to markNetworkObservabilityDeployed is currently logging failures and swallowing the error, which prevents the reconcile loop from retrying status writes; change the code in observability_controller.go so that if r.markNetworkObservabilityDeployed(ctx, &network) returns an error you propagate that error (e.g., return fmt.Errorf("markNetworkObservabilityDeployed: %w", err) or wrap with kerrors) instead of only calling klog.Warningf so the reconciler will requeue and retry the status update; locate the markNetworkObservabilityDeployed call in the reconcile flow and replace the swallowed-log branch with an error return (or requeue result) so failures are retried immediately.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controller/observability/observability_controller.go`:
- Around line 326-338: The readiness check in waitForNetObservOperator currently
returns after inspecting the first matching csv in csvs.Items which can be
non-deterministic; instead collect all CSVs whose name has the
"netobserv-operator" prefix, determine the intended CSV by selecting the one
with the highest semantic version parsed from the name (e.g., parse the version
suffix from csv.GetName() like "netobserv-operator.vX.Y.Z" and compare using
semver semantics), and then call unstructured.NestedString on that selected
CSV's Object to check status.phase; update the logic around csvs.Items,
csv.GetName(), and unstructured.NestedString to implement this deterministic
selection before returning phase == "Succeeded".
---
Duplicate comments:
In `@pkg/controller/observability/observability_controller.go`:
- Around line 183-185: The call to markNetworkObservabilityDeployed is currently
logging failures and swallowing the error, which prevents the reconcile loop
from retrying status writes; change the code in observability_controller.go so
that if r.markNetworkObservabilityDeployed(ctx, &network) returns an error you
propagate that error (e.g., return fmt.Errorf("markNetworkObservabilityDeployed:
%w", err) or wrap with kerrors) instead of only calling klog.Warningf so the
reconciler will requeue and retry the status update; locate the
markNetworkObservabilityDeployed call in the reconcile flow and replace the
swallowed-log branch with an error return (or requeue result) so failures are
retried immediately.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 63c37a51-4f01-487a-9005-0b35a6728d9f
📒 Files selected for processing (1)
pkg/controller/observability/observability_controller.go
| for _, csv := range csvs.Items { | ||
| name := csv.GetName() | ||
| // CSV names are typically like "netobserv-operator.v1.2.3" | ||
| if strings.HasPrefix(name, "netobserv-operator") { | ||
| phase, found, err := unstructured.NestedString(csv.Object, "status", "phase") | ||
| if err != nil { | ||
| return false, err | ||
| } | ||
| if !found { | ||
| return false, nil | ||
| } | ||
| return phase == "Succeeded", nil | ||
| } |
There was a problem hiding this comment.
CSV readiness check can pick the wrong CSV
waitForNetObservOperator returns after the first netobserv-operator* CSV (Line 326-Line 338). During upgrades there can be multiple CSVs; first item ordering is not stable, so readiness can be misdetected and timeout incorrectly.
Suggested fix
- // Find the netobserv operator CSV
+ // Find any netobserv operator CSV in Succeeded phase
for _, csv := range csvs.Items {
name := csv.GetName()
// CSV names are typically like "netobserv-operator.v1.2.3"
if strings.HasPrefix(name, "netobserv-operator") {
phase, found, err := unstructured.NestedString(csv.Object, "status", "phase")
if err != nil {
return false, err
}
- if !found {
- return false, nil
- }
- return phase == "Succeeded", nil
+ if found && phase == "Succeeded" {
+ return true, nil
+ }
}
}
return false, nil📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for _, csv := range csvs.Items { | |
| name := csv.GetName() | |
| // CSV names are typically like "netobserv-operator.v1.2.3" | |
| if strings.HasPrefix(name, "netobserv-operator") { | |
| phase, found, err := unstructured.NestedString(csv.Object, "status", "phase") | |
| if err != nil { | |
| return false, err | |
| } | |
| if !found { | |
| return false, nil | |
| } | |
| return phase == "Succeeded", nil | |
| } | |
| for _, csv := range csvs.Items { | |
| name := csv.GetName() | |
| // CSV names are typically like "netobserv-operator.v1.2.3" | |
| if strings.HasPrefix(name, "netobserv-operator") { | |
| phase, found, err := unstructured.NestedString(csv.Object, "status", "phase") | |
| if err != nil { | |
| return false, err | |
| } | |
| if found && phase == "Succeeded" { | |
| return true, nil | |
| } | |
| } | |
| } | |
| return false, nil |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/controller/observability/observability_controller.go` around lines 326 -
338, The readiness check in waitForNetObservOperator currently returns after
inspecting the first matching csv in csvs.Items which can be non-deterministic;
instead collect all CSVs whose name has the "netobserv-operator" prefix,
determine the intended CSV by selecting the one with the highest semantic
version parsed from the name (e.g., parse the version suffix from csv.GetName()
like "netobserv-operator.vX.Y.Z" and compare using semver semantics), and then
call unstructured.NestedString on that selected CSV's Object to check
status.phase; update the logic around csvs.Items, csv.GetName(), and
unstructured.NestedString to implement this deterministic selection before
returning phase == "Succeeded".
…ered Check if the feature gate exists before calling Enabled() to prevent panics on clusters running older API versions that don't have the feature gate registered yet. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/controller/observability/observability_controller.go (1)
133-146: Namespace creation handles not-found but not already-exists race.The Get-then-Create pattern at lines 136-146 could encounter an
AlreadyExistserror if another process creates the namespace between the Get and Create calls. This would fail the reconciliation, though it will succeed on retry.♻️ Optional: handle AlreadyExists gracefully
if errors.IsNotFound(err) { if err := r.client.Create(ctx, ns); err != nil { + if errors.IsAlreadyExists(err) { + klog.V(4).Infof("Namespace %s was created by another process", OperatorNamespace) + } else { r.status.SetDegraded(statusmanager.ObservabilityConfig, "CreateNamespaceError", fmt.Sprintf("Failed to create namespace %s: %v", OperatorNamespace, err)) return ctrl.Result{}, err + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/observability/observability_controller.go` around lines 133 - 146, The Get-then-Create namespace flow in Reconcile (observability_controller.go) can race: after r.client.Get succeeds with NotFound, r.client.Create may return AlreadyExists if another actor created the namespace; update the Create error handling in the block that constructs ns (using OperatorNamespace) to treat errors.IsAlreadyExists(err) as non-fatal (do not call r.status.SetDegraded or return error) and only set degraded/return for other errors; keep using r.status.SetDegraded(statusmanager.ObservabilityConfig, ...) for real create failures so reconciliation continues cleanly on races.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/controller/observability/observability_controller.go`:
- Around line 133-146: The Get-then-Create namespace flow in Reconcile
(observability_controller.go) can race: after r.client.Get succeeds with
NotFound, r.client.Create may return AlreadyExists if another actor created the
namespace; update the Create error handling in the block that constructs ns
(using OperatorNamespace) to treat errors.IsAlreadyExists(err) as non-fatal (do
not call r.status.SetDegraded or return error) and only set degraded/return for
other errors; keep using r.status.SetDegraded(statusmanager.ObservabilityConfig,
...) for real create failures so reconciliation continues cleanly on races.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c15118e0-7f89-43e5-a75e-3ba878565507
📒 Files selected for processing (1)
pkg/controller/observability/observability_controller.go
Align struct field colons to match project code style standards. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/controller/observability/observability_controller_test.go (1)
1268-1283:⚠️ Potential issue | 🟡 MinorRace condition in concurrent test remains unaddressed.
Calling
g.Expect()inside goroutines can cause test panics because Gomega's fail handler invokest.FailNow(), which must only be called from the test's main goroutine.🔧 Suggested fix using error channel
// Run 5 concurrent reconciliations done := make(chan bool, 5) + errChan := make(chan error, 5) for i := 0; i < 5; i++ { go func() { _, err := r.Reconcile(context.TODO(), req) - // All should complete without error (idempotent) - g.Expect(err).NotTo(HaveOccurred()) + errChan <- err done <- true }() } // Wait for all to complete for i := 0; i < 5; i++ { <-done + err := <-errChan + g.Expect(err).NotTo(HaveOccurred()) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/observability/observability_controller_test.go` around lines 1268 - 1283, The test invokes g.Expect inside goroutines which can call t.FailNow from non-test goroutines; instead collect errors from r.Reconcile in a buffered channel and perform Gomega assertions in the main test goroutine. Replace the done channel with an errChan (buffered to 5), have each goroutine call r.Reconcile(context.TODO(), req) and send the returned error into errChan, then loop 5 times in the main goroutine to receive err := <-errChan and call g.Expect(err).NotTo(HaveOccurred()). This keeps the Reconcile calls concurrent but ensures assertions (using g.Expect) run only on the main test goroutine.
🧹 Nitpick comments (2)
pkg/controller/observability/observability_controller_test.go (2)
936-938: Misleading test name and comment.The test is named
TestReconcile_FlowCollectorDeletedwith comment "tests that reconciliation recreates FlowCollector if it gets deleted," but the actual assertion verifies that reconciliation skips reinstallation when the deployed condition is set. The test logic is correct per the PR's design (avoid reinstallation once deployed), but the name/comment should reflect this behavior.♻️ Suggested rename for clarity
-// TestReconcile_FlowCollectorDeleted tests that reconciliation recreates -// FlowCollector if it gets deleted -func TestReconcile_FlowCollectorDeleted(t *testing.T) { +// TestReconcile_SkipsReinstallWhenFlowCollectorDeleted tests that reconciliation +// does NOT recreate FlowCollector after deletion if the deployed condition is set +func TestReconcile_SkipsReinstallWhenFlowCollectorDeleted(t *testing.T) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/observability/observability_controller_test.go` around lines 936 - 938, Rename the test function TestReconcile_FlowCollectorDeleted and update its comment to reflect the actual behavior being asserted: that reconciliation skips reinstallation when the FlowCollector has the "deployed" condition set. Specifically, change the function name (e.g., to TestReconcile_SkipsReinstallWhenFlowCollectorDeployed) and update the preceding comment to state "tests that reconciliation skips reinstallation of FlowCollector when the deployed condition is present"; also update any references to the old test name in the file (including test registration or helper calls) so they remain consistent.
1530-1536: Test creates files in working directory without full cleanup.The test creates
manifests/directory and writes a file but only removes the file on cleanup, not the directory. Whilemanifests/likely already exists in the repository, consider usingt.TempDir()for full isolation, especially if tests run in parallel.♻️ Suggested fix using temp directory
- err := os.MkdirAll("manifests", 0755) - g.Expect(err).NotTo(HaveOccurred()) - - // Create the FlowCollector manifest at the expected path - err = os.WriteFile(FlowCollectorYAML, []byte(flowCollectorManifest), 0644) - g.Expect(err).NotTo(HaveOccurred()) - defer os.Remove(FlowCollectorYAML) + // Use temp directory and override the manifest path for this test + tmpDir := t.TempDir() + manifestPath := filepath.Join(tmpDir, "flowcollector.yaml") + err := os.WriteFile(manifestPath, []byte(flowCollectorManifest), 0644) + g.Expect(err).NotTo(HaveOccurred()) + + // Note: This requires the controller to accept a configurable manifest path + // or use a test-specific approach to override FlowCollectorYAMLAlternatively, if modifying the controller isn't feasible, ensure the
manifestsdirectory cleanup:+ defer os.RemoveAll("manifests")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/observability/observability_controller_test.go` around lines 1530 - 1536, The test creates a real manifests/ directory and only removes the file, leaving the directory behind; update the test to use an isolated temp directory so artifacts are fully cleaned up: create a temp dir via t.TempDir() (or os.MkdirTemp if t is not available), write the FlowCollector manifest into filepath.Join(tempDir, filepath.Base(FlowCollectorYAML)) using the existing flowCollectorManifest, and set the test to use that path (or defer os.RemoveAll(tempDir)) instead of writing to the repository-level "manifests" directory; alternatively ensure the test defers os.RemoveAll("manifests") after writing—reference FlowCollectorYAML and flowCollectorManifest to locate the write site in observability_controller_test.go.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/controller/observability/observability_controller_test.go`:
- Around line 1268-1283: The test invokes g.Expect inside goroutines which can
call t.FailNow from non-test goroutines; instead collect errors from r.Reconcile
in a buffered channel and perform Gomega assertions in the main test goroutine.
Replace the done channel with an errChan (buffered to 5), have each goroutine
call r.Reconcile(context.TODO(), req) and send the returned error into errChan,
then loop 5 times in the main goroutine to receive err := <-errChan and call
g.Expect(err).NotTo(HaveOccurred()). This keeps the Reconcile calls concurrent
but ensures assertions (using g.Expect) run only on the main test goroutine.
---
Nitpick comments:
In `@pkg/controller/observability/observability_controller_test.go`:
- Around line 936-938: Rename the test function
TestReconcile_FlowCollectorDeleted and update its comment to reflect the actual
behavior being asserted: that reconciliation skips reinstallation when the
FlowCollector has the "deployed" condition set. Specifically, change the
function name (e.g., to TestReconcile_SkipsReinstallWhenFlowCollectorDeployed)
and update the preceding comment to state "tests that reconciliation skips
reinstallation of FlowCollector when the deployed condition is present"; also
update any references to the old test name in the file (including test
registration or helper calls) so they remain consistent.
- Around line 1530-1536: The test creates a real manifests/ directory and only
removes the file, leaving the directory behind; update the test to use an
isolated temp directory so artifacts are fully cleaned up: create a temp dir via
t.TempDir() (or os.MkdirTemp if t is not available), write the FlowCollector
manifest into filepath.Join(tempDir, filepath.Base(FlowCollectorYAML)) using the
existing flowCollectorManifest, and set the test to use that path (or defer
os.RemoveAll(tempDir)) instead of writing to the repository-level "manifests"
directory; alternatively ensure the test defers os.RemoveAll("manifests") after
writing—reference FlowCollectorYAML and flowCollectorManifest to locate the
write site in observability_controller_test.go.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 394e355a-3990-4ada-aaac-e4a5ee489446
📒 Files selected for processing (1)
pkg/controller/observability/observability_controller_test.go
|
@OlivierCazade: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Add Network Observability Day-0 Installation Support
The replace instruction in the go.mod file need to be removed before merging this branch, this can be removed once the PR on the API repo is merged
This PR introduces automatic Day-0 installation of the Network Observability operator, controlled by a new
installNetworkObservabilityfield in the Network CR.Summary
The Cluster Network Operator (CNO) now handles initial deployment of the Network Observability operator and FlowCollector, enabling network flow monitoring out-of-the-box for OpenShift clusters while respecting user preferences and cluster topology.
Key Features
Opt-out Model with SNO Detection
spec.installNetworkObservabilityDeployment Tracking
NetworkObservabilityDeployedcondition is setStatus Reporting
ObservabilityConfigstatus level for degraded state reportingConfiguration
The
spec.installNetworkObservabilityfield accepts three values:""(empty/nil): Default behavior - install on multi-node clusters, skip on SNO"Enable": Explicitly enable installation, even on SNO clusters"Disable": Explicitly disable installationRBAC Changes
Added permissions for the CNO to manage:
Related
/cc @stleerh