Skip to content

NVIDIA-596: Enable dpu healthcheck #2941

Open
tsorya wants to merge 2 commits intoopenshift:masterfrom
tsorya:jkary-dpu-health-check
Open

NVIDIA-596: Enable dpu healthcheck #2941
tsorya wants to merge 2 commits intoopenshift:masterfrom
tsorya:jkary-dpu-health-check

Conversation

@tsorya
Copy link
Contributor

@tsorya tsorya commented Mar 19, 2026

NVIDIA-596: pass DPU lease config via env vars on dpu-host/dpu DaemonSets

Add configurable DPU node lease renew interval and duration as env vars on ovnkube-controller, gated to dpu-host/dpu modes. Script-lib builds CLI flags from env vars. Values read from hardware-offload-config ConfigMap with defaults 10s/40s. Setting either to 0 disables the health check. Lease namespace derived via fieldRef.

Jira: https://issues.redhat.com/browse/NVIDIA-596

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 19, 2026
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 19, 2026

@tsorya: This pull request references NVIDIA-596 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

NVIDIA-596: pass DPU lease config via env vars on dpu-host/dpu DaemonSets

Add configurable DPU node lease renew interval and duration as env vars on ovnkube-controller, gated to dpu-host/dpu modes. Script-lib builds CLI flags from env vars. Values read from hardware-offload-config ConfigMap with defaults 10s/40s. Setting either to 0 disables the health check. Lease namespace derived via fieldRef.

Jira: https://issues.redhat.com/browse/NVIDIA-596

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

1 similar comment
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 19, 2026

@tsorya: This pull request references NVIDIA-596 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

NVIDIA-596: pass DPU lease config via env vars on dpu-host/dpu DaemonSets

Add configurable DPU node lease renew interval and duration as env vars on ovnkube-controller, gated to dpu-host/dpu modes. Script-lib builds CLI flags from env vars. Values read from hardware-offload-config ConfigMap with defaults 10s/40s. Setting either to 0 disables the health check. Lease namespace derived via fieldRef.

Jira: https://issues.redhat.com/browse/NVIDIA-596

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link

coderabbitai bot commented Mar 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bbfcc2d0-594b-48cd-a9d0-bba809bd1983

📥 Commits

Reviewing files that changed from the base of the PR and between 62c31b1 and b5a3d66.

📒 Files selected for processing (9)
  • bindata/network/ovn-kubernetes/common/008-script-lib.yaml
  • bindata/network/ovn-kubernetes/managed/ovnkube-node.yaml
  • bindata/network/ovn-kubernetes/self-hosted/ovnkube-node.yaml
  • hack/hardware-offload-config.yaml
  • pkg/bootstrap/types.go
  • pkg/network/kube_proxy_test.go
  • pkg/network/ovn_kubernetes.go
  • pkg/network/ovn_kubernetes_dpu_host_test.go
  • pkg/network/ovn_kubernetes_test.go
✅ Files skipped from review due to trivial changes (5)
  • hack/hardware-offload-config.yaml
  • pkg/network/kube_proxy_test.go
  • bindata/network/ovn-kubernetes/managed/ovnkube-node.yaml
  • bindata/network/ovn-kubernetes/common/008-script-lib.yaml
  • pkg/network/ovn_kubernetes.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/bootstrap/types.go
  • pkg/network/ovn_kubernetes_dpu_host_test.go

Walkthrough

Adds DPU node lease configuration: new fields for renew interval and duration, reads/validates them from ConfigMap, exposes values to renderer and manifests, injects env vars and ovnkube CLI flags for DPU node modes, and adds tests and defaults.

Changes

Cohort / File(s) Summary
Core config & rendering
pkg/network/ovn_kubernetes.go, pkg/bootstrap/types.go
Add DpuNodeLeaseRenewInterval and DpuNodeLeaseDuration fields, read/parse values from hardware-offload-config ConfigMap with defaults and cross-field validation, and pass stringified values into template render data.
Default ConfigMap
hack/hardware-offload-config.yaml
Add dpu-node-lease-renew-interval: "10" and dpu-node-lease-duration: "40" to the hardware-offload ConfigMap data.
Kubernetes manifests (templates)
bindata/network/ovn-kubernetes/managed/ovnkube-node.yaml, bindata/network/ovn-kubernetes/self-hosted/ovnkube-node.yaml
Conditionally inject OVNKUBE_NODE_LEASE_RENEW_INTERVAL and OVNKUBE_NODE_LEASE_DURATION env vars into the ovnkube-controller container when OVN_NODE_MODE is dpu-host or dpu and renewal interval is non-zero.
Shell script (ovnkube CLI flags)
bindata/network/ovn-kubernetes/common/008-script-lib.yaml
Initialize dpu_lease_flags and append --dpu-node-lease-renew-interval / --dpu-node-lease-duration to ovnkube arguments for dpu-host/dpu modes when corresponding env vars are set. Also adjust block/control flow to scope init_ovnkube_controller separately.
Tests & fixtures
pkg/network/kube_proxy_test.go, pkg/network/ovn_kubernetes_test.go, pkg/network/ovn_kubernetes_dpu_host_test.go
Update test fixtures to include new lease fields with defaults; add tests (TestOVNKubernetesLeaseEnvVars, TestDpuLeaseConfig) and helpers to verify presence/absence and values of rendered env vars across node modes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.3)

level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/Masterminds/semver@v1.5.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/Masterminds/sprig/v3@v3.2.3: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/containernetworking/cni@v0.8.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ghodss/yaml@v1.0.1-0.20190212211648-25d852aebe32: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/go-bindata/go-bindata@v3.1.2+incompatible: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/onsi/gomega@v1.38.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ope

... [truncated 17231 characters] ...

ired in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/gengo/v2@v2.0.0-20250922181213-ec3ebc5fd46b: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kms@v0.34.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kube-aggregator@v0.34.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/randfill@v1.0.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/structured-merge-diff/v6@v6.3.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n"


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from jcaamano and pperiyasamy March 19, 2026 04:15
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 19, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tsorya
Once this PR has been reviewed and has the lgtm label, please assign pliurh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tsorya tsorya marked this pull request as draft March 19, 2026 12:04
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 19, 2026
…Sets

Add configurable DPU node lease renew interval and duration as env
vars on ovnkube-controller, gated to dpu-host/dpu modes. Script-lib
builds CLI flags from env vars. Values read from hardware-offload-config
ConfigMap with defaults 10s/40s. Setting either to 0 disables the
health check. Lease namespace derived via fieldRef.

Jira: https://issues.redhat.com/browse/NVIDIA-596
@tsorya tsorya force-pushed the jkary-dpu-health-check branch from 62c31b1 to b5a3d66 Compare March 20, 2026 03:45
@tsorya tsorya marked this pull request as ready for review March 20, 2026 03:46
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 20, 2026
@openshift-ci openshift-ci bot requested review from danwinship and pliurh March 20, 2026 03:46
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 20, 2026

@tsorya: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 1eb0381 link false /test 4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade
ci/prow/security 1eb0381 link false /test security
ci/prow/e2e-aws-ovn-serial-2of2 1eb0381 link true /test e2e-aws-ovn-serial-2of2
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw 1eb0381 link true /test e2e-metal-ipi-ovn-dualstack-bgp-local-gw
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp 1eb0381 link true /test e2e-metal-ipi-ovn-dualstack-bgp
ci/prow/e2e-aws-ovn-windows 1eb0381 link true /test e2e-aws-ovn-windows
ci/prow/hypershift-e2e-aks 1eb0381 link true /test hypershift-e2e-aks
ci/prow/4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-upgrade 1eb0381 link false /test 4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-upgrade
ci/prow/e2e-metal-ipi-ovn-ipv6 1eb0381 link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-aws-ovn-rhcos10-techpreview 1eb0381 link false /test e2e-aws-ovn-rhcos10-techpreview
ci/prow/e2e-aws-ovn-serial-1of2 1eb0381 link true /test e2e-aws-ovn-serial-1of2
ci/prow/e2e-aws-ovn-hypershift-conformance 1eb0381 link true /test e2e-aws-ovn-hypershift-conformance
ci/prow/e2e-azure-ovn-upgrade 1eb0381 link true /test e2e-azure-ovn-upgrade
ci/prow/e2e-metal-ipi-ovn-ipv6-ipsec 1eb0381 link true /test e2e-metal-ipi-ovn-ipv6-ipsec
ci/prow/4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade 1eb0381 link false /test 4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade
ci/prow/e2e-ovn-ipsec-step-registry 1eb0381 link true /test e2e-ovn-ipsec-step-registry
ci/prow/e2e-gcp-ovn 1eb0381 link true /test e2e-gcp-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@tsorya
Copy link
Contributor Author

tsorya commented Mar 20, 2026

/retest-required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants