CNTRLPLANE-2685: integrate HCPEtcdBackup lifecycle into OADP backup flow by jparrill · Pull Request #238 · openshift/hypershift-oadp-plugin

jparrill · 2026-04-07T20:54:28Z

Summary

Adds etcdSnapshot as a new etcd backup method for HyperShift HostedClusters. This PR covers only the backup flow; the restore flow (with pre-signed URL generation) will follow in a separate PR.

Introduces etcdBackupMethod configuration key (volumeSnapshot default, etcdSnapshot for native etcd snapshots)
During backup, the plugin creates an HCPEtcdBackup CR that triggers a native etcdctl snapshot save and uploads the snapshot to the Velero BSL object store
Bridges Velero BSL credentials to the HCPEtcdBackup controller by copying and remapping the credential Secret
Excludes etcd pods from the backup when using etcdSnapshot to prevent CSI VolumeSnapshots
Stores the snapshot under Velero's backup directory structure ({prefix}/backups/{backup-name}/etcd-backup/)

Note: The restore flow (reading lastSuccessfulEtcdBackupURL from HC status and generating a pre-signed HTTPS URL for restoreSnapshotURL) will be implemented in a follow-up PR.

Dependencies

This PR depends 100% on the following HyperShift PRs being merged:

openshift/hypershift#8139 — HCPEtcdBackup controller (CNTRLPLANE-2678)
openshift/hypershift#8179 — LastSuccessfulEtcdBackupURL field in HostedClusterStatus (CNTRLPLANE-3173)

Once both are merged, a vendor update is required to remove local constants (BackupInProgressReason, BackupRejectedReason) in favor of API-defined ones.

Changes

Commit	Description
`fix(deps)`	Update HyperShift API vendor to latest main (brings in `HCPEtcdBackup` types)
`feat(backup)`	Integrate HCPEtcdBackup lifecycle: BSL mapping, credential copy, CR creation, polling, pod exclusion
`docs`	Full implementation reference with architecture, flows, configuration, and troubleshooting

Test plan

Unit tests for orchestrator: BSL-to-storage mapping (AWS, Azure, unsupported), credential Secret copy and key remapping, encryption field injection, CRD existence check
Unit tests for validation: etcdBackupMethod and hoNamespace recognized as valid config keys
go build ./... passes
go test ./... passes
E2E: deploy plugin with etcdBackupMethod=etcdSnapshot, run OADP backup, verify HCPEtcdBackup CR created and snapshot uploaded
E2E: verify volumeSnapshot method (default) behavior is unchanged

🤖 Generated with Claude Code

Update github.com/openshift/hypershift/api to v0.0.0-20260406110001-bcf6adaf131f. This brings in the HCPEtcdBackup CRD types, HCPEtcdBackupConfig in ManagedEtcdSpec.Backup, and related condition/reason constants needed for CNTRLPLANE-2685 integration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>

openshift-ci · 2026-04-07T20:54:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jparrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jparrill]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-04-07T20:54:49Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds HCPEtcdBackup-based etcd snapshot orchestration: new orchestrator creates/monitors HCPEtcdBackup CRs, maps Velero BSL to storage, copies/remaps credentials, injects snapshot URLs for restores, removes legacy config keys, updates types/scheme, and includes docs and tests.

Changes

Cohort / File(s)	Summary
Documentation `docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md`	New implementation and troubleshooting documentation for HCPEtcdBackup integration, backup/restore flow, BSL mapping, credential lifecycle, config keys, and testing.
Module deps `go.mod`	Updated pinned commits for `github.com/openshift/hypershift/api` and indirect `github.com/openshift/api`.
Common: constants & scheme `pkg/common/types.go`, `pkg/common/scheme.go`	Added exported constants for etcd snapshot integration, HONamespace and config keys; registered `apiextensions.k8s.io/v1` types into the scheme.
Common: utilities & tests `pkg/common/utils.go`, `pkg/common/utils_test.go`	Added `GetHostedCluster` helper to discover a HostedCluster by HCP namespace and unit tests covering positive and empty-client cases.
Backup plugin core `pkg/core/backup.go`	Extended BackupPlugin to support `etcdSnapshot` mode: read config keys, create/monitor HCPEtcdBackup CRs, wait for completion, annotate HostedCluster with snapshot URL, and exclude etcd pods/PVCs from volume-based backups.
Core types & validation `pkg/core/types/types.go`, `pkg/core/validation/backup.go`, `pkg/core/validation/backup_test.go`, `pkg/core/validation/restore.go`	Added `hcpetcdbackups` to common resources; removed `ReadoptNodes`/`ManagedServices` fields from BackupOptions/RestoreOptions; stopped parsing those keys in validators; adjusted tests and removed unused AWS restore gating.
Etcd backup orchestrator & tests `pkg/etcdbackup/orchestrator.go`, `pkg/etcdbackup/orchestrator_test.go`	New exported Orchestrator: maps Velero BSL to HCPEtcdBackup storage (S3/Azure), computes key prefixes, copies/remaps BSL credentials into HostedCluster namespace, sets encryption/credential refs, creates/verifies/waits on HCPEtcdBackup, and cleans up credentials; comprehensive unit tests added.
Platform AWS cleanup `pkg/platform/aws/aws.go`	Removed exported `RestoreTasks` function and its now-unused imports.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-04-07T20:55:16Z

@jparrill: This pull request references CNTRLPLANE-2685 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

Adds etcdSnapshot as a new etcd backup method alongside the existing volumeSnapshot (default), controlled via the etcdBackupMethod key in the plugin ConfigMap

During backup, the plugin creates an HCPEtcdBackup CR that triggers a native etcdctl snapshot save and uploads the snapshot to the Velero BSL object store

During restore, the plugin reads status.lastSuccessfulEtcdBackupURL from the HostedCluster and injects it into spec.etcd.managed.storage.restoreSnapshotURL so the HyperShift Operator restores etcd from the snapshot

Bridges Velero BSL credentials to the HCPEtcdBackup controller by copying and remapping the credential Secret

Excludes etcd pods and PVCs from the backup when using etcdSnapshot to prevent CSI VolumeSnapshots

Dependencies

This PR depends 100% on the following HyperShift PRs being merged:

openshift/hypershift#8139 — HCPEtcdBackup controller (CNTRLPLANE-2678)

openshift/hypershift#8179 — LastSuccessfulEtcdBackupURL field in HostedClusterStatus (CNTRLPLANE-3173)

Once both are merged, a vendor update is required to:

Replace the unstructured getLastSuccessfulEtcdBackupURL() helper with direct field access

Remove local constants (BackupInProgressReason, BackupRejectedReason) in favor of API-defined ones

Related

Enhancement: openshift/enhancements#1945

Jira: CNTRLPLANE-2685

Changes

Commit Description

fix(deps) Update HyperShift API vendor to latest main (brings in HCPEtcdBackup types)

feat(backup) Integrate HCPEtcdBackup lifecycle: BSL mapping, credential copy, CR creation, polling, pod/PVC exclusion

feat(restore) Read lastSuccessfulEtcdBackupURL from HC status and inject into restoreSnapshotURL

docs Full implementation reference with architecture, flows, configuration, and troubleshooting

Test plan

Unit tests for orchestrator: BSL-to-storage mapping (AWS, Azure, unsupported), credential Secret copy and key remapping, encryption field injection, CRD existence check

Unit tests for restore: getLastSuccessfulEtcdBackupURL with various status shapes

Unit tests for validation: etcdBackupMethod and hoNamespace recognized as valid config keys

go build ./... passes

go test ./... passes

E2E: deploy plugin with etcdBackupMethod=etcdSnapshot, run OADP backup, verify HCPEtcdBackup CR created and snapshot uploaded

E2E: restore HostedCluster from backup, verify restoreSnapshotURL is injected

E2E: verify volumeSnapshot method (default) behavior is unchanged

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

pkg/core/validation/backup_test.go (1)

24-27: Cover the new config keys in this table

ValidatePluginConfig now special-cases etcdBackupMethod and hoNamespace, but this suite only exercises migration. Adding a happy-path entry for those keys would keep them from regressing back to the unknown-config path.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/core/validation/backup_test.go` around lines 24 - 27, Add table-driven
test entries to the existing backup_test.go table that exercise the new
special-cased keys so they don't regress: add a happy-path case with config key
"etcdBackupMethod" set to a valid value and another with "hoNamespace" set to a
valid value, and assert ValidatePluginConfig accepts them (same expectations as
the existing "migration" case). Locate the test that calls ValidatePluginConfig
in this file (the table-driven cases around the "migration" entry) and append
two entries mirroring that structure to cover those keys.

docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md (2)

39-67: Consider adding language specifier to the ASCII diagram code block.

Static analysis flagged this as missing a language specifier. Using text or plaintext would satisfy the linter while preserving the diagram formatting.

📝 Suggested fix

-```
+```text
                     OADP Plugin (BackupPlugin)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md` around lines
39 - 67, The ASCII diagram code block is missing a language specifier; update
the opening triple-backtick that begins the diagram (the block containing the
OADP Plugin / HCPEtcdBackup ASCII art) to include a plain text specifier such as
```text or ```plaintext so the linter stops flagging it and the diagram
formatting is preserved.

218-225: Add language specifier to storage path code blocks.

The storage layout examples at lines 218-225 are flagged by markdownlint for missing language specifiers. Using text would resolve this.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md` around lines
218 - 225, Update the two fenced code blocks showing the S3 storage layout /
example so they include a language specifier (use "text") after the opening
backticks; locate the blocks containing the lines that start with
"s3://<bucket>/<bsl-prefix>/backups/<backup-name>/etcd-backup/<timestamp>.db"
and the subsequent "Example:" block in HCPEtcdBackup-implementation.md and
change the opening "```" to "```text" for both blocks to satisfy markdownlint.

pkg/core/backup.go (1)

275-293: Consider caching completion state to avoid redundant polling.

waitForEtcdBackupCompletion may be called from both HCP and HC cases. The second call will re-poll the HCPEtcdBackup CR even though it already succeeded. While this works correctly (the poll returns immediately), adding a completionWaited bool flag would eliminate the redundant API call.

♻️ Optional: Add completion state tracking

 type BackupPlugin struct {
     // ...
     etcdOrchestrator  *etcdbackup.Orchestrator
+    etcdBackupDone    bool
     // ...
 }

 func (p *BackupPlugin) waitForEtcdBackupCompletion(ctx context.Context) error {
-    if p.etcdOrchestrator == nil || !p.etcdOrchestrator.IsCreated() {
+    if p.etcdOrchestrator == nil || !p.etcdOrchestrator.IsCreated() || p.etcdBackupDone {
         return nil
     }

     snapshotURL, err := p.etcdOrchestrator.WaitForCompletion(ctx)
     if err != nil {
         return fmt.Errorf("HCPEtcdBackup failed: %v", err)
     }
     p.log.Infof("HCPEtcdBackup completed, snapshotURL: %s", snapshotURL)
+    p.etcdBackupDone = true

     // ...
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/core/backup.go` around lines 275 - 293, Add a cached completion flag on
the BackupPlugin (e.g., a bool field completionWaited) and update
waitForEtcdBackupCompletion to return early if that flag is true; when calling
etcdOrchestrator.WaitForCompletion and it succeeds, set completionWaited = true
and still call CleanupCredentialSecret as before. If waitForEtcdBackupCompletion
can be invoked concurrently, protect completionWaited with a mutex or use an
atomic to avoid races; reference the BackupPlugin struct,
waitForEtcdBackupCompletion, etcdOrchestrator.WaitForCompletion and
etdcOrchestrator.CleanupCredentialSecret when locating where to add the flag and
guard.

pkg/common/types.go (1)

47-51: Remove the unused EtcdBackupSucceeded constant.

While BackupInProgressReason (line 49) and BackupRejectedReason (line 50) are actively used in the orchestrator logic, EtcdBackupSucceeded is defined but never referenced. Removing this unused constant will reduce clutter.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/common/types.go` around lines 47 - 51, Remove the unused
EtcdBackupSucceeded constant from the local constants block so only the actively
used BackupInProgressReason and BackupRejectedReason remain; delete the line
defining EtcdBackupSucceeded and ensure no other code references that symbol
(search for "EtcdBackupSucceeded") and remove or update any usages if found,
leaving the TODO comment and the two used constants intact.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/common/utils.go`:
- Around line 174-175: The code currently swallows errors from the Kubernetes
client List call (c.List(ctx, hcList, crclient.InNamespace(ns))) by continuing
on any error, which hides API/RBAC/transient failures; change the behavior so
that if c.List returns a non-nil error you return/propagate that error (wrap
with context mentioning the namespace) instead of continuing, while keeping the
existing logic that continues only when the list is empty; apply the same change
to the other identical c.List occurrence at the referenced location (both places
should return the error rather than falling through to nil,nil).

---

Nitpick comments:
In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md`:
- Around line 39-67: The ASCII diagram code block is missing a language
specifier; update the opening triple-backtick that begins the diagram (the block
containing the OADP Plugin / HCPEtcdBackup ASCII art) to include a plain text
specifier such as ```text or ```plaintext so the linter stops flagging it and
the diagram formatting is preserved.
- Around line 218-225: Update the two fenced code blocks showing the S3 storage
layout / example so they include a language specifier (use "text") after the
opening backticks; locate the blocks containing the lines that start with
"s3://<bucket>/<bsl-prefix>/backups/<backup-name>/etcd-backup/<timestamp>.db"
and the subsequent "Example:" block in HCPEtcdBackup-implementation.md and
change the opening "```" to "```text" for both blocks to satisfy markdownlint.

In `@pkg/common/types.go`:
- Around line 47-51: Remove the unused EtcdBackupSucceeded constant from the
local constants block so only the actively used BackupInProgressReason and
BackupRejectedReason remain; delete the line defining EtcdBackupSucceeded and
ensure no other code references that symbol (search for "EtcdBackupSucceeded")
and remove or update any usages if found, leaving the TODO comment and the two
used constants intact.

In `@pkg/core/backup.go`:
- Around line 275-293: Add a cached completion flag on the BackupPlugin (e.g., a
bool field completionWaited) and update waitForEtcdBackupCompletion to return
early if that flag is true; when calling etcdOrchestrator.WaitForCompletion and
it succeeds, set completionWaited = true and still call CleanupCredentialSecret
as before. If waitForEtcdBackupCompletion can be invoked concurrently, protect
completionWaited with a mutex or use an atomic to avoid races; reference the
BackupPlugin struct, waitForEtcdBackupCompletion,
etcdOrchestrator.WaitForCompletion and etdcOrchestrator.CleanupCredentialSecret
when locating where to add the flag and guard.

In `@pkg/core/validation/backup_test.go`:
- Around line 24-27: Add table-driven test entries to the existing
backup_test.go table that exercise the new special-cased keys so they don't
regress: add a happy-path case with config key "etcdBackupMethod" set to a valid
value and another with "hoNamespace" set to a valid value, and assert
ValidatePluginConfig accepts them (same expectations as the existing "migration"
case). Locate the test that calls ValidatePluginConfig in this file (the
table-driven cases around the "migration" entry) and append two entries
mirroring that structure to cover those keys.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 58ba492b-4bde-4887-b6d3-61a9323151c4

📥 Commits

Reviewing files that changed from the base of the PR and between 3c34d79 and 9a0a887.

⛔ Files ignored due to path filters (25)

go.sum is excluded by !**/*.sum
vendor/github.com/openshift/api/config/v1/types_apiserver.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/api/config/v1/types_authentication.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/api/config/v1/types_infrastructure.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/api/config/v1/types_insights.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/api/config/v1/types_network.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/api/config/v1/types_tlssecurityprofile.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/api/config/v1/zz_generated.featuregated-crd-manifests.yaml is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/api/config/v1/zz_generated.swagger_doc_generated.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/api/operator/v1/types_network.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/api/operator/v1/zz_generated.featuregated-crd-manifests.yaml is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/azure.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/azureprivatelinkservice_types.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/controlplaneversion_types.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/etcdbackup_types.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/gcp.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/gcpprivateserviceconnect_types.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hosted_controlplane.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_conditions.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/operator.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests.yaml is excluded by !vendor/**, !**/vendor/**
vendor/modules.txt is excluded by !vendor/**, !**/vendor/**

📒 Files selected for processing (16)

docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md
go.mod
pkg/common/scheme.go
pkg/common/types.go
pkg/common/utils.go
pkg/common/utils_test.go
pkg/core/backup.go
pkg/core/restore.go
pkg/core/restore_test.go
pkg/core/types/types.go
pkg/core/validation/backup.go
pkg/core/validation/backup_test.go
pkg/core/validation/restore.go
pkg/etcdbackup/orchestrator.go
pkg/etcdbackup/orchestrator_test.go
pkg/platform/aws/aws.go

💤 Files with no reviewable changes (1)

pkg/platform/aws/aws.go

pkg/common/utils.go

jparrill · 2026-04-07T21:10:45Z

/hold until Dep PRs got merged

jparrill · 2026-04-07T21:11:40Z

/label tide/merge-method-squash

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

pkg/core/backup.go (1)
145-151: Consider guarding against nil p.hcp before calling createEtcdBackup.

If the p.hcp lookup at lines 133-143 returns nil (e.g., on IsNotFound), the code returns early. However, between the early return and line 147, there's no explicit check that p.hcp is non-nil. Currently the flow is safe because the early return handles this, but the logic would be clearer and more defensive if the createEtcdBackup call were inside the p.hcp != nil block or had an explicit guard.

This is a minor readability point given the current flow is correct.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/core/backup.go` around lines 145 - 151, Move or guard the etcd backup
creation so it only runs when p.hcp is non-nil: ensure the call to
p.createEtcdBackup(ctx, backup) is executed inside the same block that verifies
p.hcp (or add an explicit if p.hcp != nil check) before checking
p.etcdBackupMethod == common.EtcdBackupMethodEtcdSnapshot; this keeps the logic
defensive and makes it clear createEtcdBackup is only invoked when p.hcp is
present.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/etcdbackup/orchestrator.go`:
- Around line 109-131: In VerifyInProgress (used by pollCondition) replace the
incorrect local constant check common.BackupInProgressReason with the vendored
hyperv1.BackupAlreadyInProgressReason so the switch will match the controller's
"BackupAlreadyInProgress" reason; also audit the case referencing
common.BackupRejectedReason (which may not exist in the vendored HyperShift
constants) and either remove that case or replace it with the correct vendored
constant if the controller actually emits a rejection reason, ensuring all
switch cases use vendored hyperv1 reason constants.

---

Nitpick comments:
In `@pkg/core/backup.go`:
- Around line 145-151: Move or guard the etcd backup creation so it only runs
when p.hcp is non-nil: ensure the call to p.createEtcdBackup(ctx, backup) is
executed inside the same block that verifies p.hcp (or add an explicit if p.hcp
!= nil check) before checking p.etcdBackupMethod ==
common.EtcdBackupMethodEtcdSnapshot; this keeps the logic defensive and makes it
clear createEtcdBackup is only invoked when p.hcp is present.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ce55502d-e414-402b-b509-59247817656b

📥 Commits

Reviewing files that changed from the base of the PR and between 9a0a887 and 4052ecd.

📒 Files selected for processing (15)

docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md
pkg/common/scheme.go
pkg/common/types.go
pkg/common/utils.go
pkg/common/utils_test.go
pkg/core/backup.go
pkg/core/restore.go
pkg/core/restore_test.go
pkg/core/types/types.go
pkg/core/validation/backup.go
pkg/core/validation/backup_test.go
pkg/core/validation/restore.go
pkg/etcdbackup/orchestrator.go
pkg/etcdbackup/orchestrator_test.go
pkg/platform/aws/aws.go

💤 Files with no reviewable changes (1)

pkg/platform/aws/aws.go

✅ Files skipped from review due to trivial changes (7)

pkg/common/scheme.go
pkg/core/validation/backup_test.go
pkg/core/restore.go
pkg/core/restore_test.go
pkg/common/utils_test.go
pkg/common/types.go
pkg/etcdbackup/orchestrator_test.go

🚧 Files skipped from review as they are similar to previous changes (3)

pkg/core/validation/backup.go
pkg/core/types/types.go
pkg/common/utils.go

coderabbitai · 2026-04-08T09:27:48Z

pkg/etcdbackup/orchestrator.go

+// VerifyInProgress polls the HCPEtcdBackup until the controller acknowledges it.
+func (o *Orchestrator) VerifyInProgress(ctx context.Context) error {
+	return o.pollCondition(ctx, verifyTimeout, func(cond *metav1.Condition) (bool, error) {
+		if cond == nil {
+			return false, nil // no condition yet, keep polling
+		}
+		switch cond.Reason {
+		case common.BackupInProgressReason:
+			o.log.Info("HCPEtcdBackup is in progress")
+			return true, nil
+		case hyperv1.BackupSucceededReason:
+			o.log.Info("HCPEtcdBackup already succeeded")
+			return true, nil
+		case hyperv1.BackupFailedReason:
+			return false, fmt.Errorf("HCPEtcdBackup failed: %s", cond.Message)
+		case common.BackupRejectedReason:
+			return false, fmt.Errorf("HCPEtcdBackup rejected: %s", cond.Message)
+		case hyperv1.EtcdUnhealthyReason:
+			return false, fmt.Errorf("etcd unhealthy: %s", cond.Message)
+		}
+		return false, nil
+	})
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify what condition reasons the HCPEtcdBackup controller actually sets echo "=== Vendored HyperShift condition constants ===" rg -n "Reason\s*=\s*" vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/etcdbackup_types.go echo "" echo "=== Local constants in common/types.go ===" rg -n "BackupInProgressReason|BackupRejectedReason|EtcdBackupSucceeded" pkg/common/types.go echo "" echo "=== How VerifyInProgress uses these constants ===" rg -nA2 "case.*BackupInProgressReason|case.*BackupRejectedReason" pkg/etcdbackup/orchestrator.go

Repository: openshift/hypershift-oadp-plugin

Length of output: 786

🏁 Script executed:

# First, verify the vendored file exists find vendor -name "*etcdbackup*" -type f | head -20

Repository: openshift/hypershift-oadp-plugin

Length of output: 157

🏁 Script executed:

# Search for reason constants in the entire vendor/github.com/openshift/hypershift directory rg "BackupAlreadyInProgress|BackupInProgress" vendor/github.com/openshift/hypershift/ -n | head -30

Repository: openshift/hypershift-oadp-plugin

Length of output: 226

🏁 Script executed:

# Also check what file structure exists for hypershift ls -la vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/ 2>/dev/null | grep -i backup || echo "Directory not found, searching for backup files:" find vendor/github.com/openshift -name "*backup*" -type f

Repository: openshift/hypershift-oadp-plugin

Length of output: 228

🏁 Script executed:

# Read the relevant section of vendored etcdbackup_types.go sed -n '20,80p' vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/etcdbackup_types.go | cat -n

Repository: openshift/hypershift-oadp-plugin

Length of output: 3723

🏁 Script executed:

# Search for all *Reason constants in the vendored file rg "Reason\s*string\s*=" vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/etcdbackup_types.go -n

Repository: openshift/hypershift-oadp-plugin

Length of output: 322

🏁 Script executed:

# Check orchestrator.go to see all the cases it handles sed -n '109,131p' pkg/etcdbackup/orchestrator.go | cat -n

Repository: openshift/hypershift-oadp-plugin

Length of output: 1131

Fix condition reason mismatch: use hyperv1.BackupAlreadyInProgressReason instead of local constant.

The local constant common.BackupInProgressReason = "BackupInProgress" doesn't match the vendored HyperShift controller's BackupAlreadyInProgressReason = "BackupAlreadyInProgress". When the controller sets the condition with Reason: "BackupAlreadyInProgress", this switch case won't match and polling will continue indefinitely.

Additionally, common.BackupRejectedReason = "BackupRejected" is not defined in the vendored HyperShift constants—verify whether the controller actually sets this reason or remove the dead code.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/etcdbackup/orchestrator.go` around lines 109 - 131, In VerifyInProgress (used by pollCondition) replace the incorrect local constant check common.BackupInProgressReason with the vendored hyperv1.BackupAlreadyInProgressReason so the switch will match the controller's "BackupAlreadyInProgress" reason; also audit the case referencing common.BackupRejectedReason (which may not exist in the vendored HyperShift constants) and either remove that case or replace it with the correct vendored constant if the controller actually emits a rejection reason, ensuring all switch cases use vendored hyperv1 reason constants.

openshift-ci-robot · 2026-04-08T10:38:30Z

@jparrill: This pull request references CNTRLPLANE-2685 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

Adds etcdSnapshot as a new etcd backup method for HyperShift HostedClusters. This PR covers only the backup flow; the restore flow (with pre-signed URL generation) will follow in a separate PR.

Introduces etcdBackupMethod configuration key (volumeSnapshot default, etcdSnapshot for native etcd snapshots)

During backup, the plugin creates an HCPEtcdBackup CR that triggers a native etcdctl snapshot save and uploads the snapshot to the Velero BSL object store

Bridges Velero BSL credentials to the HCPEtcdBackup controller by copying and remapping the credential Secret

Excludes etcd pods from the backup when using etcdSnapshot to prevent CSI VolumeSnapshots

Stores the snapshot under Velero's backup directory structure ({prefix}/backups/{backup-name}/etcd-backup/)

Note: The restore flow (reading lastSuccessfulEtcdBackupURL from HC status and generating a pre-signed HTTPS URL for restoreSnapshotURL) will be implemented in a follow-up PR.

Dependencies

This PR depends 100% on the following HyperShift PRs being merged:

openshift/hypershift#8139 — HCPEtcdBackup controller (CNTRLPLANE-2678)

openshift/hypershift#8179 — LastSuccessfulEtcdBackupURL field in HostedClusterStatus (CNTRLPLANE-3173)

Once both are merged, a vendor update is required to remove local constants (BackupInProgressReason, BackupRejectedReason) in favor of API-defined ones.

Related

Enhancement: openshift/enhancements#1945

Jira: CNTRLPLANE-2685

Changes

Commit Description

fix(deps) Update HyperShift API vendor to latest main (brings in HCPEtcdBackup types)

feat(backup) Integrate HCPEtcdBackup lifecycle: BSL mapping, credential copy, CR creation, polling, pod exclusion

docs Full implementation reference with architecture, flows, configuration, and troubleshooting

Test plan

Unit tests for orchestrator: BSL-to-storage mapping (AWS, Azure, unsupported), credential Secret copy and key remapping, encryption field injection, CRD existence check

Unit tests for validation: etcdBackupMethod and hoNamespace recognized as valid config keys

go build ./... passes

go test ./... passes

E2E: deploy plugin with etcdBackupMethod=etcdSnapshot, run OADP backup, verify HCPEtcdBackup CR created and snapshot uploaded

E2E: verify volumeSnapshot method (default) behavior is unchanged

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

🧹 Nitpick comments (1)

docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md (1)
40-40: Consider adding language identifiers to fenced code blocks.

While not impacting functionality, adding language identifiers would resolve linter warnings and improve syntax highlighting:

Line 40: ASCII diagram (consider text or no language)

Lines 219, 224: Path structures (consider text or plaintext)

Line 456: JSON output example (consider json)

Also applies to: 219-219, 224-224, 456-456
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md` at line 40,
Add explicit language identifiers to the fenced code blocks: mark the ASCII
diagram block as ```text (or leave blank) to satisfy the linter, mark the two
path-structure blocks as ```text or ```plaintext, and mark the JSON output
example as ```json so syntax highlighting and linter warnings are resolved;
update the corresponding fenced code block delimiters for the ASCII diagram, the
path structure examples, and the JSON output example accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md`:
- Line 40: Add explicit language identifiers to the fenced code blocks: mark the
ASCII diagram block as ```text (or leave blank) to satisfy the linter, mark the
two path-structure blocks as ```text or ```plaintext, and mark the JSON output
example as ```json so syntax highlighting and linter warnings are resolved;
update the corresponding fenced code block delimiters for the ASCII diagram, the
path structure examples, and the JSON output example accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ab1e9bd0-2870-462b-83b6-33b72b9cd059

📥 Commits

Reviewing files that changed from the base of the PR and between 4052ecd and 3cf9dfe.

📒 Files selected for processing (1)

docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md

coderabbitai

🧹 Nitpick comments (1)

pkg/etcdbackup/orchestrator.go (1)

173-190: Credential Secret cleanup should occur on failure paths too.

Based on the integration in pkg/core/backup.go, CleanupCredentialSecret is only called after WaitForCompletion succeeds. If the backup fails or times out, the copied credential Secret remains in the HO namespace. While the Secret contains an STS IAM Role ARN (not rotatable keys), accumulating orphaned Secrets is undesirable.

Consider calling cleanup in a defer or ensuring it's called on all exit paths.

♻️ Suggested approach in backup.go

// In waitForEtcdBackupCompletion, defer cleanup regardless of outcome:
func (p *BackupPlugin) waitForEtcdBackupCompletion(ctx context.Context) error {
	if p.etcdOrchestrator == nil || !p.etcdOrchestrator.IsCreated() {
		return nil
	}

+	// Always attempt cleanup, even on failure
+	defer func() {
+		if cleanupErr := p.etcdOrchestrator.CleanupCredentialSecret(ctx); cleanupErr != nil {
+			p.log.Warnf("Failed to cleanup etcd backup credential Secret: %v", cleanupErr)
+		}
+	}()

	if p.etcdSnapshotURL != "" {
		return nil
	}

	snapshotURL, err := p.etcdOrchestrator.WaitForCompletion(ctx)
	if err != nil {
		return fmt.Errorf("HCPEtcdBackup failed: %v", err)
	}
	p.etcdSnapshotURL = snapshotURL
	p.log.Infof("HCPEtcdBackup completed, snapshotURL: %s", snapshotURL)

-	if cleanupErr := p.etcdOrchestrator.CleanupCredentialSecret(ctx); cleanupErr != nil {
-		p.log.Warnf("Failed to cleanup etcd backup credential Secret: %v", cleanupErr)
-	}

	return nil
}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/etcdbackup/orchestrator.go` around lines 173 - 190,
CleanupCredentialSecret is only invoked after WaitForCompletion succeeds,
leaving orphaned Secrets on failure/timeouts; update waitForEtcdBackupCompletion
(method on BackupPlugin) to always attempt cleanup by deferring a call to
etcdOrchestrator.CleanupCredentialSecret(ctx) right after verifying
etcdOrchestrator.IsCreated(), so the Secret is removed on all exit paths
(success, error, timeout); ensure the deferred call logs any cleanup error (use
p.log.Warnf or similar) and continues to return the original error from
WaitForCompletion.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/etcdbackup/orchestrator.go`:
- Around line 173-190: CleanupCredentialSecret is only invoked after
WaitForCompletion succeeds, leaving orphaned Secrets on failure/timeouts; update
waitForEtcdBackupCompletion (method on BackupPlugin) to always attempt cleanup
by deferring a call to etcdOrchestrator.CleanupCredentialSecret(ctx) right after
verifying etcdOrchestrator.IsCreated(), so the Secret is removed on all exit
paths (success, error, timeout); ensure the deferred call logs any cleanup error
(use p.log.Warnf or similar) and continues to return the original error from
WaitForCompletion.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d116e2dc-774f-4faa-9b51-39b2dd034bae

📥 Commits

Reviewing files that changed from the base of the PR and between 3cf9dfe and d96df42.

📒 Files selected for processing (13)

docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md
pkg/common/scheme.go
pkg/common/types.go
pkg/common/utils.go
pkg/common/utils_test.go
pkg/core/backup.go
pkg/core/types/types.go
pkg/core/validation/backup.go
pkg/core/validation/backup_test.go
pkg/core/validation/restore.go
pkg/etcdbackup/orchestrator.go
pkg/etcdbackup/orchestrator_test.go
pkg/platform/aws/aws.go

💤 Files with no reviewable changes (1)

pkg/platform/aws/aws.go

✅ Files skipped from review due to trivial changes (4)

pkg/common/utils_test.go
pkg/core/validation/backup_test.go
pkg/common/types.go
pkg/etcdbackup/orchestrator_test.go

🚧 Files skipped from review as they are similar to previous changes (6)

pkg/common/scheme.go
pkg/core/types/types.go
pkg/common/utils.go
pkg/core/validation/restore.go
pkg/core/validation/backup.go
pkg/core/backup.go

Add etcdSnapshot backup method that creates and monitors HCPEtcdBackup CRs during Velero backup. When etcdBackupMethod=etcdSnapshot is configured in the plugin ConfigMap, the plugin: - Creates an HCPEtcdBackup CR in the HCP namespace using BSL storage config - Copies BSL credentials to the HO namespace (remapping key for controller) - Polls the CR until backup completes or fails - Excludes etcd pods and PVCs from Velero backup (no CSI/FS backup needed) - Stores the etcd snapshot alongside the Velero backup data in the BSL The default method remains volumeSnapshot (unchanged behavior). Also cleans up dead config parameters (readoptNodes, managedServices, awsRegenPrivateLink) and registers apiextensionsv1 in the scheme for CRD existence checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>

Document the full HCPEtcdBackup integration including architecture, backup/restore flows, configuration, credential handling, storage layout, dependency chain (PRs #8139, #8010, #8017, #8040, #8114, enhancement #1945), and troubleshooting guide. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>

openshift-ci · 2026-04-08T15:29:46Z

@jparrill: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (1)

pkg/etcdbackup/orchestrator.go (1)

115-126: ⚠️ Potential issue | 🟠 Major

Poll the vendored HyperShift reasons, not local placeholders.

The current switches still look for common.BackupInProgressReason/common.BackupRejectedReason, while the vendored API exposes BackupAlreadyInProgress, BackupFailed, BackupSucceeded, and EtcdUnhealthy. WaitForCompletion() also drops EtcdUnhealthyReason, so those states can sit in the poll loop until timeout.

Run this to compare the vendored constants with the current polling branches:

#!/bin/bash
set -euo pipefail

echo "=== vendored HCPEtcdBackup condition reasons ==="
sed -n '1,20p' vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/etcdbackup_types.go | nl -ba

echo
echo "=== current polling branches ==="
sed -n '109,165p' pkg/etcdbackup/orchestrator.go | nl -ba

echo
echo "=== local reason constants referenced here ==="
rg -n 'BackupInProgressReason|BackupRejectedReason|BackupAlreadyInProgressReason|EtcdUnhealthyReason' pkg/common/types.go pkg/etcdbackup/orchestrator.go

Also applies to: 155-160

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/etcdbackup/orchestrator.go` around lines 115 - 126, The switch over
cond.Reason in the poll loop is using local placeholders
(common.BackupInProgressReason, common.BackupRejectedReason) instead of the
vendored HyperShift condition reason constants; update the cases to use the
vendored names (hyperv1.BackupAlreadyInProgress, hyperv1.BackupSucceeded,
hyperv1.BackupFailed, hyperv1.EtcdUnhealthy) so the poll loop and
WaitForCompletion() observe the same semantics, and remove/replace any
references to common.BackupInProgressReason/common.BackupRejectedReason (also
adjust the similar branch around lines 155-160) so all checks consistently
reference hyperv1 constants.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md`:
- Around line 117-127: Update the paragraph to say the restore reads the etcd
snapshot URL from the HostedCluster annotation
hypershift.openshift.io/etcd-snapshot-url (not
status.lastSuccessfulEtcdBackupURL); explain that pkg/core/backup.go persists
the URL into that annotation because Velero strips status on restore, and update
the note about unstructured access to clarify we read the annotation until the
HyperShift API vendor adds a formal status field.

In `@pkg/core/backup.go`:
- Around line 146-151: The current logic calls p.createEtcdBackup as soon as any
HCP resolves (guarded only by p.etcdBackupMethod), which can start HCPEtcdBackup
before platform validation and without a guaranteed wait/cleanup path; change
this so HCPEtcdBackup is only created after platform validation and in a branch
that implements completion/cleanup (move or gate the call into the
HostedControlPlane/HostedCluster handling after validation succeeds), or add an
explicit validation check and registration of a wait/cleanup handler before
invoking p.createEtcdBackup; reference p.etcdBackupMethod, p.createEtcdBackup,
HCPEtcdBackup, HostedControlPlane and HostedCluster to locate and update the
code paths accordingly.
- Around line 284-289: When CreateEtcdBackup or VerifyInProgress fails after
copying credentials, ensure you delete the remapped BSL Secret in the host
operator namespace before returning; update the error paths after
p.etcdOrchestrator.CreateEtcdBackup(ctx, backup, p.hcp.Namespace, hc) and
p.etcdOrchestrator.VerifyInProgress(ctx) so they call the same cleanup that
waitForEtcdBackupCompletion() uses (or explicitly delete the remapped secret by
name in hoNamespace/p.hcp.Namespace) and then return the original error; make
the deletion idempotent (ignore NotFound) and log any deletion failure but still
return the original create/verify error.

In `@pkg/etcdbackup/orchestrator.go`:
- Around line 83-104: The code generates a random crName causing non-idempotent
creates; instead derive crName deterministically from the Velero backup (e.g.,
use backup.Name or backup.UID to build the CR name) and before calling
o.client.Create check for an existing HCPEtcdBackup by that deterministic name
via o.client.Get (or list by owner/label) so you reuse the existing CR on
re-entry; update crName, use o.client.Get/Create accordingly, and set
o.BackupName = crName when reusing or after successful create (referencing
variables crName, backup.Name, backup.UID, o.client.Get/Create, and
o.BackupName).

---

Duplicate comments:
In `@pkg/etcdbackup/orchestrator.go`:
- Around line 115-126: The switch over cond.Reason in the poll loop is using
local placeholders (common.BackupInProgressReason, common.BackupRejectedReason)
instead of the vendored HyperShift condition reason constants; update the cases
to use the vendored names (hyperv1.BackupAlreadyInProgress,
hyperv1.BackupSucceeded, hyperv1.BackupFailed, hyperv1.EtcdUnhealthy) so the
poll loop and WaitForCompletion() observe the same semantics, and remove/replace
any references to common.BackupInProgressReason/common.BackupRejectedReason
(also adjust the similar branch around lines 155-160) so all checks consistently
reference hyperv1 constants.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e6bca3fb-03c5-479b-9731-d1f5ead6e10a

📥 Commits

Reviewing files that changed from the base of the PR and between d96df42 and eb942f0.

📒 Files selected for processing (13)

docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md
pkg/common/scheme.go
pkg/common/types.go
pkg/common/utils.go
pkg/common/utils_test.go
pkg/core/backup.go
pkg/core/types/types.go
pkg/core/validation/backup.go
pkg/core/validation/backup_test.go
pkg/core/validation/restore.go
pkg/etcdbackup/orchestrator.go
pkg/etcdbackup/orchestrator_test.go
pkg/platform/aws/aws.go

💤 Files with no reviewable changes (1)

pkg/platform/aws/aws.go

✅ Files skipped from review due to trivial changes (3)

pkg/core/validation/backup_test.go
pkg/common/utils_test.go
pkg/common/types.go

🚧 Files skipped from review as they are similar to previous changes (5)

pkg/core/types/types.go
pkg/common/utils.go
pkg/core/validation/restore.go
pkg/common/scheme.go
pkg/core/validation/backup.go

coderabbitai · 2026-04-08T15:39:52Z

docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md

+1. When the `HostedCluster` item is processed during restore, the plugin reads `status.lastSuccessfulEtcdBackupURL` from the HC's unstructured content.
+
+2. If the URL is present and the HC has managed etcd (`spec.etcd.managed != nil`), the plugin injects the URL into `spec.etcd.managed.storage.restoreSnapshotURL`.
+
+3. The modified HC is written back to Velero's output, so when the HC is created in the target cluster, the HyperShift Operator uses the snapshot URL to restore etcd from the snapshot.
+
+### No Bidirectional Tracking
+
+The previous approach required tracking both the `HCPEtcdBackup` CR and the `HostedCluster` item arrival order. With `lastSuccessfulEtcdBackupURL` persisted in the HC status by the HCPEtcdBackup controller, the restore flow is stateless — everything needed is in the HC object itself.
+
+> **Note**: The `lastSuccessfulEtcdBackupURL` field is read via unstructured map access until the HyperShift API vendor is updated to include it (tracked in CNTRLPLANE-3173).


⚠️ Potential issue | 🟠 Major

Document the restore handoff as the annotation, not status.

This section says restore reads status.lastSuccessfulEtcdBackupURL, but pkg/core/backup.go Lines 179-199 explicitly persist the URL in hypershift.openshift.io/etcd-snapshot-url because Velero strips status during restore. As written, this documents the wrong contract for the follow-on restore flow.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md` around lines 117 - 127, Update the paragraph to say the restore reads the etcd snapshot URL from the HostedCluster annotation hypershift.openshift.io/etcd-snapshot-url (not status.lastSuccessfulEtcdBackupURL); explain that pkg/core/backup.go persists the URL into that annotation because Velero strips status on restore, and update the note about unstructured access to clarify we read the annotation until the HyperShift API vendor adds a formal status field.

coderabbitai · 2026-04-08T15:39:53Z

pkg/core/backup.go

+	// Etcd backup: create HCPEtcdBackup CR as early as possible (once).
+	// Only when etcdBackupMethod is "etcdSnapshot".
+	if p.etcdBackupMethod == common.EtcdBackupMethodEtcdSnapshot {
+		if err := p.createEtcdBackup(ctx, backup); err != nil {
+			return nil, nil, fmt.Errorf("error creating HCPEtcdBackup: %v", err)
+		}


⚠️ Potential issue | 🟠 Major

Don't start the snapshot workflow before validation and a guaranteed wait path.

This fires as soon as any item resolves an HCP, but platform validation only happens later in the HostedControlPlane branch and completion/cleanup only happen in the HostedControlPlane/HostedCluster branches. That means an invalid or partial backup can still launch HCPEtcdBackup and copy credentials with no later path that waits for it or cleans it up.

Also applies to: 162-185

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/core/backup.go` around lines 146 - 151, The current logic calls p.createEtcdBackup as soon as any HCP resolves (guarded only by p.etcdBackupMethod), which can start HCPEtcdBackup before platform validation and without a guaranteed wait/cleanup path; change this so HCPEtcdBackup is only created after platform validation and in a branch that implements completion/cleanup (move or gate the call into the HostedControlPlane/HostedCluster handling after validation succeeds), or add an explicit validation check and registration of a wait/cleanup handler before invoking p.createEtcdBackup; reference p.etcdBackupMethod, p.createEtcdBackup, HCPEtcdBackup, HostedControlPlane and HostedCluster to locate and update the code paths accordingly.

coderabbitai · 2026-04-08T15:39:53Z

pkg/core/backup.go

+	if err := p.etcdOrchestrator.CreateEtcdBackup(ctx, backup, p.hcp.Namespace, hc); err != nil {
+		return err
+	}
+
+	if err := p.etcdOrchestrator.VerifyInProgress(ctx); err != nil {
+		return err


⚠️ Potential issue | 🟠 Major

Clean up the copied Secret on create/verify failures.

The only cleanup today is in waitForEtcdBackupCompletion(). If either call here fails after the credentials were copied, the remapped BSL Secret is left behind in hoNamespace.

Possible fix

if err := p.etcdOrchestrator.CreateEtcdBackup(ctx, backup, p.hcp.Namespace, hc); err != nil { + if cleanupErr := p.etcdOrchestrator.CleanupCredentialSecret(ctx); cleanupErr != nil { + p.log.Warnf("Failed to cleanup etcd backup credential Secret after create error: %v", cleanupErr) + } return err } if err := p.etcdOrchestrator.VerifyInProgress(ctx); err != nil { + if cleanupErr := p.etcdOrchestrator.CleanupCredentialSecret(ctx); cleanupErr != nil { + p.log.Warnf("Failed to cleanup etcd backup credential Secret after verify error: %v", cleanupErr) + } return err }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/core/backup.go` around lines 284 - 289, When CreateEtcdBackup or VerifyInProgress fails after copying credentials, ensure you delete the remapped BSL Secret in the host operator namespace before returning; update the error paths after p.etcdOrchestrator.CreateEtcdBackup(ctx, backup, p.hcp.Namespace, hc) and p.etcdOrchestrator.VerifyInProgress(ctx) so they call the same cleanup that waitForEtcdBackupCompletion() uses (or explicitly delete the remapped secret by name in hoNamespace/p.hcp.Namespace) and then return the original error; make the deletion idempotent (ignore NotFound) and log any deletion failure but still return the original create/verify error.

coderabbitai · 2026-04-08T15:39:53Z

pkg/etcdbackup/orchestrator.go

+	crName := fmt.Sprintf("oadp-%s-%s", backup.Name, utilrand.String(4))
+	etcdBackup := &hyperv1.HCPEtcdBackup{
+		ObjectMeta: metav1.ObjectMeta{
+			Name:      crName,
+			Namespace: hcpNamespace,
+		},
+		Spec: hyperv1.HCPEtcdBackupSpec{
+			Storage: *storage,
+		},
+	}
+
+	if err := o.client.Create(ctx, etcdBackup); err != nil {
+		if apierrors.IsAlreadyExists(err) {
+			o.log.Infof("HCPEtcdBackup %s/%s already exists, reusing", hcpNamespace, crName)
+		} else {
+			return fmt.Errorf("failed to create HCPEtcdBackup: %w", err)
+		}
+	} else {
+		o.log.Infof("Created HCPEtcdBackup %s/%s", hcpNamespace, crName)
+	}
+
+	o.BackupName = crName


⚠️ Potential issue | 🟠 Major

Random CR names make this create path non-idempotent.

crName changes on every call, so the AlreadyExists branch only helps on a name collision. If this method is re-entered with a fresh in-memory state, the same Velero backup can create multiple HCPEtcdBackup CRs instead of reusing the first one.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/etcdbackup/orchestrator.go` around lines 83 - 104, The code generates a random crName causing non-idempotent creates; instead derive crName deterministically from the Velero backup (e.g., use backup.Name or backup.UID to build the CR name) and before calling o.client.Create check for an existing HCPEtcdBackup by that deterministic name via o.client.Get (or list by owner/label) so you reuse the existing CR on re-entry; update crName, use o.client.Get/Create accordingly, and set o.BackupName = crName when reusing or after successful create (referencing variables crName, backup.Name, backup.UID, o.client.Get/Create, and o.BackupName).

jparrill requested review from enxebre and sjenning as code owners April 7, 2026 20:54

openshift-ci bot requested review from celebdor and muraee April 7, 2026 20:54

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 7, 2026

jparrill changed the title ~~feat: integrate HCPEtcdBackup lifecycle into OADP backup/restore~~ CNTRLPLANE-2685: integrate HCPEtcdBackup lifecycle into OADP backup/restore Apr 7, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 7, 2026

coderabbitai bot reviewed Apr 7, 2026

View reviewed changes

pkg/common/utils.go Show resolved Hide resolved

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 7, 2026

openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Apr 7, 2026

jparrill force-pushed the CNTRLPLANE-2685 branch from 9a0a887 to 4052ecd Compare April 8, 2026 09:21

coderabbitai bot reviewed Apr 8, 2026

View reviewed changes

jparrill changed the title ~~CNTRLPLANE-2685: integrate HCPEtcdBackup lifecycle into OADP backup/restore~~ CNTRLPLANE-2685: integrate HCPEtcdBackup lifecycle into OADP backup flow Apr 8, 2026

jparrill force-pushed the CNTRLPLANE-2685 branch from 4052ecd to 3cf9dfe Compare April 8, 2026 10:39

coderabbitai bot reviewed Apr 8, 2026

View reviewed changes

jparrill force-pushed the CNTRLPLANE-2685 branch from 3cf9dfe to d96df42 Compare April 8, 2026 14:16

coderabbitai bot reviewed Apr 8, 2026

View reviewed changes

jparrill and others added 2 commits April 8, 2026 16:50

jparrill force-pushed the CNTRLPLANE-2685 branch from d96df42 to eb942f0 Compare April 8, 2026 15:24

coderabbitai bot reviewed Apr 8, 2026

View reviewed changes

Conversation

jparrill commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dependencies

Related

Changes

Test plan

Uh oh!

openshift-ci bot commented Apr 7, 2026

Uh oh!

coderabbitai bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci-robot commented Apr 7, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dependencies

Related

Changes

Test plan

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jparrill commented Apr 7, 2026

Uh oh!

jparrill commented Apr 7, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Apr 8, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dependencies

Related

Changes

Test plan

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Apr 8, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jparrill commented Apr 7, 2026 •

edited

Loading

coderabbitai bot commented Apr 7, 2026 •

edited

Loading

openshift-ci-robot commented Apr 7, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Apr 8, 2026 •

edited by openshift-ci bot

Loading