Skip to content

CNTRLPLANE-2685: integrate HCPEtcdBackup lifecycle into OADP backup flow#238

Open
jparrill wants to merge 3 commits intoopenshift:mainfrom
jparrill:CNTRLPLANE-2685
Open

CNTRLPLANE-2685: integrate HCPEtcdBackup lifecycle into OADP backup flow#238
jparrill wants to merge 3 commits intoopenshift:mainfrom
jparrill:CNTRLPLANE-2685

Conversation

@jparrill
Copy link
Copy Markdown
Contributor

@jparrill jparrill commented Apr 7, 2026

Summary

Adds etcdSnapshot as a new etcd backup method for HyperShift HostedClusters. This PR covers only the backup flow; the restore flow (with pre-signed URL generation) will follow in a separate PR.

  • Introduces etcdBackupMethod configuration key (volumeSnapshot default, etcdSnapshot for native etcd snapshots)
  • During backup, the plugin creates an HCPEtcdBackup CR that triggers a native etcdctl snapshot save and uploads the snapshot to the Velero BSL object store
  • Bridges Velero BSL credentials to the HCPEtcdBackup controller by copying and remapping the credential Secret
  • Excludes etcd pods from the backup when using etcdSnapshot to prevent CSI VolumeSnapshots
  • Stores the snapshot under Velero's backup directory structure ({prefix}/backups/{backup-name}/etcd-backup/)

Note: The restore flow (reading lastSuccessfulEtcdBackupURL from HC status and generating a pre-signed HTTPS URL for restoreSnapshotURL) will be implemented in a follow-up PR.

Dependencies

This PR depends 100% on the following HyperShift PRs being merged:

Once both are merged, a vendor update is required to remove local constants (BackupInProgressReason, BackupRejectedReason) in favor of API-defined ones.

Related

Changes

Commit Description
fix(deps) Update HyperShift API vendor to latest main (brings in HCPEtcdBackup types)
feat(backup) Integrate HCPEtcdBackup lifecycle: BSL mapping, credential copy, CR creation, polling, pod exclusion
docs Full implementation reference with architecture, flows, configuration, and troubleshooting

Test plan

  • Unit tests for orchestrator: BSL-to-storage mapping (AWS, Azure, unsupported), credential Secret copy and key remapping, encryption field injection, CRD existence check
  • Unit tests for validation: etcdBackupMethod and hoNamespace recognized as valid config keys
  • go build ./... passes
  • go test ./... passes
  • E2E: deploy plugin with etcdBackupMethod=etcdSnapshot, run OADP backup, verify HCPEtcdBackup CR created and snapshot uploaded
  • E2E: verify volumeSnapshot method (default) behavior is unchanged

🤖 Generated with Claude Code

Update github.com/openshift/hypershift/api to v0.0.0-20260406110001-bcf6adaf131f.
This brings in the HCPEtcdBackup CRD types, HCPEtcdBackupConfig in
ManagedEtcdSpec.Backup, and related condition/reason constants needed
for CNTRLPLANE-2685 integration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 7, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jparrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot requested review from celebdor and muraee April 7, 2026 20:54
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 7, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 7, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds HCPEtcdBackup-based etcd snapshot orchestration: new orchestrator creates/monitors HCPEtcdBackup CRs, maps Velero BSL to storage, copies/remaps credentials, injects snapshot URLs for restores, removes legacy config keys, updates types/scheme, and includes docs and tests.

Changes

Cohort / File(s) Summary
Documentation
docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md
New implementation and troubleshooting documentation for HCPEtcdBackup integration, backup/restore flow, BSL mapping, credential lifecycle, config keys, and testing.
Module deps
go.mod
Updated pinned commits for github.com/openshift/hypershift/api and indirect github.com/openshift/api.
Common: constants & scheme
pkg/common/types.go, pkg/common/scheme.go
Added exported constants for etcd snapshot integration, HONamespace and config keys; registered apiextensions.k8s.io/v1 types into the scheme.
Common: utilities & tests
pkg/common/utils.go, pkg/common/utils_test.go
Added GetHostedCluster helper to discover a HostedCluster by HCP namespace and unit tests covering positive and empty-client cases.
Backup plugin core
pkg/core/backup.go
Extended BackupPlugin to support etcdSnapshot mode: read config keys, create/monitor HCPEtcdBackup CRs, wait for completion, annotate HostedCluster with snapshot URL, and exclude etcd pods/PVCs from volume-based backups.
Core types & validation
pkg/core/types/types.go, pkg/core/validation/backup.go, pkg/core/validation/backup_test.go, pkg/core/validation/restore.go
Added hcpetcdbackups to common resources; removed ReadoptNodes/ManagedServices fields from BackupOptions/RestoreOptions; stopped parsing those keys in validators; adjusted tests and removed unused AWS restore gating.
Etcd backup orchestrator & tests
pkg/etcdbackup/orchestrator.go, pkg/etcdbackup/orchestrator_test.go
New exported Orchestrator: maps Velero BSL to HCPEtcdBackup storage (S3/Azure), computes key prefixes, copies/remaps BSL credentials into HostedCluster namespace, sets encryption/credential refs, creates/verifies/waits on HCPEtcdBackup, and cleans up credentials; comprehensive unit tests added.
Platform AWS cleanup
pkg/platform/aws/aws.go
Removed exported RestoreTasks function and its now-unused imports.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@jparrill jparrill changed the title feat: integrate HCPEtcdBackup lifecycle into OADP backup/restore CNTRLPLANE-2685: integrate HCPEtcdBackup lifecycle into OADP backup/restore Apr 7, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 7, 2026

@jparrill: This pull request references CNTRLPLANE-2685 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

  • Adds etcdSnapshot as a new etcd backup method alongside the existing volumeSnapshot (default), controlled via the etcdBackupMethod key in the plugin ConfigMap
  • During backup, the plugin creates an HCPEtcdBackup CR that triggers a native etcdctl snapshot save and uploads the snapshot to the Velero BSL object store
  • During restore, the plugin reads status.lastSuccessfulEtcdBackupURL from the HostedCluster and injects it into spec.etcd.managed.storage.restoreSnapshotURL so the HyperShift Operator restores etcd from the snapshot
  • Bridges Velero BSL credentials to the HCPEtcdBackup controller by copying and remapping the credential Secret
  • Excludes etcd pods and PVCs from the backup when using etcdSnapshot to prevent CSI VolumeSnapshots

Dependencies

This PR depends 100% on the following HyperShift PRs being merged:

Once both are merged, a vendor update is required to:

  1. Replace the unstructured getLastSuccessfulEtcdBackupURL() helper with direct field access
  2. Remove local constants (BackupInProgressReason, BackupRejectedReason) in favor of API-defined ones

Related

Changes

Commit Description
fix(deps) Update HyperShift API vendor to latest main (brings in HCPEtcdBackup types)
feat(backup) Integrate HCPEtcdBackup lifecycle: BSL mapping, credential copy, CR creation, polling, pod/PVC exclusion
feat(restore) Read lastSuccessfulEtcdBackupURL from HC status and inject into restoreSnapshotURL
docs Full implementation reference with architecture, flows, configuration, and troubleshooting

Test plan

  • Unit tests for orchestrator: BSL-to-storage mapping (AWS, Azure, unsupported), credential Secret copy and key remapping, encryption field injection, CRD existence check
  • Unit tests for restore: getLastSuccessfulEtcdBackupURL with various status shapes
  • Unit tests for validation: etcdBackupMethod and hoNamespace recognized as valid config keys
  • go build ./... passes
  • go test ./... passes
  • E2E: deploy plugin with etcdBackupMethod=etcdSnapshot, run OADP backup, verify HCPEtcdBackup CR created and snapshot uploaded
  • E2E: restore HostedCluster from backup, verify restoreSnapshotURL is injected
  • E2E: verify volumeSnapshot method (default) behavior is unchanged

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
pkg/core/validation/backup_test.go (1)

24-27: Cover the new config keys in this table

ValidatePluginConfig now special-cases etcdBackupMethod and hoNamespace, but this suite only exercises migration. Adding a happy-path entry for those keys would keep them from regressing back to the unknown-config path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/core/validation/backup_test.go` around lines 24 - 27, Add table-driven
test entries to the existing backup_test.go table that exercise the new
special-cased keys so they don't regress: add a happy-path case with config key
"etcdBackupMethod" set to a valid value and another with "hoNamespace" set to a
valid value, and assert ValidatePluginConfig accepts them (same expectations as
the existing "migration" case). Locate the test that calls ValidatePluginConfig
in this file (the table-driven cases around the "migration" entry) and append
two entries mirroring that structure to cover those keys.
docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md (2)

39-67: Consider adding language specifier to the ASCII diagram code block.

Static analysis flagged this as missing a language specifier. Using text or plaintext would satisfy the linter while preserving the diagram formatting.

📝 Suggested fix
-```
+```text
                     OADP Plugin (BackupPlugin)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md` around lines
39 - 67, The ASCII diagram code block is missing a language specifier; update
the opening triple-backtick that begins the diagram (the block containing the
OADP Plugin / HCPEtcdBackup ASCII art) to include a plain text specifier such as
```text or ```plaintext so the linter stops flagging it and the diagram
formatting is preserved.

218-225: Add language specifier to storage path code blocks.

The storage layout examples at lines 218-225 are flagged by markdownlint for missing language specifiers. Using text would resolve this.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md` around lines
218 - 225, Update the two fenced code blocks showing the S3 storage layout /
example so they include a language specifier (use "text") after the opening
backticks; locate the blocks containing the lines that start with
"s3://<bucket>/<bsl-prefix>/backups/<backup-name>/etcd-backup/<timestamp>.db"
and the subsequent "Example:" block in HCPEtcdBackup-implementation.md and
change the opening "```" to "```text" for both blocks to satisfy markdownlint.
pkg/core/backup.go (1)

275-293: Consider caching completion state to avoid redundant polling.

waitForEtcdBackupCompletion may be called from both HCP and HC cases. The second call will re-poll the HCPEtcdBackup CR even though it already succeeded. While this works correctly (the poll returns immediately), adding a completionWaited bool flag would eliminate the redundant API call.

♻️ Optional: Add completion state tracking
 type BackupPlugin struct {
     // ...
     etcdOrchestrator  *etcdbackup.Orchestrator
+    etcdBackupDone    bool
     // ...
 }

 func (p *BackupPlugin) waitForEtcdBackupCompletion(ctx context.Context) error {
-    if p.etcdOrchestrator == nil || !p.etcdOrchestrator.IsCreated() {
+    if p.etcdOrchestrator == nil || !p.etcdOrchestrator.IsCreated() || p.etcdBackupDone {
         return nil
     }

     snapshotURL, err := p.etcdOrchestrator.WaitForCompletion(ctx)
     if err != nil {
         return fmt.Errorf("HCPEtcdBackup failed: %v", err)
     }
     p.log.Infof("HCPEtcdBackup completed, snapshotURL: %s", snapshotURL)
+    p.etcdBackupDone = true

     // ...
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/core/backup.go` around lines 275 - 293, Add a cached completion flag on
the BackupPlugin (e.g., a bool field completionWaited) and update
waitForEtcdBackupCompletion to return early if that flag is true; when calling
etcdOrchestrator.WaitForCompletion and it succeeds, set completionWaited = true
and still call CleanupCredentialSecret as before. If waitForEtcdBackupCompletion
can be invoked concurrently, protect completionWaited with a mutex or use an
atomic to avoid races; reference the BackupPlugin struct,
waitForEtcdBackupCompletion, etcdOrchestrator.WaitForCompletion and
etdcOrchestrator.CleanupCredentialSecret when locating where to add the flag and
guard.
pkg/common/types.go (1)

47-51: Remove the unused EtcdBackupSucceeded constant.

While BackupInProgressReason (line 49) and BackupRejectedReason (line 50) are actively used in the orchestrator logic, EtcdBackupSucceeded is defined but never referenced. Removing this unused constant will reduce clutter.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/common/types.go` around lines 47 - 51, Remove the unused
EtcdBackupSucceeded constant from the local constants block so only the actively
used BackupInProgressReason and BackupRejectedReason remain; delete the line
defining EtcdBackupSucceeded and ensure no other code references that symbol
(search for "EtcdBackupSucceeded") and remove or update any usages if found,
leaving the TODO comment and the two used constants intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/common/utils.go`:
- Around line 174-175: The code currently swallows errors from the Kubernetes
client List call (c.List(ctx, hcList, crclient.InNamespace(ns))) by continuing
on any error, which hides API/RBAC/transient failures; change the behavior so
that if c.List returns a non-nil error you return/propagate that error (wrap
with context mentioning the namespace) instead of continuing, while keeping the
existing logic that continues only when the list is empty; apply the same change
to the other identical c.List occurrence at the referenced location (both places
should return the error rather than falling through to nil,nil).

---

Nitpick comments:
In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md`:
- Around line 39-67: The ASCII diagram code block is missing a language
specifier; update the opening triple-backtick that begins the diagram (the block
containing the OADP Plugin / HCPEtcdBackup ASCII art) to include a plain text
specifier such as ```text or ```plaintext so the linter stops flagging it and
the diagram formatting is preserved.
- Around line 218-225: Update the two fenced code blocks showing the S3 storage
layout / example so they include a language specifier (use "text") after the
opening backticks; locate the blocks containing the lines that start with
"s3://<bucket>/<bsl-prefix>/backups/<backup-name>/etcd-backup/<timestamp>.db"
and the subsequent "Example:" block in HCPEtcdBackup-implementation.md and
change the opening "```" to "```text" for both blocks to satisfy markdownlint.

In `@pkg/common/types.go`:
- Around line 47-51: Remove the unused EtcdBackupSucceeded constant from the
local constants block so only the actively used BackupInProgressReason and
BackupRejectedReason remain; delete the line defining EtcdBackupSucceeded and
ensure no other code references that symbol (search for "EtcdBackupSucceeded")
and remove or update any usages if found, leaving the TODO comment and the two
used constants intact.

In `@pkg/core/backup.go`:
- Around line 275-293: Add a cached completion flag on the BackupPlugin (e.g., a
bool field completionWaited) and update waitForEtcdBackupCompletion to return
early if that flag is true; when calling etcdOrchestrator.WaitForCompletion and
it succeeds, set completionWaited = true and still call CleanupCredentialSecret
as before. If waitForEtcdBackupCompletion can be invoked concurrently, protect
completionWaited with a mutex or use an atomic to avoid races; reference the
BackupPlugin struct, waitForEtcdBackupCompletion,
etcdOrchestrator.WaitForCompletion and etdcOrchestrator.CleanupCredentialSecret
when locating where to add the flag and guard.

In `@pkg/core/validation/backup_test.go`:
- Around line 24-27: Add table-driven test entries to the existing
backup_test.go table that exercise the new special-cased keys so they don't
regress: add a happy-path case with config key "etcdBackupMethod" set to a valid
value and another with "hoNamespace" set to a valid value, and assert
ValidatePluginConfig accepts them (same expectations as the existing "migration"
case). Locate the test that calls ValidatePluginConfig in this file (the
table-driven cases around the "migration" entry) and append two entries
mirroring that structure to cover those keys.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 58ba492b-4bde-4887-b6d3-61a9323151c4

📥 Commits

Reviewing files that changed from the base of the PR and between 3c34d79 and 9a0a887.

⛔ Files ignored due to path filters (25)
  • go.sum is excluded by !**/*.sum
  • vendor/github.com/openshift/api/config/v1/types_apiserver.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/api/config/v1/types_authentication.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/api/config/v1/types_infrastructure.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/api/config/v1/types_insights.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/api/config/v1/types_network.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/api/config/v1/types_tlssecurityprofile.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/api/config/v1/zz_generated.featuregated-crd-manifests.yaml is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/api/config/v1/zz_generated.swagger_doc_generated.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/api/operator/v1/types_network.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/api/operator/v1/zz_generated.featuregated-crd-manifests.yaml is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/azure.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/azureprivatelinkservice_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/controlplaneversion_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/etcdbackup_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/gcp.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/gcpprivateserviceconnect_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hosted_controlplane.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_conditions.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/operator.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests.yaml is excluded by !vendor/**, !**/vendor/**
  • vendor/modules.txt is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (16)
  • docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md
  • go.mod
  • pkg/common/scheme.go
  • pkg/common/types.go
  • pkg/common/utils.go
  • pkg/common/utils_test.go
  • pkg/core/backup.go
  • pkg/core/restore.go
  • pkg/core/restore_test.go
  • pkg/core/types/types.go
  • pkg/core/validation/backup.go
  • pkg/core/validation/backup_test.go
  • pkg/core/validation/restore.go
  • pkg/etcdbackup/orchestrator.go
  • pkg/etcdbackup/orchestrator_test.go
  • pkg/platform/aws/aws.go
💤 Files with no reviewable changes (1)
  • pkg/platform/aws/aws.go

@jparrill
Copy link
Copy Markdown
Contributor Author

jparrill commented Apr 7, 2026

/hold until Dep PRs got merged

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 7, 2026
@jparrill
Copy link
Copy Markdown
Contributor Author

jparrill commented Apr 7, 2026

/label tide/merge-method-squash

@openshift-ci openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Apr 7, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/core/backup.go (1)

145-151: Consider guarding against nil p.hcp before calling createEtcdBackup.

If the p.hcp lookup at lines 133-143 returns nil (e.g., on IsNotFound), the code returns early. However, between the early return and line 147, there's no explicit check that p.hcp is non-nil. Currently the flow is safe because the early return handles this, but the logic would be clearer and more defensive if the createEtcdBackup call were inside the p.hcp != nil block or had an explicit guard.

This is a minor readability point given the current flow is correct.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/core/backup.go` around lines 145 - 151, Move or guard the etcd backup
creation so it only runs when p.hcp is non-nil: ensure the call to
p.createEtcdBackup(ctx, backup) is executed inside the same block that verifies
p.hcp (or add an explicit if p.hcp != nil check) before checking
p.etcdBackupMethod == common.EtcdBackupMethodEtcdSnapshot; this keeps the logic
defensive and makes it clear createEtcdBackup is only invoked when p.hcp is
present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/etcdbackup/orchestrator.go`:
- Around line 109-131: In VerifyInProgress (used by pollCondition) replace the
incorrect local constant check common.BackupInProgressReason with the vendored
hyperv1.BackupAlreadyInProgressReason so the switch will match the controller's
"BackupAlreadyInProgress" reason; also audit the case referencing
common.BackupRejectedReason (which may not exist in the vendored HyperShift
constants) and either remove that case or replace it with the correct vendored
constant if the controller actually emits a rejection reason, ensuring all
switch cases use vendored hyperv1 reason constants.

---

Nitpick comments:
In `@pkg/core/backup.go`:
- Around line 145-151: Move or guard the etcd backup creation so it only runs
when p.hcp is non-nil: ensure the call to p.createEtcdBackup(ctx, backup) is
executed inside the same block that verifies p.hcp (or add an explicit if p.hcp
!= nil check) before checking p.etcdBackupMethod ==
common.EtcdBackupMethodEtcdSnapshot; this keeps the logic defensive and makes it
clear createEtcdBackup is only invoked when p.hcp is present.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ce55502d-e414-402b-b509-59247817656b

📥 Commits

Reviewing files that changed from the base of the PR and between 9a0a887 and 4052ecd.

📒 Files selected for processing (15)
  • docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md
  • pkg/common/scheme.go
  • pkg/common/types.go
  • pkg/common/utils.go
  • pkg/common/utils_test.go
  • pkg/core/backup.go
  • pkg/core/restore.go
  • pkg/core/restore_test.go
  • pkg/core/types/types.go
  • pkg/core/validation/backup.go
  • pkg/core/validation/backup_test.go
  • pkg/core/validation/restore.go
  • pkg/etcdbackup/orchestrator.go
  • pkg/etcdbackup/orchestrator_test.go
  • pkg/platform/aws/aws.go
💤 Files with no reviewable changes (1)
  • pkg/platform/aws/aws.go
✅ Files skipped from review due to trivial changes (7)
  • pkg/common/scheme.go
  • pkg/core/validation/backup_test.go
  • pkg/core/restore.go
  • pkg/core/restore_test.go
  • pkg/common/utils_test.go
  • pkg/common/types.go
  • pkg/etcdbackup/orchestrator_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • pkg/core/validation/backup.go
  • pkg/core/types/types.go
  • pkg/common/utils.go

Comment on lines +109 to +131
// VerifyInProgress polls the HCPEtcdBackup until the controller acknowledges it.
func (o *Orchestrator) VerifyInProgress(ctx context.Context) error {
return o.pollCondition(ctx, verifyTimeout, func(cond *metav1.Condition) (bool, error) {
if cond == nil {
return false, nil // no condition yet, keep polling
}
switch cond.Reason {
case common.BackupInProgressReason:
o.log.Info("HCPEtcdBackup is in progress")
return true, nil
case hyperv1.BackupSucceededReason:
o.log.Info("HCPEtcdBackup already succeeded")
return true, nil
case hyperv1.BackupFailedReason:
return false, fmt.Errorf("HCPEtcdBackup failed: %s", cond.Message)
case common.BackupRejectedReason:
return false, fmt.Errorf("HCPEtcdBackup rejected: %s", cond.Message)
case hyperv1.EtcdUnhealthyReason:
return false, fmt.Errorf("etcd unhealthy: %s", cond.Message)
}
return false, nil
})
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify what condition reasons the HCPEtcdBackup controller actually sets

echo "=== Vendored HyperShift condition constants ==="
rg -n "Reason\s*=\s*" vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/etcdbackup_types.go

echo ""
echo "=== Local constants in common/types.go ==="
rg -n "BackupInProgressReason|BackupRejectedReason|EtcdBackupSucceeded" pkg/common/types.go

echo ""
echo "=== How VerifyInProgress uses these constants ==="
rg -nA2 "case.*BackupInProgressReason|case.*BackupRejectedReason" pkg/etcdbackup/orchestrator.go

Repository: openshift/hypershift-oadp-plugin

Length of output: 786


🏁 Script executed:

# First, verify the vendored file exists
find vendor -name "*etcdbackup*" -type f | head -20

Repository: openshift/hypershift-oadp-plugin

Length of output: 157


🏁 Script executed:

# Search for reason constants in the entire vendor/github.com/openshift/hypershift directory
rg "BackupAlreadyInProgress|BackupInProgress" vendor/github.com/openshift/hypershift/ -n | head -30

Repository: openshift/hypershift-oadp-plugin

Length of output: 226


🏁 Script executed:

# Also check what file structure exists for hypershift
ls -la vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/ 2>/dev/null | grep -i backup || echo "Directory not found, searching for backup files:"
find vendor/github.com/openshift -name "*backup*" -type f

Repository: openshift/hypershift-oadp-plugin

Length of output: 228


🏁 Script executed:

# Read the relevant section of vendored etcdbackup_types.go
sed -n '20,80p' vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/etcdbackup_types.go | cat -n

Repository: openshift/hypershift-oadp-plugin

Length of output: 3723


🏁 Script executed:

# Search for all *Reason constants in the vendored file
rg "Reason\s*string\s*=" vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/etcdbackup_types.go -n

Repository: openshift/hypershift-oadp-plugin

Length of output: 322


🏁 Script executed:

# Check orchestrator.go to see all the cases it handles
sed -n '109,131p' pkg/etcdbackup/orchestrator.go | cat -n

Repository: openshift/hypershift-oadp-plugin

Length of output: 1131


Fix condition reason mismatch: use hyperv1.BackupAlreadyInProgressReason instead of local constant.

The local constant common.BackupInProgressReason = "BackupInProgress" doesn't match the vendored HyperShift controller's BackupAlreadyInProgressReason = "BackupAlreadyInProgress". When the controller sets the condition with Reason: "BackupAlreadyInProgress", this switch case won't match and polling will continue indefinitely.

Additionally, common.BackupRejectedReason = "BackupRejected" is not defined in the vendored HyperShift constants—verify whether the controller actually sets this reason or remove the dead code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/etcdbackup/orchestrator.go` around lines 109 - 131, In VerifyInProgress
(used by pollCondition) replace the incorrect local constant check
common.BackupInProgressReason with the vendored
hyperv1.BackupAlreadyInProgressReason so the switch will match the controller's
"BackupAlreadyInProgress" reason; also audit the case referencing
common.BackupRejectedReason (which may not exist in the vendored HyperShift
constants) and either remove that case or replace it with the correct vendored
constant if the controller actually emits a rejection reason, ensuring all
switch cases use vendored hyperv1 reason constants.

@jparrill jparrill changed the title CNTRLPLANE-2685: integrate HCPEtcdBackup lifecycle into OADP backup/restore CNTRLPLANE-2685: integrate HCPEtcdBackup lifecycle into OADP backup flow Apr 8, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 8, 2026

@jparrill: This pull request references CNTRLPLANE-2685 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

Adds etcdSnapshot as a new etcd backup method for HyperShift HostedClusters. This PR covers only the backup flow; the restore flow (with pre-signed URL generation) will follow in a separate PR.

  • Introduces etcdBackupMethod configuration key (volumeSnapshot default, etcdSnapshot for native etcd snapshots)
  • During backup, the plugin creates an HCPEtcdBackup CR that triggers a native etcdctl snapshot save and uploads the snapshot to the Velero BSL object store
  • Bridges Velero BSL credentials to the HCPEtcdBackup controller by copying and remapping the credential Secret
  • Excludes etcd pods from the backup when using etcdSnapshot to prevent CSI VolumeSnapshots
  • Stores the snapshot under Velero's backup directory structure ({prefix}/backups/{backup-name}/etcd-backup/)

Note: The restore flow (reading lastSuccessfulEtcdBackupURL from HC status and generating a pre-signed HTTPS URL for restoreSnapshotURL) will be implemented in a follow-up PR.

Dependencies

This PR depends 100% on the following HyperShift PRs being merged:

Once both are merged, a vendor update is required to remove local constants (BackupInProgressReason, BackupRejectedReason) in favor of API-defined ones.

Related

Changes

Commit Description
fix(deps) Update HyperShift API vendor to latest main (brings in HCPEtcdBackup types)
feat(backup) Integrate HCPEtcdBackup lifecycle: BSL mapping, credential copy, CR creation, polling, pod exclusion
docs Full implementation reference with architecture, flows, configuration, and troubleshooting

Test plan

  • Unit tests for orchestrator: BSL-to-storage mapping (AWS, Azure, unsupported), credential Secret copy and key remapping, encryption field injection, CRD existence check
  • Unit tests for validation: etcdBackupMethod and hoNamespace recognized as valid config keys
  • go build ./... passes
  • go test ./... passes
  • E2E: deploy plugin with etcdBackupMethod=etcdSnapshot, run OADP backup, verify HCPEtcdBackup CR created and snapshot uploaded
  • E2E: verify volumeSnapshot method (default) behavior is unchanged

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md (1)

40-40: Consider adding language identifiers to fenced code blocks.

While not impacting functionality, adding language identifiers would resolve linter warnings and improve syntax highlighting:

  • Line 40: ASCII diagram (consider text or no language)
  • Lines 219, 224: Path structures (consider text or plaintext)
  • Line 456: JSON output example (consider json)

Also applies to: 219-219, 224-224, 456-456

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md` at line 40,
Add explicit language identifiers to the fenced code blocks: mark the ASCII
diagram block as ```text (or leave blank) to satisfy the linter, mark the two
path-structure blocks as ```text or ```plaintext, and mark the JSON output
example as ```json so syntax highlighting and linter warnings are resolved;
update the corresponding fenced code block delimiters for the ASCII diagram, the
path structure examples, and the JSON output example accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md`:
- Line 40: Add explicit language identifiers to the fenced code blocks: mark the
ASCII diagram block as ```text (or leave blank) to satisfy the linter, mark the
two path-structure blocks as ```text or ```plaintext, and mark the JSON output
example as ```json so syntax highlighting and linter warnings are resolved;
update the corresponding fenced code block delimiters for the ASCII diagram, the
path structure examples, and the JSON output example accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ab1e9bd0-2870-462b-83b6-33b72b9cd059

📥 Commits

Reviewing files that changed from the base of the PR and between 4052ecd and 3cf9dfe.

📒 Files selected for processing (1)
  • docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/etcdbackup/orchestrator.go (1)

173-190: Credential Secret cleanup should occur on failure paths too.

Based on the integration in pkg/core/backup.go, CleanupCredentialSecret is only called after WaitForCompletion succeeds. If the backup fails or times out, the copied credential Secret remains in the HO namespace. While the Secret contains an STS IAM Role ARN (not rotatable keys), accumulating orphaned Secrets is undesirable.

Consider calling cleanup in a defer or ensuring it's called on all exit paths.

♻️ Suggested approach in backup.go
// In waitForEtcdBackupCompletion, defer cleanup regardless of outcome:
func (p *BackupPlugin) waitForEtcdBackupCompletion(ctx context.Context) error {
	if p.etcdOrchestrator == nil || !p.etcdOrchestrator.IsCreated() {
		return nil
	}

+	// Always attempt cleanup, even on failure
+	defer func() {
+		if cleanupErr := p.etcdOrchestrator.CleanupCredentialSecret(ctx); cleanupErr != nil {
+			p.log.Warnf("Failed to cleanup etcd backup credential Secret: %v", cleanupErr)
+		}
+	}()

	if p.etcdSnapshotURL != "" {
		return nil
	}

	snapshotURL, err := p.etcdOrchestrator.WaitForCompletion(ctx)
	if err != nil {
		return fmt.Errorf("HCPEtcdBackup failed: %v", err)
	}
	p.etcdSnapshotURL = snapshotURL
	p.log.Infof("HCPEtcdBackup completed, snapshotURL: %s", snapshotURL)

-	if cleanupErr := p.etcdOrchestrator.CleanupCredentialSecret(ctx); cleanupErr != nil {
-		p.log.Warnf("Failed to cleanup etcd backup credential Secret: %v", cleanupErr)
-	}

	return nil
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/etcdbackup/orchestrator.go` around lines 173 - 190,
CleanupCredentialSecret is only invoked after WaitForCompletion succeeds,
leaving orphaned Secrets on failure/timeouts; update waitForEtcdBackupCompletion
(method on BackupPlugin) to always attempt cleanup by deferring a call to
etcdOrchestrator.CleanupCredentialSecret(ctx) right after verifying
etcdOrchestrator.IsCreated(), so the Secret is removed on all exit paths
(success, error, timeout); ensure the deferred call logs any cleanup error (use
p.log.Warnf or similar) and continues to return the original error from
WaitForCompletion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/etcdbackup/orchestrator.go`:
- Around line 173-190: CleanupCredentialSecret is only invoked after
WaitForCompletion succeeds, leaving orphaned Secrets on failure/timeouts; update
waitForEtcdBackupCompletion (method on BackupPlugin) to always attempt cleanup
by deferring a call to etcdOrchestrator.CleanupCredentialSecret(ctx) right after
verifying etcdOrchestrator.IsCreated(), so the Secret is removed on all exit
paths (success, error, timeout); ensure the deferred call logs any cleanup error
(use p.log.Warnf or similar) and continues to return the original error from
WaitForCompletion.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d116e2dc-774f-4faa-9b51-39b2dd034bae

📥 Commits

Reviewing files that changed from the base of the PR and between 3cf9dfe and d96df42.

📒 Files selected for processing (13)
  • docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md
  • pkg/common/scheme.go
  • pkg/common/types.go
  • pkg/common/utils.go
  • pkg/common/utils_test.go
  • pkg/core/backup.go
  • pkg/core/types/types.go
  • pkg/core/validation/backup.go
  • pkg/core/validation/backup_test.go
  • pkg/core/validation/restore.go
  • pkg/etcdbackup/orchestrator.go
  • pkg/etcdbackup/orchestrator_test.go
  • pkg/platform/aws/aws.go
💤 Files with no reviewable changes (1)
  • pkg/platform/aws/aws.go
✅ Files skipped from review due to trivial changes (4)
  • pkg/common/utils_test.go
  • pkg/core/validation/backup_test.go
  • pkg/common/types.go
  • pkg/etcdbackup/orchestrator_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • pkg/common/scheme.go
  • pkg/core/types/types.go
  • pkg/common/utils.go
  • pkg/core/validation/restore.go
  • pkg/core/validation/backup.go
  • pkg/core/backup.go

jparrill and others added 2 commits April 8, 2026 16:50
Add etcdSnapshot backup method that creates and monitors HCPEtcdBackup CRs
during Velero backup. When etcdBackupMethod=etcdSnapshot is configured in the
plugin ConfigMap, the plugin:

- Creates an HCPEtcdBackup CR in the HCP namespace using BSL storage config
- Copies BSL credentials to the HO namespace (remapping key for controller)
- Polls the CR until backup completes or fails
- Excludes etcd pods and PVCs from Velero backup (no CSI/FS backup needed)
- Stores the etcd snapshot alongside the Velero backup data in the BSL

The default method remains volumeSnapshot (unchanged behavior).

Also cleans up dead config parameters (readoptNodes, managedServices,
awsRegenPrivateLink) and registers apiextensionsv1 in the scheme for
CRD existence checks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
Document the full HCPEtcdBackup integration including architecture,
backup/restore flows, configuration, credential handling, storage layout,
dependency chain (PRs #8139, #8010, #8017, #8040, #8114, enhancement #1945),
and troubleshooting guide.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 8, 2026

@jparrill: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
pkg/etcdbackup/orchestrator.go (1)

115-126: ⚠️ Potential issue | 🟠 Major

Poll the vendored HyperShift reasons, not local placeholders.

The current switches still look for common.BackupInProgressReason/common.BackupRejectedReason, while the vendored API exposes BackupAlreadyInProgress, BackupFailed, BackupSucceeded, and EtcdUnhealthy. WaitForCompletion() also drops EtcdUnhealthyReason, so those states can sit in the poll loop until timeout.

Run this to compare the vendored constants with the current polling branches:

#!/bin/bash
set -euo pipefail

echo "=== vendored HCPEtcdBackup condition reasons ==="
sed -n '1,20p' vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/etcdbackup_types.go | nl -ba

echo
echo "=== current polling branches ==="
sed -n '109,165p' pkg/etcdbackup/orchestrator.go | nl -ba

echo
echo "=== local reason constants referenced here ==="
rg -n 'BackupInProgressReason|BackupRejectedReason|BackupAlreadyInProgressReason|EtcdUnhealthyReason' pkg/common/types.go pkg/etcdbackup/orchestrator.go

Also applies to: 155-160

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/etcdbackup/orchestrator.go` around lines 115 - 126, The switch over
cond.Reason in the poll loop is using local placeholders
(common.BackupInProgressReason, common.BackupRejectedReason) instead of the
vendored HyperShift condition reason constants; update the cases to use the
vendored names (hyperv1.BackupAlreadyInProgress, hyperv1.BackupSucceeded,
hyperv1.BackupFailed, hyperv1.EtcdUnhealthy) so the poll loop and
WaitForCompletion() observe the same semantics, and remove/replace any
references to common.BackupInProgressReason/common.BackupRejectedReason (also
adjust the similar branch around lines 155-160) so all checks consistently
reference hyperv1 constants.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md`:
- Around line 117-127: Update the paragraph to say the restore reads the etcd
snapshot URL from the HostedCluster annotation
hypershift.openshift.io/etcd-snapshot-url (not
status.lastSuccessfulEtcdBackupURL); explain that pkg/core/backup.go persists
the URL into that annotation because Velero strips status on restore, and update
the note about unstructured access to clarify we read the annotation until the
HyperShift API vendor adds a formal status field.

In `@pkg/core/backup.go`:
- Around line 146-151: The current logic calls p.createEtcdBackup as soon as any
HCP resolves (guarded only by p.etcdBackupMethod), which can start HCPEtcdBackup
before platform validation and without a guaranteed wait/cleanup path; change
this so HCPEtcdBackup is only created after platform validation and in a branch
that implements completion/cleanup (move or gate the call into the
HostedControlPlane/HostedCluster handling after validation succeeds), or add an
explicit validation check and registration of a wait/cleanup handler before
invoking p.createEtcdBackup; reference p.etcdBackupMethod, p.createEtcdBackup,
HCPEtcdBackup, HostedControlPlane and HostedCluster to locate and update the
code paths accordingly.
- Around line 284-289: When CreateEtcdBackup or VerifyInProgress fails after
copying credentials, ensure you delete the remapped BSL Secret in the host
operator namespace before returning; update the error paths after
p.etcdOrchestrator.CreateEtcdBackup(ctx, backup, p.hcp.Namespace, hc) and
p.etcdOrchestrator.VerifyInProgress(ctx) so they call the same cleanup that
waitForEtcdBackupCompletion() uses (or explicitly delete the remapped secret by
name in hoNamespace/p.hcp.Namespace) and then return the original error; make
the deletion idempotent (ignore NotFound) and log any deletion failure but still
return the original create/verify error.

In `@pkg/etcdbackup/orchestrator.go`:
- Around line 83-104: The code generates a random crName causing non-idempotent
creates; instead derive crName deterministically from the Velero backup (e.g.,
use backup.Name or backup.UID to build the CR name) and before calling
o.client.Create check for an existing HCPEtcdBackup by that deterministic name
via o.client.Get (or list by owner/label) so you reuse the existing CR on
re-entry; update crName, use o.client.Get/Create accordingly, and set
o.BackupName = crName when reusing or after successful create (referencing
variables crName, backup.Name, backup.UID, o.client.Get/Create, and
o.BackupName).

---

Duplicate comments:
In `@pkg/etcdbackup/orchestrator.go`:
- Around line 115-126: The switch over cond.Reason in the poll loop is using
local placeholders (common.BackupInProgressReason, common.BackupRejectedReason)
instead of the vendored HyperShift condition reason constants; update the cases
to use the vendored names (hyperv1.BackupAlreadyInProgress,
hyperv1.BackupSucceeded, hyperv1.BackupFailed, hyperv1.EtcdUnhealthy) so the
poll loop and WaitForCompletion() observe the same semantics, and remove/replace
any references to common.BackupInProgressReason/common.BackupRejectedReason
(also adjust the similar branch around lines 155-160) so all checks consistently
reference hyperv1 constants.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e6bca3fb-03c5-479b-9731-d1f5ead6e10a

📥 Commits

Reviewing files that changed from the base of the PR and between d96df42 and eb942f0.

📒 Files selected for processing (13)
  • docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md
  • pkg/common/scheme.go
  • pkg/common/types.go
  • pkg/common/utils.go
  • pkg/common/utils_test.go
  • pkg/core/backup.go
  • pkg/core/types/types.go
  • pkg/core/validation/backup.go
  • pkg/core/validation/backup_test.go
  • pkg/core/validation/restore.go
  • pkg/etcdbackup/orchestrator.go
  • pkg/etcdbackup/orchestrator_test.go
  • pkg/platform/aws/aws.go
💤 Files with no reviewable changes (1)
  • pkg/platform/aws/aws.go
✅ Files skipped from review due to trivial changes (3)
  • pkg/core/validation/backup_test.go
  • pkg/common/utils_test.go
  • pkg/common/types.go
🚧 Files skipped from review as they are similar to previous changes (5)
  • pkg/core/types/types.go
  • pkg/common/utils.go
  • pkg/core/validation/restore.go
  • pkg/common/scheme.go
  • pkg/core/validation/backup.go

Comment on lines +117 to +127
1. When the `HostedCluster` item is processed during restore, the plugin reads `status.lastSuccessfulEtcdBackupURL` from the HC's unstructured content.

2. If the URL is present and the HC has managed etcd (`spec.etcd.managed != nil`), the plugin injects the URL into `spec.etcd.managed.storage.restoreSnapshotURL`.

3. The modified HC is written back to Velero's output, so when the HC is created in the target cluster, the HyperShift Operator uses the snapshot URL to restore etcd from the snapshot.

### No Bidirectional Tracking

The previous approach required tracking both the `HCPEtcdBackup` CR and the `HostedCluster` item arrival order. With `lastSuccessfulEtcdBackupURL` persisted in the HC status by the HCPEtcdBackup controller, the restore flow is stateless — everything needed is in the HC object itself.

> **Note**: The `lastSuccessfulEtcdBackupURL` field is read via unstructured map access until the HyperShift API vendor is updated to include it (tracked in CNTRLPLANE-3173).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Document the restore handoff as the annotation, not status.

This section says restore reads status.lastSuccessfulEtcdBackupURL, but pkg/core/backup.go Lines 179-199 explicitly persist the URL in hypershift.openshift.io/etcd-snapshot-url because Velero strips status during restore. As written, this documents the wrong contract for the follow-on restore flow.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md` around lines
117 - 127, Update the paragraph to say the restore reads the etcd snapshot URL
from the HostedCluster annotation hypershift.openshift.io/etcd-snapshot-url (not
status.lastSuccessfulEtcdBackupURL); explain that pkg/core/backup.go persists
the URL into that annotation because Velero strips status on restore, and update
the note about unstructured access to clarify we read the annotation until the
HyperShift API vendor adds a formal status field.

Comment on lines +146 to +151
// Etcd backup: create HCPEtcdBackup CR as early as possible (once).
// Only when etcdBackupMethod is "etcdSnapshot".
if p.etcdBackupMethod == common.EtcdBackupMethodEtcdSnapshot {
if err := p.createEtcdBackup(ctx, backup); err != nil {
return nil, nil, fmt.Errorf("error creating HCPEtcdBackup: %v", err)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't start the snapshot workflow before validation and a guaranteed wait path.

This fires as soon as any item resolves an HCP, but platform validation only happens later in the HostedControlPlane branch and completion/cleanup only happen in the HostedControlPlane/HostedCluster branches. That means an invalid or partial backup can still launch HCPEtcdBackup and copy credentials with no later path that waits for it or cleans it up.

Also applies to: 162-185

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/core/backup.go` around lines 146 - 151, The current logic calls
p.createEtcdBackup as soon as any HCP resolves (guarded only by
p.etcdBackupMethod), which can start HCPEtcdBackup before platform validation
and without a guaranteed wait/cleanup path; change this so HCPEtcdBackup is only
created after platform validation and in a branch that implements
completion/cleanup (move or gate the call into the
HostedControlPlane/HostedCluster handling after validation succeeds), or add an
explicit validation check and registration of a wait/cleanup handler before
invoking p.createEtcdBackup; reference p.etcdBackupMethod, p.createEtcdBackup,
HCPEtcdBackup, HostedControlPlane and HostedCluster to locate and update the
code paths accordingly.

Comment on lines +284 to +289
if err := p.etcdOrchestrator.CreateEtcdBackup(ctx, backup, p.hcp.Namespace, hc); err != nil {
return err
}

if err := p.etcdOrchestrator.VerifyInProgress(ctx); err != nil {
return err
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clean up the copied Secret on create/verify failures.

The only cleanup today is in waitForEtcdBackupCompletion(). If either call here fails after the credentials were copied, the remapped BSL Secret is left behind in hoNamespace.

Possible fix
 	if err := p.etcdOrchestrator.CreateEtcdBackup(ctx, backup, p.hcp.Namespace, hc); err != nil {
+		if cleanupErr := p.etcdOrchestrator.CleanupCredentialSecret(ctx); cleanupErr != nil {
+			p.log.Warnf("Failed to cleanup etcd backup credential Secret after create error: %v", cleanupErr)
+		}
 		return err
 	}
 
 	if err := p.etcdOrchestrator.VerifyInProgress(ctx); err != nil {
+		if cleanupErr := p.etcdOrchestrator.CleanupCredentialSecret(ctx); cleanupErr != nil {
+			p.log.Warnf("Failed to cleanup etcd backup credential Secret after verify error: %v", cleanupErr)
+		}
 		return err
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/core/backup.go` around lines 284 - 289, When CreateEtcdBackup or
VerifyInProgress fails after copying credentials, ensure you delete the remapped
BSL Secret in the host operator namespace before returning; update the error
paths after p.etcdOrchestrator.CreateEtcdBackup(ctx, backup, p.hcp.Namespace,
hc) and p.etcdOrchestrator.VerifyInProgress(ctx) so they call the same cleanup
that waitForEtcdBackupCompletion() uses (or explicitly delete the remapped
secret by name in hoNamespace/p.hcp.Namespace) and then return the original
error; make the deletion idempotent (ignore NotFound) and log any deletion
failure but still return the original create/verify error.

Comment on lines +83 to +104
crName := fmt.Sprintf("oadp-%s-%s", backup.Name, utilrand.String(4))
etcdBackup := &hyperv1.HCPEtcdBackup{
ObjectMeta: metav1.ObjectMeta{
Name: crName,
Namespace: hcpNamespace,
},
Spec: hyperv1.HCPEtcdBackupSpec{
Storage: *storage,
},
}

if err := o.client.Create(ctx, etcdBackup); err != nil {
if apierrors.IsAlreadyExists(err) {
o.log.Infof("HCPEtcdBackup %s/%s already exists, reusing", hcpNamespace, crName)
} else {
return fmt.Errorf("failed to create HCPEtcdBackup: %w", err)
}
} else {
o.log.Infof("Created HCPEtcdBackup %s/%s", hcpNamespace, crName)
}

o.BackupName = crName
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Random CR names make this create path non-idempotent.

crName changes on every call, so the AlreadyExists branch only helps on a name collision. If this method is re-entered with a fresh in-memory state, the same Velero backup can create multiple HCPEtcdBackup CRs instead of reusing the first one.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/etcdbackup/orchestrator.go` around lines 83 - 104, The code generates a
random crName causing non-idempotent creates; instead derive crName
deterministically from the Velero backup (e.g., use backup.Name or backup.UID to
build the CR name) and before calling o.client.Create check for an existing
HCPEtcdBackup by that deterministic name via o.client.Get (or list by
owner/label) so you reuse the existing CR on re-entry; update crName, use
o.client.Get/Create accordingly, and set o.BackupName = crName when reusing or
after successful create (referencing variables crName, backup.Name, backup.UID,
o.client.Get/Create, and o.BackupName).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants