CNTRLPLANE-2685: integrate HCPEtcdBackup lifecycle into OADP backup flow#238
CNTRLPLANE-2685: integrate HCPEtcdBackup lifecycle into OADP backup flow#238jparrill wants to merge 3 commits intoopenshift:mainfrom
Conversation
Update github.com/openshift/hypershift/api to v0.0.0-20260406110001-bcf6adaf131f. This brings in the HCPEtcdBackup CRD types, HCPEtcdBackupConfig in ManagedEtcdSpec.Backup, and related condition/reason constants needed for CNTRLPLANE-2685 integration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jparrill The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds HCPEtcdBackup-based etcd snapshot orchestration: new orchestrator creates/monitors HCPEtcdBackup CRs, maps Velero BSL to storage, copies/remaps credentials, injects snapshot URLs for restores, removes legacy config keys, updates types/scheme, and includes docs and tests. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@jparrill: This pull request references CNTRLPLANE-2685 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
pkg/core/validation/backup_test.go (1)
24-27: Cover the new config keys in this table
ValidatePluginConfignow special-casesetcdBackupMethodandhoNamespace, but this suite only exercisesmigration. Adding a happy-path entry for those keys would keep them from regressing back to the unknown-config path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/core/validation/backup_test.go` around lines 24 - 27, Add table-driven test entries to the existing backup_test.go table that exercise the new special-cased keys so they don't regress: add a happy-path case with config key "etcdBackupMethod" set to a valid value and another with "hoNamespace" set to a valid value, and assert ValidatePluginConfig accepts them (same expectations as the existing "migration" case). Locate the test that calls ValidatePluginConfig in this file (the table-driven cases around the "migration" entry) and append two entries mirroring that structure to cover those keys.docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md (2)
39-67: Consider adding language specifier to the ASCII diagram code block.Static analysis flagged this as missing a language specifier. Using
textorplaintextwould satisfy the linter while preserving the diagram formatting.📝 Suggested fix
-``` +```text OADP Plugin (BackupPlugin)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md` around lines 39 - 67, The ASCII diagram code block is missing a language specifier; update the opening triple-backtick that begins the diagram (the block containing the OADP Plugin / HCPEtcdBackup ASCII art) to include a plain text specifier such as ```text or ```plaintext so the linter stops flagging it and the diagram formatting is preserved.
218-225: Add language specifier to storage path code blocks.The storage layout examples at lines 218-225 are flagged by markdownlint for missing language specifiers. Using
textwould resolve this.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md` around lines 218 - 225, Update the two fenced code blocks showing the S3 storage layout / example so they include a language specifier (use "text") after the opening backticks; locate the blocks containing the lines that start with "s3://<bucket>/<bsl-prefix>/backups/<backup-name>/etcd-backup/<timestamp>.db" and the subsequent "Example:" block in HCPEtcdBackup-implementation.md and change the opening "```" to "```text" for both blocks to satisfy markdownlint.pkg/core/backup.go (1)
275-293: Consider caching completion state to avoid redundant polling.
waitForEtcdBackupCompletionmay be called from both HCP and HC cases. The second call will re-poll theHCPEtcdBackupCR even though it already succeeded. While this works correctly (the poll returns immediately), adding acompletionWaited boolflag would eliminate the redundant API call.♻️ Optional: Add completion state tracking
type BackupPlugin struct { // ... etcdOrchestrator *etcdbackup.Orchestrator + etcdBackupDone bool // ... } func (p *BackupPlugin) waitForEtcdBackupCompletion(ctx context.Context) error { - if p.etcdOrchestrator == nil || !p.etcdOrchestrator.IsCreated() { + if p.etcdOrchestrator == nil || !p.etcdOrchestrator.IsCreated() || p.etcdBackupDone { return nil } snapshotURL, err := p.etcdOrchestrator.WaitForCompletion(ctx) if err != nil { return fmt.Errorf("HCPEtcdBackup failed: %v", err) } p.log.Infof("HCPEtcdBackup completed, snapshotURL: %s", snapshotURL) + p.etcdBackupDone = true // ... }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/core/backup.go` around lines 275 - 293, Add a cached completion flag on the BackupPlugin (e.g., a bool field completionWaited) and update waitForEtcdBackupCompletion to return early if that flag is true; when calling etcdOrchestrator.WaitForCompletion and it succeeds, set completionWaited = true and still call CleanupCredentialSecret as before. If waitForEtcdBackupCompletion can be invoked concurrently, protect completionWaited with a mutex or use an atomic to avoid races; reference the BackupPlugin struct, waitForEtcdBackupCompletion, etcdOrchestrator.WaitForCompletion and etdcOrchestrator.CleanupCredentialSecret when locating where to add the flag and guard.pkg/common/types.go (1)
47-51: Remove the unusedEtcdBackupSucceededconstant.While
BackupInProgressReason(line 49) andBackupRejectedReason(line 50) are actively used in the orchestrator logic,EtcdBackupSucceededis defined but never referenced. Removing this unused constant will reduce clutter.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/common/types.go` around lines 47 - 51, Remove the unused EtcdBackupSucceeded constant from the local constants block so only the actively used BackupInProgressReason and BackupRejectedReason remain; delete the line defining EtcdBackupSucceeded and ensure no other code references that symbol (search for "EtcdBackupSucceeded") and remove or update any usages if found, leaving the TODO comment and the two used constants intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/common/utils.go`:
- Around line 174-175: The code currently swallows errors from the Kubernetes
client List call (c.List(ctx, hcList, crclient.InNamespace(ns))) by continuing
on any error, which hides API/RBAC/transient failures; change the behavior so
that if c.List returns a non-nil error you return/propagate that error (wrap
with context mentioning the namespace) instead of continuing, while keeping the
existing logic that continues only when the list is empty; apply the same change
to the other identical c.List occurrence at the referenced location (both places
should return the error rather than falling through to nil,nil).
---
Nitpick comments:
In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md`:
- Around line 39-67: The ASCII diagram code block is missing a language
specifier; update the opening triple-backtick that begins the diagram (the block
containing the OADP Plugin / HCPEtcdBackup ASCII art) to include a plain text
specifier such as ```text or ```plaintext so the linter stops flagging it and
the diagram formatting is preserved.
- Around line 218-225: Update the two fenced code blocks showing the S3 storage
layout / example so they include a language specifier (use "text") after the
opening backticks; locate the blocks containing the lines that start with
"s3://<bucket>/<bsl-prefix>/backups/<backup-name>/etcd-backup/<timestamp>.db"
and the subsequent "Example:" block in HCPEtcdBackup-implementation.md and
change the opening "```" to "```text" for both blocks to satisfy markdownlint.
In `@pkg/common/types.go`:
- Around line 47-51: Remove the unused EtcdBackupSucceeded constant from the
local constants block so only the actively used BackupInProgressReason and
BackupRejectedReason remain; delete the line defining EtcdBackupSucceeded and
ensure no other code references that symbol (search for "EtcdBackupSucceeded")
and remove or update any usages if found, leaving the TODO comment and the two
used constants intact.
In `@pkg/core/backup.go`:
- Around line 275-293: Add a cached completion flag on the BackupPlugin (e.g., a
bool field completionWaited) and update waitForEtcdBackupCompletion to return
early if that flag is true; when calling etcdOrchestrator.WaitForCompletion and
it succeeds, set completionWaited = true and still call CleanupCredentialSecret
as before. If waitForEtcdBackupCompletion can be invoked concurrently, protect
completionWaited with a mutex or use an atomic to avoid races; reference the
BackupPlugin struct, waitForEtcdBackupCompletion,
etcdOrchestrator.WaitForCompletion and etdcOrchestrator.CleanupCredentialSecret
when locating where to add the flag and guard.
In `@pkg/core/validation/backup_test.go`:
- Around line 24-27: Add table-driven test entries to the existing
backup_test.go table that exercise the new special-cased keys so they don't
regress: add a happy-path case with config key "etcdBackupMethod" set to a valid
value and another with "hoNamespace" set to a valid value, and assert
ValidatePluginConfig accepts them (same expectations as the existing "migration"
case). Locate the test that calls ValidatePluginConfig in this file (the
table-driven cases around the "migration" entry) and append two entries
mirroring that structure to cover those keys.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 58ba492b-4bde-4887-b6d3-61a9323151c4
⛔ Files ignored due to path filters (25)
go.sumis excluded by!**/*.sumvendor/github.com/openshift/api/config/v1/types_apiserver.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/types_authentication.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/types_infrastructure.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/types_insights.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/types_network.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/types_tlssecurityprofile.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/zz_generated.featuregated-crd-manifests.yamlis excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/zz_generated.swagger_doc_generated.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/operator/v1/types_network.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/operator/v1/zz_generated.featuregated-crd-manifests.yamlis excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/azure.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/azureprivatelinkservice_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/controlplaneversion_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/etcdbackup_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/gcp.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/gcpprivateserviceconnect_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hosted_controlplane.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_conditions.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/operator.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests.yamlis excluded by!vendor/**,!**/vendor/**vendor/modules.txtis excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (16)
docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.mdgo.modpkg/common/scheme.gopkg/common/types.gopkg/common/utils.gopkg/common/utils_test.gopkg/core/backup.gopkg/core/restore.gopkg/core/restore_test.gopkg/core/types/types.gopkg/core/validation/backup.gopkg/core/validation/backup_test.gopkg/core/validation/restore.gopkg/etcdbackup/orchestrator.gopkg/etcdbackup/orchestrator_test.gopkg/platform/aws/aws.go
💤 Files with no reviewable changes (1)
- pkg/platform/aws/aws.go
|
/hold until Dep PRs got merged |
|
/label tide/merge-method-squash |
9a0a887 to
4052ecd
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/core/backup.go (1)
145-151: Consider guarding against nilp.hcpbefore callingcreateEtcdBackup.If the
p.hcplookup at lines 133-143 returnsnil(e.g., onIsNotFound), the code returns early. However, between the early return and line 147, there's no explicit check thatp.hcpis non-nil. Currently the flow is safe because the early return handles this, but the logic would be clearer and more defensive if thecreateEtcdBackupcall were inside thep.hcp != nilblock or had an explicit guard.This is a minor readability point given the current flow is correct.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/core/backup.go` around lines 145 - 151, Move or guard the etcd backup creation so it only runs when p.hcp is non-nil: ensure the call to p.createEtcdBackup(ctx, backup) is executed inside the same block that verifies p.hcp (or add an explicit if p.hcp != nil check) before checking p.etcdBackupMethod == common.EtcdBackupMethodEtcdSnapshot; this keeps the logic defensive and makes it clear createEtcdBackup is only invoked when p.hcp is present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/etcdbackup/orchestrator.go`:
- Around line 109-131: In VerifyInProgress (used by pollCondition) replace the
incorrect local constant check common.BackupInProgressReason with the vendored
hyperv1.BackupAlreadyInProgressReason so the switch will match the controller's
"BackupAlreadyInProgress" reason; also audit the case referencing
common.BackupRejectedReason (which may not exist in the vendored HyperShift
constants) and either remove that case or replace it with the correct vendored
constant if the controller actually emits a rejection reason, ensuring all
switch cases use vendored hyperv1 reason constants.
---
Nitpick comments:
In `@pkg/core/backup.go`:
- Around line 145-151: Move or guard the etcd backup creation so it only runs
when p.hcp is non-nil: ensure the call to p.createEtcdBackup(ctx, backup) is
executed inside the same block that verifies p.hcp (or add an explicit if p.hcp
!= nil check) before checking p.etcdBackupMethod ==
common.EtcdBackupMethodEtcdSnapshot; this keeps the logic defensive and makes it
clear createEtcdBackup is only invoked when p.hcp is present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ce55502d-e414-402b-b509-59247817656b
📒 Files selected for processing (15)
docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.mdpkg/common/scheme.gopkg/common/types.gopkg/common/utils.gopkg/common/utils_test.gopkg/core/backup.gopkg/core/restore.gopkg/core/restore_test.gopkg/core/types/types.gopkg/core/validation/backup.gopkg/core/validation/backup_test.gopkg/core/validation/restore.gopkg/etcdbackup/orchestrator.gopkg/etcdbackup/orchestrator_test.gopkg/platform/aws/aws.go
💤 Files with no reviewable changes (1)
- pkg/platform/aws/aws.go
✅ Files skipped from review due to trivial changes (7)
- pkg/common/scheme.go
- pkg/core/validation/backup_test.go
- pkg/core/restore.go
- pkg/core/restore_test.go
- pkg/common/utils_test.go
- pkg/common/types.go
- pkg/etcdbackup/orchestrator_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- pkg/core/validation/backup.go
- pkg/core/types/types.go
- pkg/common/utils.go
| // VerifyInProgress polls the HCPEtcdBackup until the controller acknowledges it. | ||
| func (o *Orchestrator) VerifyInProgress(ctx context.Context) error { | ||
| return o.pollCondition(ctx, verifyTimeout, func(cond *metav1.Condition) (bool, error) { | ||
| if cond == nil { | ||
| return false, nil // no condition yet, keep polling | ||
| } | ||
| switch cond.Reason { | ||
| case common.BackupInProgressReason: | ||
| o.log.Info("HCPEtcdBackup is in progress") | ||
| return true, nil | ||
| case hyperv1.BackupSucceededReason: | ||
| o.log.Info("HCPEtcdBackup already succeeded") | ||
| return true, nil | ||
| case hyperv1.BackupFailedReason: | ||
| return false, fmt.Errorf("HCPEtcdBackup failed: %s", cond.Message) | ||
| case common.BackupRejectedReason: | ||
| return false, fmt.Errorf("HCPEtcdBackup rejected: %s", cond.Message) | ||
| case hyperv1.EtcdUnhealthyReason: | ||
| return false, fmt.Errorf("etcd unhealthy: %s", cond.Message) | ||
| } | ||
| return false, nil | ||
| }) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify what condition reasons the HCPEtcdBackup controller actually sets
echo "=== Vendored HyperShift condition constants ==="
rg -n "Reason\s*=\s*" vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/etcdbackup_types.go
echo ""
echo "=== Local constants in common/types.go ==="
rg -n "BackupInProgressReason|BackupRejectedReason|EtcdBackupSucceeded" pkg/common/types.go
echo ""
echo "=== How VerifyInProgress uses these constants ==="
rg -nA2 "case.*BackupInProgressReason|case.*BackupRejectedReason" pkg/etcdbackup/orchestrator.goRepository: openshift/hypershift-oadp-plugin
Length of output: 786
🏁 Script executed:
# First, verify the vendored file exists
find vendor -name "*etcdbackup*" -type f | head -20Repository: openshift/hypershift-oadp-plugin
Length of output: 157
🏁 Script executed:
# Search for reason constants in the entire vendor/github.com/openshift/hypershift directory
rg "BackupAlreadyInProgress|BackupInProgress" vendor/github.com/openshift/hypershift/ -n | head -30Repository: openshift/hypershift-oadp-plugin
Length of output: 226
🏁 Script executed:
# Also check what file structure exists for hypershift
ls -la vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/ 2>/dev/null | grep -i backup || echo "Directory not found, searching for backup files:"
find vendor/github.com/openshift -name "*backup*" -type fRepository: openshift/hypershift-oadp-plugin
Length of output: 228
🏁 Script executed:
# Read the relevant section of vendored etcdbackup_types.go
sed -n '20,80p' vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/etcdbackup_types.go | cat -nRepository: openshift/hypershift-oadp-plugin
Length of output: 3723
🏁 Script executed:
# Search for all *Reason constants in the vendored file
rg "Reason\s*string\s*=" vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/etcdbackup_types.go -nRepository: openshift/hypershift-oadp-plugin
Length of output: 322
🏁 Script executed:
# Check orchestrator.go to see all the cases it handles
sed -n '109,131p' pkg/etcdbackup/orchestrator.go | cat -nRepository: openshift/hypershift-oadp-plugin
Length of output: 1131
Fix condition reason mismatch: use hyperv1.BackupAlreadyInProgressReason instead of local constant.
The local constant common.BackupInProgressReason = "BackupInProgress" doesn't match the vendored HyperShift controller's BackupAlreadyInProgressReason = "BackupAlreadyInProgress". When the controller sets the condition with Reason: "BackupAlreadyInProgress", this switch case won't match and polling will continue indefinitely.
Additionally, common.BackupRejectedReason = "BackupRejected" is not defined in the vendored HyperShift constants—verify whether the controller actually sets this reason or remove the dead code.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/etcdbackup/orchestrator.go` around lines 109 - 131, In VerifyInProgress
(used by pollCondition) replace the incorrect local constant check
common.BackupInProgressReason with the vendored
hyperv1.BackupAlreadyInProgressReason so the switch will match the controller's
"BackupAlreadyInProgress" reason; also audit the case referencing
common.BackupRejectedReason (which may not exist in the vendored HyperShift
constants) and either remove that case or replace it with the correct vendored
constant if the controller actually emits a rejection reason, ensuring all
switch cases use vendored hyperv1 reason constants.
|
@jparrill: This pull request references CNTRLPLANE-2685 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
4052ecd to
3cf9dfe
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md (1)
40-40: Consider adding language identifiers to fenced code blocks.While not impacting functionality, adding language identifiers would resolve linter warnings and improve syntax highlighting:
- Line 40: ASCII diagram (consider
textor no language)- Lines 219, 224: Path structures (consider
textorplaintext)- Line 456: JSON output example (consider
json)Also applies to: 219-219, 224-224, 456-456
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md` at line 40, Add explicit language identifiers to the fenced code blocks: mark the ASCII diagram block as ```text (or leave blank) to satisfy the linter, mark the two path-structure blocks as ```text or ```plaintext, and mark the JSON output example as ```json so syntax highlighting and linter warnings are resolved; update the corresponding fenced code block delimiters for the ASCII diagram, the path structure examples, and the JSON output example accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md`:
- Line 40: Add explicit language identifiers to the fenced code blocks: mark the
ASCII diagram block as ```text (or leave blank) to satisfy the linter, mark the
two path-structure blocks as ```text or ```plaintext, and mark the JSON output
example as ```json so syntax highlighting and linter warnings are resolved;
update the corresponding fenced code block delimiters for the ASCII diagram, the
path structure examples, and the JSON output example accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ab1e9bd0-2870-462b-83b6-33b72b9cd059
📒 Files selected for processing (1)
docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md
3cf9dfe to
d96df42
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/etcdbackup/orchestrator.go (1)
173-190: Credential Secret cleanup should occur on failure paths too.Based on the integration in
pkg/core/backup.go,CleanupCredentialSecretis only called afterWaitForCompletionsucceeds. If the backup fails or times out, the copied credential Secret remains in the HO namespace. While the Secret contains an STS IAM Role ARN (not rotatable keys), accumulating orphaned Secrets is undesirable.Consider calling cleanup in a
deferor ensuring it's called on all exit paths.♻️ Suggested approach in backup.go
// In waitForEtcdBackupCompletion, defer cleanup regardless of outcome: func (p *BackupPlugin) waitForEtcdBackupCompletion(ctx context.Context) error { if p.etcdOrchestrator == nil || !p.etcdOrchestrator.IsCreated() { return nil } + // Always attempt cleanup, even on failure + defer func() { + if cleanupErr := p.etcdOrchestrator.CleanupCredentialSecret(ctx); cleanupErr != nil { + p.log.Warnf("Failed to cleanup etcd backup credential Secret: %v", cleanupErr) + } + }() if p.etcdSnapshotURL != "" { return nil } snapshotURL, err := p.etcdOrchestrator.WaitForCompletion(ctx) if err != nil { return fmt.Errorf("HCPEtcdBackup failed: %v", err) } p.etcdSnapshotURL = snapshotURL p.log.Infof("HCPEtcdBackup completed, snapshotURL: %s", snapshotURL) - if cleanupErr := p.etcdOrchestrator.CleanupCredentialSecret(ctx); cleanupErr != nil { - p.log.Warnf("Failed to cleanup etcd backup credential Secret: %v", cleanupErr) - } return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/etcdbackup/orchestrator.go` around lines 173 - 190, CleanupCredentialSecret is only invoked after WaitForCompletion succeeds, leaving orphaned Secrets on failure/timeouts; update waitForEtcdBackupCompletion (method on BackupPlugin) to always attempt cleanup by deferring a call to etcdOrchestrator.CleanupCredentialSecret(ctx) right after verifying etcdOrchestrator.IsCreated(), so the Secret is removed on all exit paths (success, error, timeout); ensure the deferred call logs any cleanup error (use p.log.Warnf or similar) and continues to return the original error from WaitForCompletion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/etcdbackup/orchestrator.go`:
- Around line 173-190: CleanupCredentialSecret is only invoked after
WaitForCompletion succeeds, leaving orphaned Secrets on failure/timeouts; update
waitForEtcdBackupCompletion (method on BackupPlugin) to always attempt cleanup
by deferring a call to etcdOrchestrator.CleanupCredentialSecret(ctx) right after
verifying etcdOrchestrator.IsCreated(), so the Secret is removed on all exit
paths (success, error, timeout); ensure the deferred call logs any cleanup error
(use p.log.Warnf or similar) and continues to return the original error from
WaitForCompletion.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d116e2dc-774f-4faa-9b51-39b2dd034bae
📒 Files selected for processing (13)
docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.mdpkg/common/scheme.gopkg/common/types.gopkg/common/utils.gopkg/common/utils_test.gopkg/core/backup.gopkg/core/types/types.gopkg/core/validation/backup.gopkg/core/validation/backup_test.gopkg/core/validation/restore.gopkg/etcdbackup/orchestrator.gopkg/etcdbackup/orchestrator_test.gopkg/platform/aws/aws.go
💤 Files with no reviewable changes (1)
- pkg/platform/aws/aws.go
✅ Files skipped from review due to trivial changes (4)
- pkg/common/utils_test.go
- pkg/core/validation/backup_test.go
- pkg/common/types.go
- pkg/etcdbackup/orchestrator_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
- pkg/common/scheme.go
- pkg/core/types/types.go
- pkg/common/utils.go
- pkg/core/validation/restore.go
- pkg/core/validation/backup.go
- pkg/core/backup.go
Add etcdSnapshot backup method that creates and monitors HCPEtcdBackup CRs during Velero backup. When etcdBackupMethod=etcdSnapshot is configured in the plugin ConfigMap, the plugin: - Creates an HCPEtcdBackup CR in the HCP namespace using BSL storage config - Copies BSL credentials to the HO namespace (remapping key for controller) - Polls the CR until backup completes or fails - Excludes etcd pods and PVCs from Velero backup (no CSI/FS backup needed) - Stores the etcd snapshot alongside the Velero backup data in the BSL The default method remains volumeSnapshot (unchanged behavior). Also cleans up dead config parameters (readoptNodes, managedServices, awsRegenPrivateLink) and registers apiextensionsv1 in the scheme for CRD existence checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
Document the full HCPEtcdBackup integration including architecture, backup/restore flows, configuration, credential handling, storage layout, dependency chain (PRs #8139, #8010, #8017, #8040, #8114, enhancement #1945), and troubleshooting guide. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
d96df42 to
eb942f0
Compare
|
@jparrill: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
pkg/etcdbackup/orchestrator.go (1)
115-126:⚠️ Potential issue | 🟠 MajorPoll the vendored HyperShift reasons, not local placeholders.
The current switches still look for
common.BackupInProgressReason/common.BackupRejectedReason, while the vendored API exposesBackupAlreadyInProgress,BackupFailed,BackupSucceeded, andEtcdUnhealthy.WaitForCompletion()also dropsEtcdUnhealthyReason, so those states can sit in the poll loop until timeout.Run this to compare the vendored constants with the current polling branches:
#!/bin/bash set -euo pipefail echo "=== vendored HCPEtcdBackup condition reasons ===" sed -n '1,20p' vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/etcdbackup_types.go | nl -ba echo echo "=== current polling branches ===" sed -n '109,165p' pkg/etcdbackup/orchestrator.go | nl -ba echo echo "=== local reason constants referenced here ===" rg -n 'BackupInProgressReason|BackupRejectedReason|BackupAlreadyInProgressReason|EtcdUnhealthyReason' pkg/common/types.go pkg/etcdbackup/orchestrator.goAlso applies to: 155-160
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/etcdbackup/orchestrator.go` around lines 115 - 126, The switch over cond.Reason in the poll loop is using local placeholders (common.BackupInProgressReason, common.BackupRejectedReason) instead of the vendored HyperShift condition reason constants; update the cases to use the vendored names (hyperv1.BackupAlreadyInProgress, hyperv1.BackupSucceeded, hyperv1.BackupFailed, hyperv1.EtcdUnhealthy) so the poll loop and WaitForCompletion() observe the same semantics, and remove/replace any references to common.BackupInProgressReason/common.BackupRejectedReason (also adjust the similar branch around lines 155-160) so all checks consistently reference hyperv1 constants.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md`:
- Around line 117-127: Update the paragraph to say the restore reads the etcd
snapshot URL from the HostedCluster annotation
hypershift.openshift.io/etcd-snapshot-url (not
status.lastSuccessfulEtcdBackupURL); explain that pkg/core/backup.go persists
the URL into that annotation because Velero strips status on restore, and update
the note about unstructured access to clarify we read the annotation until the
HyperShift API vendor adds a formal status field.
In `@pkg/core/backup.go`:
- Around line 146-151: The current logic calls p.createEtcdBackup as soon as any
HCP resolves (guarded only by p.etcdBackupMethod), which can start HCPEtcdBackup
before platform validation and without a guaranteed wait/cleanup path; change
this so HCPEtcdBackup is only created after platform validation and in a branch
that implements completion/cleanup (move or gate the call into the
HostedControlPlane/HostedCluster handling after validation succeeds), or add an
explicit validation check and registration of a wait/cleanup handler before
invoking p.createEtcdBackup; reference p.etcdBackupMethod, p.createEtcdBackup,
HCPEtcdBackup, HostedControlPlane and HostedCluster to locate and update the
code paths accordingly.
- Around line 284-289: When CreateEtcdBackup or VerifyInProgress fails after
copying credentials, ensure you delete the remapped BSL Secret in the host
operator namespace before returning; update the error paths after
p.etcdOrchestrator.CreateEtcdBackup(ctx, backup, p.hcp.Namespace, hc) and
p.etcdOrchestrator.VerifyInProgress(ctx) so they call the same cleanup that
waitForEtcdBackupCompletion() uses (or explicitly delete the remapped secret by
name in hoNamespace/p.hcp.Namespace) and then return the original error; make
the deletion idempotent (ignore NotFound) and log any deletion failure but still
return the original create/verify error.
In `@pkg/etcdbackup/orchestrator.go`:
- Around line 83-104: The code generates a random crName causing non-idempotent
creates; instead derive crName deterministically from the Velero backup (e.g.,
use backup.Name or backup.UID to build the CR name) and before calling
o.client.Create check for an existing HCPEtcdBackup by that deterministic name
via o.client.Get (or list by owner/label) so you reuse the existing CR on
re-entry; update crName, use o.client.Get/Create accordingly, and set
o.BackupName = crName when reusing or after successful create (referencing
variables crName, backup.Name, backup.UID, o.client.Get/Create, and
o.BackupName).
---
Duplicate comments:
In `@pkg/etcdbackup/orchestrator.go`:
- Around line 115-126: The switch over cond.Reason in the poll loop is using
local placeholders (common.BackupInProgressReason, common.BackupRejectedReason)
instead of the vendored HyperShift condition reason constants; update the cases
to use the vendored names (hyperv1.BackupAlreadyInProgress,
hyperv1.BackupSucceeded, hyperv1.BackupFailed, hyperv1.EtcdUnhealthy) so the
poll loop and WaitForCompletion() observe the same semantics, and remove/replace
any references to common.BackupInProgressReason/common.BackupRejectedReason
(also adjust the similar branch around lines 155-160) so all checks consistently
reference hyperv1 constants.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e6bca3fb-03c5-479b-9731-d1f5ead6e10a
📒 Files selected for processing (13)
docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.mdpkg/common/scheme.gopkg/common/types.gopkg/common/utils.gopkg/common/utils_test.gopkg/core/backup.gopkg/core/types/types.gopkg/core/validation/backup.gopkg/core/validation/backup_test.gopkg/core/validation/restore.gopkg/etcdbackup/orchestrator.gopkg/etcdbackup/orchestrator_test.gopkg/platform/aws/aws.go
💤 Files with no reviewable changes (1)
- pkg/platform/aws/aws.go
✅ Files skipped from review due to trivial changes (3)
- pkg/core/validation/backup_test.go
- pkg/common/utils_test.go
- pkg/common/types.go
🚧 Files skipped from review as they are similar to previous changes (5)
- pkg/core/types/types.go
- pkg/common/utils.go
- pkg/core/validation/restore.go
- pkg/common/scheme.go
- pkg/core/validation/backup.go
| 1. When the `HostedCluster` item is processed during restore, the plugin reads `status.lastSuccessfulEtcdBackupURL` from the HC's unstructured content. | ||
|
|
||
| 2. If the URL is present and the HC has managed etcd (`spec.etcd.managed != nil`), the plugin injects the URL into `spec.etcd.managed.storage.restoreSnapshotURL`. | ||
|
|
||
| 3. The modified HC is written back to Velero's output, so when the HC is created in the target cluster, the HyperShift Operator uses the snapshot URL to restore etcd from the snapshot. | ||
|
|
||
| ### No Bidirectional Tracking | ||
|
|
||
| The previous approach required tracking both the `HCPEtcdBackup` CR and the `HostedCluster` item arrival order. With `lastSuccessfulEtcdBackupURL` persisted in the HC status by the HCPEtcdBackup controller, the restore flow is stateless — everything needed is in the HC object itself. | ||
|
|
||
| > **Note**: The `lastSuccessfulEtcdBackupURL` field is read via unstructured map access until the HyperShift API vendor is updated to include it (tracked in CNTRLPLANE-3173). |
There was a problem hiding this comment.
Document the restore handoff as the annotation, not status.
This section says restore reads status.lastSuccessfulEtcdBackupURL, but pkg/core/backup.go Lines 179-199 explicitly persist the URL in hypershift.openshift.io/etcd-snapshot-url because Velero strips status during restore. As written, this documents the wrong contract for the follow-on restore flow.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md` around lines
117 - 127, Update the paragraph to say the restore reads the etcd snapshot URL
from the HostedCluster annotation hypershift.openshift.io/etcd-snapshot-url (not
status.lastSuccessfulEtcdBackupURL); explain that pkg/core/backup.go persists
the URL into that annotation because Velero strips status on restore, and update
the note about unstructured access to clarify we read the annotation until the
HyperShift API vendor adds a formal status field.
| // Etcd backup: create HCPEtcdBackup CR as early as possible (once). | ||
| // Only when etcdBackupMethod is "etcdSnapshot". | ||
| if p.etcdBackupMethod == common.EtcdBackupMethodEtcdSnapshot { | ||
| if err := p.createEtcdBackup(ctx, backup); err != nil { | ||
| return nil, nil, fmt.Errorf("error creating HCPEtcdBackup: %v", err) | ||
| } |
There was a problem hiding this comment.
Don't start the snapshot workflow before validation and a guaranteed wait path.
This fires as soon as any item resolves an HCP, but platform validation only happens later in the HostedControlPlane branch and completion/cleanup only happen in the HostedControlPlane/HostedCluster branches. That means an invalid or partial backup can still launch HCPEtcdBackup and copy credentials with no later path that waits for it or cleans it up.
Also applies to: 162-185
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/core/backup.go` around lines 146 - 151, The current logic calls
p.createEtcdBackup as soon as any HCP resolves (guarded only by
p.etcdBackupMethod), which can start HCPEtcdBackup before platform validation
and without a guaranteed wait/cleanup path; change this so HCPEtcdBackup is only
created after platform validation and in a branch that implements
completion/cleanup (move or gate the call into the
HostedControlPlane/HostedCluster handling after validation succeeds), or add an
explicit validation check and registration of a wait/cleanup handler before
invoking p.createEtcdBackup; reference p.etcdBackupMethod, p.createEtcdBackup,
HCPEtcdBackup, HostedControlPlane and HostedCluster to locate and update the
code paths accordingly.
| if err := p.etcdOrchestrator.CreateEtcdBackup(ctx, backup, p.hcp.Namespace, hc); err != nil { | ||
| return err | ||
| } | ||
|
|
||
| if err := p.etcdOrchestrator.VerifyInProgress(ctx); err != nil { | ||
| return err |
There was a problem hiding this comment.
Clean up the copied Secret on create/verify failures.
The only cleanup today is in waitForEtcdBackupCompletion(). If either call here fails after the credentials were copied, the remapped BSL Secret is left behind in hoNamespace.
Possible fix
if err := p.etcdOrchestrator.CreateEtcdBackup(ctx, backup, p.hcp.Namespace, hc); err != nil {
+ if cleanupErr := p.etcdOrchestrator.CleanupCredentialSecret(ctx); cleanupErr != nil {
+ p.log.Warnf("Failed to cleanup etcd backup credential Secret after create error: %v", cleanupErr)
+ }
return err
}
if err := p.etcdOrchestrator.VerifyInProgress(ctx); err != nil {
+ if cleanupErr := p.etcdOrchestrator.CleanupCredentialSecret(ctx); cleanupErr != nil {
+ p.log.Warnf("Failed to cleanup etcd backup credential Secret after verify error: %v", cleanupErr)
+ }
return err
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/core/backup.go` around lines 284 - 289, When CreateEtcdBackup or
VerifyInProgress fails after copying credentials, ensure you delete the remapped
BSL Secret in the host operator namespace before returning; update the error
paths after p.etcdOrchestrator.CreateEtcdBackup(ctx, backup, p.hcp.Namespace,
hc) and p.etcdOrchestrator.VerifyInProgress(ctx) so they call the same cleanup
that waitForEtcdBackupCompletion() uses (or explicitly delete the remapped
secret by name in hoNamespace/p.hcp.Namespace) and then return the original
error; make the deletion idempotent (ignore NotFound) and log any deletion
failure but still return the original create/verify error.
| crName := fmt.Sprintf("oadp-%s-%s", backup.Name, utilrand.String(4)) | ||
| etcdBackup := &hyperv1.HCPEtcdBackup{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Name: crName, | ||
| Namespace: hcpNamespace, | ||
| }, | ||
| Spec: hyperv1.HCPEtcdBackupSpec{ | ||
| Storage: *storage, | ||
| }, | ||
| } | ||
|
|
||
| if err := o.client.Create(ctx, etcdBackup); err != nil { | ||
| if apierrors.IsAlreadyExists(err) { | ||
| o.log.Infof("HCPEtcdBackup %s/%s already exists, reusing", hcpNamespace, crName) | ||
| } else { | ||
| return fmt.Errorf("failed to create HCPEtcdBackup: %w", err) | ||
| } | ||
| } else { | ||
| o.log.Infof("Created HCPEtcdBackup %s/%s", hcpNamespace, crName) | ||
| } | ||
|
|
||
| o.BackupName = crName |
There was a problem hiding this comment.
Random CR names make this create path non-idempotent.
crName changes on every call, so the AlreadyExists branch only helps on a name collision. If this method is re-entered with a fresh in-memory state, the same Velero backup can create multiple HCPEtcdBackup CRs instead of reusing the first one.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/etcdbackup/orchestrator.go` around lines 83 - 104, The code generates a
random crName causing non-idempotent creates; instead derive crName
deterministically from the Velero backup (e.g., use backup.Name or backup.UID to
build the CR name) and before calling o.client.Create check for an existing
HCPEtcdBackup by that deterministic name via o.client.Get (or list by
owner/label) so you reuse the existing CR on re-entry; update crName, use
o.client.Get/Create accordingly, and set o.BackupName = crName when reusing or
after successful create (referencing variables crName, backup.Name, backup.UID,
o.client.Get/Create, and o.BackupName).
Summary
Adds
etcdSnapshotas a new etcd backup method for HyperShift HostedClusters. This PR covers only the backup flow; the restore flow (with pre-signed URL generation) will follow in a separate PR.etcdBackupMethodconfiguration key (volumeSnapshotdefault,etcdSnapshotfor native etcd snapshots)HCPEtcdBackupCR that triggers a nativeetcdctl snapshot saveand uploads the snapshot to the Velero BSL object storeetcdSnapshotto prevent CSI VolumeSnapshots{prefix}/backups/{backup-name}/etcd-backup/)Dependencies
LastSuccessfulEtcdBackupURLfield in HostedClusterStatus (CNTRLPLANE-3173)Once both are merged, a vendor update is required to remove local constants (
BackupInProgressReason,BackupRejectedReason) in favor of API-defined ones.Related
Changes
fix(deps)HCPEtcdBackuptypes)feat(backup)docsTest plan
etcdBackupMethodandhoNamespacerecognized as valid config keysgo build ./...passesgo test ./...passesetcdBackupMethod=etcdSnapshot, run OADP backup, verify HCPEtcdBackup CR created and snapshot uploadedvolumeSnapshotmethod (default) behavior is unchanged🤖 Generated with Claude Code