HYPERFLEET-560 - doc: add adapter deletion flow design by xueli181114 · Pull Request #110 · openshift-hyperfleet/architecture

xueli181114 · 2026-03-25T04:44:40Z

Summary

Design document for the deletion workflow between API and Adapters
Two-phase deletion: API marks for deletion (soft delete), adapters clean up K8s resources, API deletes record after all adapters confirm
Minimal DSL changes: invoke_deletion (top-level CEL expression) and deleteOption (per resource with propagationPolicy and after)
Propagation policies align with K8s metav1.DeletionPropagation: Background, Foreground, Orphan
API delete signal requires three checks: Applied=False AND Health=True AND observed_time >= deletion_timestamp
13 edge cases addressed (error reporting, stuck deletion, multi-generation resources, API mutation rejection, etc.)
Alternative approach documented: separate deletion adapter with comparison table

Test plan

Review design doc for completeness and correctness
Validate Mermaid diagrams render correctly on GitHub
Confirm DSL changes are minimal and backward compatible

Relates to: HYPERFLEET-560

Summary by CodeRabbit

Documentation
- Added a draft design describing a two-phase, adapter-driven deletion workflow for Clusters and NodePools: API delete returns 202 Accepted and marks resources Terminating while adapters perform cleanup and signal completion.
- Defines per-resource propagation policies and ordering, deletion-mode behavior, clear completion/gating signals, edge-case rules (not-found, conflicts, stale signals), and observability/metrics for progress and stuck detection.

openshift-ci · 2026-03-25T04:44:47Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign xueli181114 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-03-25T04:44:57Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds a draft design specifying a two‑phase deletion workflow for Clusters and NodePools: DELETE /clusters/{id} sets deletion_timestamp, transitions to Terminating, and returns 202 Accepted while records are retained. Sentinel behavior and CloudEvents remain unchanged; deletion intent is derived from API state. The adapter DSL gains a top‑level when_deleting CEL expression and per‑resource deleteOptions (Kubernetes propagationPolicy: Background|Foreground|Orphan and after dependency ordering). Executors enter deletion mode, compute a topological deletion order, delete resources honoring propagationPolicy (skipping Orphan from blocking), run post‑processing, and report status. Deletion completion uses the adapter status contract: deletion confirmed when Applied=False AND Health=True AND observed_time >= deletion_timestamp. NodePool and Cluster records are garbage‑collected only once these signals and hierarchical gating are satisfied; “not found” is treated as success and API rejects mutations with 409 while deletion_timestamp is set.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant API as API/Controller
    participant Sentinel
    participant Executor as Adapter Framework (Executor)
    participant Adapter as Adapter/Cloud
    participant DB as Database

    User->>API: DELETE /clusters/{id} (set deletion_timestamp, phase=Terminating) / returns 202
    API->>Sentinel: emit reconcile CloudEvent
    Sentinel->>Executor: poll & deliver reconcile
    Executor->>Adapter: evaluate top-level when_deleting CEL -> enter deletion mode
    Executor->>Adapter: compute deletion order from deleteOptions.after
    loop for each resource in topological order
        Executor->>Adapter: request delete (respect propagationPolicy)
        Adapter->>Executor: update status (Applied=False / Health=True / observed_time)
    end
    Executor->>Sentinel: publish status updates
    Sentinel->>API: reconcile observes delete signal (Applied=False && Health=True && observed_time >= deletion_timestamp)
    API->>DB: garbage-collect NodePool records when nodepool-level signals met
    API->>DB: garbage-collect Cluster record when all adapters & nodepools are deleted

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: adding a design document for adapter deletion flow, directly matching the PR's primary objective.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md (1)
241-242: Deletion ordering needs explicit handling for invalid dependency graphs.

Topological sorting is mentioned, but behavior for cycles or unknown after references is undefined. Add explicit fail-fast semantics (status/error reason) so misconfigurations don’t cause silent stuck deletions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`
around lines 241 - 242, The deletion executor's topological sorting using
deleteOption.after must validate the dependency graph up-front and fail fast on
invalid graphs: add a validation step in the executor (the component that builds
the topological order) that detects unknown `after` references and cycles before
any deletions, and on detection immediately mark the affected resources'
deletion status with an explicit error/reason (e.g., "invalid dependency:
missing reference X" or "cycle detected in deleteOption.after"), log the
failure, and abort execution rather than proceeding or hanging; ensure the
validation surfaces the offending resource identifiers so operators can fix
misconfigurations.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`:
- Around line 107-109: The deletion gate is inconsistent: change the simple "all
adapters report Applied=False" rule to the canonical, stricter gate used later
so implementations match; update the DELETE flow description (the step that
currently says "when all adapters report Applied=False") to state that the API
deletes the cluster only when every adapter status for that cluster has
Applied=False AND Health=True and the adapter's observed_time is >= the
cluster's deletion_timestamp, and ensure the terms deletion_timestamp,
status.phase (Terminating), Applied, Health, and observed_time are referenced
exactly as elsewhere in the doc so both sections read the same.
- Around line 507-509: Update the deletion criteria in the "API deletes
nodepools" and "API deletes cluster" bullets so they match the API delete-signal
safety gate: require adapters to report Applied=False AND Health=True, and
verify the adapter's post-delete timestamp is fresh (i.e., within the configured
grace window) before deleting records; change the two lines that currently
reference only `Applied=False` to include `Health=True` and the post-delete
timestamp freshness check so nodepool deletion (the nodepool-level adapter
condition) and cluster deletion (all adapters including cluster-level) follow
the same safe gating logic.
- Around line 222-223: Update the documentation to correct the `Orphan`
propagation description so it matches Kubernetes `metav1.DeletionPropagation`
semantics: replace the diagram node text `K[Skip - leave resource as-is]` and
the explanatory sentence at the other occurrence with wording that `Orphan`
deletes the owner resource while preserving dependents (dependents become
orphaned), not "skip/do nothing" or "leave resource as-is"; ensure the term
`Orphan` and `metav1.DeletionPropagation` are used to make the intended
semantics clear.

---

Nitpick comments:
In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`:
- Around line 241-242: The deletion executor's topological sorting using
deleteOption.after must validate the dependency graph up-front and fail fast on
invalid graphs: add a validation step in the executor (the component that builds
the topological order) that detects unknown `after` references and cycles before
any deletions, and on detection immediately mark the affected resources'
deletion status with an explicit error/reason (e.g., "invalid dependency:
missing reference X" or "cycle detected in deleteOption.after"), log the
failure, and abort execution rather than proceeding or hanging; ensure the
validation surfaces the offending resource identifiers so operators can fix
misconfigurations.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ae6754ea-c2cc-40a0-9185-b6ceca336446

📥 Commits

Reviewing files that changed from the base of the PR and between a693ad8 and b78790c.

📒 Files selected for processing (1)

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

coderabbitai

♻️ Duplicate comments (2)

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md (2)

124-124: ⚠️ Potential issue | 🟠 Major

Unify the final-delete gate everywhere to the same canonical criteria

Line 124, Lines 364-373, and Lines 528-529 weaken or partially define the gate compared with the stricter sections. Please use one exact rule everywhere: deletion_timestamp set, Applied=False, Health=True, and observed_time >= deletion_timestamp (per adapter).

Suggested doc edit

-2. **Delete**: The API monitors adapter statuses. When **all** adapters report `Applied=False` for a cluster that has `deletion_timestamp` set, the API deletes the cluster record and all associated data (nodepools, adapter statuses).
+2. **Delete**: The API monitors adapter statuses. It deletes only when, for all required adapters, `Applied=False` AND `Health=True` AND `observed_time >= deletion_timestamp` for a cluster with `deletion_timestamp` set. Then it deletes the cluster record and all associated data (nodepools, adapter statuses).

-The API determines when to delete by checking three things:
+The API determines when to delete by checking four things:
 ...
-3. **All adapters report `Health=True`** (confirms the adapter status is reliable)
+3. **All adapters report `Health=True`** (confirms the adapter status is reliable)
+4. **All adapters have `observed_time >= deletion_timestamp`** (confirms statuses are post-delete and not stale)

-6. API deletes nodepools when their adapters report `Applied=False`
-7. API deletes cluster when all adapters (cluster + nodepool level) report `Applied=False`
+6. API deletes nodepools when required nodepool adapters report `Applied=False AND Health=True` and `observed_time >= nodepool.deletion_timestamp`
+7. API deletes cluster when all required adapters (cluster + nodepool level) report `Applied=False AND Health=True` and `observed_time >= cluster.deletion_timestamp`

Also applies to: 364-373, 528-529

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md` at
line 124, The final-delete gate is inconsistent across the document; unify all
checks to a single canonical rule: require the cluster's deletion_timestamp to
be set and for each adapter require Applied=False, Health=True, and
observed_time >= deletion_timestamp; update every occurrence that currently
weakens or partially defines the gate so it uses this exact criteria (use the
terms deletion_timestamp, Applied=False, Health=True, observed_time >=
deletion_timestamp verbatim to ensure consistency).

102-103: ⚠️ Potential issue | 🟠 Major

Orphan propagation is documented with incorrect behavior

Lines 238/251 (and downstream examples at Line 102 and Line 294) describe Orphan as “skip/do nothing,” which conflicts with Kubernetes metav1.DeletionPropagation semantics. Orphan deletes the owner resource and preserves dependents (orphaned).

Suggested doc edit

-    D -->|Orphan| K[Skip - leave resource as-is]
+    D -->|Orphan| K[Delete owner; preserve dependents (orphan)]

-| `Orphan` | Do nothing, leave resource as-is | Resources managed by other systems or intentionally preserved |
+| `Orphan` | Delete owner resource and preserve dependents (dependents become orphaned). | Preserve child resources while removing the owner |

-        Note over Adapter: serviceAccount: Orphan, skip
+        Note over Adapter: serviceAccount: Orphan (delete owner, preserve dependents)

-    D3 -->|Orphan| D6[Skip]
+    D3 -->|Orphan| D6[Delete owner; preserve dependents]

Also applies to: 238-253, 294-295

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`
around lines 102 - 103, The doc incorrectly describes the DeletionPropagation
value "Orphan" as "skip/do nothing"; update all occurrences of "Orphan" in
adapter-deletion-flow-design.md (including the examples mentioning
serviceAccount: Orphan and the sections referencing metav1.DeletionPropagation)
to state that "Orphan" deletes the owner resource while preserving dependents
(i.e., orphaning them), and adjust any example text or diagrams that imply
"skip" behavior so they reflect the correct semantics of deleting owner +
preserving dependents.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`:
- Line 124: The final-delete gate is inconsistent across the document; unify all
checks to a single canonical rule: require the cluster's deletion_timestamp to
be set and for each adapter require Applied=False, Health=True, and
observed_time >= deletion_timestamp; update every occurrence that currently
weakens or partially defines the gate so it uses this exact criteria (use the
terms deletion_timestamp, Applied=False, Health=True, observed_time >=
deletion_timestamp verbatim to ensure consistency).
- Around line 102-103: The doc incorrectly describes the DeletionPropagation
value "Orphan" as "skip/do nothing"; update all occurrences of "Orphan" in
adapter-deletion-flow-design.md (including the examples mentioning
serviceAccount: Orphan and the sections referencing metav1.DeletionPropagation)
to state that "Orphan" deletes the owner resource while preserving dependents
(i.e., orphaning them), and adjust any example text or diagrams that imply
"skip" behavior so they reflect the correct semantics of deleting owner +
preserving dependents.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fa20e0fd-bc6c-4c19-a424-fc99f8673441

📥 Commits

Reviewing files that changed from the base of the PR and between b78790c and c72f78b.

📒 Files selected for processing (1)

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

coderabbitai

♻️ Duplicate comments (2)

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md (2)
374-375: ⚠️ Potential issue | 🟠 Major

Deletion gate criteria are inconsistent across sections.

Line 374 weakens the rule to Applied=False + deletion_timestamp, and the timeline checks at Line 114/Line 117 and Line 590/Line 593 omit observed_time >= deletion_timestamp. The canonical gate elsewhere is stricter (Applied=False AND Health=True AND observed_time >= deletion_timestamp). Keep one rule everywhere to prevent premature deletion from stale statuses.

Also applies to: 114-117, 590-594
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`
around lines 374 - 375, The deletion-gate is inconsistent: replace the weaker
`Applied=False` + `deletion_timestamp` variant and the timeline checks that omit
the timestamp with the canonical rule everywhere; require `Applied=False AND
Health=True AND observed_time >= deletion_timestamp` in all places that define
or evaluate deletion eligibility (the paragraph that currently uses
`Applied=False` + `deletion_timestamp`, the timeline checks, and the canonical
gate text) so deletion is only allowed when the resource is marked not applied,
healthy, and the observed status is at-or-after the cluster deletion timestamp.
248-249: ⚠️ Potential issue | 🟠 Major

Orphan propagation is still defined incorrectly.

Line 248/Line 261 describe Orphan as skip/do-nothing, which conflicts with Kubernetes metav1.DeletionPropagation (owner is deleted, dependents are preserved/orphaned). This is a behavioral mismatch in the DSL contract.
Suggested doc fix
-    D -->|Orphan| K[Skip - leave resource as-is]
+    D -->|Orphan| K[Delete owner, preserve dependents (orphan)]

-| `Orphan` | Do nothing, leave resource as-is | Resources managed by other systems or intentionally preserved |
+| `Orphan` | Delete the owner resource while preserving dependents (they become orphaned). | Preserve dependent resources while removing the owner |
Also applies to: 261-263
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`
around lines 248 - 249, Update the DSL doc and diagram so the "Orphan"
propagation matches Kubernetes metav1.DeletionPropagation semantics: change any
description that says "skip/do-nothing" or "leave resource as-is" (e.g., the
diagram node labeled K[Skip - leave resource as-is] and the prose around the
'Orphan' keyword) to state that the owner is deleted but dependents are
preserved/orphaned (dependents' ownerReferences are removed/left intact per K8s
behavior); ensure all mentions of Orphan (including the block currently labeled
"Orphan" at the diagram and the text blocks explaining propagation) reflect
"owner deleted, dependents preserved/orphaned" rather than "no action."

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`:
- Around line 374-375: The deletion-gate is inconsistent: replace the weaker
`Applied=False` + `deletion_timestamp` variant and the timeline checks that omit
the timestamp with the canonical rule everywhere; require `Applied=False AND
Health=True AND observed_time >= deletion_timestamp` in all places that define
or evaluate deletion eligibility (the paragraph that currently uses
`Applied=False` + `deletion_timestamp`, the timeline checks, and the canonical
gate text) so deletion is only allowed when the resource is marked not applied,
healthy, and the observed status is at-or-after the cluster deletion timestamp.
- Around line 248-249: Update the DSL doc and diagram so the "Orphan"
propagation matches Kubernetes metav1.DeletionPropagation semantics: change any
description that says "skip/do-nothing" or "leave resource as-is" (e.g., the
diagram node labeled K[Skip - leave resource as-is] and the prose around the
'Orphan' keyword) to state that the owner is deleted but dependents are
preserved/orphaned (dependents' ownerReferences are removed/left intact per K8s
behavior); ensure all mentions of Orphan (including the block currently labeled
"Orphan" at the diagram and the text blocks explaining propagation) reflect
"owner deleted, dependents preserved/orphaned" rather than "no action."

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f5a70992-699a-4bbd-a1a9-21476bcd3aaf

📥 Commits

Reviewing files that changed from the base of the PR and between c72f78b and 87629c6.

📒 Files selected for processing (1)

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md (2)
114-119: ⚠️ Potential issue | 🟠 Major

Deletion gate is still weaker in diagrams than the canonical API signal.

Line 114 and Line 117 (and again Line 596 and Line 599) only gate on Applied=False + Health=True, but the doc’s safety rule also requires observed_time >= deletion_timestamp. This inconsistency can drive premature deletion from stale statuses.

Please align these diagram/flow steps with the canonical gate used elsewhere.

Also applies to: 596-600
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`
around lines 114 - 119, The deletion gates in the sequence diagram currently
check only "Applied=False AND Health=True"; update those gate conditions (the
lines showing "Applied=False AND Health=True?") to include the canonical
temporal check so they read "Applied=False AND Health=True AND observed_time >=
deletion_timestamp?" and apply this change to both occurrences mentioned (the
cluster/nodepool gate at lines ~114-119 and the duplicate at ~596-600) so the
diagram matches the API's safety rule.
248-249: ⚠️ Potential issue | 🟠 Major

Orphan propagation is documented with incorrect Kubernetes semantics.

Line 248 and Line 261 describe Orphan as “skip/do nothing,” but Kubernetes metav1.DeletionPropagation=Orphan means delete the owner while preserving dependents (which become orphaned). This also conflicts with Line 263’s “matches metav1” statement.

Also applies to: 261-263
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`
around lines 248 - 249, The diagram and text incorrectly state that "Orphan"
means "skip/do nothing"; update all occurrences of the "Orphan" propagation
description (the diagram node labeled "Orphan" / text that says "Skip - leave
resource as-is" and the statement that "matches metav1") to accurately reflect
Kubernetes semantics: metav1.DeletionPropagation=Orphan causes deletion of the
owner while preserving dependents (dependents are left orphaned), not skipping
deletion; ensure consistency across the related descriptions that currently
mention lines 248, 261–263 so the diagram and the prose explicitly say "delete
owner, preserve dependents (dependents become orphaned) — aligns with
metav1.DeletionPropagation=Orphan."

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`:
- Around line 514-516: The example Applied condition uses invoke_deletion ?
(has(resources.clusterJob) ? "True" : "False") :
(has(resources.clusterConfigMap.metadata.resourceVersion) ? "True" : "False"),
which can mark deletion complete prematurely; update the predicate so that in
deletion mode it checks that all managed resources are absent (e.g., verify
!has(resources.clusterJob) AND !has(resources.<otherManagedResource>) etc.,
excluding only explicit Orphan resources) rather than only the absence of
resources.clusterJob, and mirror the same comprehensive absence check used by
your deletion-complete definition for consistency.

---

Duplicate comments:
In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`:
- Around line 114-119: The deletion gates in the sequence diagram currently
check only "Applied=False AND Health=True"; update those gate conditions (the
lines showing "Applied=False AND Health=True?") to include the canonical
temporal check so they read "Applied=False AND Health=True AND observed_time >=
deletion_timestamp?" and apply this change to both occurrences mentioned (the
cluster/nodepool gate at lines ~114-119 and the duplicate at ~596-600) so the
diagram matches the API's safety rule.
- Around line 248-249: The diagram and text incorrectly state that "Orphan"
means "skip/do nothing"; update all occurrences of the "Orphan" propagation
description (the diagram node labeled "Orphan" / text that says "Skip - leave
resource as-is" and the statement that "matches metav1") to accurately reflect
Kubernetes semantics: metav1.DeletionPropagation=Orphan causes deletion of the
owner while preserving dependents (dependents are left orphaned), not skipping
deletion; ensure consistency across the related descriptions that currently
mention lines 248, 261–263 so the diagram and the prose explicitly say "delete
owner, preserve dependents (dependents become orphaned) — aligns with
metav1.DeletionPropagation=Orphan."

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: db0d6340-e993-41a7-8a35-304ae0699d80

📥 Commits

Reviewing files that changed from the base of the PR and between 87629c6 and 1bc2c55.

📒 Files selected for processing (1)

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`:
- Around line 263-264: The doc's "Orphan" semantics are inconsistent with
Kubernetes: decide whether the adapter's "Orphan" means skip deletion (rename
node and text to "Skip" or "Preserve") or actually perform deletion using
Kubernetes semantics (call Delete/propagationPolicy: Orphan so owner is deleted
and dependents keep cleared ownerReferences); then update the diagram node D
-->|Orphan| K[Skip - leave resource as-is], the surrounding text that references
metav1.DeletionPropagation and lines mentioning "do nothing", and any examples
to reflect the chosen behavior (if you choose to align with K8s also mention
propagationPolicy: Orphan and ownerReferences clearing; if you choose skip,
rename symbols from Orphan to Skip/Preserve and state it does not call
Kubernetes delete).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 673c0856-8483-40e4-bd43-8591d8ef185b

📥 Commits

Reviewing files that changed from the base of the PR and between 1bc2c55 and 3c96851.

📒 Files selected for processing (1)

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

rh-amarin · 2026-03-25T07:24:18Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+
+### Overview
+
+The deletion flow extends the existing adapter framework with minimal DSL changes. When the HyperFleet API receives a DELETE request, it soft-deletes the cluster (sets `deletion_timestamp`), and the cluster enters the `Terminating` phase. The existing Sentinel polling and event-driven adapter workflow handles the rest.


Since we have created_at, deleted_at, should we have deleted_at instead of deletion_time ?

+1, lets keep them aligned here 🙏

deletion_timestamp is k8s style, but if we would like deleted_at I am OK with that. But for me, deleted_at seems like the deletion finished timestamp.

As I understand it, deleted_at is when the user deleted the API resource, so in fact is the deletion finished timestamp for the user.
Once a user has deleted an API resource, they should be consider it "gone", "kaput", "dead", "over"

The reality will be reconciled later, and that will come in the status.conditions

Agreed. Renamed deletion_timestamp → deleted_at throughout the doc to match created_at convention.

rh-amarin · 2026-03-25T07:26:47Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+
+## What & Why
+
+**What**: Design the deletion workflow for Clusters and NodePools, defining how the adapter framework handles resource cleanup when a cluster enters the `Terminating` phase.


We don't have phases in the API
I would simply say "when a cluster is soft deleted" cluster, or "API resource" if we want to talk in a more generic way from now on

I agree with API Resource here, I was going to call out below. I think we need to speak more generic terms and move away for cluster and node pools. We should be focused on API resources parent and children

This is a legacy. Let me update the word to avoid confusion.

Agreed. Removed all status.phase / "cluster phase" / "Terminating phase" references. Now uses "soft-deleted" / "cluster state" / deleted_at throughout.

rh-amarin · 2026-03-25T08:30:31Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+1. **Mark for Deletion**: On `DELETE /clusters/{id}`, the API sets `deletion_timestamp` on the cluster record **and all its nodepools**. All records remain in the database and are still queryable.
+   - `deletion_timestamp` is the canonical delete intent signal for orchestration and adapter behavior.
+   - Customer-facing lifecycle state (for example, `Terminating`) is derived from `deletion_timestamp` at read time.
+   - API should set aggregate `Ready=False` (reason `Deleting`) immediately when `deletion_timestamp` is set so Sentinel's existing not-ready query picks up the resource without Sentinel code changes.


This is interesting.
Until now, Ready changed because of:

spec change -> increases generation

required adapters reporting Available false

So, the Ready condition was a computed value out of the generation + required adapters conditions

Open questions:

Does deletion affect generation ?

Do we have to include a new condition for computing Ready?

e.g. deleted_at is set

What happens if required adapters report True after deletion?

Ready can not become True because of adapters not managing the deletion

Should be there "required adapters" for the deletion?

Should be there "required adapters" for the deletion?
An idea (hack?): to have a required adapter for deletion.

When a cluster has not deleted_at, it will return Available=True

Meaning, that is ok, Ready can become true and this adapter has "nothing to say there"

When a cluster has deleted_at

It also updates generation, so Ready=False

A message will be triggered

This adapter has work to do

It will report Available=False while it is doing

The resource, will keep Ready=False and triggering

Until deletion adapter finally says Available=True

The confusing part....

Other required adapters, should return Available=True when cluster is deleted....

In order to get Ready for the new generation to become True

Some great points here @rh-amarin

Should generation increment? I think yes, generation is capturing a new user intent on a resource. It should increment

My take, required adapters will go away in a future version. I think we need more then required adapters to determine top level state we will need to improve this aggregation.

To that end I think we need a new condition for deletion, I think anything else will be a pure hack. Current conditions are directly tied to resource creation. Making changes here will get messy over time. For example does Applied == False mean not applied or deleted. Lets take some time tomorrow to white board out a solution here.

To that end I think we need a new condition for deletion, I think anything else will be a pure hack

Well, if you think of a deletion just being a desired state... the whole point of the api-sentinel-adapter is to reconcile with that desired state.

For me, a deleted API resource with Ready==True means that the "real world" has been reconciled with the desired state.

But Ready is really a misleading name here.... Reconciled could be an option

Yeap. I agree that Ready is leading to misunderstanding.

Does deletion affect generation ?

I doubt this. If generation changes that means we need to apply the resource, then delete the resource. What's the meaning here? From user perspective, generation should only means updating happen.

Do we have to include a new condition for computing Ready

Actually once the deletion is triggered, the resources shouldn't be Available anymore. I don't think we need to introduce deleted_at to calculate the Ready status. But I am not sure if sentinel can sending event once deleted_at set.

What happens if required adapters report True after deletion

I don't get it. if required adapters still reports true that is either a bug in adapter either deletion is not invoked.

Should be there "required adapters" for the deletion?

Actually I thought about it and I don't think we should have this terminology for deletion. All adapters should report applied false to avoid resources orphan. But there is one reason to introduce this is, some out of date adapters those have been deprecated and won't report statuses anymore their status kept Applied=True,Availabel=True,Health=True stale statuses.

does Applied == False mean not applied or deleted

It means resource not exist no matter not applied or deleted. If there is deleted_at set, that means deleted otherwise it means not created.

Until deletion adapter finally says Available=True

I will against this, Available means resources are ready I don't like deletion finishing is marked as Available.
I would like to follow k8s conventions

I doubt this. If generation changes that means we need to apply the resource, then delete the resource. What's the meaning here? From user perspective, generation should only means updating happen.

Is the spec been modified as part of the deletion?

I don't get it. if required adapters still reports true that is either a bug in adapter either deletion is not invoked.

Some might, for example some race conditions of tasks that started running before the deletion was kicked off

All adapters should report applied false to avoid resources orphan

This become complex, does applied false mean not applied or removed 🤔 will similar they are semantically different

IMO, similar to how Ready is a bad word for the aggregated status (since it means "reconciled"), we are misusing Available in the adapter reports.

Available=True in an adapter means that "adapter has done its work correctly" and it may or not have relation with k8s being ready

An example of an adapter running a job...

The adapter says Available=True meaning the job has run correctly

But there is nothing really "available" there.

It just means from the point of view of the adapter, the state of the world has been reconciled with the API resource expectation

Does deletion affect generation ?
Yesterday @ciaranRoche said to me: generation changes with the "intention of the user"
So, under that insight, a deletion of an API resource is an intended change to the API resource, so it should update generation

I am not sure if sentinel can sending event once deleted_at set.
idea: the API shows soft_deleted API resources as just regular API resources

It is up to GCP/ROSA team what to do with these

They can expose or not to the final users

They can trigger a cleaning operation

A soft_deleted API resource with Ready==False would mean that the world still contains real resources

Sentinel will be still able to see these, as they are as normal API resources as any other one

rh-amarin · 2026-03-25T08:32:19Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+
+The API implements a two-phase deletion pattern:
+
+1. **Mark for Deletion**: On `DELETE /clusters/{id}`, the API sets `deletion_timestamp` on the cluster record **and all its nodepools**. All records remain in the database and are still queryable.


In our officce ours meeting, Jaime mentioned that they will like for us to keep the soft deletion, and provide a way for cleaning this triggered by them.

This way, they have time to research if anything goes wrong with some deleted clusters and the records will be still there.

Can we define what is mean from soft delete. This doc having this mark for deletion is different to a soft delete, where a resource remains in the DB but not presented in the API

If we offer a way to delete API resources maybe we don't need to hide soft_deleted API resources.

In fact, Sentinel needs to fetch API resources that are soft_deleted but Ready==False until they transition to True

(And then, Sentinel should not get soft_deleted API resources that are Ready==True older than 30m )

From my perspective,
When delete is triggered by customer, API shouldn't delete anything it just need to set deleted_at until the condition match deletion condition and API set deleted column to value 1 or true to keep this data in DB but list/get cluster should only target to deleted=0 clusters.

I think we need to decide on this whole soft_delete, we are pretty focused on it, what are the alternatives to it?
I would question if the juice is worth the squeeze, we had a deleted_clusters table in CS and I can't remember it ever been used. We mentioned soft_delete on the office hours and Jaime was like, oh that would be nice 😆

We def need to discuss this more IMO, to ensure we do not have tunnel vision on it

rh-amarin · 2026-03-25T08:34:07Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+   - `deletion_timestamp` is the canonical delete intent signal for orchestration and adapter behavior.
+   - Customer-facing lifecycle state (for example, `Terminating`) is derived from `deletion_timestamp` at read time.
+   - API should set aggregate `Ready=False` (reason `Deleting`) immediately when `deletion_timestamp` is set so Sentinel's existing not-ready query picks up the resource without Sentinel code changes.
+2. **Delete (Hierarchical)**: The API uses layered deletion:


Why not marking the dependent resources (nodepools) also with deleted_at ?
The asynchronous process can happen elsewhere, but from the API point of view, the desired state is already "deleted" for all resources

It indeed follows this way, is current sentence causing confusion? Let me update it.

rh-amarin · 2026-03-25T08:37:53Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+   - Customer-facing lifecycle state (for example, `Terminating`) is derived from `deletion_timestamp` at read time.
+   - API should set aggregate `Ready=False` (reason `Deleting`) immediately when `deletion_timestamp` is set so Sentinel's existing not-ready query picks up the resource without Sentinel code changes.
+2. **Delete (Hierarchical)**: The API uses layered deletion:
+   - **Nodepool level**: Each nodepool record is deleted when all its nodepool-level adapters report `Applied=False` + `Health=True` with `observed_time >= deletion_timestamp`


Until now, only Available was used to compute Ready
... and it was hard enough

I think this will make it very complex to reason about it, and we should try to stick with using Available if possible

I don't think so. Available=False means the CR is not ready for using. Only Applied=False means the resource is gone.

IMO Available means something beyond applying a resource.
To me it means that the work to be done by the adapter is not considered yet to be done.
And, the adapter will keep receiving messages until it answers with Available=True

Is "the work to be done by the adapter" creating a CR? maybe yes, but not only that... An adapter could create a job that evaluates something and has to be re-executed until getting a successful result. Say for example a validation job.

When the job is created, Applied=False, the resource is there, but Available doesn't become True until success

rh-amarin · 2026-03-25T09:22:02Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+### Sentinel Behavior
+
+No changes are required to Sentinel **code logic or CloudEvent format**. The Sentinel:
+- Polls the API and evaluates `message_decision` CEL expressions as usual


Right now the API returns the resources with:

Ready==False

Ready==True && last_updated_time older than 30m

But for clusters that are deleted, it doesn't make sense to retrieve them 30m after the are Ready, if that means that they are effectively deleted....

unless we want to have zombie clusters, of course

That's why in API side once deleted_at is set, the status needs to be set up Ready=False. But if API doesn't do the change then sentinel should do that.

rh-amarin · 2026-03-25T09:23:50Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+- Does not distinguish between creation and deletion events
+- The adapter is responsible for determining the appropriate action based on cluster state
+
+**Operational requirement for nodepool deletion**: Each Sentinel deployment is configured with a single `resource_type` (e.g., `clusters`). For nodepool-level deletion to work, a **separate Sentinel deployment** must be configured with `resource_type: nodepools` to poll and publish nodepool events. Without this, nodepool adapters will not receive deletion events. This is an operational rollout/deployment change, not a Sentinel code change.


I don't think this not is necessary.

Sentinel instances are required for clusters and nodepools always, the deletion use case doesn't alter this IMO

Right, this is something misleading.

rh-amarin · 2026-03-25T09:25:09Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+Two additions to the adapter task configuration:
+
+1. **`when_deleting`** (top-level) - A CEL expression evaluated after preconditions. When `true`, the resources phase switches from creation mode to deletion mode.
+2. **`deleteOptions`** (per resource) - Defines the Kubernetes delete options (propagation policy and ordering) per resource, aligned with `metav1.DeleteOptions` conventions.


why are these deleteOptions camelCase?

Oh, it's k8s Style, let me change to follow our config standard

Good catch. Renamed deleteOptions → delete_options (snake_case) throughout the doc to be consistent with the rest of the DSL naming convention.

rh-amarin · 2026-03-25T09:32:13Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+
+No changes are required to Sentinel **code logic or CloudEvent format**. The Sentinel:
+- Polls the API and evaluates `message_decision` CEL expressions as usual
+- Publishes the **same CloudEvent format** regardless of cluster phase


IMO we should avoid talking about cluster phase.
We have cluster.status or cluster.status.conditions

right, let me remove phase related words to avoid confusions.

Fixed — removed all cluster phase and status.phase references. Using cluster.status.conditions and deleted_at instead.

rh-amarin · 2026-03-25T09:43:35Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+- **Cycle detected** (e.g., A after B, B after A): Abort deletion, report `Health=False` with reason `InvalidDeletionOrder` and message describing the cycle
+- **Unknown resource reference** (e.g., `after: [nonExistent]`): Abort deletion, report `Health=False` with reason `InvalidDeletionConfig` and message identifying the unknown reference
+
+These are config errors caught at execution time. The adapter reports `Applied=True` (resources still exist) and `Health=False` (config error) so the API does not proceed with deletion.


Can this be detected at boot time?

It can't. For example, there are two k8s manifest configured in the configuration and manifestA can be applied succefully bug manifestB can't due to cannot pass k8s validations. And maybe service offering team configured the Applied Cel expression to just chech manifestA

rh-amarin · 2026-03-25T09:45:47Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+
+### Adapter Framework Changes
+
+#### DSL Changes


Is this example only applicable for kubernetes transport or is it also for maestro?

My doubt is if deleteOptions in maestro transport already specified in the manifestwork?

Right, it's only for k8s transport.

rh-amarin · 2026-03-25T09:48:11Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+
+Two additions to the adapter task configuration:
+
+1. **`when_deleting`** (top-level) - A CEL expression evaluated after preconditions. When `true`, the resources phase switches from creation mode to deletion mode.


I was under the impression that deletion would be managed by a different adapter than creation.

Having when_deleting keyword means that we are using the same adapter for all the CRUD actions for resources.

Is this the intention?

There is another alternative for having it in a different adapter. It will provide some other complex, we can have a discussion later.

rh-amarin · 2026-03-25T10:19:24Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+    end
+```
+
+#### Adapter Status Decision Matrix


My fear is that this can increase the complexity to understand the system greatly

Until now, we only take into account Available

Maybe there is a way to reduce the complexity and using only Available for deletion.

Let me elaborate:

The user deletes the cluster in the API

generation advances, deleted_at is set

Ready==false

All adapters will start to receive messages because of Ready==false

If an adapter reads deleted_at set

It will answer always Available=False

Until the resource is deleted

This way, the computation of the aggregated condition is kept the same.
A deletion, is just another reconciliation by adapters
All required adapters must answer with Available=True, or they will keep receiving messages

Available=True means resources are ready for using not means resources exist or not. Applied=True means resources exist, and for deletion we are checking if resources are cleaned.

rh-amarin · 2026-03-25T10:26:46Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+        body: '{{ .clusterStatusPayload }}'
+```
+
+**Note on CEL expression complexity**: The post-processing expressions use `when_deleting` ternary branching to handle both creation and deletion states in a single config. While this keeps lifecycle logic co-located, it increases expression complexity. To manage this:


An idea to simplify this would be to define two different payloads

One for creation, one for deletion

And then conditionally use one or the other in the post_actions

One concern is that you may want to skip completely the building of one payload when is not the correct time. e.g. when creating, you don't want to create the deletion payload

A way to avoid this could be to add a when condition to the payloads.

I agree with this, adding a when will enable us to write straightforward expressions for each mode instead of nesting them and they becoming very verbose and complex

That is also an approach provided by AI. But a different payloads seems another approach of different adapter. Maybe we can talk same adapter VS different adapter first

rh-amarin · 2026-03-25T10:34:47Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+| **Framework changes** | Deletion mode switch in executor + deletion strategies | Still needs delete action in executor + resource ordering |
+| **Scaling** | Single adapter handles both | Can scale deletion independently |
+| **Operational overhead** | Low - one deployment | High - doubled deployments, monitoring, ConfigMaps |
+| **Complexity** | Logic complexity in config (CEL expressions for status) | Infrastructure complexity (more pods, subscriptions) |


There is the logic to compute the aggregated Ready condition

With separated adapters, the deletion is an additional condition that the API computes.

With a single adapter, the complexity now is greater since it implies taking into account Available, Applied, Health

They are calculating for different purpose.

ciaranRoche

Just to clarify, is there any cases of race conditions if a 'create' or 'update' adapter is running and a 'delete' adapter reports conditions before them?

I am assuming with the incremented generation the 'create', 'update' conditions would be dropped as stale, but think it is worth calling out in this doc to cover in flight status reports.

rh-amarin · 2026-03-25T10:44:40Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+
+Two additions to the adapter task configuration:
+
+1. **`when_deleting`** (top-level) - A CEL expression evaluated after preconditions. When `true`, the resources phase switches from creation mode to deletion mode.


Does it make sense to have this keyword at the resource level?

Can different resources have different triggers for deletion?

I don't think so. Since each resource have their own sentinel shard and each adapter should serve only one resource kind in API side, they must be in same lifecycle. per resource level makes it complicated to manage.

rh-amarin · 2026-03-25T10:46:16Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+5. If `when_deleting` is false or not set → resources phase runs in creation mode
+6. Post-processing (always runs)
+
+#### deleteOptions


I wonder if for resource updates we will offer an option to re-create the resource, by deleting first and later creating it again.

In that case... will deleteOptions still apply? e.g. the order of deleting/creating

I am considering to removing the supporting of recreate. Maestro doesn't support it, if service offering team want a cron job they should use job TTL. recreate may cause resource leak problems. I would hold for that until get any requirements. I didn't see delete_options has any conflicts here.

unofficially I was given requirements for recreate they want full lifecycle control of adapter task resources.
For example if they are creating a job, they want to be able specify when to create another one, how many jobs to retain, when you clean up, how to name them etc etc

I am yet to put them into a JIRA just yet

rh-amarin · 2026-03-25T10:48:14Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+resources:
+  - name: clusterJob
+    manifest: ...
+    deleteOptions:


idea: name this also when_deleting

Then, it is obvious the relationship between the root level when_deleting and these options

This could be also handy in the post.build phase.

Imagine having:

post: payloads: - name: clusterStatusPayload build: .... when_deleting: ....

Yes, maybe there is an ugly assymetry with build/when_deleting but... the result CEL expressions will be much clearer

alternativelly:

post: payloads: - name: clusterStatusPayload build: .... when_deleting: - name: clusterStatusPayload # this overrides previous with same name build: ....

A different payloads looks like a different adapter behavior, we can discuss more later.

rh-amarin · 2026-03-25T11:04:51Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+### What We Lose / What Gets Harder
+- Resources phase executor becomes more complex (creation mode + deletion mode)
+- Post-processing CEL expressions become more complex (must handle both creation and deletion states)
+- `deleteOptions.after` ordering requires building a dependency graph in the executor


idea: what if we add conditions for each resource deletion?

Instead of having root level when_deleting we make it per resource.

resources: - name: clusterNamespace manifest: ... discovery: by_name: '{{ .clusterId | lower }}' when_deleting: expression: deleted_at != "" && !resources.clusterConfigMap && !resources.clusterJob propagationPolicy: Background - name: clusterConfigMap manifest: ... discovery: by_name: cluster-{{ .clusterId }}-config namespace: '{{ .clusterId | lower }}' when_deleting: expression: deleted_at != "" && !resources.clusterConfigMap propagationPolicy: Background

For this to work there should be a discovery at the beginning of the resources phase to populate initially the resources.xx variables

That make sense.

rh-amarin · 2026-03-25T12:30:57Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+1. **Mark for Deletion**: On `DELETE /clusters/{id}`, the API sets `deletion_timestamp` on the cluster record **and all its nodepools**. All records remain in the database and are still queryable.
+   - `deletion_timestamp` is the canonical delete intent signal for orchestration and adapter behavior.
+   - Customer-facing lifecycle state (for example, `Terminating`) is derived from `deletion_timestamp` at read time.
+   - API should set aggregate `Ready=False` (reason `Deleting`) immediately when `deletion_timestamp` is set so Sentinel's existing not-ready query picks up the resource without Sentinel code changes.


To that end I think we need a new condition for deletion, I think anything else will be a pure hack

Well, if you think of a deletion just being a desired state... the whole point of the api-sentinel-adapter is to reconcile with that desired state.

For me, a deleted API resource with Ready==True means that the "real world" has been reconciled with the desired state.

But Ready is really a misleading name here.... Reconciled could be an option

rh-amarin · 2026-03-25T12:34:30Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+
+The API implements a two-phase deletion pattern:
+
+1. **Mark for Deletion**: On `DELETE /clusters/{id}`, the API sets `deletion_timestamp` on the cluster record **and all its nodepools**. All records remain in the database and are still queryable.


If we offer a way to delete API resources maybe we don't need to hide soft_deleted API resources.

In fact, Sentinel needs to fetch API resources that are soft_deleted but Ready==False until they transition to True

(And then, Sentinel should not get soft_deleted API resources that are Ready==True older than 30m )

ciaranRoche · 2026-03-25T12:33:15Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+   - API should set aggregate `Ready=False` (reason `Deleting`) immediately when `deletion_timestamp` is set so Sentinel's existing not-ready query picks up the resource without Sentinel code changes.
+2. **Delete (Hierarchical)**: The API uses layered deletion:
+   - **Nodepool level**: Each nodepool record is deleted when all its nodepool-level adapters report `Applied=False` + `Health=True` with `observed_time >= deletion_timestamp`
+   - **Cluster level**: The cluster record is deleted when all cluster-level adapters report `Applied=False` + `Health=True` with `observed_time >= deletion_timestamp` **AND** all nodepool records have been deleted


We focus heavy on cluster/node pool relations here. I think we should be way more generic and define how we handle resources and their sub resources, as sub resources could have sub resources of their own.

I think we need to discuss more around the deletion contract between parent and child resources and the possibility of future grand children. Right now this document is too focused on cluster/node pools and means as our API grows we would need similar docs for new resources

This doc currently focus on what we implemented in our API. But I can make it generic to avoid further changes.

ciaranRoche · 2026-03-25T12:34:47Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+
+The API waits for adapters with **existing status entries** for the cluster. An adapter that never reported status has no resources to clean up.
+
+**Important**: "No status entries" does not mean "delete immediately." The API should wait for at least one Sentinel reconciliation cycle after `deletion_timestamp` is set before concluding that no adapters will report. This accounts for the case where an adapter ACKs a broker message but fails to POST its status — without a grace period, the API could delete the record before the adapter retries its status report. A configurable grace period (e.g., 30 seconds after `deletion_timestamp`) ensures adapters have time to report before the API proceeds.


How does this API wait happen? Is it a synchronous check on every poll, or some background routine?

This is something even won't happen. Removing

rafabene · 2026-03-25T16:33:39Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+
+```yaml
+# Evaluated after preconditions, using captured variables
+when_deleting: "deletionTimestamp != ''"


Shouldn't we change to deletion_mode:

when_deleting reads like an event hook ("when deleting happens, do X"), but it's actually a boolean predicate that switches the executor's operating mode.

deletion_mode fits better because that's exactly what the doc describes — the design already
uses the term "deletion mode" repeatedly. It also pairs naturally with the CEL expression:
deletion_mode: "deletionTimestamp != ''".

Good idea. I asked AI to help to generate a terminology which fit k8s ecosystem conventions and it gave me when_deleting.

rafabene · 2026-03-25T16:56:58Z

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

+
+### Status Reporting During Deletion
+
+The adapter uses the existing status contract with no new **required** condition types. The three standard conditions (Applied, Available, Health) remain the canonical contract for orchestration and naturally reflect deletion state:


I think it worth to clarify why the deletion gate uses Applied=False AND Health=True AND observed_time >= deletion_timestamp but Available — which is explicitly set to False during deletion — is not part of the
gate. The doc doesn't explain why.

Maybe add a note:

Suggested change

The adapter uses the existing status contract with no new **required** condition types. The three standard conditions (Applied, Available, Health) remain the canonical contract for orchestration and naturally reflect deletion state:

The adapter uses the existing status contract with no new **required** condition types. The three standard conditions (Applied, Available, Health) remain the canonical contract for orchestration and naturally reflect deletion state:

`Available` is excluded from the deletion gate because `Available=False` is ambiguous — it occurs both during initial provisioning (resources not yet ready) and during deletion (resources being terminated). It doesn't add discriminating signal beyond what `Applied=False` already provides. The `observed_time >= deletion_timestamp` check is what disambiguates "never created" from "successfully deleted."

That's a good point. But actually, Available=False only means the manifest is not ready for using doesn't mean it's gone. Applied=False means the resource is gone. For deletion, the intention is to check if resource is gone not check if resource is ready for using. I will explain it.

xueli181114 · 2026-03-26T07:13:15Z

he 'create', 'update' conditions would be dropped as stale, but think it is worth calling out in this doc to cover in flight status reports.

The create/update should be forbidden from API side once the cluster has deleted_at or deletion_timestamp set.
And in this SPIKE the picked approach is having creation/update and deletion the same adapter that won't encountered into this problem. With another approach, the precondition should skip that. But precondition is owned by service offering team, adapter framework won't touch the detailed cluster body, that's why a single config make these kind of things easier.

rh-amarin · 2026-03-26T07:23:11Z

he 'create', 'update' conditions would be dropped as stale, but think it is worth calling out in this doc to cover in flight status reports.

The create/update should be forbidden from API side once the cluster has deleted_at or deletion_timestamp set. And in this SPIKE the picked approach is having creation/update and deletion the same adapter that won't encountered into this problem. With another approach, the precondition should skip that. But precondition is owned by service offering team, adapter framework won't touch the detailed cluster body, that's why a single config make these kind of things easier.

I think @ciaranRoche means having some adapter response delayed from an update before the deletion

User changes the API resource @gen=2
- Adapter starts processing @gen=2
User deletes API resource @gen=3
- Adapter starts deleting @gen=3
API receives report @gen=3
- Accepts, and computes Ready as report is for current generation
API receives report from @gen=2
- Discarded, as it is a report from an older generation

This is how the aggregation logic works today
And this is why making deletion to increase generation is appealing, since it works in the current flow

Consolidate the deletion-flow documentation into a single commit with Kubernetes-aligned terminology and clearer API/Sentinel/adapter lifecycle guidance, including the latest wording cleanup. Made-with: Cursor

openshift-ci bot requested review from jsell-rh and pnguyen44 March 25, 2026 04:44

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md Outdated Show resolved Hide resolved

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md Outdated Show resolved Hide resolved

xueli181114 force-pushed the HYPERFLEET-560 branch from b78790c to c72f78b Compare March 25, 2026 04:51

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

xueli181114 force-pushed the HYPERFLEET-560 branch 2 times, most recently from 87629c6 to 1bc2c55 Compare March 25, 2026 06:17

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md Outdated Show resolved Hide resolved

xueli181114 force-pushed the HYPERFLEET-560 branch from 1bc2c55 to 3c96851 Compare March 25, 2026 07:33

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md Show resolved Hide resolved

xueli181114 force-pushed the HYPERFLEET-560 branch 2 times, most recently from 04c7af2 to 85f615e Compare March 25, 2026 08:07

rh-amarin reviewed Mar 25, 2026

View reviewed changes

ciaranRoche reviewed Mar 25, 2026

View reviewed changes

rh-amarin reviewed Mar 25, 2026

View reviewed changes

ciaranRoche reviewed Mar 25, 2026

View reviewed changes

rafabene reviewed Mar 25, 2026

View reviewed changes

HYPERFLEET-560 - docs: refine adapter deletion flow and diagrams

ade0197

Consolidate the deletion-flow documentation into a single commit with Kubernetes-aligned terminology and clearer API/Sentinel/adapter lifecycle guidance, including the latest wording cleanup. Made-with: Cursor

xueli181114 force-pushed the HYPERFLEET-560 branch from 85f615e to ade0197 Compare March 26, 2026 08:55


		### Overview

		The deletion flow extends the existing adapter framework with minimal DSL changes. When the HyperFleet API receives a DELETE request, it soft-deletes the cluster (sets `deletion_timestamp`), and the cluster enters the `Terminating` phase. The existing Sentinel polling and event-driven adapter workflow handles the rest.


		## What & Why

		What: Design the deletion workflow for Clusters and NodePools, defining how the adapter framework handles resource cleanup when a cluster enters the `Terminating` phase.


		The API implements a two-phase deletion pattern:

		1. Mark for Deletion: On `DELETE /clusters/{id}`, the API sets `deletion_timestamp` on the cluster record and all its nodepools. All records remain in the database and are still queryable.


		Two additions to the adapter task configuration:

		1. `when_deleting` (top-level) - A CEL expression evaluated after preconditions. When `true`, the resources phase switches from creation mode to deletion mode.


		The API waits for adapters with existing status entries for the cluster. An adapter that never reported status has no resources to clean up.

		Important: "No status entries" does not mean "delete immediately." The API should wait for at least one Sentinel reconciliation cycle after `deletion_timestamp` is set before concluding that no adapters will report. This accounts for the case where an adapter ACKs a broker message but fails to POST its status — without a grace period, the API could delete the record before the adapter retries its status report. A configurable grace period (e.g., 30 seconds after `deletion_timestamp`) ensures adapters have time to report before the API proceeds.


		### Status Reporting During Deletion

		The adapter uses the existing status contract with no new required condition types. The three standard conditions (Applied, Available, Health) remain the canonical contract for orchestration and naturally reflect deletion state:

Conversation

xueli181114 commented Mar 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

openshift-ci bot commented Mar 25, 2026

Uh oh!

coderabbitai bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rh-amarin Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xueli181114 Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rh-amarin Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

xueli181114 commented Mar 25, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 25, 2026 •

edited

Loading

rh-amarin Mar 26, 2026 •

edited

Loading

xueli181114 Mar 26, 2026 •

edited

Loading

rh-amarin Mar 26, 2026 •

edited

Loading