Skip to content

HYPERFLEET-560 - doc: add adapter deletion flow design#110

Open
xueli181114 wants to merge 1 commit intoopenshift-hyperfleet:mainfrom
xueli181114:HYPERFLEET-560
Open

HYPERFLEET-560 - doc: add adapter deletion flow design#110
xueli181114 wants to merge 1 commit intoopenshift-hyperfleet:mainfrom
xueli181114:HYPERFLEET-560

Conversation

@xueli181114
Copy link
Contributor

@xueli181114 xueli181114 commented Mar 25, 2026

Summary

  • Design document for the deletion workflow between API and Adapters
  • Two-phase deletion: API marks for deletion (soft delete), adapters clean up K8s resources, API deletes record after all adapters confirm
  • Minimal DSL changes: invoke_deletion (top-level CEL expression) and deleteOption (per resource with propagationPolicy and after)
  • Propagation policies align with K8s metav1.DeletionPropagation: Background, Foreground, Orphan
  • API delete signal requires three checks: Applied=False AND Health=True AND observed_time >= deletion_timestamp
  • 13 edge cases addressed (error reporting, stuck deletion, multi-generation resources, API mutation rejection, etc.)
  • Alternative approach documented: separate deletion adapter with comparison table

Test plan

  • Review design doc for completeness and correctness
  • Validate Mermaid diagrams render correctly on GitHub
  • Confirm DSL changes are minimal and backward compatible

Relates to: HYPERFLEET-560

Summary by CodeRabbit

  • Documentation
    • Added a draft design describing a two-phase, adapter-driven deletion workflow for Clusters and NodePools: API delete returns 202 Accepted and marks resources Terminating while adapters perform cleanup and signal completion.
    • Defines per-resource propagation policies and ordering, deletion-mode behavior, clear completion/gating signals, edge-case rules (not-found, conflicts, stale signals), and observability/metrics for progress and stuck detection.

@openshift-ci openshift-ci bot requested review from jsell-rh and pnguyen44 March 25, 2026 04:44
@openshift-ci
Copy link

openshift-ci bot commented Mar 25, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign xueli181114 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 25, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a draft design specifying a two‑phase deletion workflow for Clusters and NodePools: DELETE /clusters/{id} sets deletion_timestamp, transitions to Terminating, and returns 202 Accepted while records are retained. Sentinel behavior and CloudEvents remain unchanged; deletion intent is derived from API state. The adapter DSL gains a top‑level when_deleting CEL expression and per‑resource deleteOptions (Kubernetes propagationPolicy: Background|Foreground|Orphan and after dependency ordering). Executors enter deletion mode, compute a topological deletion order, delete resources honoring propagationPolicy (skipping Orphan from blocking), run post‑processing, and report status. Deletion completion uses the adapter status contract: deletion confirmed when Applied=False AND Health=True AND observed_time >= deletion_timestamp. NodePool and Cluster records are garbage‑collected only once these signals and hierarchical gating are satisfied; “not found” is treated as success and API rejects mutations with 409 while deletion_timestamp is set.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant API as API/Controller
    participant Sentinel
    participant Executor as Adapter Framework (Executor)
    participant Adapter as Adapter/Cloud
    participant DB as Database

    User->>API: DELETE /clusters/{id} (set deletion_timestamp, phase=Terminating) / returns 202
    API->>Sentinel: emit reconcile CloudEvent
    Sentinel->>Executor: poll & deliver reconcile
    Executor->>Adapter: evaluate top-level when_deleting CEL -> enter deletion mode
    Executor->>Adapter: compute deletion order from deleteOptions.after
    loop for each resource in topological order
        Executor->>Adapter: request delete (respect propagationPolicy)
        Adapter->>Executor: update status (Applied=False / Health=True / observed_time)
    end
    Executor->>Sentinel: publish status updates
    Sentinel->>API: reconcile observes delete signal (Applied=False && Health=True && observed_time >= deletion_timestamp)
    API->>DB: garbage-collect NodePool records when nodepool-level signals met
    API->>DB: garbage-collect Cluster record when all adapters & nodepools are deleted
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: adding a design document for adapter deletion flow, directly matching the PR's primary objective.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md (1)

241-242: Deletion ordering needs explicit handling for invalid dependency graphs.

Topological sorting is mentioned, but behavior for cycles or unknown after references is undefined. Add explicit fail-fast semantics (status/error reason) so misconfigurations don’t cause silent stuck deletions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`
around lines 241 - 242, The deletion executor's topological sorting using
deleteOption.after must validate the dependency graph up-front and fail fast on
invalid graphs: add a validation step in the executor (the component that builds
the topological order) that detects unknown `after` references and cycles before
any deletions, and on detection immediately mark the affected resources'
deletion status with an explicit error/reason (e.g., "invalid dependency:
missing reference X" or "cycle detected in deleteOption.after"), log the
failure, and abort execution rather than proceeding or hanging; ensure the
validation surfaces the offending resource identifiers so operators can fix
misconfigurations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`:
- Around line 107-109: The deletion gate is inconsistent: change the simple "all
adapters report Applied=False" rule to the canonical, stricter gate used later
so implementations match; update the DELETE flow description (the step that
currently says "when all adapters report Applied=False") to state that the API
deletes the cluster only when every adapter status for that cluster has
Applied=False AND Health=True and the adapter's observed_time is >= the
cluster's deletion_timestamp, and ensure the terms deletion_timestamp,
status.phase (Terminating), Applied, Health, and observed_time are referenced
exactly as elsewhere in the doc so both sections read the same.
- Around line 507-509: Update the deletion criteria in the "API deletes
nodepools" and "API deletes cluster" bullets so they match the API delete-signal
safety gate: require adapters to report Applied=False AND Health=True, and
verify the adapter's post-delete timestamp is fresh (i.e., within the configured
grace window) before deleting records; change the two lines that currently
reference only `Applied=False` to include `Health=True` and the post-delete
timestamp freshness check so nodepool deletion (the nodepool-level adapter
condition) and cluster deletion (all adapters including cluster-level) follow
the same safe gating logic.
- Around line 222-223: Update the documentation to correct the `Orphan`
propagation description so it matches Kubernetes `metav1.DeletionPropagation`
semantics: replace the diagram node text `K[Skip - leave resource as-is]` and
the explanatory sentence at the other occurrence with wording that `Orphan`
deletes the owner resource while preserving dependents (dependents become
orphaned), not "skip/do nothing" or "leave resource as-is"; ensure the term
`Orphan` and `metav1.DeletionPropagation` are used to make the intended
semantics clear.

---

Nitpick comments:
In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`:
- Around line 241-242: The deletion executor's topological sorting using
deleteOption.after must validate the dependency graph up-front and fail fast on
invalid graphs: add a validation step in the executor (the component that builds
the topological order) that detects unknown `after` references and cycles before
any deletions, and on detection immediately mark the affected resources'
deletion status with an explicit error/reason (e.g., "invalid dependency:
missing reference X" or "cycle detected in deleteOption.after"), log the
failure, and abort execution rather than proceeding or hanging; ensure the
validation surfaces the offending resource identifiers so operators can fix
misconfigurations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ae6754ea-c2cc-40a0-9185-b6ceca336446

📥 Commits

Reviewing files that changed from the base of the PR and between a693ad8 and b78790c.

📒 Files selected for processing (1)
  • hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md (2)

124-124: ⚠️ Potential issue | 🟠 Major

Unify the final-delete gate everywhere to the same canonical criteria

Line 124, Lines 364-373, and Lines 528-529 weaken or partially define the gate compared with the stricter sections. Please use one exact rule everywhere: deletion_timestamp set, Applied=False, Health=True, and observed_time >= deletion_timestamp (per adapter).

Suggested doc edit
-2. **Delete**: The API monitors adapter statuses. When **all** adapters report `Applied=False` for a cluster that has `deletion_timestamp` set, the API deletes the cluster record and all associated data (nodepools, adapter statuses).
+2. **Delete**: The API monitors adapter statuses. It deletes only when, for all required adapters, `Applied=False` AND `Health=True` AND `observed_time >= deletion_timestamp` for a cluster with `deletion_timestamp` set. Then it deletes the cluster record and all associated data (nodepools, adapter statuses).

-The API determines when to delete by checking three things:
+The API determines when to delete by checking four things:
 ...
-3. **All adapters report `Health=True`** (confirms the adapter status is reliable)
+3. **All adapters report `Health=True`** (confirms the adapter status is reliable)
+4. **All adapters have `observed_time >= deletion_timestamp`** (confirms statuses are post-delete and not stale)

-6. API deletes nodepools when their adapters report `Applied=False`
-7. API deletes cluster when all adapters (cluster + nodepool level) report `Applied=False`
+6. API deletes nodepools when required nodepool adapters report `Applied=False AND Health=True` and `observed_time >= nodepool.deletion_timestamp`
+7. API deletes cluster when all required adapters (cluster + nodepool level) report `Applied=False AND Health=True` and `observed_time >= cluster.deletion_timestamp`

Also applies to: 364-373, 528-529

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md` at
line 124, The final-delete gate is inconsistent across the document; unify all
checks to a single canonical rule: require the cluster's deletion_timestamp to
be set and for each adapter require Applied=False, Health=True, and
observed_time >= deletion_timestamp; update every occurrence that currently
weakens or partially defines the gate so it uses this exact criteria (use the
terms deletion_timestamp, Applied=False, Health=True, observed_time >=
deletion_timestamp verbatim to ensure consistency).

102-103: ⚠️ Potential issue | 🟠 Major

Orphan propagation is documented with incorrect behavior

Lines 238/251 (and downstream examples at Line 102 and Line 294) describe Orphan as “skip/do nothing,” which conflicts with Kubernetes metav1.DeletionPropagation semantics. Orphan deletes the owner resource and preserves dependents (orphaned).

Suggested doc edit
-    D -->|Orphan| K[Skip - leave resource as-is]
+    D -->|Orphan| K[Delete owner; preserve dependents (orphan)]

-| `Orphan` | Do nothing, leave resource as-is | Resources managed by other systems or intentionally preserved |
+| `Orphan` | Delete owner resource and preserve dependents (dependents become orphaned). | Preserve child resources while removing the owner |

-        Note over Adapter: serviceAccount: Orphan, skip
+        Note over Adapter: serviceAccount: Orphan (delete owner, preserve dependents)

-    D3 -->|Orphan| D6[Skip]
+    D3 -->|Orphan| D6[Delete owner; preserve dependents]

Also applies to: 238-253, 294-295

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`
around lines 102 - 103, The doc incorrectly describes the DeletionPropagation
value "Orphan" as "skip/do nothing"; update all occurrences of "Orphan" in
adapter-deletion-flow-design.md (including the examples mentioning
serviceAccount: Orphan and the sections referencing metav1.DeletionPropagation)
to state that "Orphan" deletes the owner resource while preserving dependents
(i.e., orphaning them), and adjust any example text or diagrams that imply
"skip" behavior so they reflect the correct semantics of deleting owner +
preserving dependents.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`:
- Line 124: The final-delete gate is inconsistent across the document; unify all
checks to a single canonical rule: require the cluster's deletion_timestamp to
be set and for each adapter require Applied=False, Health=True, and
observed_time >= deletion_timestamp; update every occurrence that currently
weakens or partially defines the gate so it uses this exact criteria (use the
terms deletion_timestamp, Applied=False, Health=True, observed_time >=
deletion_timestamp verbatim to ensure consistency).
- Around line 102-103: The doc incorrectly describes the DeletionPropagation
value "Orphan" as "skip/do nothing"; update all occurrences of "Orphan" in
adapter-deletion-flow-design.md (including the examples mentioning
serviceAccount: Orphan and the sections referencing metav1.DeletionPropagation)
to state that "Orphan" deletes the owner resource while preserving dependents
(i.e., orphaning them), and adjust any example text or diagrams that imply
"skip" behavior so they reflect the correct semantics of deleting owner +
preserving dependents.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fa20e0fd-bc6c-4c19-a424-fc99f8673441

📥 Commits

Reviewing files that changed from the base of the PR and between b78790c and c72f78b.

📒 Files selected for processing (1)
  • hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

@xueli181114 xueli181114 force-pushed the HYPERFLEET-560 branch 2 times, most recently from 87629c6 to 1bc2c55 Compare March 25, 2026 06:17
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md (2)

374-375: ⚠️ Potential issue | 🟠 Major

Deletion gate criteria are inconsistent across sections.

Line 374 weakens the rule to Applied=False + deletion_timestamp, and the timeline checks at Line 114/Line 117 and Line 590/Line 593 omit observed_time >= deletion_timestamp. The canonical gate elsewhere is stricter (Applied=False AND Health=True AND observed_time >= deletion_timestamp). Keep one rule everywhere to prevent premature deletion from stale statuses.

Also applies to: 114-117, 590-594

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`
around lines 374 - 375, The deletion-gate is inconsistent: replace the weaker
`Applied=False` + `deletion_timestamp` variant and the timeline checks that omit
the timestamp with the canonical rule everywhere; require `Applied=False AND
Health=True AND observed_time >= deletion_timestamp` in all places that define
or evaluate deletion eligibility (the paragraph that currently uses
`Applied=False` + `deletion_timestamp`, the timeline checks, and the canonical
gate text) so deletion is only allowed when the resource is marked not applied,
healthy, and the observed status is at-or-after the cluster deletion timestamp.

248-249: ⚠️ Potential issue | 🟠 Major

Orphan propagation is still defined incorrectly.

Line 248/Line 261 describe Orphan as skip/do-nothing, which conflicts with Kubernetes metav1.DeletionPropagation (owner is deleted, dependents are preserved/orphaned). This is a behavioral mismatch in the DSL contract.

Suggested doc fix
-    D -->|Orphan| K[Skip - leave resource as-is]
+    D -->|Orphan| K[Delete owner, preserve dependents (orphan)]

-| `Orphan` | Do nothing, leave resource as-is | Resources managed by other systems or intentionally preserved |
+| `Orphan` | Delete the owner resource while preserving dependents (they become orphaned). | Preserve dependent resources while removing the owner |

Also applies to: 261-263

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`
around lines 248 - 249, Update the DSL doc and diagram so the "Orphan"
propagation matches Kubernetes metav1.DeletionPropagation semantics: change any
description that says "skip/do-nothing" or "leave resource as-is" (e.g., the
diagram node labeled K[Skip - leave resource as-is] and the prose around the
'Orphan' keyword) to state that the owner is deleted but dependents are
preserved/orphaned (dependents' ownerReferences are removed/left intact per K8s
behavior); ensure all mentions of Orphan (including the block currently labeled
"Orphan" at the diagram and the text blocks explaining propagation) reflect
"owner deleted, dependents preserved/orphaned" rather than "no action."
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`:
- Around line 374-375: The deletion-gate is inconsistent: replace the weaker
`Applied=False` + `deletion_timestamp` variant and the timeline checks that omit
the timestamp with the canonical rule everywhere; require `Applied=False AND
Health=True AND observed_time >= deletion_timestamp` in all places that define
or evaluate deletion eligibility (the paragraph that currently uses
`Applied=False` + `deletion_timestamp`, the timeline checks, and the canonical
gate text) so deletion is only allowed when the resource is marked not applied,
healthy, and the observed status is at-or-after the cluster deletion timestamp.
- Around line 248-249: Update the DSL doc and diagram so the "Orphan"
propagation matches Kubernetes metav1.DeletionPropagation semantics: change any
description that says "skip/do-nothing" or "leave resource as-is" (e.g., the
diagram node labeled K[Skip - leave resource as-is] and the prose around the
'Orphan' keyword) to state that the owner is deleted but dependents are
preserved/orphaned (dependents' ownerReferences are removed/left intact per K8s
behavior); ensure all mentions of Orphan (including the block currently labeled
"Orphan" at the diagram and the text blocks explaining propagation) reflect
"owner deleted, dependents preserved/orphaned" rather than "no action."

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f5a70992-699a-4bbd-a1a9-21476bcd3aaf

📥 Commits

Reviewing files that changed from the base of the PR and between c72f78b and 87629c6.

📒 Files selected for processing (1)
  • hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md (2)

114-119: ⚠️ Potential issue | 🟠 Major

Deletion gate is still weaker in diagrams than the canonical API signal.

Line 114 and Line 117 (and again Line 596 and Line 599) only gate on Applied=False + Health=True, but the doc’s safety rule also requires observed_time >= deletion_timestamp. This inconsistency can drive premature deletion from stale statuses.

Please align these diagram/flow steps with the canonical gate used elsewhere.

Also applies to: 596-600

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`
around lines 114 - 119, The deletion gates in the sequence diagram currently
check only "Applied=False AND Health=True"; update those gate conditions (the
lines showing "Applied=False AND Health=True?") to include the canonical
temporal check so they read "Applied=False AND Health=True AND observed_time >=
deletion_timestamp?" and apply this change to both occurrences mentioned (the
cluster/nodepool gate at lines ~114-119 and the duplicate at ~596-600) so the
diagram matches the API's safety rule.

248-249: ⚠️ Potential issue | 🟠 Major

Orphan propagation is documented with incorrect Kubernetes semantics.

Line 248 and Line 261 describe Orphan as “skip/do nothing,” but Kubernetes metav1.DeletionPropagation=Orphan means delete the owner while preserving dependents (which become orphaned). This also conflicts with Line 263’s “matches metav1” statement.

Also applies to: 261-263

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`
around lines 248 - 249, The diagram and text incorrectly state that "Orphan"
means "skip/do nothing"; update all occurrences of the "Orphan" propagation
description (the diagram node labeled "Orphan" / text that says "Skip - leave
resource as-is" and the statement that "matches metav1") to accurately reflect
Kubernetes semantics: metav1.DeletionPropagation=Orphan causes deletion of the
owner while preserving dependents (dependents are left orphaned), not skipping
deletion; ensure consistency across the related descriptions that currently
mention lines 248, 261–263 so the diagram and the prose explicitly say "delete
owner, preserve dependents (dependents become orphaned) — aligns with
metav1.DeletionPropagation=Orphan."
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`:
- Around line 514-516: The example Applied condition uses invoke_deletion ?
(has(resources.clusterJob) ? "True" : "False") :
(has(resources.clusterConfigMap.metadata.resourceVersion) ? "True" : "False"),
which can mark deletion complete prematurely; update the predicate so that in
deletion mode it checks that all managed resources are absent (e.g., verify
!has(resources.clusterJob) AND !has(resources.<otherManagedResource>) etc.,
excluding only explicit Orphan resources) rather than only the absence of
resources.clusterJob, and mirror the same comprehensive absence check used by
your deletion-complete definition for consistency.

---

Duplicate comments:
In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`:
- Around line 114-119: The deletion gates in the sequence diagram currently
check only "Applied=False AND Health=True"; update those gate conditions (the
lines showing "Applied=False AND Health=True?") to include the canonical
temporal check so they read "Applied=False AND Health=True AND observed_time >=
deletion_timestamp?" and apply this change to both occurrences mentioned (the
cluster/nodepool gate at lines ~114-119 and the duplicate at ~596-600) so the
diagram matches the API's safety rule.
- Around line 248-249: The diagram and text incorrectly state that "Orphan"
means "skip/do nothing"; update all occurrences of the "Orphan" propagation
description (the diagram node labeled "Orphan" / text that says "Skip - leave
resource as-is" and the statement that "matches metav1") to accurately reflect
Kubernetes semantics: metav1.DeletionPropagation=Orphan causes deletion of the
owner while preserving dependents (dependents are left orphaned), not skipping
deletion; ensure consistency across the related descriptions that currently
mention lines 248, 261–263 so the diagram and the prose explicitly say "delete
owner, preserve dependents (dependents become orphaned) — aligns with
metav1.DeletionPropagation=Orphan."

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: db0d6340-e993-41a7-8a35-304ae0699d80

📥 Commits

Reviewing files that changed from the base of the PR and between 87629c6 and 1bc2c55.

📒 Files selected for processing (1)
  • hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md`:
- Around line 263-264: The doc's "Orphan" semantics are inconsistent with
Kubernetes: decide whether the adapter's "Orphan" means skip deletion (rename
node and text to "Skip" or "Preserve") or actually perform deletion using
Kubernetes semantics (call Delete/propagationPolicy: Orphan so owner is deleted
and dependents keep cleared ownerReferences); then update the diagram node D
-->|Orphan| K[Skip - leave resource as-is], the surrounding text that references
metav1.DeletionPropagation and lines mentioning "do nothing", and any examples
to reflect the chosen behavior (if you choose to align with K8s also mention
propagationPolicy: Orphan and ownerReferences clearing; if you choose skip,
rename symbols from Orphan to Skip/Preserve and state it does not call
Kubernetes delete).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 673c0856-8483-40e4-bd43-8591d8ef185b

📥 Commits

Reviewing files that changed from the base of the PR and between 1bc2c55 and 3c96851.

📒 Files selected for processing (1)
  • hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

@xueli181114 xueli181114 force-pushed the HYPERFLEET-560 branch 2 times, most recently from 04c7af2 to 85f615e Compare March 25, 2026 08:07

### Overview

The deletion flow extends the existing adapter framework with minimal DSL changes. When the HyperFleet API receives a DELETE request, it soft-deletes the cluster (sets `deletion_timestamp`), and the cluster enters the `Terminating` phase. The existing Sentinel polling and event-driven adapter workflow handles the rest.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have created_at, deleted_at, should we have deleted_at instead of deletion_time ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, lets keep them aligned here 🙏

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deletion_timestamp is k8s style, but if we would like deleted_at I am OK with that. But for me, deleted_at seems like the deletion finished timestamp.

Copy link
Collaborator

@rh-amarin rh-amarin Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, deleted_at is when the user deleted the API resource, so in fact is the deletion finished timestamp for the user.
Once a user has deleted an API resource, they should be consider it "gone", "kaput", "dead", "over"

The reality will be reconciled later, and that will come in the status.conditions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Renamed deletion_timestampdeleted_at throughout the doc to match created_at convention.


## What & Why

**What**: Design the deletion workflow for Clusters and NodePools, defining how the adapter framework handles resource cleanup when a cluster enters the `Terminating` phase.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have phases in the API
I would simply say "when a cluster is soft deleted" cluster, or "API resource" if we want to talk in a more generic way from now on

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with API Resource here, I was going to call out below. I think we need to speak more generic terms and move away for cluster and node pools. We should be focused on API resources parent and children

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a legacy. Let me update the word to avoid confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Removed all status.phase / "cluster phase" / "Terminating phase" references. Now uses "soft-deleted" / "cluster state" / deleted_at throughout.

1. **Mark for Deletion**: On `DELETE /clusters/{id}`, the API sets `deletion_timestamp` on the cluster record **and all its nodepools**. All records remain in the database and are still queryable.
- `deletion_timestamp` is the canonical delete intent signal for orchestration and adapter behavior.
- Customer-facing lifecycle state (for example, `Terminating`) is derived from `deletion_timestamp` at read time.
- API should set aggregate `Ready=False` (reason `Deleting`) immediately when `deletion_timestamp` is set so Sentinel's existing not-ready query picks up the resource without Sentinel code changes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting.
Until now, Ready changed because of:

  • spec change -> increases generation
  • required adapters reporting Available false

So, the Ready condition was a computed value out of the generation + required adapters conditions

Open questions:

  • Does deletion affect generation ?
  • Do we have to include a new condition for computing Ready?
    • e.g. deleted_at is set
  • What happens if required adapters report True after deletion?
    • Ready can not become True because of adapters not managing the deletion
  • Should be there "required adapters" for the deletion?

Should be there "required adapters" for the deletion?
An idea (hack?): to have a required adapter for deletion.

  • When a cluster has not deleted_at, it will return Available=True
    • Meaning, that is ok, Ready can become true and this adapter has "nothing to say there"
  • When a cluster has deleted_at
    • It also updates generation, so Ready=False
    • A message will be triggered
    • This adapter has work to do
    • It will report Available=False while it is doing
    • The resource, will keep Ready=False and triggering
    • Until deletion adapter finally says Available=True
  • The confusing part....
    • Other required adapters, should return Available=True when cluster is deleted....
    • In order to get Ready for the new generation to become True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some great points here @rh-amarin

Should generation increment? I think yes, generation is capturing a new user intent on a resource. It should increment

My take, required adapters will go away in a future version. I think we need more then required adapters to determine top level state we will need to improve this aggregation.

To that end I think we need a new condition for deletion, I think anything else will be a pure hack. Current conditions are directly tied to resource creation. Making changes here will get messy over time. For example does Applied == False mean not applied or deleted. Lets take some time tomorrow to white board out a solution here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To that end I think we need a new condition for deletion, I think anything else will be a pure hack

Well, if you think of a deletion just being a desired state... the whole point of the api-sentinel-adapter is to reconcile with that desired state.

For me, a deleted API resource with Ready==True means that the "real world" has been reconciled with the desired state.

But Ready is really a misleading name here.... Reconciled could be an option

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap. I agree that Ready is leading to misunderstanding.

Copy link
Contributor Author

@xueli181114 xueli181114 Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does deletion affect generation ?

I doubt this. If generation changes that means we need to apply the resource, then delete the resource. What's the meaning here? From user perspective, generation should only means updating happen.

Do we have to include a new condition for computing Ready

Actually once the deletion is triggered, the resources shouldn't be Available anymore. I don't think we need to introduce deleted_at to calculate the Ready status. But I am not sure if sentinel can sending event once deleted_at set.

What happens if required adapters report True after deletion

I don't get it. if required adapters still reports true that is either a bug in adapter either deletion is not invoked.

Should be there "required adapters" for the deletion?

Actually I thought about it and I don't think we should have this terminology for deletion. All adapters should report applied false to avoid resources orphan. But there is one reason to introduce this is, some out of date adapters those have been deprecated and won't report statuses anymore their status kept Applied=True,Availabel=True,Health=True stale statuses.

does Applied == False mean not applied or deleted

It means resource not exist no matter not applied or deleted. If there is deleted_at set, that means deleted otherwise it means not created.

Until deletion adapter finally says Available=True

I will against this, Available means resources are ready I don't like deletion finishing is marked as Available.
I would like to follow k8s conventions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt this. If generation changes that means we need to apply the resource, then delete the resource. What's the meaning here? From user perspective, generation should only means updating happen.

Is the spec been modified as part of the deletion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. if required adapters still reports true that is either a bug in adapter either deletion is not invoked.

Some might, for example some race conditions of tasks that started running before the deletion was kicked off

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All adapters should report applied false to avoid resources orphan

This become complex, does applied false mean not applied or removed 🤔 will similar they are semantically different

Copy link
Collaborator

@rh-amarin rh-amarin Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, similar to how Ready is a bad word for the aggregated status (since it means "reconciled"), we are misusing Available in the adapter reports.

Available=True in an adapter means that "adapter has done its work correctly" and it may or not have relation with k8s being ready

An example of an adapter running a job...

  • The adapter says Available=True meaning the job has run correctly
  • But there is nothing really "available" there.
  • It just means from the point of view of the adapter, the state of the world has been reconciled with the API resource expectation

Does deletion affect generation ?
Yesterday @ciaranRoche said to me: generation changes with the "intention of the user"
So, under that insight, a deletion of an API resource is an intended change to the API resource, so it should update generation

I am not sure if sentinel can sending event once deleted_at set.
idea: the API shows soft_deleted API resources as just regular API resources

  • It is up to GCP/ROSA team what to do with these
    • They can expose or not to the final users
    • They can trigger a cleaning operation
  • A soft_deleted API resource with Ready==False would mean that the world still contains real resources
    • Sentinel will be still able to see these, as they are as normal API resources as any other one


The API implements a two-phase deletion pattern:

1. **Mark for Deletion**: On `DELETE /clusters/{id}`, the API sets `deletion_timestamp` on the cluster record **and all its nodepools**. All records remain in the database and are still queryable.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our officce ours meeting, Jaime mentioned that they will like for us to keep the soft deletion, and provide a way for cleaning this triggered by them.

This way, they have time to research if anything goes wrong with some deleted clusters and the records will be still there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define what is mean from soft delete. This doc having this mark for deletion is different to a soft delete, where a resource remains in the DB but not presented in the API

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we offer a way to delete API resources maybe we don't need to hide soft_deleted API resources.

In fact, Sentinel needs to fetch API resources that are soft_deleted but Ready==False until they transition to True

(And then, Sentinel should not get soft_deleted API resources that are Ready==True older than 30m )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective,
When delete is triggered by customer, API shouldn't delete anything it just need to set deleted_at until the condition match deletion condition and API set deleted column to value 1 or true to keep this data in DB but list/get cluster should only target to deleted=0 clusters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to decide on this whole soft_delete, we are pretty focused on it, what are the alternatives to it?
I would question if the juice is worth the squeeze, we had a deleted_clusters table in CS and I can't remember it ever been used. We mentioned soft_delete on the office hours and Jaime was like, oh that would be nice 😆

We def need to discuss this more IMO, to ensure we do not have tunnel vision on it

- `deletion_timestamp` is the canonical delete intent signal for orchestration and adapter behavior.
- Customer-facing lifecycle state (for example, `Terminating`) is derived from `deletion_timestamp` at read time.
- API should set aggregate `Ready=False` (reason `Deleting`) immediately when `deletion_timestamp` is set so Sentinel's existing not-ready query picks up the resource without Sentinel code changes.
2. **Delete (Hierarchical)**: The API uses layered deletion:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not marking the dependent resources (nodepools) also with deleted_at ?
The asynchronous process can happen elsewhere, but from the API point of view, the desired state is already "deleted" for all resources

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It indeed follows this way, is current sentence causing confusion? Let me update it.

- Customer-facing lifecycle state (for example, `Terminating`) is derived from `deletion_timestamp` at read time.
- API should set aggregate `Ready=False` (reason `Deleting`) immediately when `deletion_timestamp` is set so Sentinel's existing not-ready query picks up the resource without Sentinel code changes.
2. **Delete (Hierarchical)**: The API uses layered deletion:
- **Nodepool level**: Each nodepool record is deleted when all its nodepool-level adapters report `Applied=False` + `Health=True` with `observed_time >= deletion_timestamp`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until now, only Available was used to compute Ready
... and it was hard enough

I think this will make it very complex to reason about it, and we should try to stick with using Available if possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Available=False means the CR is not ready for using. Only Applied=False means the resource is gone.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO Available means something beyond applying a resource.
To me it means that the work to be done by the adapter is not considered yet to be done.
And, the adapter will keep receiving messages until it answers with Available=True

Is "the work to be done by the adapter" creating a CR? maybe yes, but not only that... An adapter could create a job that evaluates something and has to be re-executed until getting a successful result. Say for example a validation job.

When the job is created, Applied=False, the resource is there, but Available doesn't become True until success

### Sentinel Behavior

No changes are required to Sentinel **code logic or CloudEvent format**. The Sentinel:
- Polls the API and evaluates `message_decision` CEL expressions as usual
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the API returns the resources with:

  • Ready==False
  • Ready==True && last_updated_time older than 30m

But for clusters that are deleted, it doesn't make sense to retrieve them 30m after the are Ready, if that means that they are effectively deleted....

unless we want to have zombie clusters, of course

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why in API side once deleted_at is set, the status needs to be set up Ready=False. But if API doesn't do the change then sentinel should do that.

- Does not distinguish between creation and deletion events
- The adapter is responsible for determining the appropriate action based on cluster state

**Operational requirement for nodepool deletion**: Each Sentinel deployment is configured with a single `resource_type` (e.g., `clusters`). For nodepool-level deletion to work, a **separate Sentinel deployment** must be configured with `resource_type: nodepools` to poll and publish nodepool events. Without this, nodepool adapters will not receive deletion events. This is an operational rollout/deployment change, not a Sentinel code change.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this not is necessary.

Sentinel instances are required for clusters and nodepools always, the deletion use case doesn't alter this IMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is something misleading.

Two additions to the adapter task configuration:

1. **`when_deleting`** (top-level) - A CEL expression evaluated after preconditions. When `true`, the resources phase switches from creation mode to deletion mode.
2. **`deleteOptions`** (per resource) - Defines the Kubernetes delete options (propagation policy and ordering) per resource, aligned with `metav1.DeleteOptions` conventions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these deleteOptions camelCase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's k8s Style, let me change to follow our config standard

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Renamed deleteOptionsdelete_options (snake_case) throughout the doc to be consistent with the rest of the DSL naming convention.


No changes are required to Sentinel **code logic or CloudEvent format**. The Sentinel:
- Polls the API and evaluates `message_decision` CEL expressions as usual
- Publishes the **same CloudEvent format** regardless of cluster phase
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should avoid talking about cluster phase.
We have cluster.status or cluster.status.conditions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, let me remove phase related words to avoid confusions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — removed all cluster phase and status.phase references. Using cluster.status.conditions and deleted_at instead.

- **Cycle detected** (e.g., A after B, B after A): Abort deletion, report `Health=False` with reason `InvalidDeletionOrder` and message describing the cycle
- **Unknown resource reference** (e.g., `after: [nonExistent]`): Abort deletion, report `Health=False` with reason `InvalidDeletionConfig` and message identifying the unknown reference

These are config errors caught at execution time. The adapter reports `Applied=True` (resources still exist) and `Health=False` (config error) so the API does not proceed with deletion.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be detected at boot time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't. For example, there are two k8s manifest configured in the configuration and manifestA can be applied succefully bug manifestB can't due to cannot pass k8s validations. And maybe service offering team configured the Applied Cel expression to just chech manifestA


### Adapter Framework Changes

#### DSL Changes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this example only applicable for kubernetes transport or is it also for maestro?

My doubt is if deleteOptions in maestro transport already specified in the manifestwork?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's only for k8s transport.


Two additions to the adapter task configuration:

1. **`when_deleting`** (top-level) - A CEL expression evaluated after preconditions. When `true`, the resources phase switches from creation mode to deletion mode.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that deletion would be managed by a different adapter than creation.

Having when_deleting keyword means that we are using the same adapter for all the CRUD actions for resources.

Is this the intention?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another alternative for having it in a different adapter. It will provide some other complex, we can have a discussion later.

end
```

#### Adapter Status Decision Matrix
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My fear is that this can increase the complexity to understand the system greatly

Until now, we only take into account Available

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there is a way to reduce the complexity and using only Available for deletion.

Let me elaborate:

  • The user deletes the cluster in the API
  • generation advances, deleted_at is set
  • Ready==false

All adapters will start to receive messages because of Ready==false

If an adapter reads deleted_at set

  • It will answer always Available=False
  • Until the resource is deleted

This way, the computation of the aggregated condition is kept the same.
A deletion, is just another reconciliation by adapters
All required adapters must answer with Available=True, or they will keep receiving messages

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Available=True means resources are ready for using not means resources exist or not. Applied=True means resources exist, and for deletion we are checking if resources are cleaned.

body: '{{ .clusterStatusPayload }}'
```

**Note on CEL expression complexity**: The post-processing expressions use `when_deleting` ternary branching to handle both creation and deletion states in a single config. While this keeps lifecycle logic co-located, it increases expression complexity. To manage this:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An idea to simplify this would be to define two different payloads

One for creation, one for deletion

And then conditionally use one or the other in the post_actions

One concern is that you may want to skip completely the building of one payload when is not the correct time. e.g. when creating, you don't want to create the deletion payload

A way to avoid this could be to add a when condition to the payloads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this, adding a when will enable us to write straightforward expressions for each mode instead of nesting them and they becoming very verbose and complex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is also an approach provided by AI. But a different payloads seems another approach of different adapter. Maybe we can talk same adapter VS different adapter first

| **Framework changes** | Deletion mode switch in executor + deletion strategies | Still needs delete action in executor + resource ordering |
| **Scaling** | Single adapter handles both | Can scale deletion independently |
| **Operational overhead** | Low - one deployment | High - doubled deployments, monitoring, ConfigMaps |
| **Complexity** | Logic complexity in config (CEL expressions for status) | Infrastructure complexity (more pods, subscriptions) |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the logic to compute the aggregated Ready condition

With separated adapters, the deletion is an additional condition that the API computes.

With a single adapter, the complexity now is greater since it implies taking into account Available, Applied, Health

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are calculating for different purpose.

Copy link
Contributor

@ciaranRoche ciaranRoche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, is there any cases of race conditions if a 'create' or 'update' adapter is running and a 'delete' adapter reports conditions before them?

I am assuming with the incremented generation the 'create', 'update' conditions would be dropped as stale, but think it is worth calling out in this doc to cover in flight status reports.


Two additions to the adapter task configuration:

1. **`when_deleting`** (top-level) - A CEL expression evaluated after preconditions. When `true`, the resources phase switches from creation mode to deletion mode.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have this keyword at the resource level?

Can different resources have different triggers for deletion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Since each resource have their own sentinel shard and each adapter should serve only one resource kind in API side, they must be in same lifecycle. per resource level makes it complicated to manage.

5. If `when_deleting` is false or not set → resources phase runs in creation mode
6. Post-processing (always runs)

#### deleteOptions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if for resource updates we will offer an option to re-create the resource, by deleting first and later creating it again.

In that case... will deleteOptions still apply? e.g. the order of deleting/creating

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am considering to removing the supporting of recreate. Maestro doesn't support it, if service offering team want a cron job they should use job TTL. recreate may cause resource leak problems. I would hold for that until get any requirements. I didn't see delete_options has any conflicts here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unofficially I was given requirements for recreate they want full lifecycle control of adapter task resources.
For example if they are creating a job, they want to be able specify when to create another one, how many jobs to retain, when you clean up, how to name them etc etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am yet to put them into a JIRA just yet

resources:
- name: clusterJob
manifest: ...
deleteOptions:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea: name this also when_deleting

Then, it is obvious the relationship between the root level when_deleting and these options

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be also handy in the post.build phase.

Imagine having:

post:
   payloads:
     - name: clusterStatusPayload
        build:
             ....
        when_deleting:
             ....

Yes, maybe there is an ugly assymetry with build/when_deleting but... the result CEL expressions will be much clearer

alternativelly:

post:
   payloads:
     - name: clusterStatusPayload
        build:
             ....
   when_deleting:
      - name: clusterStatusPayload    # this overrides previous with same name
        build:
         ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different payloads looks like a different adapter behavior, we can discuss more later.

### What We Lose / What Gets Harder
- Resources phase executor becomes more complex (creation mode + deletion mode)
- Post-processing CEL expressions become more complex (must handle both creation and deletion states)
- `deleteOptions.after` ordering requires building a dependency graph in the executor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea: what if we add conditions for each resource deletion?

Instead of having root level when_deleting we make it per resource.

resources:
  - name: clusterNamespace
    manifest:
      ...
    discovery:
      by_name: '{{ .clusterId | lower }}'
    when_deleting:
       expression: deleted_at != "" && !resources.clusterConfigMap && !resources.clusterJob
       propagationPolicy: Background
    

  - name: clusterConfigMap
    manifest:
     ...
    discovery:
      by_name: cluster-{{ .clusterId }}-config
      namespace: '{{ .clusterId | lower }}'
    when_deleting:
       expression: deleted_at != "" && !resources.clusterConfigMap
       propagationPolicy: Background

For this to work there should be a discovery at the beginning of the resources phase to populate initially the resources.xx variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That make sense.

1. **Mark for Deletion**: On `DELETE /clusters/{id}`, the API sets `deletion_timestamp` on the cluster record **and all its nodepools**. All records remain in the database and are still queryable.
- `deletion_timestamp` is the canonical delete intent signal for orchestration and adapter behavior.
- Customer-facing lifecycle state (for example, `Terminating`) is derived from `deletion_timestamp` at read time.
- API should set aggregate `Ready=False` (reason `Deleting`) immediately when `deletion_timestamp` is set so Sentinel's existing not-ready query picks up the resource without Sentinel code changes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To that end I think we need a new condition for deletion, I think anything else will be a pure hack

Well, if you think of a deletion just being a desired state... the whole point of the api-sentinel-adapter is to reconcile with that desired state.

For me, a deleted API resource with Ready==True means that the "real world" has been reconciled with the desired state.

But Ready is really a misleading name here.... Reconciled could be an option


The API implements a two-phase deletion pattern:

1. **Mark for Deletion**: On `DELETE /clusters/{id}`, the API sets `deletion_timestamp` on the cluster record **and all its nodepools**. All records remain in the database and are still queryable.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we offer a way to delete API resources maybe we don't need to hide soft_deleted API resources.

In fact, Sentinel needs to fetch API resources that are soft_deleted but Ready==False until they transition to True

(And then, Sentinel should not get soft_deleted API resources that are Ready==True older than 30m )

- API should set aggregate `Ready=False` (reason `Deleting`) immediately when `deletion_timestamp` is set so Sentinel's existing not-ready query picks up the resource without Sentinel code changes.
2. **Delete (Hierarchical)**: The API uses layered deletion:
- **Nodepool level**: Each nodepool record is deleted when all its nodepool-level adapters report `Applied=False` + `Health=True` with `observed_time >= deletion_timestamp`
- **Cluster level**: The cluster record is deleted when all cluster-level adapters report `Applied=False` + `Health=True` with `observed_time >= deletion_timestamp` **AND** all nodepool records have been deleted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We focus heavy on cluster/node pool relations here. I think we should be way more generic and define how we handle resources and their sub resources, as sub resources could have sub resources of their own.

I think we need to discuss more around the deletion contract between parent and child resources and the possibility of future grand children. Right now this document is too focused on cluster/node pools and means as our API grows we would need similar docs for new resources

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc currently focus on what we implemented in our API. But I can make it generic to avoid further changes.


The API waits for adapters with **existing status entries** for the cluster. An adapter that never reported status has no resources to clean up.

**Important**: "No status entries" does not mean "delete immediately." The API should wait for at least one Sentinel reconciliation cycle after `deletion_timestamp` is set before concluding that no adapters will report. This accounts for the case where an adapter ACKs a broker message but fails to POST its status — without a grace period, the API could delete the record before the adapter retries its status report. A configurable grace period (e.g., 30 seconds after `deletion_timestamp`) ensures adapters have time to report before the API proceeds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this API wait happen? Is it a synchronous check on every poll, or some background routine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something even won't happen. Removing


```yaml
# Evaluated after preconditions, using captured variables
when_deleting: "deletionTimestamp != ''"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we change to deletion_mode:

when_deleting reads like an event hook ("when deleting happens, do X"), but it's actually a boolean predicate that switches the executor's operating mode.

deletion_mode fits better because that's exactly what the doc describes — the design already
uses the term "deletion mode" repeatedly. It also pairs naturally with the CEL expression:
deletion_mode: "deletionTimestamp != ''".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I asked AI to help to generate a terminology which fit k8s ecosystem conventions and it gave me when_deleting.


### Status Reporting During Deletion

The adapter uses the existing status contract with no new **required** condition types. The three standard conditions (Applied, Available, Health) remain the canonical contract for orchestration and naturally reflect deletion state:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it worth to clarify why the deletion gate uses Applied=False AND Health=True AND observed_time >= deletion_timestamp but Available — which is explicitly set to False during deletion — is not part of the
gate. The doc doesn't explain why.

Maybe add a note:

Suggested change
The adapter uses the existing status contract with no new **required** condition types. The three standard conditions (Applied, Available, Health) remain the canonical contract for orchestration and naturally reflect deletion state:
The adapter uses the existing status contract with no new **required** condition types. The three standard conditions (Applied, Available, Health) remain the canonical contract for orchestration and naturally reflect deletion state:
`Available` is excluded from the deletion gate because `Available=False` is ambiguous — it occurs both during initial provisioning (resources not yet ready) and during deletion (resources being terminated). It doesn't add discriminating signal beyond what `Applied=False` already provides. The `observed_time >= deletion_timestamp` check is what disambiguates "never created" from "successfully deleted."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. But actually, Available=False only means the manifest is not ready for using doesn't mean it's gone. Applied=False means the resource is gone. For deletion, the intention is to check if resource is gone not check if resource is ready for using. I will explain it.

@xueli181114
Copy link
Contributor Author

he 'create', 'update' conditions would be dropped as stale, but think it is worth calling out in this doc to cover in flight status reports.

The create/update should be forbidden from API side once the cluster has deleted_at or deletion_timestamp set.
And in this SPIKE the picked approach is having creation/update and deletion the same adapter that won't encountered into this problem. With another approach, the precondition should skip that. But precondition is owned by service offering team, adapter framework won't touch the detailed cluster body, that's why a single config make these kind of things easier.

@rh-amarin
Copy link
Collaborator

he 'create', 'update' conditions would be dropped as stale, but think it is worth calling out in this doc to cover in flight status reports.

The create/update should be forbidden from API side once the cluster has deleted_at or deletion_timestamp set. And in this SPIKE the picked approach is having creation/update and deletion the same adapter that won't encountered into this problem. With another approach, the precondition should skip that. But precondition is owned by service offering team, adapter framework won't touch the detailed cluster body, that's why a single config make these kind of things easier.

I think @ciaranRoche means having some adapter response delayed from an update before the deletion

  • User changes the API resource @gen=2
    • Adapter starts processing @gen=2
  • User deletes API resource @gen=3
    • Adapter starts deleting @gen=3
  • API receives report @gen=3
    • Accepts, and computes Ready as report is for current generation
  • API receives report from @gen=2
    • Discarded, as it is a report from an older generation

This is how the aggregation logic works today
And this is why making deletion to increase generation is appealing, since it works in the current flow

Consolidate the deletion-flow documentation into a single commit with Kubernetes-aligned terminology and clearer API/Sentinel/adapter lifecycle guidance, including the latest wording cleanup.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants