Skip to content

fix: Multi-Node Jobs ended_at time not being set#184

Open
yuechao-qin wants to merge 1 commit intomasterfrom
ycq/fix-multinode-ended-at
Open

fix: Multi-Node Jobs ended_at time not being set#184
yuechao-qin wants to merge 1 commit intomasterfrom
ycq/fix-multinode-ended-at

Conversation

@yuechao-qin
Copy link
Collaborator

@yuechao-qin yuechao-qin commented Mar 21, 2026

Background

Closes https://github.com/Shopify/oasis-frontend/issues/542

Kubernetes Job conditions use "Complete" for successful jobs and "Failed" for failed jobs. The LaunchedKubernetesJob.ended_at property was checking for "Succeeded" instead of "Complete", which meant ended_at was always None for successful K8s Jobs. This caused the API to return no end time for completed executions.

Separately, when the orchestrator hit an internal error processing a running execution (SYSTEM_ERROR), it set the status but never recorded ended_at, making it impossible to tell when the failure occurred.

Fix

  • ended_at now correctly resolves for successful K8s Jobs by matching the "Complete" condition type
  • SYSTEM_ERROR executions now record ended_at so the API shows when the error occurred
  • All terminal state fields (status, ended_at, exit_code, started_at) are set through a single helper to prevent future inconsistencies

Changes

cloud_pipelines_backend/launchers/kubernetes_launchers.py

  • Added JobConditionType and ConditionStatus enums with K8s API doc references, replacing magic strings throughout the file
  • Fixed ended_at to check ("Complete", "Failed") instead of ("Succeeded", "Failed")

cloud_pipelines_backend/orchestrator_sql.py

  • Added _set_terminal_state() helper that sets all terminal fields on a ContainerExecution in one place
  • Refactored SUCCEEDED, FAILED, CANCELLED, and SYSTEM_ERROR branches to use the helper
  • SYSTEM_ERROR branch now sets ended_at (was previously missing)

Test

uv run pytest tests/test_kubernetes_launchers.py tests/test_orchestrator_terminal_state.py

Copy link
Collaborator Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@yuechao-qin yuechao-qin marked this pull request as ready for review March 21, 2026 03:47
@yuechao-qin yuechao-qin requested a review from Ark-kun as a code owner March 21, 2026 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant