Skip to content

Fix orphaned worker pods: log deletion errors, eliminate TOCTOU race#385

Merged
EDsCODE merged 1 commit intomainfrom
eric/fix-orphaned-pod-deletion
Apr 1, 2026
Merged

Fix orphaned worker pods: log deletion errors, eliminate TOCTOU race#385
EDsCODE merged 1 commit intomainfrom
eric/fix-orphaned-pod-deletion

Conversation

@EDsCODE
Copy link
Copy Markdown
Contributor

@EDsCODE EDsCODE commented Apr 1, 2026

Summary

Worker pods were left running after the janitor marked them retired in the DB. Two bugs:

1. Silent pod deletion errors
retireWorkerPod discarded the K8s API delete error with _. If deletion failed, no log, no retry, pod stays running forever. Now logs the error.

2. TOCTOU race in retireLocalWorker
The callback checked p.workers[id], returned true, then called retireWorkerWithReason which checked again. If the worker was removed between the two checks, the callback reported success but the pod was never deleted, and the fallback retireRuntimeWorker (which also deletes pods) was never called.

Fix: retireWorkerWithReason returns bool. retireLocalWorker returns the result directly with no separate existence check.

Observed in prod: 3 worker pods ran for 76+ minutes after hot-idle TTL expiry. DB showed state=retired, retire_reason=hot_idle_ttl_expired but pods were still Running.

Test plan

  • go build -tags kubernetes . passes
  • All tests pass
  • Deploy, trigger hot-idle expiry, verify pods are deleted and errors logged if deletion fails

🤖 Generated with Claude Code

Two bugs caused worker pods to remain running after the janitor
marked them retired in the DB:

1. retireWorkerPod silently discarded pod deletion errors with _.
   If the K8s API delete failed (timeout, network blip), the pod
   stayed running with no indication in logs. Now logs the error.

2. retireLocalWorker had a TOCTOU race: it checked p.workers[id]
   then called retireWorkerWithReason which checked again. If the
   worker was removed between the two checks, retireLocalWorker
   returned true (thinking it handled it) but the pod was never
   deleted, and the fallback retireRuntimeWorker was never called.

   Fix: retireWorkerWithReason now returns bool. retireLocalWorker
   returns the result directly — no separate existence check.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@EDsCODE EDsCODE merged commit 9178dbe into main Apr 1, 2026
21 checks passed
@EDsCODE EDsCODE deleted the eric/fix-orphaned-pod-deletion branch April 1, 2026 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant