Fix orphaned worker pods: log deletion errors, eliminate TOCTOU race by EDsCODE · Pull Request #385 · PostHog/duckgres

EDsCODE · 2026-04-01T19:59:52Z

Summary

Worker pods were left running after the janitor marked them retired in the DB. Two bugs:

1. Silent pod deletion errors
retireWorkerPod discarded the K8s API delete error with _. If deletion failed, no log, no retry, pod stays running forever. Now logs the error.

2. TOCTOU race in retireLocalWorker
The callback checked p.workers[id], returned true, then called retireWorkerWithReason which checked again. If the worker was removed between the two checks, the callback reported success but the pod was never deleted, and the fallback retireRuntimeWorker (which also deletes pods) was never called.

Fix: retireWorkerWithReason returns bool. retireLocalWorker returns the result directly with no separate existence check.

Observed in prod: 3 worker pods ran for 76+ minutes after hot-idle TTL expiry. DB showed state=retired, retire_reason=hot_idle_ttl_expired but pods were still Running.

Test plan

go build -tags kubernetes . passes
All tests pass
Deploy, trigger hot-idle expiry, verify pods are deleted and errors logged if deletion fails

🤖 Generated with Claude Code

Two bugs caused worker pods to remain running after the janitor marked them retired in the DB: 1. retireWorkerPod silently discarded pod deletion errors with _. If the K8s API delete failed (timeout, network blip), the pod stayed running with no indication in logs. Now logs the error. 2. retireLocalWorker had a TOCTOU race: it checked p.workers[id] then called retireWorkerWithReason which checked again. If the worker was removed between the two checks, retireLocalWorker returned true (thinking it handled it) but the pod was never deleted, and the fallback retireRuntimeWorker was never called. Fix: retireWorkerWithReason now returns bool. retireLocalWorker returns the result directly — no separate existence check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

EDsCODE merged commit 9178dbe into main Apr 1, 2026
21 checks passed

EDsCODE deleted the eric/fix-orphaned-pod-deletion branch April 1, 2026 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix orphaned worker pods: log deletion errors, eliminate TOCTOU race#385

Fix orphaned worker pods: log deletion errors, eliminate TOCTOU race#385
EDsCODE merged 1 commit intomainfrom
eric/fix-orphaned-pod-deletion

EDsCODE commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EDsCODE commented Apr 1, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant