-
Notifications
You must be signed in to change notification settings - Fork 10
Description
The Problem
When control-plane connectivity to a worker node is lost, K8s immediately assumes the node has failed and begins evicting and rescheduling workloads to other nodes.
For location-bound real-time workloads (EtherCAT, PROFINET connections), this is catastrophic. The workload is still running fine on its hardware, but K8s reshuffles it anyway—breaking the control loop.
K8s can't distinguish between:
- Transient network loss (worker is healthy, just disconnected)
- Actual node failure (worker is dead)
So it treats both the same: reshuffle everything.
Motivation
Industrial edge deployments operate in environments with unreliable network connectivity (4G/5G dropouts, WiFi interference, cellular gaps). Control-plane disconnects are temporary and expected.
But current K8s behavior treats every control-plane disconnect as permanent node failure and immediately reshuffles workloads. This breaks location-bound real-time control loops that are physically wired to specific hardware.
A network hiccup (or longer outage) shouldn't destroy production. Margo needs semantics to distinguish transient control-plane loss from actual node failure, so location-bound workloads can survive network interruptions without being evicted (even after the connection restores.)
What We Need
Orchestration semantics that:
- Don't assume node failure just because control-plane lost heartbeat
- Keep location-bound workloads pinned during control-plane disconnects
- Only evict if the worker node itself is actually unhealthy
This requires the ability to mark workloads as "location-bound" so the orchestrator knows: transient network loss ≠ node failure.
How This Relates to Margo
Device capabilities (#96, #136) could enable this by allowing WFM to understand which workloads are location-bound and shouldn't be evicted on control-plane loss.
Posted as an individual contributor.