Skip to content

## Feature: Prevent Workload Reshuffling During Control-Plane Disconnects #173

@HAHermsen

Description

@HAHermsen

The Problem

When control-plane connectivity to a worker node is lost, K8s immediately assumes the node has failed and begins evicting and rescheduling workloads to other nodes.

For location-bound real-time workloads (EtherCAT, PROFINET connections), this is catastrophic. The workload is still running fine on its hardware, but K8s reshuffles it anyway—breaking the control loop.

K8s can't distinguish between:

  • Transient network loss (worker is healthy, just disconnected)
  • Actual node failure (worker is dead)

So it treats both the same: reshuffle everything.

Motivation

Industrial edge deployments operate in environments with unreliable network connectivity (4G/5G dropouts, WiFi interference, cellular gaps). Control-plane disconnects are temporary and expected.

But current K8s behavior treats every control-plane disconnect as permanent node failure and immediately reshuffles workloads. This breaks location-bound real-time control loops that are physically wired to specific hardware.

A network hiccup (or longer outage) shouldn't destroy production. Margo needs semantics to distinguish transient control-plane loss from actual node failure, so location-bound workloads can survive network interruptions without being evicted (even after the connection restores.)

What We Need

Orchestration semantics that:

  • Don't assume node failure just because control-plane lost heartbeat
  • Keep location-bound workloads pinned during control-plane disconnects
  • Only evict if the worker node itself is actually unhealthy

This requires the ability to mark workloads as "location-bound" so the orchestrator knows: transient network loss ≠ node failure.

How This Relates to Margo

Device capabilities (#96, #136) could enable this by allowing WFM to understand which workloads are location-bound and shouldn't be evicted on control-plane loss.


Posted as an individual contributor.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions