Skip to content

feat: add disk-utilization-aware segment loading#19288

Draft
jtuglu1 wants to merge 1 commit intoapache:masterfrom
jtuglu1:disk-util-aware-segment-movement
Draft

feat: add disk-utilization-aware segment loading#19288
jtuglu1 wants to merge 1 commit intoapache:masterfrom
jtuglu1:disk-util-aware-segment-movement

Conversation

@jtuglu1
Copy link
Copy Markdown
Contributor

@jtuglu1 jtuglu1 commented Apr 9, 2026

Description

Problem

We have seen the following issue in our production clusters:

Tier X exhibits a persistent bimodal disk distribution: ~35 servers near 70% full (Group A) and ~35 servers at 99%+ (Group B). Two root causes prevent the coordinator segment balancing from correcting this:

  1. Round-robin initial placement distributes new segments across all servers, continuously loading Group B servers that are already near-full.

  2. CostBalancerStrategy is purely temporal — disk utilization plays no role in selecting a move destination server. This causes two problems:

    • Moves from Group B are never scheduled: the balancer adds the source server back to the candidate pool as a "stay in place" option, and the temporal cost function frequently scores it as optimal, marking the segment as "Optimally placed."
    • Moves to Group B are attempted and fail at runtime: the coordinator's snapshot of available disk space lags behind the historical's live state, so the balancer schedules a load to a server it believes has capacity, but the historical rejects it because the disk is full.

Together these create a feedback loop: new segments land on Group B, moves to drain Group B are never scheduled, moves to Group B fail, and the imbalance compounds over time (disks in Group B remain pinned at ~100%).

Solution

The core issues we are trying to solve are:

  1. Minimize server disk utilization variance within a tier subject to segment temporal locality.
  2. Prevent cases where we get into a state of severe disk imbalance.
  3. Allow for an "off-ramp" in case we do get into a perpetual state of imbalance.
  4. Create a tunable way to deterministically force data redistribution, while still allowing oversubscription in worst-case scenarios (e.g. auto-scaling is delayed).

Considerations

I considered 2 options:

  1. Add a static/dynamic threshold to candidate server selection (e.g. don't assign to servers with >{threshold}% utilization).
  2. Change CostBalancerStrategy to penalize high disk utilization in the cost function.

The core tradeoff is whether to allow Druid to oversubscribe a disk or not. Option #2 would permit oversubscription based on a heuristic — if the segment fits AND temporal locality gain outweighs the disk penalty — meaning it is less deterministic and has pathological cases where temporal value still outweighs the utilization penalty.

Option #1 is a harder limit that is more deterministic. I opted for a preference-with-fallback approach: prefer servers below a configurable utilization threshold, then fall back to current behavior if all servers exceed the threshold. This keeps temporal locality optimization within the set of "valid" servers, and avoids blocking segment loads entirely during oversubscription events (e.g. slow auto-scaling) where no
server is below the threshold.

Additionally, when selecting a move destination, the source server is only re-added to the candidate pool if it is itself below the threshold. This prevents the balancer from declaring a segment on a 99%-full server as "Optimally placed" and suppressing the drain move.

Configuration

Static default
druid.coordinator.segmentLoading.defaultServerFillThreshold
Default: 1.0 (disabled — preserves current behavior).

Dynamic per-tier overrides

{ "tierServerFillThreshold": { "temp": 0.90 } }

The per-tier override takes precedence over the static default. If no override exists for a tier, the static default applies.

Release note

Add disk-utilization-aware segment loading threshold to help balance segment load evenly


This PR has:

  • been self-reviewed.
  • using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant