feat: add disk-utilization-aware segment loading#19288
Draft
jtuglu1 wants to merge 1 commit intoapache:masterfrom
Draft
feat: add disk-utilization-aware segment loading#19288jtuglu1 wants to merge 1 commit intoapache:masterfrom
jtuglu1 wants to merge 1 commit intoapache:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Problem
We have seen the following issue in our production clusters:
Tier
Xexhibits a persistent bimodal disk distribution: ~35 servers near 70% full (Group A) and ~35 servers at 99%+ (Group B). Two root causes prevent the coordinator segment balancing from correcting this:Round-robin initial placement distributes new segments across all servers, continuously loading Group B servers that are already near-full.
CostBalancerStrategyis purely temporal — disk utilization plays no role in selecting a move destination server. This causes two problems:Together these create a feedback loop: new segments land on Group B, moves to drain Group B are never scheduled, moves to Group B fail, and the imbalance compounds over time (disks in Group B remain pinned at ~100%).
Solution
The core issues we are trying to solve are:
Considerations
I considered 2 options:
CostBalancerStrategyto penalize high disk utilization in the cost function.The core tradeoff is whether to allow Druid to oversubscribe a disk or not. Option #2 would permit oversubscription based on a heuristic — if the segment fits AND temporal locality gain outweighs the disk penalty — meaning it is less deterministic and has pathological cases where temporal value still outweighs the utilization penalty.
Option #1 is a harder limit that is more deterministic. I opted for a preference-with-fallback approach: prefer servers below a configurable utilization threshold, then fall back to current behavior if all servers exceed the threshold. This keeps temporal locality optimization within the set of "valid" servers, and avoids blocking segment loads entirely during oversubscription events (e.g. slow auto-scaling) where no
server is below the threshold.
Additionally, when selecting a move destination, the source server is only re-added to the candidate pool if it is itself below the threshold. This prevents the balancer from declaring a segment on a 99%-full server as "Optimally placed" and suppressing the drain move.
Configuration
Static default
druid.coordinator.segmentLoading.defaultServerFillThresholdDefault:
1.0(disabled — preserves current behavior).Dynamic per-tier overrides
{ "tierServerFillThreshold": { "temp": 0.90 } } The per-tier override takes precedence over the static default. If no override exists for a tier, the static default applies.Release note
Add disk-utilization-aware segment loading threshold to help balance segment load evenly
This PR has: