Skip to content

docs: Recommend Overlord-based auto-compaction and mark useIncrementalCache production ready#19252

Open
cecemei wants to merge 11 commits intoapache:masterfrom
cecemei:auto
Open

docs: Recommend Overlord-based auto-compaction and mark useIncrementalCache production ready#19252
cecemei wants to merge 11 commits intoapache:masterfrom
cecemei:auto

Conversation

@cecemei
Copy link
Copy Markdown
Contributor

@cecemei cecemei commented Apr 1, 2026

Description

  • Rename CoordinatorRunStats to DruidRunStats
  • Some changes in automatic compaction documentation and default configuration settings.

Release note

Automatic Compaction

  • Overlord-based compaction supervisors are now the recommended and default approach for automatic compaction. This method provides better reactivity, MSQ task engine support, and easier management through the supervisor framework. Coordinator-based auto-compaction remains available as an alternative.

Incremental cache

  • Incremental segment metadata cache (useIncrementalCache) is no longer experimental and defaults to ifSynced.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@cecemei cecemei changed the title Auto docs: Recommend Overlord-based auto-compaction and mark useIncrementalCache production ready Apr 1, 2026
@cecemei cecemei marked this pull request as ready for review April 1, 2026 23:52
@capistrant
Copy link
Copy Markdown
Contributor

The default for useSupervisors should be true in the cluster compaction config if we are recommending it going forward. that way all new deploys will get the recommended config

@cecemei
Copy link
Copy Markdown
Contributor Author

cecemei commented Apr 7, 2026

useSupervisors

updated default to true, PTAL!

@cecemei cecemei added this to the 37.0.0 milestone Apr 7, 2026
|`druid.manager.rules.pollDuration`|The duration between polls the Coordinator does for updates to the set of active rules. Generally defines the amount of lag time it can take for the Coordinator to notice rules.|`PT1M`|
|`druid.manager.rules.defaultRule`|The default rule for the cluster|`_default`|
|`druid.manager.rules.alertThreshold`|The duration after a failed poll upon which an alert should be emitted.|`PT10M`|
|Property|Description| Default |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's up with all the unrelated formatting changes?

*/
@ThreadSafe
public class CoordinatorRunStats
public class DruidRunStats
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't help but wonder if there is a better name for this, maybe DutyRunStats or something that means 'thing to collect stuff to emit later from regularly occurring internal chores'? I guess 'duty' isn't quite right because that is basically only used to refer to periodic coordinator tasks, not supervisor stuff. The javadoc still only mentions coordinator run/duties, which should be fixed.

Also, it is kind of weird having the name DruidRunStats but the thing it has is still called CoordinatorStat, it seems like that naming should be changed to reflect the change here.

Stepping back, what exactly is the motivation for renaming, i guess that compaction uses and runs as a supervisor now so it isn't really specific to the coordinator? While this is still used quite heavily by all of the things the coordinator does, it seems reasonable to give it a more generic name of some sort, I was just wondering if this one is a bit too generic, but maybe is fine too as long as the javadoc clarifies its purpose?

Some addtional thoughts: I believe some of us would like to eventually merge the coordinator and overlord into a single service. Since they both now basically need a heavy segment timeline and so have similar footprint requirements, there aren't a lot of compelling reasons to keep them separate anymore. In my mind 'coordinator' would be the remaining service, with all of the overlords functionality merged into it (though this hasn't really been discussed so maybe other people have other opinions), so if that were true then this would basically become something only used by the coordinator again heh. There is a lot of work to do for something like this, so it is not really a short term goal afaik and needs more official discussion at some point, just adding it here for additional stuff to think about.

* Can use either the native compaction engine or the [MSQ task engine](#use-msq-for-auto-compaction)
* More reactive and submits tasks as soon as a compaction slot is available
* Tracked compaction task status to avoid re-compacting an interval repeatedly
* Uses new Indexing State Fingerprinting mechanisms to store less data per segment in metadata storage
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know this isn't new, but by default we still store compaction state afaict (ClusterCompactionConfig.storeCompactionStatePerSegment defaults to true), so this should be reworded to be like 'can be configured to store only fingerprints' or whatever

- **MSQ compaction engine**: Set `engine` to `msq` in the compaction dynamic config or in the supervisor spec.
- **Incremental segment metadata caching**: Set `druid.manager.segments.useIncrementalCache` to `always` or `ifSynced` in your Overlord and Coordinator runtime properties. See [Segment metadata caching](../configuration/index.md#metadata-retrieval).
- **At least two compaction task slots**: The MSQ task engine requires at least two tasks (one controller, one worker).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we need to leave part of this, no? the default engine is still native, so people still need to set engine to msq, and you need 2 compaction slots since its using msq engine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants