[fix][broker] Fix ExtensibleLoadManagerImpl stuck Assigning bundle state after broker restart#25379
Open
lhotari wants to merge 1 commit intoapache:masterfrom
Open
Conversation
…er broker restart When a broker restarts and its ServiceUnitStateChannel initializes via handleExisting(), in-flight Assigning states targeting the restarted broker were silently ignored. This caused ownership resolution to stall for up to 60 seconds (one full monitorOwnerships cycle) before the ownership monitor rescued the stuck bundle. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
|
Thank you for this fix! Can you add/update a test to cover this case? |
merlimat
approved these changes
Mar 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
When using
ExtensibleLoadManagerImpl, a restarted broker can end up with a namespace bundle stuck inAssigningstate for up to 60 seconds after it comes back online, blocking all topic lookups for that bundle during that window.Root cause
ServiceUnitStateChannelImpl.handleExisting()is called during channel startup to replay the current table view state. It only handlesOwnedentries:If a broker restarts while a bundle is in
Assigningstate with that broker as the destination,handleExistingignores it. ThehandleEventpath (for new topic messages) also won't re-deliver the event because the message was published before this broker subscribed to the topic stream. The result: theAssigningstate is never actioned by the target broker.Why 60 seconds?
The fallback is
monitorOwnerships(), which runs everyloadBalancerServiceUnitStateMonitorIntervalInSeconds(default: 60s). It detects the stuckAssigningstate (stale longer thanloadBalancerInFlightServiceUnitStateWaitingTimeInMillis, default 30s) and callsoverrideOwnership()to forcibly resolve it. In the worst case this takes a full 60-second monitor cycle + processing overhead.Observed failure
This was identified by investigating a flaky failure of
ClusterMigrationTest.testClusterMigrationWithReplicationBacklogin CI. Afterbroker3.restart(), the bundlepulsar/migrationNs/0x00000000_0xffffffffenteredAssigningstate at22:47:10and wasn't resolved toOwneduntil22:48:19— a 69-second delay. The test's 60-secondAwaitilitytimeout expired just before ownership resolved:Modifications
In
handleExisting(), add handling forAssigningstates where this broker is the target, delegating to the existinghandleAssignEvent()which publishes theOwnedstate:Why this is safe
pubAsyncistableview.put()with nochannelStatecheck. The existinghandleExistingforOwnedalready performs side effects at this stage.stateChangeListenersfires asynchronously:notifyOnCompletionresolves only afterpubAsynccompletes — by which timechannelState = Started.handleExistingprocesses the initial snapshot;handleEventprocesses messages published after the consumer subscribed. The sameAssigningmessage is never delivered to both.monitorOwnershipsalready rescued the bundle before this broker's tableview initializes,handleExistingseesOwned(notAssigning) and the existing path handles it normally.isTargetBroker: Only the assigned destination broker responds; all others ignore it — identical to thehandleEventpath.Documentation
doc-not-needed