Skip to content

[fix][broker] Fix ExtensibleLoadManagerImpl stuck Assigning bundle state after broker restart#25379

Open
lhotari wants to merge 1 commit intoapache:masterfrom
lhotari:fix/extensible-load-manager-assigning-state-on-restart
Open

[fix][broker] Fix ExtensibleLoadManagerImpl stuck Assigning bundle state after broker restart#25379
lhotari wants to merge 1 commit intoapache:masterfrom
lhotari:fix/extensible-load-manager-assigning-state-on-restart

Conversation

@lhotari
Copy link
Member

@lhotari lhotari commented Mar 21, 2026

Motivation

When using ExtensibleLoadManagerImpl, a restarted broker can end up with a namespace bundle stuck in Assigning state for up to 60 seconds after it comes back online, blocking all topic lookups for that bundle during that window.

Root cause

ServiceUnitStateChannelImpl.handleExisting() is called during channel startup to replay the current table view state. It only handles Owned entries:

private void handleExisting(String serviceUnit, ServiceUnitStateData data) {
    ServiceUnitState state = state(data);
    if (state.equals(Owned) && isTargetBroker(data.dstBroker())) {
        pulsar.getNamespaceService()
                .onNamespaceBundleOwned(LoadManagerShared.getNamespaceBundle(pulsar, serviceUnit));
    }
    // Assigning state: silently ignored ← bug
}

If a broker restarts while a bundle is in Assigning state with that broker as the destination, handleExisting ignores it. The handleEvent path (for new topic messages) also won't re-deliver the event because the message was published before this broker subscribed to the topic stream. The result: the Assigning state is never actioned by the target broker.

Why 60 seconds?

The fallback is monitorOwnerships(), which runs every loadBalancerServiceUnitStateMonitorIntervalInSeconds (default: 60s). It detects the stuck Assigning state (stale longer than loadBalancerInFlightServiceUnitStateWaitingTimeInMillis, default 30s) and calls overrideOwnership() to forcibly resolve it. In the worst case this takes a full 60-second monitor cycle + processing overhead.

Observed failure

This was identified by investigating a flaky failure of ClusterMigrationTest.testClusterMigrationWithReplicationBacklog in CI. After broker3.restart(), the bundle pulsar/migrationNs/0x00000000_0xffffffff entered Assigning state at 22:47:10 and wasn't resolved to Owned until 22:48:19 — a 69-second delay. The test's 60-second Awaitility timeout expired just before ownership resolved:

22:47:10  Overriding inactiveBroker:localhost:36873 ... to overrideData:ServiceUnitStateData[state=Free...]
22:47:10  [Assigning state published for pulsar/migrationNs/0x00000000_0xffffffff]
22:47:10  cancel the lookup request for pulsar/migrationNs/0x00000000_0xffffffff when receiving Assigning
...69 seconds of ServiceUnitNotReadyException / replicator backoff...
22:48:19  [Owned state finally published]
22:48:19  Re-Sending 1 messages to server  ← replicator reconnects, too late

Modifications

In handleExisting(), add handling for Assigning states where this broker is the target, delegating to the existing handleAssignEvent() which publishes the Owned state:

} else if (state.equals(Assigning) && isTargetBroker(data.dstBroker())) {
    // If this broker is the assignment target and the Assigning event was published before this
    // broker's channel finished starting (e.g., after a restart), handle it now so ownership
    // is resolved immediately rather than waiting for the ownership monitor (up to 60s).
    handleAssignEvent(serviceUnit, data);
}

Why this is safe

  • No channel state guard needed: pubAsync is tableview.put() with no channelState check. The existing handleExisting for Owned already performs side effects at this stage.
  • stateChangeListeners fires asynchronously: notifyOnCompletion resolves only after pubAsync completes — by which time channelState = Started.
  • No double-delivery: handleExisting processes the initial snapshot; handleEvent processes messages published after the consumer subscribed. The same Assigning message is never delivered to both.
  • Idempotent: If monitorOwnerships already rescued the bundle before this broker's tableview initializes, handleExisting sees Owned (not Assigning) and the existing path handles it normally.
  • Scoped by isTargetBroker: Only the assigned destination broker responds; all others ignore it — identical to the handleEvent path.

Documentation

  • doc-not-needed

…er broker restart

When a broker restarts and its ServiceUnitStateChannel initializes via
handleExisting(), in-flight Assigning states targeting the restarted broker
were silently ignored. This caused ownership resolution to stall for up to
60 seconds (one full monitorOwnerships cycle) before the ownership monitor
rescued the stuck bundle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Mar 21, 2026
@heesung-sohn
Copy link
Contributor

Thank you for this fix! Can you add/update a test to cover this case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs ready-to-test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants