Add Poison Message Handling to the Dispatchers#1331
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces “poison message” handling across the orchestration, entity, and activity dispatchers by tracking per-event dispatch attempts and failing/dropping work once a configured maximum dispatch count is exceeded.
Changes:
- Adds
DispatchCounttoHistoryEvent(and propagates it into entity request messages) and addsMaxDispatchCounttoIOrchestrationService. - Adds poison detection logic in
TaskOrchestrationDispatcher,TaskEntityDispatcher, andTaskActivityDispatcherto fail/drop over-dispatched messages. - Adds structured logging support for poison message detection.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| src/DurableTask.Core/TaskOrchestrationDispatcher.cs | Detects over-dispatched orchestration events and fails the orchestration with non-retriable FailureDetails. |
| src/DurableTask.Core/TaskEntityDispatcher.cs | Propagates dispatch counts into entity requests, filters/handles poison operations, and emits poison logs/failures. |
| src/DurableTask.Core/TaskActivityDispatcher.cs | Detects poison activity events and either discards or fails activities based on dispatch count. |
| src/DurableTask.Core/Logging/LogHelper.cs | Adds PoisonMessageDetected structured logging helpers. |
| src/DurableTask.Core/Logging/LogEvents.cs | Adds a new structured log event type for poison message detection. |
| src/DurableTask.Core/Logging/EventIds.cs | Introduces a new event id for poison detection logs. |
| src/DurableTask.Core/IOrchestrationService.cs | Adds MaxDispatchCount configuration knob for providers. |
| src/DurableTask.Core/History/HistoryEvent.cs | Adds DispatchCount to all history events for serialization/transport. |
| src/DurableTask.Core/Entities/OrchestrationEntityContext.cs | Adds AbandonAcquire() to reset critical section lock acquisition state. |
| src/DurableTask.Core/Entities/EventFormat/RequestMessage.cs | Adds DispatchCount field to entity request messages. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ase of poison message handling, except for entity unlock requests
| $"Activity has received an event with no parent orchestration instance ID."); | ||
| taskMessage.Event.IsPoisoned = true; | ||
| // All orchestration services that implement poison message handling must have logic to handle a null response message in this case. | ||
| await this.orchestrationService.CompleteTaskActivityWorkItemAsync(workItem, responseMessage: null); |
There was a problem hiding this comment.
I have opted for this approach because it seemed the best option to me to achieve the P0 goal (stop throwing exceptions when poison message handling is enabled). It will require some special logic in the backends if they choose to handle this case, but this sort of "fast-complete" logic already exists in the TaskOrchestrationDispatcher and TaskEntityDispatcher when ReconcileMessagesWithState return false.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cgillum
left a comment
There was a problem hiding this comment.
Sharing some initial feedback - I unfortunately haven't reviewed everything yet.
| /// <summary> | ||
| /// Gets or sets the number of times this event has been dispatched. | ||
| /// </summary> | ||
| [DataMember] |
There was a problem hiding this comment.
I suggest we have these new properties be omitted by default if not populated, especially IsPoisoned. This will reduce the chance that there's some compatibility issue with older versions of the SDK that try to deserialize the history.
| [DataMember] | |
| [DataMember(EmitDefaultValue = false)] |
| /// or if some other error occurs during dispatch. | ||
| /// </summary> | ||
| [DataMember] | ||
| public bool IsPoisoned { get; set; } |
There was a problem hiding this comment.
I'm wondering if we really want this IsPoisoned property, or if DispatchCount is good enough. My preference is generally fewer properties and I'm wondering if it might afford us some additional flexibility if we don't necessarily have to explicitly mark something as poiseond.
| this.InstanceId = orchestrationInstance?.InstanceId ?? string.Empty; | ||
| this.ExecutionId = orchestrationInstance?.ExecutionId ?? string.Empty; | ||
| this.EventType = eventType; | ||
| this.TaskEventId = taskEventId; |
There was a problem hiding this comment.
Let's also include DispatchCount in this log message.
| using Microsoft.Extensions.Logging; | ||
| using System; | ||
| using System.Collections.Generic; | ||
| using System.Text; |
There was a problem hiding this comment.
Can we revert the sort order change to this file? Normally we keep System using statements on top. Visual Studio has a setting for this, if it's VS that's changing it automatically.
| this.logHelper.TaskActivityDispatcherError( | ||
| workItem, | ||
| $"The activity worker received a message that does not have any OrchestrationInstance information."); | ||
| if (this.maxDispatchCount != null) |
There was a problem hiding this comment.
Maybe it would be safer to do:
| if (this.maxDispatchCount != null) | |
| if (this.maxDispatchCount > 0) |
| orchestrationInstance, | ||
| taskMessage.Event, | ||
| $"Activity has received an event with no parent orchestration instance ID."); | ||
| taskMessage.Event.IsPoisoned = true; |
There was a problem hiding this comment.
This is reminding me that there are various places where we assume that history events are immutable. I think the one exception to this is the IsPlayed property, but we've actually stopped using that in most modern code paths. I'm worried that this code won't work as expected in various backends.
This PR adds poison message handling to the activity, entity, and orchestration dispatchers. The general policy followed is that we want to make sure when poison message handling is enabled:
Depending on the type of "irrecoverable" error, the backends might have to add special edge-case handling for the poison message. The SDK's responsibility is simply to mark the message as poisoned and prevent its processing.
Note that we have intentionally chosen not to include poison message handling for unlock requests. This is because failing to unlock an entity could leave an entire task hub in a bad state, so we retain the current behavior.