Skip to content

Add Poison Message Handling to the Dispatchers#1331

Open
sophiatev wants to merge 9 commits intomainfrom
stevosyan/add-poison-message-handling
Open

Add Poison Message Handling to the Dispatchers#1331
sophiatev wants to merge 9 commits intomainfrom
stevosyan/add-poison-message-handling

Conversation

@sophiatev
Copy link
Copy Markdown
Contributor

@sophiatev sophiatev commented Mar 24, 2026

This PR adds poison message handling to the activity, entity, and orchestration dispatchers. The general policy followed is that we want to make sure when poison message handling is enabled:

  1. exceptions are not thrown from the dispatchers which would cause repeated aborting of the work item in the case of an irrecoverable error
  2. we stop dispatching a message after its dispatch count exceeds the user-configured maximum
  3. whenever possible, we surface this information to the customer via a failed orchestration or entity operation

Depending on the type of "irrecoverable" error, the backends might have to add special edge-case handling for the poison message. The SDK's responsibility is simply to mark the message as poisoned and prevent its processing.

Note that we have intentionally chosen not to include poison message handling for unlock requests. This is because failing to unlock an entity could leave an entire task hub in a bad state, so we retain the current behavior.

Copilot AI review requested due to automatic review settings March 24, 2026 19:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces “poison message” handling across the orchestration, entity, and activity dispatchers by tracking per-event dispatch attempts and failing/dropping work once a configured maximum dispatch count is exceeded.

Changes:

  • Adds DispatchCount to HistoryEvent (and propagates it into entity request messages) and adds MaxDispatchCount to IOrchestrationService.
  • Adds poison detection logic in TaskOrchestrationDispatcher, TaskEntityDispatcher, and TaskActivityDispatcher to fail/drop over-dispatched messages.
  • Adds structured logging support for poison message detection.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
src/DurableTask.Core/TaskOrchestrationDispatcher.cs Detects over-dispatched orchestration events and fails the orchestration with non-retriable FailureDetails.
src/DurableTask.Core/TaskEntityDispatcher.cs Propagates dispatch counts into entity requests, filters/handles poison operations, and emits poison logs/failures.
src/DurableTask.Core/TaskActivityDispatcher.cs Detects poison activity events and either discards or fails activities based on dispatch count.
src/DurableTask.Core/Logging/LogHelper.cs Adds PoisonMessageDetected structured logging helpers.
src/DurableTask.Core/Logging/LogEvents.cs Adds a new structured log event type for poison message detection.
src/DurableTask.Core/Logging/EventIds.cs Introduces a new event id for poison detection logs.
src/DurableTask.Core/IOrchestrationService.cs Adds MaxDispatchCount configuration knob for providers.
src/DurableTask.Core/History/HistoryEvent.cs Adds DispatchCount to all history events for serialization/transport.
src/DurableTask.Core/Entities/OrchestrationEntityContext.cs Adds AbandonAcquire() to reset critical section lock acquisition state.
src/DurableTask.Core/Entities/EventFormat/RequestMessage.cs Adds DispatchCount field to entity request messages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 24, 2026 20:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 25, 2026 19:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ase of poison message handling, except for entity unlock requests
$"Activity has received an event with no parent orchestration instance ID.");
taskMessage.Event.IsPoisoned = true;
// All orchestration services that implement poison message handling must have logic to handle a null response message in this case.
await this.orchestrationService.CompleteTaskActivityWorkItemAsync(workItem, responseMessage: null);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have opted for this approach because it seemed the best option to me to achieve the P0 goal (stop throwing exceptions when poison message handling is enabled). It will require some special logic in the backends if they choose to handle this case, but this sort of "fast-complete" logic already exists in the TaskOrchestrationDispatcher and TaskEntityDispatcher when ReconcileMessagesWithState return false.

Copilot AI review requested due to automatic review settings April 1, 2026 19:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sharing some initial feedback - I unfortunately haven't reviewed everything yet.

/// <summary>
/// Gets or sets the number of times this event has been dispatched.
/// </summary>
[DataMember]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we have these new properties be omitted by default if not populated, especially IsPoisoned. This will reduce the chance that there's some compatibility issue with older versions of the SDK that try to deserialize the history.

Suggested change
[DataMember]
[DataMember(EmitDefaultValue = false)]

/// or if some other error occurs during dispatch.
/// </summary>
[DataMember]
public bool IsPoisoned { get; set; }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we really want this IsPoisoned property, or if DispatchCount is good enough. My preference is generally fewer properties and I'm wondering if it might afford us some additional flexibility if we don't necessarily have to explicitly mark something as poiseond.

this.InstanceId = orchestrationInstance?.InstanceId ?? string.Empty;
this.ExecutionId = orchestrationInstance?.ExecutionId ?? string.Empty;
this.EventType = eventType;
this.TaskEventId = taskEventId;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also include DispatchCount in this log message.

using Microsoft.Extensions.Logging;
using System;
using System.Collections.Generic;
using System.Text;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we revert the sort order change to this file? Normally we keep System using statements on top. Visual Studio has a setting for this, if it's VS that's changing it automatically.

this.logHelper.TaskActivityDispatcherError(
workItem,
$"The activity worker received a message that does not have any OrchestrationInstance information.");
if (this.maxDispatchCount != null)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be safer to do:

Suggested change
if (this.maxDispatchCount != null)
if (this.maxDispatchCount > 0)

orchestrationInstance,
taskMessage.Event,
$"Activity has received an event with no parent orchestration instance ID.");
taskMessage.Event.IsPoisoned = true;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reminding me that there are various places where we assume that history events are immutable. I think the one exception to this is the IsPlayed property, but we've actually stopped using that in most modern code paths. I'm worried that this code won't work as expected in various backends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants