Skip to content

Enhance purge with parallel batch deletes and partial purge timeout#1321

Open
YunchuWang wants to merge 28 commits intomainfrom
wangbill/enpurge
Open

Enhance purge with parallel batch deletes and partial purge timeout#1321
YunchuWang wants to merge 28 commits intomainfrom
wangbill/enpurge

Conversation

@YunchuWang
Copy link
Copy Markdown
Member

@YunchuWang YunchuWang commented Mar 18, 2026

Summary

Enhance the Azure Storage purge implementation with parallel batch deletes, CancellationToken-based partial purge timeout, improved error handling, and comprehensive tests.

Motivation

Purging large numbers of orchestration instances (100K+) with the current implementation causes:

  1. Timeouts: Sequential batch deletes are too slow, causing gRPC deadline timeouts in isolated worker
  2. Storage errors: DeleteBatchAsync fails with 404 when entities are already deleted (race condition)
  3. Silent data loss: gRPC cancellation kills the response but not the in-flight storage operations — caller has no visibility into progress
  4. No progress tracking: No way to know how many instances remain

Changes

Core (DurableTask.Core)

  • PurgeInstanceFilter.Timeout (TimeSpan?): Optional timeout for partial purge
  • PurgeResult.IsComplete (bool?): Already existed, now properly populated

Azure Storage (DurableTask.AzureStorage)

  • PurgeHistoryResult.IsComplete: New property + constructor overload, forwarded via ToCorePurgeHistoryResult()
  • AzureStorageOrchestrationService.PurgeInstanceHistoryAsync(..., TimeSpan timeout): New overload
  • AzureTableTrackingStore.DeleteHistoryAsync: CancellationToken-based timeout using linked CancellationTokenSource
  • Table.DeleteBatchParallelAsync: New parallel batch delete with concurrent transactions and 404 fallback
  • MessageManager.DeleteLargeMessageBlobs: Fixed 404 handling with try/catch instead of ExistsAsync + delete
  • Concurrency control: SemaphoreSlim(100) for instance-level parallelism

Behavior

When Timeout is set:

  • Creates a CancellationTokenSource(timeout) linked with the caller's CancellationToken
  • Passes the effective token to table queries, throttle waits, and ThrowIfCancellationRequested
  • On timeout: catches OperationCanceledException, waits for in-flight deletions, returns IsComplete = false
  • Already-dispatched instance deletions use effectiveToken and can be cancelled in flight when timeout
    When Timeout is not set:
  • Existing behavior unchanged (IsComplete = null for backward compatibility)

Benchmark Results

100K Instances (EP1, separate ASPs/storage)

Metric Baseline (stock) Optimized Delta
Total Deleted 28,702 99,949 3.5x
Purge Rate 48.7 inst/s 336.5 inst/s 6.9x
Errors 16 0 Error-free

500K Instances (EP1, isolated worker SDK path with 25s timeout)

Metric Baseline (no timeout) Optimized (25s timeout) Delta
Reported Deleted 17,402 (3.5%) 499,560 (99.9%) 28.7x
Purge Rate 12.3 inst/s 318.1 inst/s 25.9x
Errors 41 (95%) 0 Error-free

Breaking Changes

None. All changes are additive:

  • New optional Timeout property on PurgeInstanceFilter
  • New constructor overload on PurgeHistoryResult
  • New PurgeInstanceHistoryAsync overload (original method unchanged)
  • Internal interface/base class changes are non-public

Tests Added

  • PartialPurge_TimesOutThenCompletesOnRetry
  • PartialPurge_GenerousTimeout_CompletesAll
  • PartialPurge_WithoutTimeout_ReturnsNullIsComplete
  • PurgeMultipleInstancesHistoryByTimePeriod_ScalabilityValidation
  • PurgeSingleInstanceWithIdempotency
  • PurgeSingleInstance_WithLargeBlobs_CleansUpBlobs
  • PurgeInstance_WithManyHistoryRows_DeletesAll
  • 9 unit tests for DeleteBatchParallelAsync

Related PRs

YunchuWang and others added 7 commits March 13, 2026 15:33
- Add TimeSpan? Timeout to PurgeInstanceFilter for partial purge support
- Add bool? IsComplete to PurgeHistoryResult to indicate completion status
- Add new PurgeInstanceHistoryAsync overload with TimeSpan timeout parameter
- Use CancellationToken-based timeout (linked CTS) in DeleteHistoryAsync
- Already-dispatched deletions complete before returning partial results
- Backward compatible: no timeout = original behavior (IsComplete = null)
- Forward IsComplete through ToCorePurgeHistoryResult to PurgeResult
- Add scenario tests for partial purge timeout, generous timeout, and compat
- Always cap timeout to 30s max, even if not specified or exceeds 30s
- Pass effectiveToken into DeleteAllDataForOrchestrationInstance so in-flight deletes are also cancelled on timeout
- Catch OperationCanceledException from Task.WhenAll for timed-out in-flight deletes
- External cancellationToken cancellation still propagates normally
Copilot AI review requested due to automatic review settings March 18, 2026 19:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves purge scalability and robustness for DurableTask’s Azure Storage backend by adding parallelized table batch deletes, optional timeout-based partial purging, better 404/idempotency handling, and expanded test coverage.

Changes:

  • Add optional purge timeout (PurgeInstanceFilter.Timeout) and propagate completion status via IsComplete into core PurgeResult.
  • Implement parallel table batch deletion with 404 fallback to per-entity deletes.
  • Add scenario + unit tests for partial purge behavior, blob cleanup, and parallel batch delete behavior.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
test/DurableTask.AzureStorage.Tests/TestOrchestrationClient.cs Adds helper API to invoke the new timed purge overload in tests.
test/DurableTask.AzureStorage.Tests/AzureStorageScenarioTests.cs Adds new purge/partial-purge scenario tests and validation for large-message blob cleanup.
src/DurableTask.Core/PurgeInstanceFilter.cs Introduces optional Timeout for partial purge.
src/DurableTask.AzureStorage/Tracking/TrackingStoreBase.cs Extends purge-by-time signature to include an optional timeout.
src/DurableTask.AzureStorage/Tracking/ITrackingStore.cs Extends tracking store purge API contract to include optional timeout.
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs Implements timeout-aware, parallel purge-by-time behavior and uses parallel batch delete.
src/DurableTask.AzureStorage/Storage/Table.cs Adds DeleteBatchParallelAsync with transactional chunking and 404 fallback.
src/DurableTask.AzureStorage/PurgeHistoryResult.cs Adds IsComplete and forwards it to core PurgeResult.
src/DurableTask.AzureStorage/MessageManager.cs Improves 404 handling for large-message blob deletion by relying on list/delete with exception handling.
src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs Adds timed purge overload and wires PurgeInstanceFilter.Timeout into the call path.
Test/DurableTask.AzureStorage.Tests/Storage/TableDeleteBatchParallelTests.cs Adds unit tests validating parallel batch delete chunking, fallback, and cancellation behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

- Hard-code 30s CancellationToken-based timeout in DeleteHistoryAsync
- Remove configurable Timeout from PurgeInstanceFilter (not needed)
- Remove timeout overload from AzureStorageOrchestrationService
- IsComplete = true when all purged within 30s, false when timed out
- Callers loop until IsComplete = true for large-scale purge
- Add TimeSpan? Timeout property to PurgeInstanceFilter (opt-in, default null)
- When null: unbounded purge, IsComplete=null (backward compat, no behavior change)
- When set: CancellationToken-based timeout, IsComplete=true/false
- Thread Timeout through IOrchestrationServicePurgeClient path
- Zero breaking changes: existing callers unaffected
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the Azure Storage purge pipeline to improve throughput and reliability for large purges by introducing parallelized batch deletes, a timeout-driven partial purge mechanism, and forwarding completion status back to the core purge result shape.

Changes:

  • Added PurgeInstanceFilter.Timeout and plumbed timeout support into Azure Storage tracking-store purging.
  • Implemented Table.DeleteBatchParallelAsync with 404/idempotency fallback and updated purge to use it.
  • Added/updated purge-related tests and extended purge result types to carry IsComplete.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
test/DurableTask.AzureStorage.Tests/AzureStorageScenarioTests.cs Adds new purge scenario tests for scalability/idempotency/large-blob cleanup and a test intended to validate completion semantics.
src/DurableTask.Core/PurgeInstanceFilter.cs Adds Timeout option to the core purge filter contract.
src/DurableTask.AzureStorage/Tracking/TrackingStoreBase.cs Extends time-range purge signature to accept optional timeout.
src/DurableTask.AzureStorage/Tracking/ITrackingStore.cs Extends tracking store purge API with an optional timeout parameter.
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs Implements timeout-aware, parallel instance purging and returns IsComplete based on timeout.
src/DurableTask.AzureStorage/Storage/Table.cs Adds DeleteBatchParallelAsync with parallel transactions and 404 fallback to individual deletes.
src/DurableTask.AzureStorage/PurgeHistoryResult.cs Adds IsComplete to AzureStorage purge result and forwards it to DurableTask.Core.PurgeResult.
src/DurableTask.AzureStorage/MessageManager.cs Improves 404 handling for large message blob cleanup by relying on try/catch rather than container existence checks.
src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs Wires PurgeInstanceFilter.Timeout into the tracking-store purge path used by IOrchestrationServicePurgeClient.
Test/DurableTask.AzureStorage.Tests/Storage/TableDeleteBatchParallelTests.cs Adds unit tests for DeleteBatchParallelAsync (but currently placed outside the referenced test project directory).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

- Update PurgeInstanceFilter.Timeout docs: in-flight deletions are cancelled (intentional)
- Add using var for SemaphoreSlim disposal
- Fix DateTime.Now/UtcNow mixing in purge tests (use UtcNow consistently)
- Rename PurgeReturnsIsComplete test to match actual assertions
- Move TableDeleteBatchParallelTests.cs from Test/ to test/ (correct project path)
- Fix typos: grater->greater, status->statuses in XML docs
- Use LINQ Select for foreach loop per code quality suggestion
Copilot AI review requested due to automatic review settings March 20, 2026 00:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the Azure Storage purge pipeline to better handle large-scale instance purges by adding parallelized table batch deletes, introducing an optional timeout for partial purges, and improving idempotency around already-deleted storage artifacts. It also expands scenario/unit test coverage to validate the new purge behaviors and scalability characteristics.

Changes:

  • Add PurgeInstanceFilter.Timeout and propagate IsComplete via PurgeHistoryResultPurgeResult.
  • Implement parallel table batch deletion with 404 fallback to per-entity deletes.
  • Update purge and blob cleanup implementations for better cancellation/timeout behavior and add comprehensive tests.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
test/DurableTask.AzureStorage.Tests/Storage/TableDeleteBatchParallelTests.cs Adds unit tests validating new parallel batch delete behavior (including 404 fallback and cancellation).
test/DurableTask.AzureStorage.Tests/AzureStorageScenarioTests.cs Adds end-to-end purge scenario tests and uses UTC timestamps for purge windows.
src/DurableTask.Core/PurgeInstanceFilter.cs Introduces optional Timeout for partial purge semantics.
src/DurableTask.AzureStorage/Tracking/TrackingStoreBase.cs Extends tracking store purge API shape to accept optional timeout.
src/DurableTask.AzureStorage/Tracking/ITrackingStore.cs Updates tracking store interface to include optional timeout parameter.
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs Implements timeout-linked cancellation + throttled parallel instance purges and uses parallel history row deletes.
src/DurableTask.AzureStorage/Storage/Table.cs Adds DeleteBatchParallelAsync with concurrent chunk submission and 404 fallback behavior.
src/DurableTask.AzureStorage/PurgeHistoryResult.cs Adds IsComplete and forwards completion to core PurgeResult.
src/DurableTask.AzureStorage/MessageManager.cs Improves large-message blob deletion to handle missing containers via exception-based 404 handling.
src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs Threads the new timeout value through purge calls and fixes doc typos.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

YunchuWang and others added 2 commits March 19, 2026 18:57
- PurgeHistoryResultTests: constructor IsComplete (true/false/null), ToCorePurgeHistoryResult propagation, backward compat
- PurgeInstanceFilterTests: Timeout default null, set/reset, PurgeResult IsComplete tri-state, old constructor compat
- Remove unused using System.Collections.Concurrent (#1)
- Pass original cancellationToken (not effectiveToken) to in-flight deletes (#3)
- Update ITrackingStore doc to include Canceled status (#4)
- Use wall-clock Stopwatch for DeleteBatchParallelAsync Elapsed (#5)
Copilot AI review requested due to automatic review settings March 20, 2026 04:04
@YunchuWang
Copy link
Copy Markdown
Member Author

Regarding the pendingTasks memory concern: With the new opt-in timeout feature (default 30s when used), the maximum number of pending tasks is naturally bounded by how many instances can be dispatched within the timeout window (~100 concurrent 30s a few thousand tasks at most). For the no-timeout path (backward compat), the existing behavior is preserved. The SemaphoreSlim(100) already limits actual concurrency. Switching to Parallel.ForEachAsync would be a larger refactor that changes the async enumeration pattern better suited for a follow-up.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Revert to effectiveToken so in-flight deletes are cancelled on timeout
- Update PurgeInstanceFilter.Timeout XML doc to match behavior
- Docs and comments now consistently say in-flight deletes are cancelled
Copilot AI review requested due to automatic review settings March 20, 2026 21:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@YunchuWang YunchuWang requested a review from cgillum March 23, 2026 21:51
@YunchuWang
Copy link
Copy Markdown
Member Author

YunchuWang commented Mar 25, 2026

Hi @cgillum Could you please review this PR?


/// <inheritdoc />
public virtual Task<PurgeHistoryResult> PurgeInstanceHistoryAsync(DateTime createdTimeFrom, DateTime? createdTimeTo, IEnumerable<OrchestrationStatus> runtimeStatus, CancellationToken cancellationToken = default)
public virtual Task<PurgeHistoryResult> PurgeInstanceHistoryAsync(DateTime createdTimeFrom, DateTime? createdTimeTo, IEnumerable<OrchestrationStatus> runtimeStatus, TimeSpan? timeout = null, CancellationToken cancellationToken = default)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a TimeSpan timeout parameter when we already have CancellationToken? We should be using the existing CancellationToken for timeout handling.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question — I traced the full call chain and there are two reasons we need the explicit imeout parameter rather than relying solely on CancellationToken:

1. CancellationToken cannot flow through the current call chain

There are 3 breaks where CancellationToken is dropped:

Layer Method CancellationToken?
SDK (GrpcDurableTaskClient) PurgeInstancesCoreAsync ✅ Passes to gRPC stub
Extension (LocalGrpcListener.PurgeInstances) context.CancellationToken available Not passed (compare: adjacent QueryInstances does pass it)
Core (IOrchestrationServicePurgeClient) PurgeInstanceStateAsync(PurgeInstanceFilter) Interface has no CancellationToken parameter
Azure Storage (AzureStorageOrchestrationService) PurgeInstanceHistoryAsync ❌ Calls tracking store with default
Tracking Store (ITrackingStore) PurgeInstanceHistoryAsync ✅ Has CancellationToken = default but never receives a real one

The root cause is IOrchestrationServicePurgeClient — it's a public interface with no CancellationToken. Fixing that is a breaking change across all backend implementations.

2. Even if CancellationToken flowed through, gRPC deadline kills the channel before we can return partial results

The isolated worker SDK calls purge over gRPC with a ~30s deadline. When the deadline fires:

  • context.CancellationToken is cancelled
  • The gRPC channel is closed immediately
  • The server cannot send back a response — the client gets RpcException(DeadlineExceeded)
  • The client receives zero information about how many instances were deleted

The timeout parameter acts as a soft timeout (e.g. 25s) that fires before the gRPC deadline (30s), giving the server a 5-second window to:

  1. Stop accepting new instance deletions
  2. Wait for in-flight deletes to finish
  3. Build and return PurgeResult { DeletedInstanceCount = 17402, IsComplete = false }
  4. The client loops and calls again

This is the same pattern DTS uses internally (PurgeTimeout = 25s capped to avoid grain call timeouts).

Summary

CancellationToken = hard cancel (can't return partial result after gRPC deadline)
Timeout = soft timeout (returns partial result before deadline hits)

They solve different problems and aren't mutually exclusive. Happy to also fix the CancellationToken gaps (LocalGrpcListener + interface) as a separate follow-up if you'd prefer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Copilot posted duplicate responses here. Also, I think it's over analyzing my question. Decoupling the cancellation token from the gRPC request cancellation makes sense, but that's not what I'm asking. You can still use the existing cancellation token parameter to implement a purge timeout that's separate from the request timeout.

Copy link
Copy Markdown
Member Author

@YunchuWang YunchuWang Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense — addressed. Removed the TimeSpan? timeout parameter from all internal method signatures (ITrackingStore, TrackingStoreBase, AzureTableTrackingStore). The Timeout-to-CancellationToken conversion now happens once at the AzureStorageOrchestrationService layer in PurgeInstanceStateAsync(PurgeInstanceFilter), and the tracking store just observes the CancellationToken. PurgeInstanceFilter.Timeout property is kept as the API surface for callers (additive, not breaking). Also cleaned up the duplicate responses — sorry about that. @cgillum

Address PR review: replace reliance on global ThrottlingHttpPipelinePolicy
with a local SemaphoreSlim(10) in DeleteBatchParallelAsync to prevent purge
operations from starving other storage operations.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…rallelTests.cs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 26, 2026 01:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

src/DurableTask.AzureStorage/MessageManager.cs:1

  • The implementation only counts the list operation when no blobs are found. When blobs exist, storageOperationCount currently counts deletes only, despite the comment indicating the list request should be counted as well. Consider counting the list operation unconditionally (e.g., start at 1, or add 1 after the listing completes) so metrics remain consistent with the comment and prior behavior.
//  ----------------------------------------------------------------------------------

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…omments

- Replace hardcoded MaxPurgeInstanceConcurrency=100 with
  MaxStorageOperationConcurrency/3, ensuring purge never takes more than
  1/3 of the global HTTP budget and auto-scales with deployment size.
- Fix stale test comment that referenced 'no internal semaphore' after
  SemaphoreSlim(10) was added to DeleteBatchParallelAsync.
- Change flaky 1ms timeout in scenario test to 100ms to avoid OS timer
  resolution issues (~15ms on Windows).
@YunchuWang
Copy link
Copy Markdown
Member Author

@cgillum Hi Chris, could you kindly take a look at this PR when you get a chance? Would really appreciate your review. Thanks!

Address cgillum's review: remove TimeSpan? timeout from internal method
signatures (ITrackingStore, TrackingStoreBase, AzureTableTrackingStore).
Convert PurgeInstanceFilter.Timeout to CancellationToken at the Service
layer in PurgeInstanceStateAsync, then pass it through the existing
CancellationToken parameter. The tracking store only observes a single
cancellation mechanism.

PurgeInstanceFilter.Timeout property is preserved (additive, not breaking).
Copilot AI review requested due to automatic review settings March 30, 2026 20:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

YunchuWang and others added 2 commits March 30, 2026 13:33
…icePurgeClient

Don't expose a public method just for testing. The test helper now calls
IOrchestrationServicePurgeClient.PurgeInstanceStateAsync(PurgeInstanceFilter)
which is the real production call path and handles Timeout->CT conversion
internally.
… expression'

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 30, 2026 20:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Fix P0: When no timeout is specified, route through the original code
  path (no CancellationToken) so IsComplete remains null for backward
  compatibility. Only create CancellationTokenSource(timeout) when
  PurgeInstanceFilter.Timeout is set.
- Simplify 'timedOut ? false : true' to '!timedOut'.
- Fix null warning: use null-forgiving operator for test loop variable.
- Count failed batch attempt in RequestCount (+1) for accurate metrics.
- Increase timeout test instance count from 10 to 50 for reliability.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants