ENH+BF: Zarr upload retry resilience, diagnostics, and explicit timeout#1827
ENH+BF: Zarr upload retry resilience, diagnostics, and explicit timeout#1827yarikoptic wants to merge 2 commits intomasterfrom
Conversation
- Batch-level retry backoff: replace linear sleep(1*N) with exponential backoff (5s, 20s, 40s, 80s, 120s) plus random jitter, giving S3 time to recover from throttling - Reduce parallelism on batch retry: halve ThreadPoolExecutor workers on each retry (5→3→2→1), reducing concurrent connections that may trigger S3 prefix-level throttling - Better failure diagnostics in _handle_failed_items_and_raise: log a summary line with failed/total counts, exception types grouped by count, and a "systematic" flag when all failures share the same exception type. This makes it immediately clear whether failures are random (flaky network) or deterministic (server-side issue) Motivated by investigation of #1821 where a previously-aborted OME-Zarr upload consistently fails with ConnectionAbortedError on every retry attempt for all 188 level-0 chunks (100% failure rate, 2259 connection attempts), while level-1 chunks and other fresh zarr uploads succeed fine from the same machine. Co-Authored-By: Claude Code 2.1.81 / Claude Opus 4.6 <noreply@anthropic.com>
Previously, zarr chunk uploads used no explicit timeout, relying on OS TCP defaults. For large zarr chunks (e.g., 2.6 GB level-0 OME-Zarr chunks), the upload can take many minutes at typical home upload speeds (~7 min at 50 Mbps, ~70 min at 5 Mbps). Without a timeout, the client either hangs indefinitely on network failures, or the OS TCP stack's default keepalive/idle timeout (often ~120s on Windows) kills the connection before a large upload can complete. Set an explicit requests timeout of (60, 7200) — 60 seconds for TCP connect, 2 hours for response read. The read timeout covers the period after the full request body is sent while waiting for S3's response, which is where ConnectionAbortedError was observed in #1821. Note: this does NOT limit the upload body transfer time itself. The requests library's read timeout only applies to waiting for response data, not to sending request data. So even very slow multi-hour uploads will not be interrupted by this timeout. Ref: #1821 Co-Authored-By: Claude Code 2.1.81 / Claude Opus 4.6 <noreply@anthropic.com>
|
is it chunks or shards, or does the code not distinguish between the types of objects? we asked people to use zarr v3 with sharding, so likely these are shards. may be just a nomenclature issue in the code which can be fixed. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1827 +/- ##
==========================================
- Coverage 75.12% 75.09% -0.04%
==========================================
Files 84 84
Lines 11931 11939 +8
==========================================
+ Hits 8963 8965 +2
- Misses 2968 2974 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
zarr upload/download code has no clue (AFAIK) about zarr structure/metadata indeed, and I think here it also does not matter really since we do not tell those apart anywhere in the design, not here in cli nor on dandi-archive backend AFAIK. For the overall issue, which is more grave, see/comment on |
Summary
Follows up on #1824 (the
data.seek(0)fix). While investigating #1821 further with a new log where the digest fix eliminated BadDigest errors, we found that all 188 level-0 chunks (each ~2.6 GB) still fail systematically withConnectionAbortedErroratgetresponse()— the upload body is sent but TCP dies before S3's response arrives. Level-1 chunks (smaller) succeed fine. The user confirmed other large zarr uploads work from the same machine; only this previously-aborted zarr fails, pointing to a server-side issue for which a separate investigation is tracked in #1821.Changes
Batch-level retry improvements (57c1aab):
sleep(1 * retry_count)with exponential backoff (5s, 20s, 40s, 80s, 120s cap) plus random jitterThreadPoolExecutorworkers on each batch retry (max(1, ceil(N/2))), reducing concurrent connections that may trigger S3 prefix-level throttling_handle_failed_items_and_raise: failed/total counts, exception types viaCounter, "systematic" flag when all failures share the same typeExplicit upload timeout (7fd3014):
timeout=(60, 7200)on zarr chunk PUT — 60s TCP connect, 2h response readgetresponse()wait after body is fully sent (whereConnectionAbortedErrorwas observed)requestsread timeout only applies to waiting for response data)Test plan
🤖 Generated with Claude Code