Skip to content

ENH+BF: Zarr upload retry resilience, diagnostics, and explicit timeout#1827

Draft
yarikoptic wants to merge 2 commits intomasterfrom
bf-zarr-upload3
Draft

ENH+BF: Zarr upload retry resilience, diagnostics, and explicit timeout#1827
yarikoptic wants to merge 2 commits intomasterfrom
bf-zarr-upload3

Conversation

@yarikoptic
Copy link
Copy Markdown
Member

@yarikoptic yarikoptic commented Mar 30, 2026

Summary

  • Improve zarr upload retry resilience: exponential backoff at batch level, reduce parallelism on retry (5→3→2→1 workers), better failure diagnostics
  • Set explicit timeout (60s connect, 2h read) for zarr chunk PUT requests to prevent indefinite hangs

Follows up on #1824 (the data.seek(0) fix). While investigating #1821 further with a new log where the digest fix eliminated BadDigest errors, we found that all 188 level-0 chunks (each ~2.6 GB) still fail systematically with ConnectionAbortedError at getresponse() — the upload body is sent but TCP dies before S3's response arrives. Level-1 chunks (smaller) succeed fine. The user confirmed other large zarr uploads work from the same machine; only this previously-aborted zarr fails, pointing to a server-side issue for which a separate investigation is tracked in #1821.

Changes

Batch-level retry improvements (57c1aab):

  • Replace linear sleep(1 * retry_count) with exponential backoff (5s, 20s, 40s, 80s, 120s cap) plus random jitter
  • Halve ThreadPoolExecutor workers on each batch retry (max(1, ceil(N/2))), reducing concurrent connections that may trigger S3 prefix-level throttling
  • Add failure summary in _handle_failed_items_and_raise: failed/total counts, exception types via Counter, "systematic" flag when all failures share the same type

Explicit upload timeout (7fd3014):

  • Set timeout=(60, 7200) on zarr chunk PUT — 60s TCP connect, 2h response read
  • The read timeout covers the getresponse() wait after body is fully sent (where ConnectionAbortedError was observed)
  • Does NOT limit upload body transfer time (requests read timeout only applies to waiting for response data)

Test plan

🤖 Generated with Claude Code

yarikoptic and others added 2 commits March 30, 2026 11:10
- Batch-level retry backoff: replace linear sleep(1*N) with exponential
  backoff (5s, 20s, 40s, 80s, 120s) plus random jitter, giving S3 time
  to recover from throttling

- Reduce parallelism on batch retry: halve ThreadPoolExecutor workers on
  each retry (5→3→2→1), reducing concurrent connections that may trigger
  S3 prefix-level throttling

- Better failure diagnostics in _handle_failed_items_and_raise: log a
  summary line with failed/total counts, exception types grouped by count,
  and a "systematic" flag when all failures share the same exception type.
  This makes it immediately clear whether failures are random (flaky
  network) or deterministic (server-side issue)

Motivated by investigation of #1821 where a previously-aborted OME-Zarr
upload consistently fails with ConnectionAbortedError on every retry
attempt for all 188 level-0 chunks (100% failure rate, 2259 connection
attempts), while level-1 chunks and other fresh zarr uploads succeed fine
from the same machine.

Co-Authored-By: Claude Code 2.1.81 / Claude Opus 4.6 <noreply@anthropic.com>
Previously, zarr chunk uploads used no explicit timeout, relying on OS
TCP defaults.  For large zarr chunks (e.g., 2.6 GB level-0 OME-Zarr
chunks), the upload can take many minutes at typical home upload speeds
(~7 min at 50 Mbps, ~70 min at 5 Mbps).  Without a timeout, the client
either hangs indefinitely on network failures, or the OS TCP stack's
default keepalive/idle timeout (often ~120s on Windows) kills the
connection before a large upload can complete.

Set an explicit requests timeout of (60, 7200) — 60 seconds for TCP
connect, 2 hours for response read.  The read timeout covers the period
after the full request body is sent while waiting for S3's response,
which is where ConnectionAbortedError was observed in #1821.

Note: this does NOT limit the upload body transfer time itself.  The
requests library's read timeout only applies to waiting for response
data, not to sending request data.  So even very slow multi-hour uploads
will not be interrupted by this timeout.

Ref: #1821

Co-Authored-By: Claude Code 2.1.81 / Claude Opus 4.6 <noreply@anthropic.com>
@satra
Copy link
Copy Markdown
Member

satra commented Mar 30, 2026

is it chunks or shards, or does the code not distinguish between the types of objects? we asked people to use zarr v3 with sharding, so likely these are shards. may be just a nomenclature issue in the code which can be fixed.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 36.36364% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.09%. Comparing base (fb5e2f0) to head (4a2da70).

Files with missing lines Patch % Lines
dandi/files/zarr.py 36.36% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1827      +/-   ##
==========================================
- Coverage   75.12%   75.09%   -0.04%     
==========================================
  Files          84       84              
  Lines       11931    11939       +8     
==========================================
+ Hits         8963     8965       +2     
- Misses       2968     2974       +6     
Flag Coverage Δ
unittests 75.09% <36.36%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yarikoptic yarikoptic added the patch Increment the patch version when merged label Mar 30, 2026
@yarikoptic
Copy link
Copy Markdown
Member Author

zarr upload/download code has no clue (AFAIK) about zarr structure/metadata indeed, and I think here it also does not matter really since we do not tell those apart anywhere in the design, not here in cli nor on dandi-archive backend AFAIK. For the overall issue, which is more grave, see/comment on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cmd-upload patch Increment the patch version when merged zarr

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants