ENH+BF: Zarr upload retry resilience, diagnostics, and explicit timeout by yarikoptic · Pull Request #1827 · dandi/dandi-cli

yarikoptic · 2026-03-30T15:12:45Z

Summary

Improve zarr upload retry resilience: exponential backoff at batch level, reduce parallelism on retry (5→3→2→1 workers), better failure diagnostics
Set explicit timeout (60s connect, 2h read) for zarr chunk PUT requests to prevent indefinite hangs

Follows up on #1824 (the data.seek(0) fix). While investigating #1821 further with a new log where the digest fix eliminated BadDigest errors, we found that all 188 level-0 chunks (each ~2.6 GB) still fail systematically with ConnectionAbortedError at getresponse() — the upload body is sent but TCP dies before S3's response arrives. Level-1 chunks (smaller) succeed fine. The user confirmed other large zarr uploads work from the same machine; only this previously-aborted zarr fails, pointing to a server-side issue for which a separate investigation is tracked in #1821.

Changes

Batch-level retry improvements (57c1aab):

Replace linear sleep(1 * retry_count) with exponential backoff (5s, 20s, 40s, 80s, 120s cap) plus random jitter
Halve ThreadPoolExecutor workers on each batch retry (max(1, ceil(N/2))), reducing concurrent connections that may trigger S3 prefix-level throttling
Add failure summary in _handle_failed_items_and_raise: failed/total counts, exception types via Counter, "systematic" flag when all failures share the same type

Explicit upload timeout (7fd3014):

Set timeout=(60, 7200) on zarr chunk PUT — 60s TCP connect, 2h response read
The read timeout covers the getresponse() wait after body is fully sent (where ConnectionAbortedError was observed)
Does NOT limit upload body transfer time (requests read timeout only applies to waiting for response data)

Test plan

All 11 local zarr tests pass (16 integration tests skipped — need Docker)
CI tests pass
Verify batch retry backoff etc works on the case of Validation error related to content-md5 when uploading OME-Zarr #1821:
- without custom --jobs
- overall works with --jobs 1:1

🤖 Generated with Claude Code

- Batch-level retry backoff: replace linear sleep(1*N) with exponential backoff (5s, 20s, 40s, 80s, 120s) plus random jitter, giving S3 time to recover from throttling - Reduce parallelism on batch retry: halve ThreadPoolExecutor workers on each retry (5→3→2→1), reducing concurrent connections that may trigger S3 prefix-level throttling - Better failure diagnostics in _handle_failed_items_and_raise: log a summary line with failed/total counts, exception types grouped by count, and a "systematic" flag when all failures share the same exception type. This makes it immediately clear whether failures are random (flaky network) or deterministic (server-side issue) Motivated by investigation of #1821 where a previously-aborted OME-Zarr upload consistently fails with ConnectionAbortedError on every retry attempt for all 188 level-0 chunks (100% failure rate, 2259 connection attempts), while level-1 chunks and other fresh zarr uploads succeed fine from the same machine. Co-Authored-By: Claude Code 2.1.81 / Claude Opus 4.6 <noreply@anthropic.com>

Previously, zarr chunk uploads used no explicit timeout, relying on OS TCP defaults. For large zarr chunks (e.g., 2.6 GB level-0 OME-Zarr chunks), the upload can take many minutes at typical home upload speeds (~7 min at 50 Mbps, ~70 min at 5 Mbps). Without a timeout, the client either hangs indefinitely on network failures, or the OS TCP stack's default keepalive/idle timeout (often ~120s on Windows) kills the connection before a large upload can complete. Set an explicit requests timeout of (60, 7200) — 60 seconds for TCP connect, 2 hours for response read. The read timeout covers the period after the full request body is sent while waiting for S3's response, which is where ConnectionAbortedError was observed in #1821. Note: this does NOT limit the upload body transfer time itself. The requests library's read timeout only applies to waiting for response data, not to sending request data. So even very slow multi-hour uploads will not be interrupted by this timeout. Ref: #1821 Co-Authored-By: Claude Code 2.1.81 / Claude Opus 4.6 <noreply@anthropic.com>

satra · 2026-03-30T15:17:38Z

is it chunks or shards, or does the code not distinguish between the types of objects? we asked people to use zarr v3 with sharding, so likely these are shards. may be just a nomenclature issue in the code which can be fixed.

codecov · 2026-03-30T15:19:04Z

Codecov Report

❌ Patch coverage is 36.36364% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.09%. Comparing base (fb5e2f0) to head (4a2da70).

Files with missing lines	Patch %	Lines
dandi/files/zarr.py	36.36%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1827      +/-   ##
==========================================
- Coverage   75.12%   75.09%   -0.04%     
==========================================
  Files          84       84              
  Lines       11931    11939       +8     
==========================================
+ Hits         8963     8965       +2     
- Misses       2968     2974       +6

Flag	Coverage Δ
unittests	`75.09% <36.36%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

yarikoptic · 2026-03-30T16:15:36Z

zarr upload/download code has no clue (AFAIK) about zarr structure/metadata indeed, and I think here it also does not matter really since we do not tell those apart anywhere in the design, not here in cli nor on dandi-archive backend AFAIK. For the overall issue, which is more grave, see/comment on

Support multi-part uploads for Zarrs dandi-archive#2758

yarikoptic and others added 2 commits March 30, 2026 11:10

yarikoptic added cmd-upload zarr labels Mar 30, 2026

yarikoptic mentioned this pull request Mar 30, 2026

Validation error related to content-md5 when uploading OME-Zarr #1821

Closed

yarikoptic added the patch Increment the patch version when merged label Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH+BF: Zarr upload retry resilience, diagnostics, and explicit timeout#1827

ENH+BF: Zarr upload retry resilience, diagnostics, and explicit timeout#1827
yarikoptic wants to merge 2 commits intomasterfrom
bf-zarr-upload3

yarikoptic commented Mar 30, 2026 •

edited

Loading

Uh oh!

satra commented Mar 30, 2026

Uh oh!

codecov bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

yarikoptic commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yarikoptic commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

satra commented Mar 30, 2026

Uh oh!

codecov bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yarikoptic commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yarikoptic commented Mar 30, 2026 •

edited

Loading

codecov bot commented Mar 30, 2026 •

edited

Loading