Add concurrent chunk fetching for external array links by bendichter · Pull Request #123 · NeurodataWithoutBorders/lindi

bendichter · 2026-03-14T09:25:15Z

Summary

Add getitems() to LindiH5ZarrStore enabling concurrent HTTP range requests when zarr reads multiple chunks
Route remote external array links through zarr + LindiH5ZarrStore instead of h5py + LindiRemfile, so chunk fetches go through getitems() and run in parallel via ThreadPoolExecutor
Coalesce adjacent/nearby byte ranges into single HTTP requests to reduce round-trips
Local external array links still use h5py directly (no concurrency benefit for local I/O)

Motivation

When reading large remote NWB datasets via LINDI (e.g., from DANDI), reads are slow because h5py fetches chunks serially through LindiRemfile — one HTTP request per chunk. h5py has no batch-read API, so parallelism can't be added at the LindiRemfile level. Zarr does have this via getitems(), which receives all needed chunk keys in a single call.

How it works

Before: h5py.File(LindiRemfile(url))[dataset][selection] → serial HTTP requests

After: zarr.open_array(LindiH5ZarrStore(url))[selection] → store.getitems(chunk_keys) → serial metadata lookup (fast, h5py B-tree cache) → coalesce nearby byte ranges → concurrent data fetches (ThreadPoolExecutor, up to 8 workers)

Byte range coalescing

Before issuing HTTP requests, nearby byte ranges are merged into larger contiguous fetches. This reduces the number of round-trips when chunks are stored close together in the HDF5 file. Two parameters control the behavior:

_coalesce_merge_gap (default 256KB): maximum gap between ranges to merge into one request
_coalesce_max_size (default 20MB): maximum size of a single coalesced request

Test plan

test_getitems_local_chunks — multi-chunk getitems with local file
test_getitems_inline_data — inline (small) dataset path
test_getitems_single_chunk_shortcut — single chunk skips thread pool
test_external_array_link_via_zarr_store — local external array links still work
test_zarr_store_for_external_array — zarr store serves all chunks correctly with slicing
test_getitems_empty_keys — empty key list returns empty dict
test_coalesce_byte_ranges — unit tests for range merging logic (gap, max_size, sorting)
test_coalesce_integration — coalesced fetching returns correct data through zarr
Full test suite passes (49/49, excluding pre-existing remote timeouts)

🤖 Generated with Claude Code

Route remote external array links through zarr + LindiH5ZarrStore instead of h5py + LindiRemfile. This enables concurrent HTTP range requests via LindiH5ZarrStore.getitems(), which zarr calls when reading multiple chunks. The getitems() method separates serial metadata lookup (fast, uses h5py's B-tree cache) from parallel data fetches (N concurrent HTTP requests via ThreadPoolExecutor instead of N serial ones). Local external array links still use h5py directly since there's no concurrency benefit for local I/O. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge adjacent/nearby chunk byte ranges into single HTTP requests before concurrent fetching. Two configurable parameters control the behavior: - coalesce_merge_gap (default 256KB): max gap between ranges to merge - coalesce_max_size (default 20MB): max size of a coalesced request Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-03-14T09:37:05Z

Codecov Report

❌ Patch coverage is 83.72093% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.42%. Comparing base (f48e993) to head (9dddcc9).

Files with missing lines	Patch %	Lines
lindi/LindiH5ZarrStore/LindiH5ZarrStore.py	81.25%	21 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #123      +/-   ##
==========================================
- Coverage   81.63%   81.42%   -0.21%     
==========================================
  Files          30       30              
  Lines        2793     2918     +125     
==========================================
+ Hits         2280     2376      +96     
- Misses        513      542      +29

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Standalone script in devel/ that compares serial (h5py+LindiRemfile) vs concurrent (zarr+LindiH5ZarrStore) reads from DANDI, asserts data equivalence, and produces a bar chart showing timings and speedup. Usage: python devel/benchmark_concurrent_fetch.py [--dandiset 000473] [-o results.png] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bendichter and others added 2 commits March 14, 2026 09:24

bendichter requested review from magland and rly March 14, 2026 09:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add concurrent chunk fetching for external array links#123

Add concurrent chunk fetching for external array links#123
bendichter wants to merge 3 commits intomainfrom
concurrent-chunk-fetching

bendichter commented Mar 14, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Mar 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bendichter commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

How it works

Byte range coalescing

Test plan

Uh oh!

codecov-commenter commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bendichter commented Mar 14, 2026 •

edited

Loading

codecov-commenter commented Mar 14, 2026 •

edited

Loading