Skip to content

Add concurrent chunk fetching for external array links#123

Open
bendichter wants to merge 3 commits intomainfrom
concurrent-chunk-fetching
Open

Add concurrent chunk fetching for external array links#123
bendichter wants to merge 3 commits intomainfrom
concurrent-chunk-fetching

Conversation

@bendichter
Copy link
Copy Markdown
Contributor

@bendichter bendichter commented Mar 14, 2026

Summary

  • Add getitems() to LindiH5ZarrStore enabling concurrent HTTP range requests when zarr reads multiple chunks
  • Route remote external array links through zarr + LindiH5ZarrStore instead of h5py + LindiRemfile, so chunk fetches go through getitems() and run in parallel via ThreadPoolExecutor
  • Coalesce adjacent/nearby byte ranges into single HTTP requests to reduce round-trips
  • Local external array links still use h5py directly (no concurrency benefit for local I/O)

Motivation

When reading large remote NWB datasets via LINDI (e.g., from DANDI), reads are slow because h5py fetches chunks serially through LindiRemfile — one HTTP request per chunk. h5py has no batch-read API, so parallelism can't be added at the LindiRemfile level. Zarr does have this via getitems(), which receives all needed chunk keys in a single call.

How it works

Before: h5py.File(LindiRemfile(url))[dataset][selection] → serial HTTP requests

After: zarr.open_array(LindiH5ZarrStore(url))[selection]store.getitems(chunk_keys) → serial metadata lookup (fast, h5py B-tree cache) → coalesce nearby byte ranges → concurrent data fetches (ThreadPoolExecutor, up to 8 workers)

Byte range coalescing

Before issuing HTTP requests, nearby byte ranges are merged into larger contiguous fetches. This reduces the number of round-trips when chunks are stored close together in the HDF5 file. Two parameters control the behavior:

  • _coalesce_merge_gap (default 256KB): maximum gap between ranges to merge into one request
  • _coalesce_max_size (default 20MB): maximum size of a single coalesced request

Test plan

  • test_getitems_local_chunks — multi-chunk getitems with local file
  • test_getitems_inline_data — inline (small) dataset path
  • test_getitems_single_chunk_shortcut — single chunk skips thread pool
  • test_external_array_link_via_zarr_store — local external array links still work
  • test_zarr_store_for_external_array — zarr store serves all chunks correctly with slicing
  • test_getitems_empty_keys — empty key list returns empty dict
  • test_coalesce_byte_ranges — unit tests for range merging logic (gap, max_size, sorting)
  • test_coalesce_integration — coalesced fetching returns correct data through zarr
  • Full test suite passes (49/49, excluding pre-existing remote timeouts)

🤖 Generated with Claude Code

bendichter and others added 2 commits March 14, 2026 09:24
Route remote external array links through zarr + LindiH5ZarrStore instead
of h5py + LindiRemfile. This enables concurrent HTTP range requests via
LindiH5ZarrStore.getitems(), which zarr calls when reading multiple chunks.

The getitems() method separates serial metadata lookup (fast, uses h5py's
B-tree cache) from parallel data fetches (N concurrent HTTP requests via
ThreadPoolExecutor instead of N serial ones).

Local external array links still use h5py directly since there's no
concurrency benefit for local I/O.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merge adjacent/nearby chunk byte ranges into single HTTP requests before
concurrent fetching. Two configurable parameters control the behavior:
- coalesce_merge_gap (default 256KB): max gap between ranges to merge
- coalesce_max_size (default 20MB): max size of a coalesced request

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bendichter bendichter requested review from magland and rly March 14, 2026 09:30
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 14, 2026

Codecov Report

❌ Patch coverage is 83.72093% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.42%. Comparing base (f48e993) to head (9dddcc9).

Files with missing lines Patch % Lines
lindi/LindiH5ZarrStore/LindiH5ZarrStore.py 81.25% 21 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #123      +/-   ##
==========================================
- Coverage   81.63%   81.42%   -0.21%     
==========================================
  Files          30       30              
  Lines        2793     2918     +125     
==========================================
+ Hits         2280     2376      +96     
- Misses        513      542      +29     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Standalone script in devel/ that compares serial (h5py+LindiRemfile)
vs concurrent (zarr+LindiH5ZarrStore) reads from DANDI, asserts data
equivalence, and produces a bar chart showing timings and speedup.

Usage: python devel/benchmark_concurrent_fetch.py [--dandiset 000473] [-o results.png]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants