S3Source.open_shard uses get_object (slow) instead of download_file (fast)

## Problem

`S3Source.shards` uses `client.get_object()` which returns a `StreamingBody` — a single sequential HTTP/1.1 stream. On R2/S3, this achieves only **3-6 MB/s** per shard.

In contrast, `boto3.download_file()` uses the Transfer Manager with concurrent multipart range requests and achieves **50-100+ MB/s** — a 10-20x speedup.

For a typical workload (360 shards × 250MB each = 90GB), this is the difference between ~5 hours and ~20 minutes.

## Current code (slow path)

```python
# S3Source.shards
response = client.get_object(Bucket=self.bucket, Key=key)
stream = response["Body"]  # StreamingBody, ~5 MB/s
yield uri, stream
```

## Suggested fix

Override `open_shard` to download to a temp file via `download_file`, then open locally:

```python
def open_shard(self, shard_id):
    key = self._uri_to_key(shard_id)
    tmp = tempfile.NamedTemporaryFile(suffix=".tar", delete=False)
    tmp.close()
    self._get_client().download_file(self.bucket, key, tmp.name)
    return open(tmp.name, "rb")  # wrap with auto-delete on close
```

## Optional enhancement: cache_dir

Accept a `cache_dir` parameter (like webdataset's `WebDataset(urls, cache_dir=...)`) so repeated iterations over the same shards don't re-download. This would benefit two-pass workflows (e.g., PCA fit then transform) where the same data is iterated twice.

## Workaround

We implemented a `_FastS3Source` subclass in `qnd-decode` that overrides `shards` and `open_shard` with `download_file` + auto-deleting temp files. Happy to contribute this upstream as a PR if useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3Source.open_shard uses get_object (slow) instead of download_file (fast) #82

Problem

Current code (slow path)

Suggested fix

Optional enhancement: cache_dir

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

S3Source.open_shard uses get_object (slow) instead of download_file (fast) #82

Description

Problem

Current code (slow path)

Suggested fix

Optional enhancement: cache_dir

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions