Skip to content

S3Source.open_shard uses get_object (slow) instead of download_file (fast) #82

@maxine-at-forecast

Description

@maxine-at-forecast

Problem

S3Source.shards uses client.get_object() which returns a StreamingBody — a single sequential HTTP/1.1 stream. On R2/S3, this achieves only 3-6 MB/s per shard.

In contrast, boto3.download_file() uses the Transfer Manager with concurrent multipart range requests and achieves 50-100+ MB/s — a 10-20x speedup.

For a typical workload (360 shards × 250MB each = 90GB), this is the difference between ~5 hours and ~20 minutes.

Current code (slow path)

# S3Source.shards
response = client.get_object(Bucket=self.bucket, Key=key)
stream = response["Body"]  # StreamingBody, ~5 MB/s
yield uri, stream

Suggested fix

Override open_shard to download to a temp file via download_file, then open locally:

def open_shard(self, shard_id):
    key = self._uri_to_key(shard_id)
    tmp = tempfile.NamedTemporaryFile(suffix=".tar", delete=False)
    tmp.close()
    self._get_client().download_file(self.bucket, key, tmp.name)
    return open(tmp.name, "rb")  # wrap with auto-delete on close

Optional enhancement: cache_dir

Accept a cache_dir parameter (like webdataset's WebDataset(urls, cache_dir=...)) so repeated iterations over the same shards don't re-download. This would benefit two-pass workflows (e.g., PCA fit then transform) where the same data is iterated twice.

Workaround

We implemented a _FastS3Source subclass in qnd-decode that overrides shards and open_shard with download_file + auto-deleting temp files. Happy to contribute this upstream as a PR if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions