Problem
S3Source.shards uses client.get_object() which returns a StreamingBody — a single sequential HTTP/1.1 stream. On R2/S3, this achieves only 3-6 MB/s per shard.
In contrast, boto3.download_file() uses the Transfer Manager with concurrent multipart range requests and achieves 50-100+ MB/s — a 10-20x speedup.
For a typical workload (360 shards × 250MB each = 90GB), this is the difference between ~5 hours and ~20 minutes.
Current code (slow path)
# S3Source.shards
response = client.get_object(Bucket=self.bucket, Key=key)
stream = response["Body"] # StreamingBody, ~5 MB/s
yield uri, stream
Suggested fix
Override open_shard to download to a temp file via download_file, then open locally:
def open_shard(self, shard_id):
key = self._uri_to_key(shard_id)
tmp = tempfile.NamedTemporaryFile(suffix=".tar", delete=False)
tmp.close()
self._get_client().download_file(self.bucket, key, tmp.name)
return open(tmp.name, "rb") # wrap with auto-delete on close
Optional enhancement: cache_dir
Accept a cache_dir parameter (like webdataset's WebDataset(urls, cache_dir=...)) so repeated iterations over the same shards don't re-download. This would benefit two-pass workflows (e.g., PCA fit then transform) where the same data is iterated twice.
Workaround
We implemented a _FastS3Source subclass in qnd-decode that overrides shards and open_shard with download_file + auto-deleting temp files. Happy to contribute this upstream as a PR if useful.
Problem
S3Source.shardsusesclient.get_object()which returns aStreamingBody— a single sequential HTTP/1.1 stream. On R2/S3, this achieves only 3-6 MB/s per shard.In contrast,
boto3.download_file()uses the Transfer Manager with concurrent multipart range requests and achieves 50-100+ MB/s — a 10-20x speedup.For a typical workload (360 shards × 250MB each = 90GB), this is the difference between ~5 hours and ~20 minutes.
Current code (slow path)
Suggested fix
Override
open_shardto download to a temp file viadownload_file, then open locally:Optional enhancement: cache_dir
Accept a
cache_dirparameter (like webdataset'sWebDataset(urls, cache_dir=...)) so repeated iterations over the same shards don't re-download. This would benefit two-pass workflows (e.g., PCA fit then transform) where the same data is iterated twice.Workaround
We implemented a
_FastS3Sourcesubclass inqnd-decodethat overridesshardsandopen_shardwithdownload_file+ auto-deleting temp files. Happy to contribute this upstream as a PR if useful.