Skip to content

[DataLoader] Add ArrivalOrder API and batch_size support#537

Merged
robreeves merged 1 commit intolinkedin:mainfrom
robreeves:pyice
Apr 10, 2026
Merged

[DataLoader] Add ArrivalOrder API and batch_size support#537
robreeves merged 1 commit intolinkedin:mainfrom
robreeves:pyice

Conversation

@robreeves
Copy link
Copy Markdown
Collaborator

@robreeves robreeves commented Apr 8, 2026

Summary

Re-introduce the ArrivalOrder scan order and batch_size parameter that were removed in #504. The original removal was necessary because the fork dependency (sumedhsakdeo/iceberg-python) made it ineligible for use internally at LI. Now that li-pyiceberg==0.11.3 includes the ArrivalOrder API from upstream (apache/iceberg-python#3046), we can restore the functionality using an approved registry dependency.

In a future PR, we will add support for streaming multiple splits concurrently using ArrivalOrder with concurrent_streams > 1.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

pyproject.toml — Bump li-pyiceberg from 0.11.2 to 0.11.3 which includes the ArrivalOrder API.

data_loader.py — Re-add batch_size parameter to OpenHouseDataLoader.__init__, forwarded to each DataLoaderSplit.

data_loader_split.py — Re-add ArrivalOrder import and batch_size parameter. to_record_batches() now uses order=ArrivalOrder(concurrent_streams=1, batch_size=self._batch_size).

uv.lock — Regenerated to resolve li-pyiceberg==0.11.3 from the internal registry.

Tests — Re-add test_arrival_order.py (verifies ScanOrder class hierarchy and ArrowScan.to_record_batches with order parameter). Re-add batch_size tests in test_data_loader.py, test_data_loader_split.py, and integration_tests.py.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

make verify passes — 202 tests pass, lint, format, and mypy all green.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

batch_size is re-added to OpenHouseDataLoader and DataLoaderSplit. This is additive (new optional parameter with default None), so existing callers are unaffected.

…iceberg

Re-introduce the ArrivalOrder scan order and batch_size parameter that
were removed in linkedin#504. The original removal was necessary because the
fork dependency (sumedhsakdeo/iceberg-python) could not pass ELR.

Now that li-pyiceberg 0.11.3 includes the ArrivalOrder API from
upstream (apache/iceberg-python#3046), we can restore the functionality
using an approved registry dependency.
@robreeves robreeves changed the title [DataLoader] Re-add ArrivalOrder API and batch_size support [DataLoader] Add ArrivalOrder API and batch_size support Apr 8, 2026
@robreeves robreeves merged commit f93a5f4 into linkedin:main Apr 10, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants