feat: add async 'for' loop support to LogScanner (#424)#438
feat: add async 'for' loop support to LogScanner (#424)#438qzyu999 wants to merge 14 commits intoapache:mainfrom
Conversation
|
@qzyu999 Ty for the PR, but I checked this branch out and integration tests for python still hang even when I run them locally. PTAL |
Hi @fresh-borzoni, applied the fix. Ran |
…within a local scope in `to_arrow`
There was a problem hiding this comment.
Pull request overview
This PR adds Python async for support for the PyO3 LogScanner binding to address Issue #424, aiming to let PyFluss users iterate scanner results via the native async-iterator protocol instead of manual polling loops.
Changes:
- Added a new Python integration test covering
async for record in scanneron a record-basedLogScanner. - Refactored the Rust
LogScannerbinding to store scanner state behindArc<tokio::sync::Mutex<_>>with a pending-records buffer for per-record yielding. - Implemented
__aiter__/__anext__in the Rust binding (viafuture_into_py) to produce awaitable next-items for async iteration.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 11 comments.
| File | Description |
|---|---|
| bindings/python/test/test_log_table.py | Adds an async-iterator integration test for LogScanner. |
| bindings/python/src/table.rs | Introduces async iterator support and refactors scanner state management for Python bindings. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| fn __aiter__<'py>(slf: PyRef<'py, Self>) -> PyResult<Bound<'py, PyAny>> { | ||
| let py = slf.py(); | ||
| let code = pyo3::ffi::c_str!( | ||
| r#" | ||
| async def _adapter(obj): | ||
| while True: | ||
| try: | ||
| yield await obj.__anext__() | ||
| except StopAsyncIteration: | ||
| break | ||
| "# | ||
| ); | ||
| let globals = pyo3::types::PyDict::new(py); | ||
| py.run(code, Some(&globals), None)?; | ||
| let adapter = globals.get_item("_adapter")?.unwrap(); | ||
| // Return adapt(self) | ||
| adapter.call1((slf.into_bound_py_any(py)?,)) | ||
| } |
There was a problem hiding this comment.
__aiter__ recompiles and executes Python source via py.run() on every iteration start. Consider caching the adapter function (e.g., in a PyOnceLock) or returning self directly as the async iterator if possible; this avoids repeated code compilation and reduces overhead per async for loop.
bindings/python/src/table.rs
Outdated
| let mut state = state_arc.lock().await; | ||
|
|
||
| // 1. If we already have buffered records, pop and return immediately | ||
| if let Some(record) = state.pending_records.pop_front() { | ||
| return Ok(record.into_any()); | ||
| } | ||
|
|
||
| // 2. Buffer is empty, we must poll the network for the next batch | ||
| // The underlying kind must be a Record-based scanner. | ||
| let scanner = match state.kind.as_record() { | ||
| Ok(s) => s, | ||
| Err(_) => { | ||
| return Err(pyo3::exceptions::PyStopAsyncIteration::new_err( | ||
| "Stream Ended", | ||
| )); | ||
| } | ||
| }; | ||
|
|
||
| // Poll with a reasonable internal timeout before unblocking the event loop | ||
| let timeout = core::time::Duration::from_millis(5000); | ||
|
|
||
| let mut current_records = scanner | ||
| .poll(timeout) | ||
| .await | ||
| .map_err(|e| FlussError::from_core_error(&e))?; | ||
|
|
||
| // If it's a real timeout with zero records, loop or throw StopAsyncIteration? | ||
| // Since it's a streaming log, we can yield None or block. Blocking requires a loop in the future. | ||
| while current_records.is_empty() { | ||
| current_records = scanner | ||
| .poll(timeout) | ||
| .await | ||
| .map_err(|e| FlussError::from_core_error(&e))?; | ||
| } |
There was a problem hiding this comment.
__anext__ holds state_arc.lock() across scanner.poll(timeout).await (and the retry loop). This blocks all other methods needing the same mutex (e.g., subscribe/unsubscribe/poll/to_arrow) for the full network wait time and can lead to poor responsiveness or deadlock-like behavior under concurrent use. Consider narrowing the critical section (e.g., split locks for kind vs pending_records, or temporarily take/move the scanner out of the state while polling).
bindings/python/src/table.rs
Outdated
| let scanner_ref = | ||
| unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) }; | ||
| let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await }); |
There was a problem hiding this comment.
Avoid the unsafe pointer cast when accessing self.state. You can lock the mutex directly via self.state.lock() (or clone the Arc first) without unsafe; the current cast is unnecessary and introduces unsoundness risk if the field type ever changes.
| let scanner_ref = | |
| unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) }; | |
| let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await }); | |
| let lock = TOKIO_RUNTIME.block_on(async { self.state.lock().await }); |
bindings/python/src/table.rs
Outdated
| let scanner_ref = | ||
| unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) }; | ||
| let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await }); |
There was a problem hiding this comment.
query_latest_offsets() uses the same unsafe cast pattern to lock self.state. Please replace this with a safe lock on self.state (or a cloned Arc) to avoid unnecessary unsafe in the Python bindings.
| let scanner_ref = | |
| unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) }; | |
| let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await }); | |
| let lock = TOKIO_RUNTIME.block_on(async { self.state.lock().await }); |
bindings/python/src/table.rs
Outdated
| return Err(pyo3::exceptions::PyStopAsyncIteration::new_err( | ||
| "Stream Ended", |
There was a problem hiding this comment.
__anext__ treats the batch-based scanner variant as end-of-stream (StopAsyncIteration). That will silently terminate async for on scanners created via create_record_batch_log_scanner(), and it also masks the helpful error message from as_record(). Either implement async iteration for the batch variant (yielding RecordBatch/Arrow), or raise a TypeError explaining that async iteration is only supported for record scanners.
| return Err(pyo3::exceptions::PyStopAsyncIteration::new_err( | |
| "Stream Ended", | |
| return Err(PyTypeError::new_err( | |
| "Async iteration is only supported for record scanners; \ | |
| use create_record_log_scanner() instead.", |
| scanner = await table.new_scan().create_log_scanner() | ||
| num_buckets = (await admin.get_table_info(table_path)).num_buckets | ||
| scanner.subscribe_buckets({i: fluss.EARLIEST_OFFSET for i in range(num_buckets)}) |
There was a problem hiding this comment.
This test only covers async for on a record-based scanner (create_log_scanner()). Since LogScanner can also wrap the batch variant (create_record_batch_log_scanner()), consider adding a companion test for async iteration on the batch scanner (or explicitly asserting that async iteration is unsupported there) so the intended behavior is locked in by tests.
bindings/python/src/table.rs
Outdated
| let scanner_ref = | ||
| unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) }; | ||
| let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await }); | ||
| let scanner = lock.kind.as_batch()?; | ||
|
|
There was a problem hiding this comment.
Same as in poll(): please remove the unsafe cast used to get scanner_ref. Lock self.state directly; keeping this unsafe here makes the method harder to reason about and can hide real lifetime/aliasing issues.
| let scanner_ref = | |
| unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) }; | |
| let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await }); | |
| let scanner = lock.kind.as_batch()?; | |
| let lock = TOKIO_RUNTIME.block_on(async { self.state.lock().await }); | |
| let scanner = lock.kind.as_batch()?; |
bindings/python/src/table.rs
Outdated
| let scanner_ref = | ||
| unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) }; | ||
| let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await }); |
There was a problem hiding this comment.
Remove the unsafe pointer cast when locking self.state in poll_arrow(). This can be expressed safely with self.state.lock().await (via TOKIO_RUNTIME.block_on) and avoids introducing UB hazards.
| let scanner_ref = | |
| unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) }; | |
| let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await }); | |
| let lock = TOKIO_RUNTIME.block_on(async { self.state.lock().await }); |
bindings/python/src/table.rs
Outdated
| let scanner_ref = unsafe { | ||
| &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) | ||
| }; |
There was a problem hiding this comment.
to_arrow() also uses an unsafe cast to access self.state. This should be rewritten to safely clone/borrow self.state and lock it without unsafe to keep the bindings memory-safe.
| let scanner_ref = unsafe { | |
| &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) | |
| }; | |
| let scanner_ref = &self.state; |
bindings/python/src/table.rs
Outdated
| let scanner_ref = | ||
| unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) }; |
There was a problem hiding this comment.
poll_until_offsets() also relies on the unsafe cast to access self.state. This should be refactored to lock self.state safely; keeping unsafe here is especially risky because this method can run for a long time and is on a hot path for to_arrow().
| let scanner_ref = | |
| unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) }; | |
| let scanner_ref = &self.state; |
|
@qzyu999 Ty, took a look at the approach, have some ideas. PTAL The scanner is already thread-safe internally (&self on all methods), so the Mutex isn't needed, it just adds locking to every call and forces 5 unsafe pointer casts to work around borrow issues it created. The anext loop is also problematic: it runs inside tokio::spawn, so breaking out of async for leaves it polling forever in the background. Simpler idea: store the scanner in an Arc, keep existing methods as-is. Add _async_poll(timeout_ms) that does one bounded poll and returns a list. aiter returns a small Python async generator that calls _async_poll and yields records. Break stops the generator naturally, so no leaks, no unsafe, no mutex. |
|
Hi @fresh-borzoni, thanks for the recommendations. I've taken a look and came up with the following changes, PTAL when available:
|
…o that they match when talking about Record vs Batch
|
Hi @fresh-borzoni, I just realized that the original issue #424 is also looking for |
fresh-borzoni
left a comment
There was a problem hiding this comment.
@qzyu999 TY, looked through, left comments, PTAL
bindings/python/src/table.rs
Outdated
| PyDeltaAccess, PyDict, PyList, PySequence, PySlice, PyTime, PyTimeAccess, PyTuple, PyType, | ||
| PyTzInfo, | ||
| }; | ||
| use pyo3::{Bound, IntoPyObjectExt, Py, PyAny, PyRef, PyRefMut, PyResult, Python}; |
There was a problem hiding this comment.
Do we really need these?
Seems that it's a leftover and we only need 1-2 objects, maybe only IntoPyObjectExt
There was a problem hiding this comment.
Hi @fresh-borzoni, I see that you're correct as the others are imported already via use crate::*; from lib.rs. I removed the extra imports in d619b13 and kept only IntoPyObjectExt.
| Ok(df) | ||
| } | ||
|
|
||
| fn __aiter__<'py>(slf: PyRef<'py, Self>) -> PyResult<Bound<'py, PyAny>> { |
There was a problem hiding this comment.
we need to add this method to .pyi stubs
There was a problem hiding this comment.
Hi @fresh-borzoni, just added __aiter__ to __init__.pyi here 134e56b along with with _async_poll and _async_poll_batches.
There was a problem hiding this comment.
I think it's better to leave _async_poll and _async_poll_batches out of .pyi bc these methods ideally should be private implementation details.
So exposing __aiter__ makes sense to just signal IDE that we support async for, but the rest of underscore methods added - we don't want to encourage users to use them directly
There was a problem hiding this comment.
Hi @fresh-borzoni, thanks for the clarification, removed those two entries in the .pyi file in db23dd6.
| await admin.drop_table(table_path, ignore_if_not_exists=False) | ||
|
|
||
|
|
||
| async def test_async_iterator(connection, admin): |
There was a problem hiding this comment.
Can we reduce the amount of tests?
- Keep tests that cover the async for happy path, break safety, and multiple poll cycles
- Drop tests for _async_poll / _async_poll_batches directly - they're internal methods, tested implicitly through async for
- Drop tests that re-verify existing features (projection, pandas, metadata) through the async path - those features are already covered by sync tests
There was a problem hiding this comment.
Hi @fresh-borzoni, I've made the changes here: efbcb8c please let me know if I removed correctly all the tests requested.
…ter__` and async polling methods.
…_batches directly - they're internal methods, tested implicitly through async for and tests that re-verify existing features (already covered by sync tests).
fresh-borzoni
left a comment
There was a problem hiding this comment.
@qzyu999 Ty, nice work, only nit comments remain, PTAL
| Ok(df) | ||
| } | ||
|
|
||
| fn __aiter__<'py>(slf: PyRef<'py, Self>) -> PyResult<Bound<'py, PyAny>> { |
There was a problem hiding this comment.
I think it's better to leave _async_poll and _async_poll_batches out of .pyi bc these methods ideally should be private implementation details.
So exposing __aiter__ makes sense to just signal IDE that we support async for, but the rest of underscore methods added - we don't want to encourage users to use them directly
bindings/python/src/table.rs
Outdated
| "No buckets subscribed. Call subscribe(), subscribe_buckets(), subscribe_partition(), or subscribe_partition_buckets() first.", | ||
| )); | ||
| } | ||
| subs.clone() |
There was a problem hiding this comment.
nit: scoping block + subs.clone() was needed with the Mutex, not needed with Arc - all borrows are shared now
There was a problem hiding this comment.
Hi @fresh-borzoni, made the changes in 6ad8cab, tested locally and passing.
bindings/python/src/table.rs
Outdated
| let gen_fn = BATCH_ASYNC_GEN_FN.get_or_init(py, || { | ||
| let code = pyo3::ffi::c_str!( | ||
| r#" | ||
| async def _async_batch_scan(scanner, timeout_ms=1000): |
There was a problem hiding this comment.
The two __aiter__ branches are identical except for the poll method name. You can collapse to a single PyOnceLock + generator that takes a callable, and dispatch by passing _async_poll or _async_poll_batches
There was a problem hiding this comment.
Hi @fresh-borzoni, made the changes in 3981fff, tested locally and passing.
…re private implementation details
…d with Mutex no longer being used
…ngle PyOnceLock and generator that takes a callable
|
Thank you so much @fresh-borzoni! I really appreciate you taking the time to help review the PR. |
|
FYI: we're cutting a release branch soon hence we are taking time to review/merge. Just wanted contributors to stay informed about longer turnaround. |
Purpose
Linked issue: close #424
This pull request completes Issue #424 by enabling standard cross-boundary native Python
async forlanguage built-ins over the high-performance PyO3 wrappedLogScannerstream instance.Brief change log
Previously, PyFluss developers had to manually orchestrate
while Truepolling loops over network boundaries usingscanner.poll(timeout). This PR refactors the PythonLogScanneriterator logic by implementing the async traversal natively via Rust__anext__polling bindings and Python Generator__aiter__context adapters:ScannerKindinternals into a safely bufferedArc<tokio::sync::Mutex<ScannerState>>. This guarantees strict thread-safety and fulfills Rust's lifetime constraints enabling unboxed state transitions inside thepython_async_runtimestokioclosure..awaitfuture yield sequence smoothly without blocking event cycles or hardware threads directly!inspect.isasyncgen()compliance checks within strictly versioned Python 3.12+ engines (such as modern IPython Jupyter servers),__aiter__dynamically generates a properly wrapped coroutine generator dynamically inside the codebase viapy.run(). This completely masks the Python ecosystem's iterator type limitations automatically out-of-the-box.Tests
test_log_table.py::test_async_iterator: Integrated a testcontainers ecosystem confirming zero-configuration iteration capabilities function natively evaluatingasync for record in scannerperfectly without pipeline interruptions while yielding thousands of appended instances sequentially backwards matching existing legacy data frameworks.API and Format
Yes, this expands the API natively extending capabilities allowing
async forloops gracefully. Existing user logic leveraging explicit implementations of.poll_arrow()or legacy functions are untouched.Documentation
Yes, I updated integration tests acting as live documentation proof demonstrating the capability natively.