feat: add async 'for' loop support to LogScanner (#424) by qzyu999 · Pull Request #438 · apache/fluss-rust

qzyu999 · 2026-03-09T03:55:37Z

Purpose

Linked issue: close #424

This pull request completes Issue #424 by enabling standard cross-boundary native Python async for language built-ins over the high-performance PyO3 wrapped LogScanner stream instance.

Brief change log

Previously, PyFluss developers had to manually orchestrate while True polling loops over network boundaries using scanner.poll(timeout). This PR refactors the Python LogScanner iterator logic by implementing the async traversal natively via Rust __anext__ polling bindings and Python Generator __aiter__ context adapters:

State Independence: Refactored ScannerKind internals into a safely buffered Arc<tokio::sync::Mutex<ScannerState>>. This guarantees strict thread-safety and fulfills Rust's lifetime constraints enabling unboxed state transitions inside the python_async_runtimes tokio closure.
Asynchronous Execution: Polling evaluates non-blocking loops. PyFluss automatically maps Arrow records onto the .await future yield sequence smoothly without blocking event cycles or hardware threads directly!
Iterable Compliance: To correctly resolve runtime inspect.isasyncgen() compliance checks within strictly versioned Python 3.12+ engines (such as modern IPython Jupyter servers), __aiter__ dynamically generates a properly wrapped coroutine generator dynamically inside the codebase via py.run(). This completely masks the Python ecosystem's iterator type limitations automatically out-of-the-box.

Tests

[NEW] test_log_table.py::test_async_iterator: Integrated a testcontainers ecosystem confirming zero-configuration iteration capabilities function natively evaluating async for record in scanner perfectly without pipeline interruptions while yielding thousands of appended instances sequentially backwards matching existing legacy data frameworks.

API and Format

Yes, this expands the API natively extending capabilities allowing async for loops gracefully. Existing user logic leveraging explicit implementations of .poll_arrow() or legacy functions are untouched.

Documentation

Yes, I updated integration tests acting as live documentation proof demonstrating the capability natively.

fresh-borzoni · 2026-03-10T01:27:49Z

@qzyu999 Ty for the PR, but I checked this branch out and integration tests for python still hang even when I run them locally. PTAL

qzyu999 · 2026-03-10T02:40:55Z

@qzyu999 Ty for the PR, but I checked this branch out and integration tests for python still hang even when I run them locally. PTAL

Hi @fresh-borzoni, applied the fix. Ran cargo fmt --all locally and it passed. Please run the CI again.

…within a local scope in `to_arrow`

Copilot

Pull request overview

This PR adds Python async for support for the PyO3 LogScanner binding to address Issue #424, aiming to let PyFluss users iterate scanner results via the native async-iterator protocol instead of manual polling loops.

Changes:

Added a new Python integration test covering async for record in scanner on a record-based LogScanner.
Refactored the Rust LogScanner binding to store scanner state behind Arc<tokio::sync::Mutex<_>> with a pending-records buffer for per-record yielding.
Implemented __aiter__ / __anext__ in the Rust binding (via future_into_py) to produce awaitable next-items for async iteration.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 11 comments.

File	Description
bindings/python/test/test_log_table.py	Adds an async-iterator integration test for `LogScanner`.
bindings/python/src/table.rs	Introduces async iterator support and refactors scanner state management for Python bindings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-11T02:24:09Z

bindings/python/src/table.rs

+    fn __aiter__<'py>(slf: PyRef<'py, Self>) -> PyResult<Bound<'py, PyAny>> {
+        let py = slf.py();
+        let code = pyo3::ffi::c_str!(
+            r#"
+async def _adapter(obj):
+    while True:
+        try:
+            yield await obj.__anext__()
+        except StopAsyncIteration:
+            break
+"#
+        );
+        let globals = pyo3::types::PyDict::new(py);
+        py.run(code, Some(&globals), None)?;
+        let adapter = globals.get_item("_adapter")?.unwrap();
+        // Return adapt(self)
+        adapter.call1((slf.into_bound_py_any(py)?,))
+    }


__aiter__ recompiles and executes Python source via py.run() on every iteration start. Consider caching the adapter function (e.g., in a PyOnceLock) or returning self directly as the async iterator if possible; this avoids repeated code compilation and reduces overhead per async for loop.

Copilot · 2026-03-11T02:24:09Z

bindings/python/src/table.rs

+            let mut state = state_arc.lock().await;
+
+            // 1. If we already have buffered records, pop and return immediately
+            if let Some(record) = state.pending_records.pop_front() {
+                return Ok(record.into_any());
+            }
+
+            // 2. Buffer is empty, we must poll the network for the next batch
+            // The underlying kind must be a Record-based scanner.
+            let scanner = match state.kind.as_record() {
+                Ok(s) => s,
+                Err(_) => {
+                    return Err(pyo3::exceptions::PyStopAsyncIteration::new_err(
+                        "Stream Ended",
+                    ));
+                }
+            };
+
+            // Poll with a reasonable internal timeout before unblocking the event loop
+            let timeout = core::time::Duration::from_millis(5000);
+
+            let mut current_records = scanner
+                .poll(timeout)
+                .await
+                .map_err(|e| FlussError::from_core_error(&e))?;
+
+            // If it's a real timeout with zero records, loop or throw StopAsyncIteration?
+            // Since it's a streaming log, we can yield None or block. Blocking requires a loop in the future.
+            while current_records.is_empty() {
+                current_records = scanner
+                    .poll(timeout)
+                    .await
+                    .map_err(|e| FlussError::from_core_error(&e))?;
+            }


__anext__ holds state_arc.lock() across scanner.poll(timeout).await (and the retry loop). This blocks all other methods needing the same mutex (e.g., subscribe/unsubscribe/poll/to_arrow) for the full network wait time and can lead to poor responsiveness or deadlock-like behavior under concurrent use. Consider narrowing the critical section (e.g., split locks for kind vs pending_records, or temporarily take/move the scanner out of the state while polling).

Copilot · 2026-03-11T02:24:09Z

bindings/python/src/table.rs

+        let scanner_ref =
+            unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };
+        let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await });


Avoid the unsafe pointer cast when accessing self.state. You can lock the mutex directly via self.state.lock() (or clone the Arc first) without unsafe; the current cast is unnecessary and introduces unsoundness risk if the field type ever changes.

Suggested change

let scanner_ref =

unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };

let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await });

let lock = TOKIO_RUNTIME.block_on(async { self.state.lock().await });

Copilot · 2026-03-11T02:24:10Z

bindings/python/src/table.rs

+        let scanner_ref =
+            unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };
+        let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await });


query_latest_offsets() uses the same unsafe cast pattern to lock self.state. Please replace this with a safe lock on self.state (or a cloned Arc) to avoid unnecessary unsafe in the Python bindings.

Suggested change

let scanner_ref =

unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };

let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await });

let lock = TOKIO_RUNTIME.block_on(async { self.state.lock().await });

Copilot · 2026-03-11T02:24:10Z

bindings/python/src/table.rs

+                    return Err(pyo3::exceptions::PyStopAsyncIteration::new_err(
+                        "Stream Ended",


__anext__ treats the batch-based scanner variant as end-of-stream (StopAsyncIteration). That will silently terminate async for on scanners created via create_record_batch_log_scanner(), and it also masks the helpful error message from as_record(). Either implement async iteration for the batch variant (yielding RecordBatch/Arrow), or raise a TypeError explaining that async iteration is only supported for record scanners.

Suggested change

return Err(pyo3::exceptions::PyStopAsyncIteration::new_err(

"Stream Ended",

return Err(PyTypeError::new_err(

"Async iteration is only supported for record scanners; \

use create_record_log_scanner() instead.",

Copilot · 2026-03-11T02:24:10Z

bindings/python/test/test_log_table.py

+    scanner = await table.new_scan().create_log_scanner()
+    num_buckets = (await admin.get_table_info(table_path)).num_buckets
+    scanner.subscribe_buckets({i: fluss.EARLIEST_OFFSET for i in range(num_buckets)})


This test only covers async for on a record-based scanner (create_log_scanner()). Since LogScanner can also wrap the batch variant (create_record_batch_log_scanner()), consider adding a companion test for async iteration on the batch scanner (or explicitly asserting that async iteration is unsupported there) so the intended behavior is locked in by tests.

Copilot · 2026-03-11T02:24:11Z

bindings/python/src/table.rs

+        let scanner_ref =
+            unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };
+        let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await });
+        let scanner = lock.kind.as_batch()?;



Same as in poll(): please remove the unsafe cast used to get scanner_ref. Lock self.state directly; keeping this unsafe here makes the method harder to reason about and can hide real lifetime/aliasing issues.

Suggested change

let scanner_ref =

unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };

let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await });

let scanner = lock.kind.as_batch()?;

let lock = TOKIO_RUNTIME.block_on(async { self.state.lock().await });

let scanner = lock.kind.as_batch()?;

Copilot · 2026-03-11T02:24:11Z

bindings/python/src/table.rs

+        let scanner_ref =
+            unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };
+        let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await });


Remove the unsafe pointer cast when locking self.state in poll_arrow(). This can be expressed safely with self.state.lock().await (via TOKIO_RUNTIME.block_on) and avoids introducing UB hazards.

Suggested change

let scanner_ref =

unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };

let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await });

let lock = TOKIO_RUNTIME.block_on(async { self.state.lock().await });

Copilot · 2026-03-11T02:24:11Z

bindings/python/src/table.rs

+            let scanner_ref = unsafe {
+                &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>)
+            };


to_arrow() also uses an unsafe cast to access self.state. This should be rewritten to safely clone/borrow self.state and lock it without unsafe to keep the bindings memory-safe.

Suggested change

let scanner_ref = unsafe {

&*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>)

};

let scanner_ref = &self.state;

Copilot · 2026-03-11T02:24:11Z

bindings/python/src/table.rs

+        let scanner_ref =
+            unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };


poll_until_offsets() also relies on the unsafe cast to access self.state. This should be refactored to lock self.state safely; keeping unsafe here is especially risky because this method can run for a long time and is on a hot path for to_arrow().

Suggested change

let scanner_ref =

unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };

let scanner_ref = &self.state;

fresh-borzoni · 2026-03-11T12:52:33Z

@qzyu999 Ty, took a look at the approach, have some ideas. PTAL

The scanner is already thread-safe internally (&self on all methods), so the Mutex isn't needed, it just adds locking to every call and forces 5 unsafe pointer casts to work around borrow issues it created. The anext loop is also problematic: it runs inside tokio::spawn, so breaking out of async for leaves it polling forever in the background.

Simpler idea: store the scanner in an Arc, keep existing methods as-is. Add _async_poll(timeout_ms) that does one bounded poll and returns a list. aiter returns a small Python async generator that calls _async_poll and yields records. Break stops the generator naturally, so no leaks, no unsafe, no mutex.

…_ms) instead

qzyu999 · 2026-03-11T23:47:52Z

Hi @fresh-borzoni, thanks for the recommendations. I've taken a look and came up with the following changes, PTAL when available:

Replace Arc<tokio::sync::Mutex<ScannerState>> with Arc<ScannerKind>
Remove ScannerState struct and VecDeque buffer
Remove all 6 unsafe pointer casts
Replace __anext__ with _async_poll(timeout_ms) (single bounded poll)
Replace __aiter__ with PyOnceLock-cached Python async generator
Change batch scanner error from StopAsyncIteration to TypeError
Update with_scanner! macro or inline to use &self.kind directly
Add break-safety and batch-scanner-error tests

…o that they match when talking about Record vs Batch

qzyu999 · 2026-03-13T05:39:54Z

Hi @fresh-borzoni, I just realized that the original issue #424 is also looking for create_record_batch_log_scanner to be included in the functionality. Just added code for it, was able to compile and test locally, PTAL when available, thanks!

fresh-borzoni

@qzyu999 TY, looked through, left comments, PTAL

fresh-borzoni · 2026-03-14T16:10:12Z

bindings/python/src/table.rs

    PyDeltaAccess, PyDict, PyList, PySequence, PySlice, PyTime, PyTimeAccess, PyTuple, PyType,
    PyTzInfo,
 };
+use pyo3::{Bound, IntoPyObjectExt, Py, PyAny, PyRef, PyRefMut, PyResult, Python};


Do we really need these?
Seems that it's a leftover and we only need 1-2 objects, maybe only IntoPyObjectExt

Hi @fresh-borzoni, I see that you're correct as the others are imported already via use crate::*; from lib.rs. I removed the extra imports in d619b13 and kept only IntoPyObjectExt.

fresh-borzoni · 2026-03-14T16:17:07Z

bindings/python/src/table.rs

        Ok(df)
    }

+    fn __aiter__<'py>(slf: PyRef<'py, Self>) -> PyResult<Bound<'py, PyAny>> {


we need to add this method to .pyi stubs

Hi @fresh-borzoni, just added __aiter__ to __init__.pyi here 134e56b along with with _async_poll and _async_poll_batches.

I think it's better to leave _async_poll and _async_poll_batches out of .pyi bc these methods ideally should be private implementation details.
So exposing __aiter__ makes sense to just signal IDE that we support async for, but the rest of underscore methods added - we don't want to encourage users to use them directly

Hi @fresh-borzoni, thanks for the clarification, removed those two entries in the .pyi file in db23dd6.

fresh-borzoni · 2026-03-14T16:25:20Z

bindings/python/test/test_log_table.py

    await admin.drop_table(table_path, ignore_if_not_exists=False)


+async def test_async_iterator(connection, admin):


Can we reduce the amount of tests?

Keep tests that cover the async for happy path, break safety, and multiple poll cycles

Drop tests for _async_poll / _async_poll_batches directly - they're internal methods, tested implicitly through async for

Drop tests that re-verify existing features (projection, pandas, metadata) through the async path - those features are already covered by sync tests

Hi @fresh-borzoni, I've made the changes here: efbcb8c please let me know if I removed correctly all the tests requested.

…ter__` and async polling methods.

…_batches directly - they're internal methods, tested implicitly through async for and tests that re-verify existing features (already covered by sync tests).

fresh-borzoni

@qzyu999 Ty, nice work, only nit comments remain, PTAL

fresh-borzoni · 2026-03-16T23:14:18Z

bindings/python/src/table.rs

        Ok(df)
    }

+    fn __aiter__<'py>(slf: PyRef<'py, Self>) -> PyResult<Bound<'py, PyAny>> {


I think it's better to leave _async_poll and _async_poll_batches out of .pyi bc these methods ideally should be private implementation details.
So exposing __aiter__ makes sense to just signal IDE that we support async for, but the rest of underscore methods added - we don't want to encourage users to use them directly

fresh-borzoni · 2026-03-16T23:16:39Z

bindings/python/src/table.rs

+                    "No buckets subscribed. Call subscribe(), subscribe_buckets(), subscribe_partition(), or subscribe_partition_buckets() first.",
+                ));
+            }
+            subs.clone()


nit: scoping block + subs.clone() was needed with the Mutex, not needed with Arc - all borrows are shared now

Hi @fresh-borzoni, made the changes in 6ad8cab, tested locally and passing.

fresh-borzoni · 2026-03-16T23:36:04Z

bindings/python/src/table.rs

+                let gen_fn = BATCH_ASYNC_GEN_FN.get_or_init(py, || {
+                    let code = pyo3::ffi::c_str!(
+                        r#"
+async def _async_batch_scan(scanner, timeout_ms=1000):


The two __aiter__ branches are identical except for the poll method name. You can collapse to a single PyOnceLock + generator that takes a callable, and dispatch by passing _async_poll or _async_poll_batches

Hi @fresh-borzoni, made the changes in 3981fff, tested locally and passing.

…re private implementation details

…d with Mutex no longer being used

…ngle PyOnceLock and generator that takes a callable

fresh-borzoni

@qzyu999 Ty, LGTM

qzyu999 · 2026-03-18T02:55:48Z

Thank you so much @fresh-borzoni! I really appreciate you taking the time to help review the PR.

leekeiabstraction · 2026-03-18T08:03:59Z

FYI: we're cutting a release branch soon hence we are taking time to review/merge. Just wanted contributors to stay informed about longer turnaround.

qzyu999 added 3 commits March 8, 2026 20:51

feat: add async 'for' loop support to LogScanner (apache#424)

768266d

chore: revert formatting changes to __init__.pyi

0e01b8b

fix: remove unused PyClassInitializer and PyErr imports

3aa067b

style: apply cargo fmt

1065665

refactor: release scanner lock earlier by cloning subscribed buckets …

195ec7c

…within a local scope in `to_arrow`

luoyuxia requested a review from Copilot March 11, 2026 02:17

Copilot started reviewing on behalf of luoyuxia March 11, 2026 02:18 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

refactor: Remove Mutex and utilize __aiter__ with _async_poll(timeout…

4ad2fd8

…_ms) instead

qzyu999 added 2 commits March 12, 2026 22:31

feat: add create_record_batch_log_scanner()

08eef13

chore: update error message for _async_poll and _async_poll_batches s…

68426a0

…o that they match when talking about Record vs Batch

fresh-borzoni reviewed Mar 14, 2026

View reviewed changes

qzyu999 added 3 commits March 14, 2026 15:57

refactor: separate IntoPyObjectExt import from grouped pyo3 imports.

d619b13

feat: Add asynchronous iteration support to ScanIterator with `__ai…

134e56b

…ter__` and async polling methods.

refactor: Remove extra tests that check for _async_poll / _async_poll…

efbcb8c

…_batches directly - they're internal methods, tested implicitly through async for and tests that re-verify existing features (already covered by sync tests).

fresh-borzoni reviewed Mar 16, 2026

View reviewed changes

qzyu999 added 3 commits March 16, 2026 17:09

chore: Remove _async_poll and _async_poll_batches from .pyi as they a…

db23dd6

…re private implementation details

refactor: Remove scoping block and subs.clone() as they are not neede…

6ad8cab

…d with Mutex no longer being used

refactor: Combine the two _async_scan and _async_batch_scan into a si…

3981fff

…ngle PyOnceLock and generator that takes a callable

fresh-borzoni approved these changes Mar 18, 2026

View reviewed changes

		return Err(pyo3::exceptions::PyStopAsyncIteration::new_err(
		"Stream Ended",

-                    return Err(pyo3::exceptions::PyStopAsyncIteration::new_err(
-                        "Stream Ended",
+                    return Err(PyTypeError::new_err(
+                        "Async iteration is only supported for record scanners; \
+                         use create_record_log_scanner() instead.",

		let scanner_ref =
		unsafe { &(&self.state as const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };

	let scanner_ref =
	unsafe { &(&self.state as const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };
	let scanner_ref = &self.state;

		await admin.drop_table(table_path, ignore_if_not_exists=False)


		async def test_async_iterator(connection, admin):

Conversation

qzyu999 commented Mar 9, 2026

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

fresh-borzoni commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qzyu999 commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qzyu999 commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qzyu999 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fresh-borzoni left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

fresh-borzoni commented Mar 10, 2026 •

edited

Loading

fresh-borzoni commented Mar 11, 2026 •

edited

Loading

qzyu999 commented Mar 11, 2026 •

edited

Loading

qzyu999 commented Mar 13, 2026 •

edited

Loading