fix: added check to get_admin() before creating new admin by toxicteddy00077 · Pull Request #369 · apache/fluss-rust

toxicteddy00077 · 2026-02-23T15:26:21Z

Purpose

Linked issue: close #319

Description

I have followed a similar patter as the get_or_create_writer_client with all necessary async checks .However, since the return type is still Result<FlussAdmin>, I return a clone for now. The correct way would be to use Result<Arc<FlussAdmin>>. Please let me know if I have overlooked some part or misunderstood the issue.

toxicteddy00077 · 2026-02-24T07:09:20Z

@luoyuxia could you review this, and help me refine this further?

Copilot

Pull request overview

This PR addresses issue #319 by adding lazy caching for the FlussAdmin instance inside FlussConnection so repeated get_admin() calls don’t create a brand-new admin client each time.

Changes:

Added an admin_client cache (RwLock<Option<Arc<FlussAdmin>>>) to FlussConnection.
Updated get_admin() to return a cached admin instance when available (otherwise initialize and store one).
Made FlussAdmin clonable to preserve the existing Result<FlussAdmin> return type.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`crates/fluss/src/client/connection.rs`	Adds cached storage and a double-checked init path for `get_admin()` to avoid per-call construction.
`crates/fluss/src/client/admin.rs`	Adds `Clone` to `FlussAdmin` so cached admin can be returned without changing the public signature.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

crates/fluss/src/client/connection.rs

charlesdong1991

Nice PR! Left a couple comments!

crates/fluss/src/client/connection.rs

charlesdong1991 · 2026-02-24T21:16:02Z

crates/fluss/src/client/connection.rs

 // under the License.

-use crate::client::WriterClient;
+use crate::client::{WriterClient, admin};


FlussAdmin is imported already below, by adding admin, it will it can be reached in 2 ways. i think we should have a consistent way for module import.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

crates/fluss/src/client/connection.rs

Copilot · 2026-03-02T04:25:37Z

crates/fluss/src/client/admin.rs

 use std::sync::Arc;
 use tokio::task::JoinHandle;

+#[derive(Clone)]


Adding #[derive(Clone)] to the public FlussAdmin type expands the public API surface and implicitly documents that cloning is a supported operation. Since a clone will share the same underlying ServerConnection (it’s an Arc), this can be surprising if callers expect an independent/fresh connection. If cloning isn’t required for this PR, consider removing it; if it is required, consider documenting the clone semantics explicitly (or returning Arc<FlussAdmin> instead of relying on Clone).

Suggested change

#[derive(Clone)]

I don't think we need Clone anymore, since we use Arc now

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

crates/fluss/src/client/connection.rs

Copilot · 2026-03-02T06:36:24Z

crates/fluss/src/client/connection.rs

+        let admin = self
+            .admin_client
+            .get_or_try_init(|| FlussAdmin::new(self.network_connects.clone(), self.metadata.clone()))
+            .await?;
+        Ok(admin.clone())


Caching FlussAdmin in a tokio::sync::OnceCell makes the admin instance effectively permanent for the lifetime of the connection. If the coordinator changes (metadata refresh) or the underlying ServerConnection becomes poisoned, get_admin() will keep returning the same cached instance and callers may be unable to recover without constructing a new FlussConnection. Consider using a cache that supports invalidation/refresh (e.g., async lock around an Option<FlussAdmin>), or make FlussAdmin acquire/refresh its admin_gateway on demand when the current connection is poisoned or when requests return InvalidCoordinatorException.

toxicteddy00077 · 2026-03-03T17:12:40Z

@charlesdong1991 Please have a look at the latest commit and let me know the general direction i should take. Is using OneCell<FlussAdmin> the correct approach for the admin client type?

charlesdong1991 · 2026-03-03T20:19:11Z

i think for the purpose of this ticket, we can use OneCell, and direction is right.

But i think we should avoid derive(clone) IMHO, we can do it by wrap the cached admin in Arc

fresh-borzoni · 2026-03-03T23:18:59Z

Thanks for working on this @toxicteddy00077!

The direction is right, we want to cache the admin like Java does. But we need to fix one thing first: FlussAdmin currently stores a concrete ServerConnection at construction time, which means a cached admin would be stuck with a dead connection if the coordinator restarts. Java avoids this because its FlussAdmin resolves the coordinator per-call via GatewayClientProxy.

Here's how to do it:

// admin.rs:

Remove admin_gateway field, remove both #[allow(dead_code)] on metadata and rpc_client, add #[derive(Clone)] (now safe - struct is just two Arcs)
Make new() sync, rewrite to just store the arcs
Add a helper that resolves the coordinator connection per-call, smth like this

async fn admin_gateway(&self) -> Result<ServerConnection> {
      let coordinator = self
          .metadata
          .get_cluster()
          .get_coordinator_server()
          .ok_or_else(|| Error::UnexpectedError {
              message: "Coordinator server not found in cluster metadata".to_string(),
              source: None,
          })?;
      self.rpc_client.get_connection(coordinator).await
  }

Update all methods, mechanical
// before
self.admin_gateway.request(SomeRequest::new(...)).await?;
// after
self.admin_gateway().await?.request(SomeRequest::new(...)).await?;

// connection.rs:
Add cache field admin_client: RwLock<Option> and use double-checked locking in get_admin(), same pattern as get_or_create_writer_client() right above it.
Just get_admin() should return Result and and stay async and doesn't await internally, so that it's compatible with previous signature.
With this, public API stays async fn get_admin(&self) -> Result, bindings untouched.

But if we can still break the signature, I would just go with Arc and sync method, and no Clone derive with Arc
cc @luoyuxia

toxicteddy00077 · 2026-03-07T04:33:45Z

Thank you very much @fresh-borzoni, I have understood and made the changes, seems to test fine. Like you mentioned, I have just used RwLock over OneCell for admin_client.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-07T04:41:49Z

crates/fluss/src/client/admin.rs

 impl FlussAdmin {
-    pub async fn new(connections: Arc<RpcClient>, metadata: Arc<Metadata>) -> Result<Self> {
-        let admin_con =
-            connections
-                .get_connection(metadata.get_cluster().get_coordinator_server().ok_or_else(
-                    || Error::UnexpectedError {
-                        message: "Coordinator server not found in cluster metadata".to_string(),
-                        source: None,
-                    },
-                )?)
-                .await?;
-
-        Ok(FlussAdmin {
-            admin_gateway: admin_con,
+    pub fn new(connections: Arc<RpcClient>, metadata: Arc<Metadata>) -> Self {
+        FlussAdmin {
            metadata,
            rpc_client: connections,
-        })
+        }
+    }


FlussAdmin::new changed from an async fn returning Result<Self> to a sync constructor returning Self. Since FlussAdmin is a public type, this is an API-breaking change for downstream crates and also changes when/where initialization errors surface (constructor can no longer fail).

Consider keeping the existing public constructor signature (or adding a new new_unchecked/from_parts constructor for internal use) so callers depending on fallible/async initialization don’t break.

Copilot · 2026-03-07T04:41:50Z

crates/fluss/src/client/connection.rs

    pub async fn get_admin(&self) -> Result<FlussAdmin> {
-        FlussAdmin::new(self.network_connects.clone(), self.metadata.clone()).await
+        // 1. Fast path: return cached instance if already initialized.
+        if let Some(admin) = self.admin_client.read().as_ref() {
+            return Ok(admin.clone());
+        }
+
+        // 2. Slow path: acquire write lock.
+        let mut admin_guard = self.admin_client.write();
+
+        // 3. Double-check: another thread may have initialized while we waited.
+        if let Some(admin) = admin_guard.as_ref() {
+            return Ok(admin.clone());
+        }
+
+        // 4. Initialize and cache.
+        let admin = FlussAdmin::new(self.network_connects.clone(), self.metadata.clone());
+        *admin_guard = Some(admin.clone());
+        Ok(admin)


get_admin() now always succeeds and caches a FlussAdmin without verifying that the coordinator exists / is reachable. This is a behavior change: previously get_admin() could fail early (e.g., missing coordinator in metadata), and the integration readiness check in tests/integration/utils.rs relies on get_admin().await.is_ok() as part of cluster readiness.

If get_admin() is intended to remain a readiness/validation point, consider performing a lightweight async validation (e.g., resolve coordinator + RpcClient::get_connection) before caching/returning. Since that introduces an await, prefer an async single-init primitive (tokio::sync::OnceCell / async mutex) rather than holding a parking_lot lock across an await.

PTAL, if some integration tests depend on get_admin.await.is_ok(), we need to fix it.
I think that we call admin.get_server_nodes() in in python at least, so just check that other clients do this as well.

fresh-borzoni

@toxicteddy00077 TY, LGTM overall, left comment, PTAL

let's be consistent with Arcs, it will change public API signature, but Arc auto-derefs to FlussAdmin, so it will only break on explicit type annotation, which is low chance.

fresh-borzoni · 2026-03-07T13:29:26Z

crates/fluss/src/client/connection.rs

    network_connects: Arc<RpcClient>,
    args: Config,
    writer_client: RwLock<Option<Arc<WriterClient>>>,
+    admin_client: RwLock<Option<FlussAdmin>>,


It's better to use RwLock<Option<Arc<FlussAdmin>>>

I've done as you mentioned. It does break bindings, should I correct those as well?

Yes, I think so

fresh-borzoni

@toxicteddy00077 Looked through, left comment. Also integration tests readiness check is still not fixed. PTAL

fresh-borzoni · 2026-03-11T11:10:23Z

crates/fluss/src/client/admin.rs

 use std::sync::Arc;
 use tokio::task::JoinHandle;

+#[derive(Clone)]


I don't think we need Clone anymore, since we use Arc now

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fresh-borzoni

@toxicteddy00077 TY, left some comments PTAL

I don't think we need list_databases check

fresh-borzoni · 2026-03-14T14:21:17Z

crates/examples/src/example_partitioned_kv_table.rs

 }

-async fn create_partition(table_path: &TablePath, admin: &mut FlussAdmin, region: &str, zone: i64) {
+async fn create_partition(table_path: &TablePath, admin: &Arc<FlussAdmin>, region: &str, zone: i64) {


why not &FlussAdmin?

Good catch, refactored

fresh-borzoni · 2026-03-14T14:21:48Z

bindings/python/test/conftest.py

+    start = time.time()
+    while time.time() - start < timeout:
+        try:
+            async def _probe():


it's wasteful to redefine on every loop iteration

fresh-borzoni · 2026-03-14T14:25:09Z

bindings/python/test/conftest.py

    return subprocess.run(cmd, capture_output=True).returncode


+def _wait_for_coordinator_ready(host, port, timeout=60):


tbh, this is not necessary for python, we have _connect_with_retry

fresh-borzoni · 2026-03-14T15:07:26Z

crates/fluss/tests/integration/utils.rs

        let connection = cluster
            .get_fluss_connection_with_sasl(username, password)
            .await;
-        if connection.get_admin().await.is_ok()


I think we shall just remove this check and rely on get_one_available_server() later in the code, no need for list_databases check anywhere.

fresh-borzoni

@toxicteddy00077 Ty, LGTM

cc @luoyuxia
current implementation assumes that we don't want to break existing API using get_admin, so it stays async/Result as it was before, the cleaner version would have been to just have Arc ofc

luoyuxia requested a review from Copilot February 24, 2026 09:22

Copilot started reviewing on behalf of luoyuxia February 24, 2026 09:22 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

crates/fluss/src/client/connection.rs Outdated Show resolved Hide resolved

crates/fluss/src/client/connection.rs Outdated Show resolved Hide resolved

charlesdong1991 reviewed Feb 24, 2026

View reviewed changes

toxicteddy00077 force-pushed the fluss-admin-inst-and-cache branch from 5293c9d to 7a122a6 Compare March 2, 2026 04:22

toxicteddy00077 requested a review from Copilot March 2, 2026 04:22

Copilot started reviewing on behalf of toxicteddy00077 March 2, 2026 04:23 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

Fixed get_admin() to check before creating new admin

f3c415e

toxicteddy00077 force-pushed the fluss-admin-inst-and-cache branch from 7a122a6 to f3c415e Compare March 2, 2026 06:30

Merge branch 'main' into fluss-admin-inst-and-cache

351f6d9

toxicteddy00077 requested a review from Copilot March 2, 2026 06:32

Copilot started reviewing on behalf of toxicteddy00077 March 2, 2026 06:32 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

toxicteddy00077 force-pushed the fluss-admin-inst-and-cache branch from 5c377cb to dac8f93 Compare March 3, 2026 17:11

toxicteddy00077 force-pushed the fluss-admin-inst-and-cache branch from dac8f93 to d75ef69 Compare March 7, 2026 04:28

toxicteddy00077 requested a review from Copilot March 7, 2026 04:38

Copilot started reviewing on behalf of toxicteddy00077 March 7, 2026 04:39 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

fresh-borzoni reviewed Mar 7, 2026

View reviewed changes

toxicteddy00077 force-pushed the fluss-admin-inst-and-cache branch 2 times, most recently from 7c1fa55 to 37a2c4d Compare March 10, 2026 17:47

fresh-borzoni reviewed Mar 11, 2026

View reviewed changes

toxicteddy00077 force-pushed the fluss-admin-inst-and-cache branch from 37a2c4d to a312e02 Compare March 11, 2026 13:35

Update crates/fluss/src/client/connection.rs

44f8f47

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

toxicteddy00077 force-pushed the fluss-admin-inst-and-cache branch from a312e02 to 44f8f47 Compare March 11, 2026 14:41

toxicteddy00077 requested a review from fresh-borzoni March 12, 2026 18:30

fresh-borzoni reviewed Mar 14, 2026

View reviewed changes

Merge branch 'main' into fluss-admin-inst-and-cache

63d6da7

toxicteddy00077 force-pushed the fluss-admin-inst-and-cache branch from 502b549 to 63d6da7 Compare March 16, 2026 08:51

fresh-borzoni approved these changes Mar 16, 2026

View reviewed changes

		return subprocess.run(cmd, capture_output=True).returncode


		def _wait_for_coordinator_ready(host, port, timeout=60):

Conversation

toxicteddy00077 commented Feb 23, 2026

Purpose

Description

Uh oh!

toxicteddy00077 commented Feb 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

charlesdong1991 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

toxicteddy00077 commented Mar 3, 2026

Uh oh!

charlesdong1991 commented Mar 3, 2026

Uh oh!

fresh-borzoni commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

toxicteddy00077 commented Mar 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni left a comment

Choose a reason for hiding this comment

fresh-borzoni commented Mar 3, 2026 •

edited

Loading

fresh-borzoni left a comment •

edited

Loading

fresh-borzoni Mar 14, 2026 •

edited

Loading