Skip to content

feat(dapi): add background health check for DAPI nodes#3353

Open
lklimek wants to merge 19 commits intov3.1-devfrom
feat/sdk-health-check
Open

feat(dapi): add background health check for DAPI nodes#3353
lklimek wants to merge 19 commits intov3.1-devfrom
feat/sdk-health-check

Conversation

@lklimek
Copy link
Copy Markdown
Contributor

@lklimek lklimek commented Mar 16, 2026

Issue being fixed or feature implemented

Many DAPI nodes on the network are broken, causing users to hit timeouts and retry errors before the SDK discovers they're dead. Currently, node health is purely reactive — nodes are only banned when actual user requests fail.

This PR adds an opt-in background health check system with two behaviors:

  1. Startup probe — probes all nodes when the SDK is built, bans dead ones immediately so the first user request already has a clean list
  2. Ban-expiry re-probe — when a node's ban timer expires, re-probes it before making it available again; either unbans (healthy) or re-bans with escalated backoff (still dead)

No periodic interval probes. The existing reactive ban logic handles nodes that break during normal operations.

What was done?

rs-dapi-client changes:

  • address_list.rs — Added get_all_addresses(), get_next_ban_expiry(), get_expired_ban_addresses() methods and ban_count() getter on AddressStatus
  • dapi_client.rs — Added connection_pool() and settings() accessors on DapiClient
  • health_check.rs (new)HealthCheckConfig struct and run_health_check() async function implementing:
    • Phase 1: Startup probe of all addresses using GetStatusRequest (lightest gRPC call)
    • Phase 2: Sleep-to-expiry watcher loop that re-probes nodes when bans expire
    • Concurrent probing via futures::stream::for_each_concurrent
    • Graceful cancellation via CancellationToken + tokio::select!
  • lib.rs — Module registration gated with #[cfg(not(target_arch = "wasm32"))]
  • Cargo.toml — Added tokio-util dependency for CancellationToken

dash-sdk changes:

  • sdk.rs — Added health_check and health_check_config fields to SdkBuilder, builder methods with_health_check() and with_health_check_config(), spawning logic in build(), _health_check_handle on Sdk, and Drop impl that aborts the task

Design decisions:

  • On by default, opt-out via SdkBuilder::with_health_check(false)
  • WASM excluded — entire module gated with #[cfg(not(target_arch = "wasm32"))]
  • Sleep-to-expiry model minimizes CPU/network overhead (no polling)
  • Reuses existing ban logicAddressList::ban() handles exponential backoff automatically
  • Probe settings use ban_failed_address: false to avoid interfering with reactive banning
  • Known limitation: Small race window between ban expiry and re-probe completion where get_live_address() may return a previously-banned node (documented in code)

How Has This Been Tested?

Unit tests added:

address_list.rs (5 new tests):

  • test_get_all_addresses_returns_both_banned_and_unbanned
  • test_get_next_ban_expiry_returns_earliest
  • test_get_next_ban_expiry_none_when_no_bans
  • test_get_expired_ban_addresses
  • test_address_status_ban_count

health_check.rs (8 tests):

  • test_health_check_config_default — Config defaults match spec
  • test_cancel_token_stops_loop — Cancellation breaks the loop
  • test_probe_unreachable_node_returns_false — Probe with non-routable address
  • test_startup_probe_bans_unreachable — Startup probe bans dead nodes
  • test_startup_probe_empty_address_list — Empty list handled gracefully
  • test_startup_probe_bans_all_unreachable_nodes — All-dead scenario
  • test_probe_settings_do_not_ban_on_failure — Probe settings correctness
  • test_cancel_during_ban_watch_phase — Cancellation during Phase 2

Verification commands:

cargo test -p rs-dapi-client    # 103 passed
cargo test -p dash-sdk --lib    # 84 passed
cargo clippy -p rs-dapi-client -p dash-sdk -- -D warnings  # clean
cargo fmt --all --check         # clean

Breaking Changes

None. Health check is additive and backward-compatible. Existing SdkBuilder usage works unchanged (health check is enabled by default but doesn't alter existing behavior).

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated relevant unit/integration/functional/e2e tests
  • I have added "!" to the title and described breaking changes in the corresponding section if my code contains any
  • I have made corresponding changes to the documentation if needed

🤖 Co-authored by Claudius the Magnificent AI Agent

Summary by CodeRabbit

  • New Features

    • Background health check for monitoring DAPI nodes (non‑WASM), configurable via SDK builder; start/stop and status controls exposed.
    • Address management APIs: list all addresses, list expired bans, get next ban expiry, and improved ban/unban behavior.
    • Read‑only accessors to view client connection pool and request settings.
  • Tests

    • Expanded tests covering health checks, ban/expiry behavior, address utilities, startup probing, and cancellation scenarios.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a non‑WASM background health check subsystem (HealthCheckConfig + run_health_check), AddressList ban/expiry query APIs, DapiClient getters, a non‑WASM tokio-util Cargo dependency, and SDK wiring to start/stop health checks with cancellation support.

Changes

Cohort / File(s) Summary
Dependency
packages/rs-dapi-client/Cargo.toml
Added tokio-util = "0.7.12" under [target.'cfg(not(target_arch = "wasm32"))'.dependencies].
Address list & status
packages/rs-dapi-client/src/address_list.rs
Added AddressStatus::ban_count(), AddressList::get_all_addresses(), get_next_ban_expiry(), get_expired_ban_addresses(); unban() now clears banned_until. Expanded tests for parsing, bans/expiry, and iteration.
Dapi client accessors
packages/rs-dapi-client/src/dapi_client.rs
Added read-only getters connection_pool(&self) -> &ConnectionPool and settings(&self) -> &RequestSettings.
Health check module
packages/rs-dapi-client/src/health_check.rs
New non‑WASM module providing HealthCheckConfig, run_health_check() async entrypoint, startup probing, batched concurrent probes, ban/unban logic, ban‑expiry watch loop, cancellation support, and tests.
Crate export
packages/rs-dapi-client/src/lib.rs
Added #[cfg(not(target_arch = "wasm32"))] mod health_check; and re-exported HealthCheckConfig at crate root under the same cfg.
SDK integration
packages/rs-sdk/src/sdk.rs
Added health‑check wiring: SdkBuilder gains health_check_config: Option<rs_dapi_client::HealthCheckConfig> (non‑WASM) and with_health_check_config(...); build spawns health check when configured. Sdk now contains health_check_cancel: Arc<std::sync::Mutex<CancellationToken>>, Clone propagates it, and public APIs is_health_check_running() and stop_health_check() were added; start_health_check() added for non‑WASM.

Sequence Diagram

sequenceDiagram
    participant SDK as SDK Builder / Sdk
    participant HealthCheck as Health Check Task
    participant AddrList as AddressList
    participant Pool as ConnectionPool
    participant Transport as Transport/Node

    SDK->>HealthCheck: spawn run_health_check(config, cancel_token)
    activate HealthCheck

    HealthCheck->>AddrList: get_all_addresses()
    AddrList-->>HealthCheck: addresses

    par Startup probes
        HealthCheck->>Pool: acquire connections (batched)
        Pool->>Transport: probe_node (GetStatus)
        Transport-->>HealthCheck: success / failure
        HealthCheck->>AddrList: ban/unban addresses
    end

    loop Watch loop
        HealthCheck->>AddrList: get_next_ban_expiry()
        AddrList-->>HealthCheck: next_expiry or None
        HealthCheck->>HealthCheck: sleep until expiry or idle_period

        HealthCheck->>AddrList: get_expired_ban_addresses()
        AddrList-->>HealthCheck: expired_addresses

        par Re-probe expired
            HealthCheck->>Pool: probe_node for expired
            Transport-->>HealthCheck: success / failure
            HealthCheck->>AddrList: unban healthy / keep ban
        end
    end

    SDK-->>HealthCheck: CancellationToken (abort)
    HealthCheck-->>SDK: task stopped
    deactivate HealthCheck
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through crates and stitched a watchful seam,
I probed each node in a concurrent dream,
Bans tick away while timers softly chime,
Tokens say stop and I nap for a time,
I unban the healthy and nibble on code-time.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(dapi): add background health check for DAPI nodes' accurately and concisely describes the main change in the changeset—introducing a background health check system for DAPI nodes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/sdk-health-check

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added this to the v3.1.0 milestone Mar 16, 2026
@lklimek lklimek requested a review from Copilot March 16, 2026 21:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in (default-on for non-WASM) background health check to proactively probe DAPI nodes at startup and re-probe nodes when bans expire, aiming to reduce user-facing timeouts by keeping the live node set cleaner.

Changes:

  • Introduces rs-dapi-client background health check module with startup probing + ban-expiry re-probing and associated config/tests.
  • Extends AddressList with helpers to enumerate all addresses and ban-expiry/expired-ban queries; adds ban_count() accessor.
  • Updates dash-sdk builder to configure/spawn the health check task, stores its JoinHandle, and aborts it on Sdk drop (non-WASM).

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
packages/rs-sdk/src/sdk.rs Spawns/stores background health-check task from SdkBuilder, aborts task in Drop.
packages/rs-dapi-client/src/lib.rs Exposes health_check module + HealthCheckConfig for non-WASM.
packages/rs-dapi-client/src/health_check.rs New health check implementation + tests (startup probe + ban-expiry watcher).
packages/rs-dapi-client/src/dapi_client.rs Adds accessors for connection pool and settings to support health check wiring.
packages/rs-dapi-client/src/address_list.rs Adds address/ban helper methods + tests; adds ban_count() getter.
packages/rs-dapi-client/Cargo.toml Adds tokio-util (non-WASM) for CancellationToken.
Cargo.lock Locks tokio-util dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@lklimek lklimek changed the title feat(rs-dapi-client): add background health check for DAPI nodes feat(dapi): add background health check for DAPI nodes Mar 17, 2026
@lklimek lklimek force-pushed the feat/sdk-health-check branch 2 times, most recently from 81d46ba to 0ee1996 Compare March 17, 2026 08:40
lklimek and others added 2 commits March 17, 2026 10:01
Introduces an opt-in health check system with two behaviors:
- Startup probe: probes all nodes when the SDK is built, bans dead ones
- Ban-expiry re-probe: re-probes nodes when their ban timer expires

Enabled by default, opt-out via SdkBuilder::with_health_check(false).
Gated behind #[cfg(not(target_arch = "wasm32"))].

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Refactor SdkBuilder to use Option<HealthCheckConfig> instead of
  separate bool + config fields (simpler API, None = disabled)
- Log banned/healthy count after startup probe completes
- Change no_ban_idle_period default to 59s (ban expiry - 1s)
- Pass CancellationToken through probe_and_update_batch for
  graceful shutdown during probing
- Clamp max_concurrent_probes to min 1 to prevent hang
- Guard tokio::spawn with Handle::try_current() to avoid panic
  when build() is called outside a Tokio runtime
- Clarify is_banned() doc comment about temporal vs state semantics

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lklimek lklimek force-pushed the feat/sdk-health-check branch from 0ee1996 to fc732a9 Compare March 17, 2026 09:04
@lklimek lklimek marked this pull request as ready for review March 17, 2026 09:32
lklimek and others added 2 commits March 17, 2026 10:43
… debug log

Change probe_node to return Result<(), String> instead of bool so the
actual failure reason (transport error, probe timeout, etc.) propagates
to the caller and appears in the debug-level log message when a node
is banned.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… String

Use the proper error type and convert to string only at the display site
via tracing's Display formatting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
packages/rs-dapi-client/src/dapi_client.rs (1)

159-166: Hide transport internals behind a client-level health-check API.

These accessors make ConnectionPool and raw request settings part of the public crate surface just to bootstrap the checker from packages/rs-sdk/src/sdk.rs. A DapiClient-level entry point would keep those internals swappable without another public compatibility promise.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/rs-dapi-client/src/dapi_client.rs` around lines 159 - 166, Remove
the public exposure of transport internals by deleting or making private the
connection_pool(&self) -> &ConnectionPool and settings(&self) ->
&RequestSettings accessors on DapiClient, and add a single public
DapiClient-level health API (e.g., pub fn transport_health(&self) ->
Result<HealthStatus, TransportError> or pub fn check_health(&self) ->
HealthStatus) that performs the necessary checks using the pool and settings
internally; then update the bootstrap checker in packages/rs-sdk/src/sdk.rs to
call this new DapiClient method instead of accessing ConnectionPool or
RequestSettings directly so transport internals remain swappable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/rs-dapi-client/src/health_check.rs`:
- Around line 95-126: The loop currently calls
address_list.get_next_ban_expiry() and treats None the same as “no bans”, which
delays re-probing even when address_list.get_expired_ban_addresses() is
non-empty; change the logic in the Phase 2 loop so that after computing
sleep_duration (or when get_next_ban_expiry() returns None) you first call
address_list.get_expired_ban_addresses() and if it returns non-empty,
immediately call probe_and_update_batch(&expired, &address_list, &pool,
&probe_settings, &config, &cancel_token).await (or set sleep_duration =
Duration::ZERO) before falling back to config.no_ban_idle_period, ensuring
expired bans are probed immediately and get_live_address() won’t hand out
recently-banned nodes prematurely.

In `@packages/rs-sdk/src/sdk.rs`:
- Around line 134-136: Change the raw optional JoinHandle ownership into an
Arc-backed guard so clones share the same health-check task and only abort it
when the last clone is dropped: introduce a small HealthCheckGuard type that
owns the tokio::task::JoinHandle<()> (and an AbortHandle/abort logic) and
implements Drop to abort the task when the guard is dropped; replace the Sdk
field _health_check_handle: Option<tokio::task::JoinHandle<()>> with
_health_check_guard: Option<Arc<HealthCheckGuard>> (or Option<Arc<dyn
DropGuard>>), update Sdk::clone to clone the Arc instead of taking/stripping the
handle, and remove the explicit abort in Sdk::drop so the task is only canceled
by the HealthCheckGuard::Drop when the last Arc reference is released.

---

Nitpick comments:
In `@packages/rs-dapi-client/src/dapi_client.rs`:
- Around line 159-166: Remove the public exposure of transport internals by
deleting or making private the connection_pool(&self) -> &ConnectionPool and
settings(&self) -> &RequestSettings accessors on DapiClient, and add a single
public DapiClient-level health API (e.g., pub fn transport_health(&self) ->
Result<HealthStatus, TransportError> or pub fn check_health(&self) ->
HealthStatus) that performs the necessary checks using the pool and settings
internally; then update the bootstrap checker in packages/rs-sdk/src/sdk.rs to
call this new DapiClient method instead of accessing ConnectionPool or
RequestSettings directly so transport internals remain swappable.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f31813b2-a3b9-4677-b418-a6bdf4e1e58f

📥 Commits

Reviewing files that changed from the base of the PR and between 95bdf2c and fc732a9.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • packages/rs-dapi-client/Cargo.toml
  • packages/rs-dapi-client/src/address_list.rs
  • packages/rs-dapi-client/src/dapi_client.rs
  • packages/rs-dapi-client/src/health_check.rs
  • packages/rs-dapi-client/src/lib.rs
  • packages/rs-sdk/src/sdk.rs

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 17, 2026

Codecov Report

❌ Patch coverage is 86.82842% with 76 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.98%. Comparing base (94cefb3) to head (b076885).

Files with missing lines Patch % Lines
packages/rs-sdk/src/sdk.rs 52.63% 54 Missing ⚠️
packages/rs-dapi-client/src/health_check.rs 94.59% 18 Missing ⚠️
packages/rs-dapi-client/src/address_list.rs 98.36% 2 Missing ⚠️
packages/rs-dapi-client/src/dapi_client.rs 75.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           v3.1-dev    #3353      +/-   ##
============================================
+ Coverage     79.93%   79.98%   +0.05%     
============================================
  Files          2861     2862       +1     
  Lines        280236   281400    +1164     
============================================
+ Hits         223993   225069    +1076     
- Misses        56243    56331      +88     
Components Coverage Δ
dpp 71.82% <ø> (ø)
drive 82.30% <ø> (ø)
drive-abci 86.67% <ø> (ø)
sdk 39.72% <52.63%> (+3.17%) ⬆️
dapi-client 85.21% <95.24%> (+6.14%) ⬆️
platform-version ∅ <ø> (∅)
platform-value 83.36% <ø> (ø)
platform-wallet 76.09% <ø> (ø)
drive-proof-verifier 55.26% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
packages/rs-dapi-client/src/health_check.rs (1)

95-126: ⚠️ Potential issue | 🟠 Major

Probe already-expired bans before falling back to the idle sleep.

get_next_ban_expiry() ignores bans that are already in the past, so if a re-probe batch runs longer than the gap to the next expiry, this loop comes back here, gets None, and sleeps no_ban_idle_period before touching those nodes. That stretches the documented “small race window” to a full idle period and lets get_live_address() hand those addresses out unchecked.

Suggested fix
     // Phase 2: Watch for ban expirations
     loop {
+        let expired = address_list.get_expired_ban_addresses();
+        if !expired.is_empty() {
+            tracing::debug!(
+                count = expired.len(),
+                "re-probing addresses with expired bans"
+            );
+            probe_and_update_batch(
+                &expired,
+                &address_list,
+                &pool,
+                &probe_settings,
+                &config,
+                &cancel_token,
+            )
+            .await;
+            continue;
+        }
+
         let sleep_duration = match address_list.get_next_ban_expiry() {
             Some(expiry) => {
                 let until = expiry - chrono::Utc::now();
                 until.to_std().unwrap_or(Duration::ZERO)
             }
             None => config.no_ban_idle_period,
         };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/rs-dapi-client/src/health_check.rs` around lines 95 - 126, The loop
currently computes sleep_duration from get_next_ban_expiry() and may sleep the
full no_ban_idle_period even when bans have already expired; before picking the
sleep branch, call address_list.get_expired_ban_addresses() and if it returns a
non-empty Vec, immediately call probe_and_update_batch(...) (same args as the
existing branch) and continue the loop instead of sleeping. This ensures
already-expired bans are re-probed immediately; keep the existing tokio::select!
cancel handling and only fall back to sleeping when get_expired_ban_addresses()
is empty and get_next_ban_expiry() returns None.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/rs-dapi-client/src/health_check.rs`:
- Around line 145-161: The async probe can overwrite newer bans; fix by
snapshotting the address' ban state before calling probe_node and only apply
address_list.unban or address_list.ban if the current ban-generation/timestamp
equals that snapshot. Concretely: read a lightweight version/token (e.g.,
ban_generation or banned_until timestamp) from address_list into a local
variable (snapshot) immediately before calling probe_node(address, &pool,
&settings). After probe_node returns, re-check the current token from
address_list and only call address_list.unban(address) on Ok(()) if the token
still matches the snapshot, and only call address_list.ban(address) on Err(_) if
the token still matches the snapshot; this prevents stale probe results from
clearing or double-applying newer foreground bans.

---

Duplicate comments:
In `@packages/rs-dapi-client/src/health_check.rs`:
- Around line 95-126: The loop currently computes sleep_duration from
get_next_ban_expiry() and may sleep the full no_ban_idle_period even when bans
have already expired; before picking the sleep branch, call
address_list.get_expired_ban_addresses() and if it returns a non-empty Vec,
immediately call probe_and_update_batch(...) (same args as the existing branch)
and continue the loop instead of sleeping. This ensures already-expired bans are
re-probed immediately; keep the existing tokio::select! cancel handling and only
fall back to sleeping when get_expired_ban_addresses() is empty and
get_next_ban_expiry() returns None.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5251aea2-a142-4ae8-a7c6-05c597d07e8d

📥 Commits

Reviewing files that changed from the base of the PR and between fc732a9 and 4b313b9.

📒 Files selected for processing (1)
  • packages/rs-dapi-client/src/health_check.rs

/// require modifying `get_live_address()` to never auto-unban expired bans, which would break
/// the existing reactive ban behavior.
pub async fn run_health_check(
address_list: AddressList,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consume pool and list?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address list is clone-friendly:

pub struct AddressList {
    addresses: Arc<RwLock<HashMap<Address, AddressStatus>>>,
    base_ban_period: Duration,
}

So we can clone and consume.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same about connection pool:

pub struct ConnectionPool {
    inner: Arc<Mutex<LruCache<String, PoolItem>>>,
}

impl Drop for Sdk {
fn drop(&mut self) {
// Abort the background health check task when the Sdk that owns the handle is dropped.
// Clones always have `None`, so only the original Sdk triggers this.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why clones have none?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impl Clone for Sdk sets this to None, but I agree it's confusing and can lead to bugs. Claudius will refactor this to use Arc<> and some data structure that will abort automatically once last reference is done.


// Extract clones needed for health check before dapi is moved
#[cfg(not(target_arch = "wasm32"))]
let hc_address_list = dapi.address_list().clone();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's independent clone?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already mentioned, clone() is recommended way of managing objects in Sdk, they are lightweight and clone-friendly - use Arc<> internally. At least they should ;-).

lklimek and others added 2 commits March 17, 2026 12:09
…onToken

- Replace _health_check_handle with health_check_cancel CancellationToken
  (child of cancel_token), shared across all Sdk clones so dropping the
  original no longer kills the health check for clones
- Add stop_health_check() and is_health_check_running() to Sdk (not
  cfg-gated, works on all targets)
- Remove Drop impl for Sdk — the spawned task exits via token cancellation
- Check expired bans before idle sleep in Phase 2 (coderabbitai thread 11)
- Document accepted stale probe race condition (coderabbitai thread 13)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use ArcSwapOption<CancellationToken> for lock-free atomic swap of the
health check token. Some = running, None = stopped.

- stop_health_check(): swap(None) + cancel the old token
- start_health_check(config): stop existing, create fresh child token,
  spawn new task
- is_health_check_running(): load().is_some()

All Sdk clones share the same ArcSwapOption via Arc, so any clone can
start/stop the health check for all of them.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 819756a5-ac14-42ca-9802-d39480f1755d

📥 Commits

Reviewing files that changed from the base of the PR and between 4b313b9 and 4f15288.

📒 Files selected for processing (2)
  • packages/rs-dapi-client/src/health_check.rs
  • packages/rs-sdk/src/sdk.rs

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/rs-sdk/src/sdk.rs`:
- Around line 455-485: Race between concurrent calls to start_health_check can
leave an orphaned spawned task because stop_health_check and the subsequent
spawn/store are not atomic; fix by serializing the stop+spawn sequence. Add a
lightweight Mutex (e.g., health_check_mutex: tokio::sync::Mutex<()>) to the Sdk
struct and acquire it at the start of start_health_check (and likewise in
stop_health_check if it mutates the same state) so the sequence that calls
stop_health_check(), creates new_cancel, spawns the tokio task, and stores
health_check_cancel is executed under the lock; update start_health_check (and
stop_health_check) to lock/unlock around the critical section referencing
health_check_cancel, cancel_token, and the spawn logic to eliminate the race.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 08b28c98-6195-4c6d-be4d-294de92923d9

📥 Commits

Reviewing files that changed from the base of the PR and between 4f15288 and 0bb481e.

📒 Files selected for processing (1)
  • packages/rs-sdk/src/sdk.rs

lklimek and others added 2 commits March 17, 2026 12:37
…ancel

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Call sdk.start_health_check() instead of duplicating the spawn logic.
Removes pre-extracted hc_address_list/hc_pool/hc_ca_cert variables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
packages/rs-sdk/src/sdk.rs (1)

467-495: ⚠️ Potential issue | 🟠 Major

Serialize start_health_check to prevent orphaned background tasks.

Line 468 performs stop and restart in separate steps without one shared lock. Concurrent calls can each spawn a task, but only the last token is retained, leaving earlier tasks unmanaged until SDK-wide cancellation.

🔧 Proposed fix
 #[cfg(not(target_arch = "wasm32"))]
 pub fn start_health_check(&self, config: rs_dapi_client::HealthCheckConfig) {
-    self.stop_health_check();
-
-    let new_cancel = self.cancel_token.child_token();
+    let mut cancel_guard = self
+        .health_check_cancel
+        .lock()
+        .expect("health_check_cancel mutex poisoned");
+    cancel_guard.cancel();

     if let SdkInstance::Dapi { dapi, .. } = &self.inner {
+        let new_cancel = self.cancel_token.child_token();
         let hc_address_list = dapi.address_list().clone();
         let hc_pool = dapi.connection_pool().clone();
         let hc_ca_cert = dapi.ca_certificate.clone();
         let hc_token = new_cancel.clone();

         if let Ok(runtime) = tokio::runtime::Handle::try_current() {
-            *self
-                .health_check_cancel
-                .lock()
-                .expect("health_check_cancel mutex poisoned") = new_cancel;
+            *cancel_guard = new_cancel;
+            drop(cancel_guard);
             drop(
                 runtime.spawn(rs_dapi_client::health_check::run_health_check(
                     hc_address_list,
                     hc_pool,
                     config,
                     hc_token,
                     hc_ca_cert,
                 )),
             );
         } else {
             tracing::warn!("no Tokio runtime found, cannot start health check");
         }
     }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/rs-sdk/src/sdk.rs` around lines 467 - 495, Concurrent calls to
start_health_check can race with stop_health_check and each spawn a background
task whose cancel token gets overwritten; serialize start/stop by holding the
same mutex around the stop + new token assignment + spawn so only one caller can
perform the sequence. In practice, acquire the health_check_cancel mutex at the
start of start_health_check, call stop_health_check (or perform its logic) while
still holding that lock, create and store new_cancel into health_check_cancel,
then spawn the health check task before releasing the lock; reference the
start_health_check method, stop_health_check helper, the health_check_cancel
Mutex and cancel_token.child_token to locate where to apply this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@packages/rs-sdk/src/sdk.rs`:
- Around line 467-495: Concurrent calls to start_health_check can race with
stop_health_check and each spawn a background task whose cancel token gets
overwritten; serialize start/stop by holding the same mutex around the stop +
new token assignment + spawn so only one caller can perform the sequence. In
practice, acquire the health_check_cancel mutex at the start of
start_health_check, call stop_health_check (or perform its logic) while still
holding that lock, create and store new_cancel into health_check_cancel, then
spawn the health check task before releasing the lock; reference the
start_health_check method, stop_health_check helper, the health_check_cancel
Mutex and cancel_token.child_token to locate where to apply this change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5f4ab381-fecb-4cdd-a15c-c6a57f550795

📥 Commits

Reviewing files that changed from the base of the PR and between 0bb481e and 062d06d.

📒 Files selected for processing (1)
  • packages/rs-sdk/src/sdk.rs

#3371)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@lklimek lklimek requested a review from shumkov March 17, 2026 12:23
Copy link
Copy Markdown
Collaborator

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Well-structured background health check with good WASM/native portability via futures::select_biased!. Two significant lifecycle issues: a FusedFuture cancellation bug that can create zombie tasks when cancellation fires during active probing, and the health check silently fails to start in FFI (mobile) SDK paths because build() is called outside a Tokio runtime context. The WASM integration is functional but the PR description incorrectly claims WASM exclusion, and JavaScript has no API to stop the background task once started.

Reviewed commit: b70f39d

🔴 1 blocking | 🟡 5 suggestion(s) | 💬 1 nitpick(s)

1 additional finding(s) omitted (not in diff).

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-dapi-client/src/health_check.rs`:
- [BLOCKING] lines 116-158: FusedFuture cancellation consumed inside `probe_and_update_batch` makes outer loop unkillable
  The `stop` future is fused once at line 102 and passed by `&mut` into `probe_and_update_batch`. If cancellation fires during probing, the `select_biased!` inside that function (line 190-195) resolves the `stop` branch — permanently terminating the `FusedFuture`. Back in `health_check_loop`, the outer `select_biased!` (line 147-153) skips the terminated `stop` branch and only waits on `sleep_fut`. The loop then continues indefinitely.

Consequences:
1. `stop_health_check()` during active probing creates a zombie task that runs until SDK shutdown
2. `is_health_check_running()` returns `false` (cancelled token) while the task is still running
3. A subsequent `start_health_check()` spawns a second concurrent health check

Fix: check `stop.is_terminated()` after `probe_and_update_batch` returns and break if true.
- [SUGGESTION] lines 187-190: Health check re-probe systematically escalates ban duration with no cap
  Each failed health check re-probe calls `address_list.ban(address)` (line 189), which increments `ban_count` and applies `base_period * e^(ban_count - 1)`. Unlike foreground bans which only trigger on user requests, the health check drives this escalation deterministically on every sleep-to-expiry cycle. After 7 cycles: ~66 min ban. After 10: ~6 hours. After 12: ~112 days.

A node temporarily down for maintenance (30 min) may get escalated to hours-long bans before it can prove itself healthy. Consider either capping the health-check ban tier, not escalating `ban_count` when re-banning an already-banned node, or adding a max ban duration.

In `packages/rs-sdk/src/sdk.rs`:
- [SUGGESTION] lines 481-500: Health check silently fails to start in FFI (mobile) SDK paths
  `start_health_check()` stores the new `CancellationToken` in the mutex (line 481-484) *before* checking `tokio::runtime::Handle::try_current()` (line 487). When `build()` is called outside an entered Tokio runtime — which is exactly what happens in the FFI layer at `packages/rs-sdk-ffi/src/sdk.rs:160` where `init_or_get_runtime()` creates a runtime but never calls `runtime.enter()` before `builder.build()` — `try_current()` fails, the task is never spawned, but `is_health_check_running()` returns `true` because the stored token isn't cancelled.

This means all mobile SDK users (iOS/Android via FFI) silently get no health checking despite it being enabled by default, with an incorrect status report.
- [SUGGESTION] lines 466-514: Non-atomic `start_health_check` can orphan tasks under concurrent calls
  The method acquires the mutex twice: once inside `stop_health_check()` and once to store the new token (line 481-484). Between these acquisitions, a concurrent `start_health_check()` from another `Sdk` clone can interleave, causing one spawned task's cancellation token to be overwritten. That task becomes unreachable via `stop_health_check()` — only SDK-level shutdown can stop it.

Fix: hold the mutex for the entire start operation (stop → create token → store → spawn) instead of separate lock/unlock cycles.
- [SUGGESTION] lines 488-497: Dropped `JoinHandle` silently swallows panics from health check task
  The `JoinHandle` from `runtime.spawn(...)` is explicitly dropped (line 489-497). If the health check task panics (e.g., from a poisoned RwLock), the panic is silently lost and `is_health_check_running()` remains stale. For a long-running background task, consider either spawning a monitoring wrapper that logs `JoinError::is_panic()`, or storing an `AbortHandle` for cleanup.

In `packages/wasm-sdk/src/sdk.rs`:
- [SUGGESTION] lines 329-347: WASM SDK runs full health check loop but exposes no lifecycle controls to JavaScript
  Despite the PR description claiming 'WASM excluded — entire module gated with `#[cfg(not(target_arch = "wasm32"))]`', the health_check module has no cfg gate (line 11 of lib.rs). The full probing loop runs on WASM via `wasm_bindgen_futures::spawn_local()` (sdk.rs:503-511) using `gloo_timers` for sleep.

This is actually fine — the code is correctly portable. But two issues:
1. The PR description should be updated to reflect that health checks ARE fully functional on WASM
2. `WasmSdk` doesn't expose `shutdown()` or `stop_health_check()` via `#[wasm_bindgen]`, so JavaScript can start a perpetual background task with no way to cancel it

FIX 1: Break out of health_check_loop when cancellation is consumed
inside probe_and_update_batch. The FusedFuture becomes terminated,
causing the outer select_biased! to skip the stop branch forever.

FIX 2: Only store the new CancellationToken into the mutex after
confirming the tokio runtime exists and the task will be spawned.
Previously, is_health_check_running() returned true in FFI paths
where no task was actually running.

FIX 3: Don't re-ban already-banned nodes during health check
re-probes. Each ban() call increments ban_count with exponential
backoff; deterministic re-probing drove ban duration to absurd
levels (112 days after 12 cycles).

FIX 4: Hold the mutex for the entire start_health_check operation
to prevent concurrent calls from orphaning tasks. The previous
implementation released and re-acquired the mutex, allowing
interleaving.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lklimek lklimek added the claudius-review Trigger automated review label Mar 18, 2026
Copy link
Copy Markdown
Collaborator

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Two real issues confirmed. The ban-escalation guard at health_check.rs:202 creates a Phase 2 regression where failed re-probes never re-ban expired nodes, leaving dead nodes live for ~59s per cycle. The missing Drop impl on Sdk means health-check tasks leak if dropped without calling shutdown(). All 6 prior findings are resolved.

Reviewed commit: aec5665

🔴 1 blocking | 🟡 2 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-dapi-client/src/health_check.rs`:
- [BLOCKING] lines 196-207: Failed Phase 2 re-probes never re-ban expired nodes, leaving dead nodes live for ~59s per cycle
  The `!is_banned(address)` guard at line 202 was added to prevent ban-escalation from deterministic re-probing. But `is_banned()` checks `ban_count > 0` (address_list.rs:94-96), and ALL Phase 2 candidates have `ban_count > 0` by definition (they come from `get_expired_ban_addresses()` which requires `ban_count > 0` at line 305). So `!is_banned()` is ALWAYS false for Phase 2 failures — `ban()` is never called.

Consequence: after a failed re-probe, `banned_until` stays at the old expired timestamp. `get_live_address()` (address_list.rs:231-238) checks `banned_until < now` — since the ban is expired, the dead node passes the filter and is returned to foreground traffic. `get_next_ban_expiry()` returns `None` (no future bans), so the loop sleeps for `no_ban_idle_period` (59s). The dead node is live for the entire sleep window.

Suggested fix: replace the guard with logic that refreshes the ban without escalating. For example, `unban()` + `ban()` resets `ban_count` to 0 then sets it to 1 with a base-period ban duration. Alternatively, add an `AddressStatus::refresh_ban()` method that resets `banned_until` to `now + base_period` without touching `ban_count`.
- [SUGGESTION] lines 245-496: No test verifies that Phase 2 re-probe keeps persistently dead nodes banned
  The existing tests cover Phase 1 startup banning and cancellation, but no test exercises the Phase 2 re-probe-after-expiry path. A test with a short ban period that waits for expiry and then verifies the dead node is still banned (not returned by `get_live_address()`) would have caught finding #1.

In `packages/rs-sdk/src/sdk.rs`:
- [SUGGESTION] lines 469-525: Health-check task leaks when Sdk is dropped without calling shutdown()
  There is no `impl Drop for Sdk`. `CancellationToken::drop()` does NOT cancel the token (tokio-util semantics). If the Sdk is dropped without an explicit `shutdown()` call, the spawned health-check task continues as a zombie.

At the FFI boundary this is concrete: `dash_sdk_destroy` (rs-sdk-ffi/src/sdk.rs:521-524) does `Box::from_raw(handle)` which drops `SDKWrapper` → drops `Sdk`, but never calls `shutdown()`. Since `runtime` is `Arc<Runtime>`, the runtime survives and the orphaned task keeps running.

Mitigating factor: `health_check_cancel` is a child of `cancel_token`, so calling `shutdown()` does cancel it. The issue is only when `shutdown()` is skipped.

Options: (1) Add `impl Drop for Sdk { fn drop(&mut self) { self.cancel_token.cancel(); } }` — but since Sdk is Clone, this cancels on any clone's drop. (2) Call `sdk.shutdown()` in `dash_sdk_destroy` before dropping. (3) Document that `shutdown()` must be called.

Wrap the spawned run_health_check JoinHandle in a monitor task that logs
panics/failures and cancels the health-check token, so is_health_check_running()
reflects the actual task state instead of going stale on panic.

Also expand the no_ban_idle_period doc comment to warn that the value must
be less than the effective ban duration for the re-probe guarantee to hold.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Small incremental addressing prior review feedback. Adds a panic-monitoring wrapper around the health check task and improves documentation for the no_ban_idle_period config field. Both changes are correct and well-implemented.

Reviewed commit: 5c7825d

… Drop impl

FIX 1: Phase 2 re-probes now correctly re-ban failed nodes. The old
`!is_banned()` guard was always false for Phase 2 candidates (they all
have ban_count > 0 from get_expired_ban_addresses), so ban() was never
called and dead nodes re-entered live rotation. Replace with unban()+ban()
to reset ban_count to 1 with a fresh base-period duration.

FIX 2: Add test_phase2_reprobe_failure_rebans_node that exercises the
Phase 2 re-probe-after-expiry path with a short ban period (200ms).

FIX 3: Add Arc<()>-based Drop impl for Sdk that cancels background tasks
only when the last clone is dropped, preventing both zombie health-check
tasks and premature cancellation when intermediate clones are dropped.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Incremental push fixes the Phase 2 re-ban regression (dead nodes escaping ban after failed re-probe) and adds automatic background-task cleanup on last Sdk clone drop. Both fixes target real bugs. Two concurrency gaps exist: (1) the unban()+ban() two-step creates a brief window where an unhealthy node appears live, and (2) the Arc::strong_count check in Drop has a TOCTOU race that can cause missed cancellation on concurrent last-clone drops. Both are low-probability in practice but have clean fixes.

Reviewed commit: dd8a8eb

🟡 2 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-dapi-client/src/health_check.rs`:
- [SUGGESTION] lines 202-208: `unban()`+`ban()` two-step creates window where dead node appears live
  `AddressList::unban()` and `AddressList::ban()` each acquire and release the `RwLock` independently. Between the two calls, a concurrent `get_live_address()`/`get_live_addresses()` call can observe `banned_until = None` and return the known-unhealthy node as a live candidate — potentially routing one request to a dead node before the ban is re-applied.

The window is very brief (sub-microsecond in the common single-threaded case), and the existing design already documents an accepted race at ban expiry (lines 62-66). However, this gap is strictly unnecessary and can be closed with an atomic `reset_ban()` method that holds a single write lock across both operations.

All 8 review agents independently flagged this issue.

In `packages/rs-sdk/src/sdk.rs`:
- [SUGGESTION] lines 167-176: `Arc::strong_count` TOCTOU — consider `Arc::into_inner` for race-free last-owner detection
  If two threads drop the final two `Sdk` clones concurrently, both destructors observe `strong_count == 2` (the Arc isn't decremented until after `drop()` returns), so neither calls `cancel()`. Background tasks then leak until the tokio runtime shuts down.

The comment acknowledges this is "not perfectly atomic," but since MSRV is 1.92 and `Arc::into_inner` has been stable since Rust 1.70, a race-free alternative is available:

```rust
impl Drop for Sdk {
    fn drop(&mut self) {
        let shared = std::mem::replace(&mut self._shared_ref, Arc::new(()));
        if Arc::into_inner(shared).is_some() {
            self.cancel_token.cancel();
        }
    }
}

Arc::into_inner returns Some only when the calling thread holds the last strong reference, with no TOCTOU gap. The replacement Arc::new(()) is trivial (ZST, no heap allocation).

</details>

Add AddressList::reset_ban() that holds a single write lock to atomically
reset ban history and apply a fresh base-period ban, closing the TOCTOU
window between separate unban()+ban() calls. Use it in health_check.rs.

Replace Arc::strong_count TOCTOU check in Sdk::drop with Arc::into_inner,
which is race-free and stable since Rust 1.70.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This push cleanly addresses both prior review suggestions:

  1. TOCTOU race in unban()+ban() — replaced with a new atomic reset_ban() method that acquires a single write lock, resets ban_count to 1, and applies a fresh base-period ban. Two unit tests cover the happy path and the no-op case for nonexistent addresses.

  2. Arc::strong_count TOCTOU in Sdk::drop — replaced with Arc::into_inner, which is race-free by specification (returns Some only for the true last owner). The std::mem::replace with Arc::new(()) is a zero-sized type so no allocation concern.

Both fixes are well-targeted, minimal, and correct. No new issues introduced.

Reviewed commit: 4154f16

@lklimek lklimek removed the claudius-review Trigger automated review label Mar 18, 2026
@lklimek
Copy link
Copy Markdown
Contributor Author

lklimek commented Mar 18, 2026

Closing after discussion with @shumkov , we will fall back to TrustedHttpContextProvider.

@lklimek lklimek closed this Mar 18, 2026
@lklimek lklimek reopened this Mar 30, 2026
…heck

# Conflicts:
#	packages/rs-dapi-client/src/address_list.rs
@thepastaclaw
Copy link
Copy Markdown
Collaborator

thepastaclaw commented Mar 30, 2026

✅ Review complete (commit b076885)

@github-actions
Copy link
Copy Markdown
Contributor

✅ DashSDKFFI.xcframework built for this PR.

SwiftPM (host the zip at a stable URL, then use):

.binaryTarget(
  name: "DashSDKFFI",
  url: "https://your.cdn.example/DashSDKFFI.xcframework.zip",
  checksum: "5143f7afbefe2bc88e1a504ed2fa2e961ab1f673908a82ae3870067c482b5a79"
)

Xcode manual integration:

  • Download 'DashSDKFFI.xcframework' artifact from the run link above.
  • Drag it into your app target (Frameworks, Libraries & Embedded Content) and set Embed & Sign.
  • If using the Swift wrapper package, point its binaryTarget to the xcframework location or add the package and place the xcframework at the expected path.

Copy link
Copy Markdown
Collaborator

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Well-designed health check system with clean async architecture and proper cancellation handling. The sleep-to-expiry model is efficient and the documented race window is an honest, acceptable tradeoff. Two issues worth addressing: the WASM build enables health checks by default but exposes no explicit shutdown API (relying on JS GC for cleanup), and the idle sleep can miss bans created mid-sleep, weakening the re-probe-before-reuse guarantee.

Reviewed commit: b076885

🟡 3 suggestion(s) | 💬 1 nitpick(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-dapi-client/src/health_check.rs`:
- [SUGGESTION] lines 42-46: Idle sleep can miss bans created mid-sleep, weakening re-probe-before-reuse guarantee
  When no bans exist, the watcher sleeps for `no_ban_idle_period` (default 59s). If a foreground request bans a node at t=58s for the default 60s (expiring at t=118s), the watcher wakes at t=59s, sees the next expiry at t=118s, and sleeps for another 59s. The node's ban expires at t=118s and it re-enters `get_live_address()` rotation before the watcher re-probes it at ~t=118s.

The doc comment on `no_ban_idle_period` says it ensures the watcher wakes *before* bans expire, but this only holds for bans that already exist when the sleep begins. Bans created during the sleep are missed until the next wake cycle.

This isn't critical — the documented race window in `run_health_check()` already acknowledges that nodes can briefly re-enter rotation before re-probe, and the reactive ban logic handles nodes that fail on actual requests. But the doc comment on `no_ban_idle_period` overpromises. Consider either: (a) adjusting the doc comment to clarify the guarantee only applies to pre-existing bans, or (b) using a notify/channel from the ban path to interrupt the idle sleep.

In `packages/rs-sdk/src/sdk.rs`:
- [SUGGESTION] lines 543-553: WASM health check has no JS-visible shutdown path — relies on GC for cleanup
  On WASM, `SdkBuilder` enables the health check by default and `start_health_check()` spawns it via `spawn_local()`. The spawned task is only stopped when the `CancellationToken` is cancelled — which happens in `Sdk::shutdown()` or on `Drop`. But `WasmSdk` doesn't export `shutdown()` or `stop_health_check()` to JS, so the only cleanup path is the wasm-bindgen-generated `free()` method, which JS callers aren't required to invoke (it depends on GC finalization).

A forgotten `WasmSdk` instance can leave a permanent background probe loop alive for the lifetime of the page/worker. Additionally, if `run_health_check()` panics on WASM, the cancel token is never cancelled (no monitor task like native has), leaving `is_health_check_running()` permanently stale.

Consider either: (a) exporting a `shutdown()` or `stop_health_check()` method on `WasmSdkBuilder`/`WasmSdk`, or (b) defaulting health checks to off on WASM targets.
- [SUGGESTION] lines 511-538: Health check silently disabled when SDK is built outside a Tokio runtime
  `SdkBuilder::default()` enables health checks, but `start_health_check()` on native only spawns the task if `tokio::runtime::Handle::try_current()` succeeds. If the SDK is constructed in synchronous setup code (before entering the Tokio runtime), the health check is silently skipped with just a `tracing::warn!`. The builder advertises health checks as on-by-default, but whether they actually start depends on call-site context.

The code already handles this by not storing the cancel token when no runtime is found (so `is_health_check_running()` returns false), which is correct. But the silent skip could surprise users. Consider either returning an error from `build()` when health checks are enabled but no runtime is available, or documenting this behavior prominently.

Comment on lines +42 to +46
/// period shorter than this value, nodes may re-enter rotation before the health check
/// can re-probe them. Adjust accordingly.
pub no_ban_idle_period: Duration,
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Idle sleep can miss bans created mid-sleep, weakening re-probe-before-reuse guarantee

When no bans exist, the watcher sleeps for no_ban_idle_period (default 59s). If a foreground request bans a node at t=58s for the default 60s (expiring at t=118s), the watcher wakes at t=59s, sees the next expiry at t=118s, and sleeps for another 59s. The node's ban expires at t=118s and it re-enters get_live_address() rotation before the watcher re-probes it at ~t=118s.

The doc comment on no_ban_idle_period says it ensures the watcher wakes before bans expire, but this only holds for bans that already exist when the sleep begins. Bans created during the sleep are missed until the next wake cycle.

This isn't critical — the documented race window in run_health_check() already acknowledges that nodes can briefly re-enter rotation before re-probe, and the reactive ban logic handles nodes that fail on actual requests. But the doc comment on no_ban_idle_period overpromises. Consider either: (a) adjusting the doc comment to clarify the guarantee only applies to pre-existing bans, or (b) using a notify/channel from the ban path to interrupt the idle sleep.

source: ['codex']

🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-dapi-client/src/health_check.rs`:
- [SUGGESTION] lines 42-46: Idle sleep can miss bans created mid-sleep, weakening re-probe-before-reuse guarantee
  When no bans exist, the watcher sleeps for `no_ban_idle_period` (default 59s). If a foreground request bans a node at t=58s for the default 60s (expiring at t=118s), the watcher wakes at t=59s, sees the next expiry at t=118s, and sleeps for another 59s. The node's ban expires at t=118s and it re-enters `get_live_address()` rotation before the watcher re-probes it at ~t=118s.

The doc comment on `no_ban_idle_period` says it ensures the watcher wakes *before* bans expire, but this only holds for bans that already exist when the sleep begins. Bans created during the sleep are missed until the next wake cycle.

This isn't critical — the documented race window in `run_health_check()` already acknowledges that nodes can briefly re-enter rotation before re-probe, and the reactive ban logic handles nodes that fail on actual requests. But the doc comment on `no_ban_idle_period` overpromises. Consider either: (a) adjusting the doc comment to clarify the guarantee only applies to pre-existing bans, or (b) using a notify/channel from the ban path to interrupt the idle sleep.

Comment on lines +543 to +553
{
*cancel_guard = new_cancel;
drop(cancel_guard);
wasm_bindgen_futures::spawn_local(rs_dapi_client::health_check::run_health_check(
hc_address_list,
hc_pool,
config,
hc_token,
hc_ca_cert,
));
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: WASM health check has no JS-visible shutdown path — relies on GC for cleanup

On WASM, SdkBuilder enables the health check by default and start_health_check() spawns it via spawn_local(). The spawned task is only stopped when the CancellationToken is cancelled — which happens in Sdk::shutdown() or on Drop. But WasmSdk doesn't export shutdown() or stop_health_check() to JS, so the only cleanup path is the wasm-bindgen-generated free() method, which JS callers aren't required to invoke (it depends on GC finalization).

A forgotten WasmSdk instance can leave a permanent background probe loop alive for the lifetime of the page/worker. Additionally, if run_health_check() panics on WASM, the cancel token is never cancelled (no monitor task like native has), leaving is_health_check_running() permanently stale.

Consider either: (a) exporting a shutdown() or stop_health_check() method on WasmSdkBuilder/WasmSdk, or (b) defaulting health checks to off on WASM targets.

source: ['codex']

🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-sdk/src/sdk.rs`:
- [SUGGESTION] lines 543-553: WASM health check has no JS-visible shutdown path — relies on GC for cleanup
  On WASM, `SdkBuilder` enables the health check by default and `start_health_check()` spawns it via `spawn_local()`. The spawned task is only stopped when the `CancellationToken` is cancelled — which happens in `Sdk::shutdown()` or on `Drop`. But `WasmSdk` doesn't export `shutdown()` or `stop_health_check()` to JS, so the only cleanup path is the wasm-bindgen-generated `free()` method, which JS callers aren't required to invoke (it depends on GC finalization).

A forgotten `WasmSdk` instance can leave a permanent background probe loop alive for the lifetime of the page/worker. Additionally, if `run_health_check()` panics on WASM, the cancel token is never cancelled (no monitor task like native has), leaving `is_health_check_running()` permanently stale.

Consider either: (a) exporting a `shutdown()` or `stop_health_check()` method on `WasmSdkBuilder`/`WasmSdk`, or (b) defaulting health checks to off on WASM targets.

Comment on lines +511 to +538
if let Ok(runtime) = tokio::runtime::Handle::try_current() {
*cancel_guard = new_cancel;
drop(cancel_guard);
let hc_cancel_for_monitor = hc_token.clone();
let handle = runtime.spawn(rs_dapi_client::health_check::run_health_check(
hc_address_list,
hc_pool,
config,
hc_token,
hc_ca_cert,
));
// Monitor task: logs panics and cancels the token so
// is_health_check_running() reflects the actual state.
drop(runtime.spawn(async move {
if let Err(e) = handle.await {
if e.is_panic() {
tracing::error!("health check task panicked: {:?}", e);
} else {
tracing::warn!("health check task failed: {:?}", e);
}
hc_cancel_for_monitor.cancel();
}
}));
} else {
// No runtime: don't store the token so is_health_check_running()
// correctly returns false.
drop(cancel_guard);
tracing::warn!("no Tokio runtime found, cannot start health check");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Health check silently disabled when SDK is built outside a Tokio runtime

SdkBuilder::default() enables health checks, but start_health_check() on native only spawns the task if tokio::runtime::Handle::try_current() succeeds. If the SDK is constructed in synchronous setup code (before entering the Tokio runtime), the health check is silently skipped with just a tracing::warn!. The builder advertises health checks as on-by-default, but whether they actually start depends on call-site context.

The code already handles this by not storing the cancel token when no runtime is found (so is_health_check_running() returns false), which is correct. But the silent skip could surprise users. Consider either returning an error from build() when health checks are enabled but no runtime is available, or documenting this behavior prominently.

source: ['codex']

🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-sdk/src/sdk.rs`:
- [SUGGESTION] lines 511-538: Health check silently disabled when SDK is built outside a Tokio runtime
  `SdkBuilder::default()` enables health checks, but `start_health_check()` on native only spawns the task if `tokio::runtime::Handle::try_current()` succeeds. If the SDK is constructed in synchronous setup code (before entering the Tokio runtime), the health check is silently skipped with just a `tracing::warn!`. The builder advertises health checks as on-by-default, but whether they actually start depends on call-site context.

The code already handles this by not storing the cancel token when no runtime is found (so `is_health_check_running()` returns false), which is correct. But the silent skip could surprise users. Consider either returning an error from `build()` when health checks are enabled but no runtime is available, or documenting this behavior prominently.

Comment on lines +488 to +498
pub fn start_health_check(&self, config: rs_dapi_client::HealthCheckConfig) {
let mut cancel_guard = self
.health_check_cancel
.lock()
.expect("health_check_cancel mutex poisoned");

// Stop existing health check (atomic: no mutex release between stop and start)
cancel_guard.cancel();

let new_cancel = self.cancel_token.child_token();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Nitpick: Restarting health check doesn't await previous task completion

start_health_check() cancels the previous token and immediately spawns a new task without waiting for the old one to finish. In theory, the old task could still be inside probe_and_update_batch() when the new task starts its startup probe, causing concurrent ban/unban mutations on the shared AddressList.

In practice the impact is minimal: the old task will observe cancellation on its next select_biased! check and exit promptly, and ban/unban operations are individually atomic. But for correctness, consider storing the JoinHandle and awaiting it before spawning the replacement, or adding a generation counter to discard stale probe results.

source: ['codex']

Copy link
Copy Markdown
Collaborator

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR’s health-check loop is close, but three core behaviors in the checked-out patch do not match the feature contract it introduces. The startup probe is still asynchronous relative to build(), the default-on feature silently never starts in common native construction flows, and failed re-probes flatten ban history instead of preserving the existing backoff semantics.

Reviewed commit: b076885

🔴 3 blocking

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-sdk/src/sdk.rs`:
- [BLOCKING] lines 1087-1088: The advertised startup probe is still fire-and-forget, so the first request can race dead nodes
  `SdkBuilder::build()` enables the feature by calling `sdk.start_health_check(hc_config)`, but that method only spawns `run_health_check()` and returns immediately. Phase 1 probing then happens asynchronously inside `run_health_check()` (`packages/rs-dapi-client/src/health_check.rs`, lines 110-119), so `build()` can return before any address has been checked or banned. A caller that sends a request right after constructing the SDK can still select an unprobed dead node from the existing address list, which contradicts the PR’s promised startup behavior that the first user request already sees a cleaned list.
- [BLOCKING] lines 511-538: Default-on health checks silently never start when the SDK is built outside an active Tokio runtime
  The feature is configured as enabled by default, but `start_health_check()` only stores the token and spawns the task if `tokio::runtime::Handle::try_current()` succeeds. When `SdkBuilder::build()` is called from synchronous setup code, this branch only logs `no Tokio runtime found, cannot start health check` and returns, with no retry path once the caller later enters async code. That makes the same default configuration behave differently depending on where construction happens, and in a common native usage pattern the promised background health check is just absent.

In `packages/rs-dapi-client/src/health_check.rs`:
- [BLOCKING] lines 201-206: Failed re-probes flatten ban history instead of preserving the existing backoff semantics
  On re-probe failure the loop calls `address_list.reset_ban(address)`, and `reset_ban()` (`packages/rs-dapi-client/src/address_list.rs`, lines 172-176) forcibly sets `ban_count = 1` with a fresh base-period ban. That means a node that stays dead across multiple expiry cycles is retried every base ban period forever instead of following the existing exponential backoff implemented by `AddressStatus::ban()`. The patch comment explicitly says this prevents escalation, but the PR description says ban-expiry re-probes should reuse the existing ban behavior, so the shipped semantics no longer match the feature contract and create constant background churn against permanently dead nodes.

Comment on lines +1087 to +1088
if let Some(hc_config) = health_check_config {
sdk.start_health_check(hc_config);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Blocking: The advertised startup probe is still fire-and-forget, so the first request can race dead nodes

SdkBuilder::build() enables the feature by calling sdk.start_health_check(hc_config), but that method only spawns run_health_check() and returns immediately. Phase 1 probing then happens asynchronously inside run_health_check() (packages/rs-dapi-client/src/health_check.rs, lines 110-119), so build() can return before any address has been checked or banned. A caller that sends a request right after constructing the SDK can still select an unprobed dead node from the existing address list, which contradicts the PR’s promised startup behavior that the first user request already sees a cleaned list.

source: ['codex']

🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-sdk/src/sdk.rs`:
- [BLOCKING] lines 1087-1088: The advertised startup probe is still fire-and-forget, so the first request can race dead nodes
  `SdkBuilder::build()` enables the feature by calling `sdk.start_health_check(hc_config)`, but that method only spawns `run_health_check()` and returns immediately. Phase 1 probing then happens asynchronously inside `run_health_check()` (`packages/rs-dapi-client/src/health_check.rs`, lines 110-119), so `build()` can return before any address has been checked or banned. A caller that sends a request right after constructing the SDK can still select an unprobed dead node from the existing address list, which contradicts the PR’s promised startup behavior that the first user request already sees a cleaned list.

Comment on lines +511 to +538
if let Ok(runtime) = tokio::runtime::Handle::try_current() {
*cancel_guard = new_cancel;
drop(cancel_guard);
let hc_cancel_for_monitor = hc_token.clone();
let handle = runtime.spawn(rs_dapi_client::health_check::run_health_check(
hc_address_list,
hc_pool,
config,
hc_token,
hc_ca_cert,
));
// Monitor task: logs panics and cancels the token so
// is_health_check_running() reflects the actual state.
drop(runtime.spawn(async move {
if let Err(e) = handle.await {
if e.is_panic() {
tracing::error!("health check task panicked: {:?}", e);
} else {
tracing::warn!("health check task failed: {:?}", e);
}
hc_cancel_for_monitor.cancel();
}
}));
} else {
// No runtime: don't store the token so is_health_check_running()
// correctly returns false.
drop(cancel_guard);
tracing::warn!("no Tokio runtime found, cannot start health check");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Blocking: Default-on health checks silently never start when the SDK is built outside an active Tokio runtime

The feature is configured as enabled by default, but start_health_check() only stores the token and spawns the task if tokio::runtime::Handle::try_current() succeeds. When SdkBuilder::build() is called from synchronous setup code, this branch only logs no Tokio runtime found, cannot start health check and returns, with no retry path once the caller later enters async code. That makes the same default configuration behave differently depending on where construction happens, and in a common native usage pattern the promised background health check is just absent.

source: ['codex']

🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-sdk/src/sdk.rs`:
- [BLOCKING] lines 511-538: Default-on health checks silently never start when the SDK is built outside an active Tokio runtime
  The feature is configured as enabled by default, but `start_health_check()` only stores the token and spawns the task if `tokio::runtime::Handle::try_current()` succeeds. When `SdkBuilder::build()` is called from synchronous setup code, this branch only logs `no Tokio runtime found, cannot start health check` and returns, with no retry path once the caller later enters async code. That makes the same default configuration behave differently depending on where construction happens, and in a common native usage pattern the promised background health check is just absent.

Comment on lines +201 to +206
Err(error) => {
// reset_ban atomically resets ban history and applies a fresh
// base-period ban, preventing exponential backoff escalation
// while closing the unban+ban race window.
address_list.reset_ban(address);
tracing::debug!(%address, error = %error, "health check: node is unhealthy, re-banned");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Blocking: Failed re-probes flatten ban history instead of preserving the existing backoff semantics

On re-probe failure the loop calls address_list.reset_ban(address), and reset_ban() (packages/rs-dapi-client/src/address_list.rs, lines 172-176) forcibly sets ban_count = 1 with a fresh base-period ban. That means a node that stays dead across multiple expiry cycles is retried every base ban period forever instead of following the existing exponential backoff implemented by AddressStatus::ban(). The patch comment explicitly says this prevents escalation, but the PR description says ban-expiry re-probes should reuse the existing ban behavior, so the shipped semantics no longer match the feature contract and create constant background churn against permanently dead nodes.

source: ['codex']

🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-dapi-client/src/health_check.rs`:
- [BLOCKING] lines 201-206: Failed re-probes flatten ban history instead of preserving the existing backoff semantics
  On re-probe failure the loop calls `address_list.reset_ban(address)`, and `reset_ban()` (`packages/rs-dapi-client/src/address_list.rs`, lines 172-176) forcibly sets `ban_count = 1` with a fresh base-period ban. That means a node that stays dead across multiple expiry cycles is retried every base ban period forever instead of following the existing exponential backoff implemented by `AddressStatus::ban()`. The patch comment explicitly says this prevents escalation, but the PR description says ban-expiry re-probes should reuse the existing ban behavior, so the shipped semantics no longer match the feature contract and create constant background churn against permanently dead nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants