feat(dapi): add background health check for DAPI nodes by lklimek · Pull Request #3353 · dashpay/platform

lklimek · 2026-03-16T21:41:45Z

Issue being fixed or feature implemented

Many DAPI nodes on the network are broken, causing users to hit timeouts and retry errors before the SDK discovers they're dead. Currently, node health is purely reactive — nodes are only banned when actual user requests fail.

This PR adds an opt-in background health check system with two behaviors:

Startup probe — probes all nodes when the SDK is built, bans dead ones immediately so the first user request already has a clean list
Ban-expiry re-probe — when a node's ban timer expires, re-probes it before making it available again; either unbans (healthy) or re-bans with escalated backoff (still dead)

No periodic interval probes. The existing reactive ban logic handles nodes that break during normal operations.

What was done?

`rs-dapi-client` changes:

address_list.rs — Added get_all_addresses(), get_next_ban_expiry(), get_expired_ban_addresses() methods and ban_count() getter on AddressStatus
dapi_client.rs — Added connection_pool() and settings() accessors on DapiClient
health_check.rs (new) — HealthCheckConfig struct and run_health_check() async function implementing:
- Phase 1: Startup probe of all addresses using GetStatusRequest (lightest gRPC call)
- Phase 2: Sleep-to-expiry watcher loop that re-probes nodes when bans expire
- Concurrent probing via futures::stream::for_each_concurrent
- Graceful cancellation via CancellationToken + tokio::select!
lib.rs — Module registration gated with #[cfg(not(target_arch = "wasm32"))]
Cargo.toml — Added tokio-util dependency for CancellationToken

`dash-sdk` changes:

sdk.rs — Added health_check and health_check_config fields to SdkBuilder, builder methods with_health_check() and with_health_check_config(), spawning logic in build(), _health_check_handle on Sdk, and Drop impl that aborts the task

Design decisions:

On by default, opt-out via SdkBuilder::with_health_check(false)
WASM excluded — entire module gated with #[cfg(not(target_arch = "wasm32"))]
Sleep-to-expiry model minimizes CPU/network overhead (no polling)
Reuses existing ban logic — AddressList::ban() handles exponential backoff automatically
Probe settings use ban_failed_address: false to avoid interfering with reactive banning
Known limitation: Small race window between ban expiry and re-probe completion where get_live_address() may return a previously-banned node (documented in code)

How Has This Been Tested?

Unit tests added:

address_list.rs (5 new tests):

test_get_all_addresses_returns_both_banned_and_unbanned
test_get_next_ban_expiry_returns_earliest
test_get_next_ban_expiry_none_when_no_bans
test_get_expired_ban_addresses
test_address_status_ban_count

health_check.rs (8 tests):

test_health_check_config_default — Config defaults match spec
test_cancel_token_stops_loop — Cancellation breaks the loop
test_probe_unreachable_node_returns_false — Probe with non-routable address
test_startup_probe_bans_unreachable — Startup probe bans dead nodes
test_startup_probe_empty_address_list — Empty list handled gracefully
test_startup_probe_bans_all_unreachable_nodes — All-dead scenario
test_probe_settings_do_not_ban_on_failure — Probe settings correctness
test_cancel_during_ban_watch_phase — Cancellation during Phase 2

Verification commands:

cargo test -p rs-dapi-client    # 103 passed
cargo test -p dash-sdk --lib    # 84 passed
cargo clippy -p rs-dapi-client -p dash-sdk -- -D warnings  # clean
cargo fmt --all --check         # clean

Breaking Changes

None. Health check is additive and backward-compatible. Existing SdkBuilder usage works unchanged (health check is enabled by default but doesn't alter existing behavior).

Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have added or updated relevant unit/integration/functional/e2e tests
I have added "!" to the title and described breaking changes in the corresponding section if my code contains any
I have made corresponding changes to the documentation if needed

_{🤖 Co-authored by Claudius the Magnificent AI Agent}

Summary by CodeRabbit

New Features
- Background health check for monitoring DAPI nodes (non‑WASM), configurable via SDK builder; start/stop and status controls exposed.
- Address management APIs: list all addresses, list expired bans, get next ban expiry, and improved ban/unban behavior.
- Read‑only accessors to view client connection pool and request settings.
Tests
- Expanded tests covering health checks, ban/expiry behavior, address utilities, startup probing, and cancellation scenarios.

coderabbitai · 2026-03-16T21:41:52Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a non‑WASM background health check subsystem (HealthCheckConfig + run_health_check), AddressList ban/expiry query APIs, DapiClient getters, a non‑WASM tokio-util Cargo dependency, and SDK wiring to start/stop health checks with cancellation support.

Changes

Cohort / File(s)	Summary
Dependency `packages/rs-dapi-client/Cargo.toml`	Added `tokio-util = "0.7.12"` under `[target.'cfg(not(target_arch = "wasm32"))'.dependencies]`.
Address list & status `packages/rs-dapi-client/src/address_list.rs`	Added `AddressStatus::ban_count()`, `AddressList::get_all_addresses()`, `get_next_ban_expiry()`, `get_expired_ban_addresses()`; `unban()` now clears `banned_until`. Expanded tests for parsing, bans/expiry, and iteration.
Dapi client accessors `packages/rs-dapi-client/src/dapi_client.rs`	Added read-only getters `connection_pool(&self) -> &ConnectionPool` and `settings(&self) -> &RequestSettings`.
Health check module `packages/rs-dapi-client/src/health_check.rs`	New non‑WASM module providing `HealthCheckConfig`, `run_health_check()` async entrypoint, startup probing, batched concurrent probes, ban/unban logic, ban‑expiry watch loop, cancellation support, and tests.
Crate export `packages/rs-dapi-client/src/lib.rs`	Added `#[cfg(not(target_arch = "wasm32"))] mod health_check;` and re-exported `HealthCheckConfig` at crate root under the same cfg.
SDK integration `packages/rs-sdk/src/sdk.rs`	Added health‑check wiring: `SdkBuilder` gains `health_check_config: Option<rs_dapi_client::HealthCheckConfig>` (non‑WASM) and `with_health_check_config(...)`; build spawns health check when configured. `Sdk` now contains `health_check_cancel: Arc<std::sync::Mutex<CancellationToken>>`, Clone propagates it, and public APIs `is_health_check_running()` and `stop_health_check()` were added; `start_health_check()` added for non‑WASM.

Sequence Diagram

sequenceDiagram
    participant SDK as SDK Builder / Sdk
    participant HealthCheck as Health Check Task
    participant AddrList as AddressList
    participant Pool as ConnectionPool
    participant Transport as Transport/Node

    SDK->>HealthCheck: spawn run_health_check(config, cancel_token)
    activate HealthCheck

    HealthCheck->>AddrList: get_all_addresses()
    AddrList-->>HealthCheck: addresses

    par Startup probes
        HealthCheck->>Pool: acquire connections (batched)
        Pool->>Transport: probe_node (GetStatus)
        Transport-->>HealthCheck: success / failure
        HealthCheck->>AddrList: ban/unban addresses
    end

    loop Watch loop
        HealthCheck->>AddrList: get_next_ban_expiry()
        AddrList-->>HealthCheck: next_expiry or None
        HealthCheck->>HealthCheck: sleep until expiry or idle_period

        HealthCheck->>AddrList: get_expired_ban_addresses()
        AddrList-->>HealthCheck: expired_addresses

        par Re-probe expired
            HealthCheck->>Pool: probe_node for expired
            Transport-->>HealthCheck: success / failure
            HealthCheck->>AddrList: unban healthy / keep ban
        end
    end

    SDK-->>HealthCheck: CancellationToken (abort)
    HealthCheck-->>SDK: task stopped
    deactivate HealthCheck

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through crates and stitched a watchful seam,
I probed each node in a concurrent dream,
Bans tick away while timers softly chime,
Tokens say stop and I nap for a time,
I unban the healthy and nibble on code-time.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(dapi): add background health check for DAPI nodes' accurately and concisely describes the main change in the changeset—introducing a background health check system for DAPI nodes.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/sdk-health-check

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Adds an opt-in (default-on for non-WASM) background health check to proactively probe DAPI nodes at startup and re-probe nodes when bans expire, aiming to reduce user-facing timeouts by keeping the live node set cleaner.

Changes:

Introduces rs-dapi-client background health check module with startup probing + ban-expiry re-probing and associated config/tests.
Extends AddressList with helpers to enumerate all addresses and ban-expiry/expired-ban queries; adds ban_count() accessor.
Updates dash-sdk builder to configure/spawn the health check task, stores its JoinHandle, and aborts it on Sdk drop (non-WASM).

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
packages/rs-sdk/src/sdk.rs	Spawns/stores background health-check task from `SdkBuilder`, aborts task in `Drop`.
packages/rs-dapi-client/src/lib.rs	Exposes `health_check` module + `HealthCheckConfig` for non-WASM.
packages/rs-dapi-client/src/health_check.rs	New health check implementation + tests (startup probe + ban-expiry watcher).
packages/rs-dapi-client/src/dapi_client.rs	Adds accessors for connection pool and settings to support health check wiring.
packages/rs-dapi-client/src/address_list.rs	Adds address/ban helper methods + tests; adds `ban_count()` getter.
packages/rs-dapi-client/Cargo.toml	Adds `tokio-util` (non-WASM) for `CancellationToken`.
Cargo.lock	Locks `tokio-util` dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

packages/rs-sdk/src/sdk.rs

packages/rs-dapi-client/src/health_check.rs

packages/rs-dapi-client/src/address_list.rs

packages/rs-dapi-client/src/health_check.rs

packages/rs-sdk/src/sdk.rs

Introduces an opt-in health check system with two behaviors: - Startup probe: probes all nodes when the SDK is built, bans dead ones - Ban-expiry re-probe: re-probes nodes when their ban timer expires Enabled by default, opt-out via SdkBuilder::with_health_check(false). Gated behind #[cfg(not(target_arch = "wasm32"))]. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Refactor SdkBuilder to use Option<HealthCheckConfig> instead of separate bool + config fields (simpler API, None = disabled) - Log banned/healthy count after startup probe completes - Change no_ban_idle_period default to 59s (ban expiry - 1s) - Pass CancellationToken through probe_and_update_batch for graceful shutdown during probing - Clamp max_concurrent_probes to min 1 to prevent hang - Guard tokio::spawn with Handle::try_current() to avoid panic when build() is called outside a Tokio runtime - Clarify is_banned() doc comment about temporal vs state semantics Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… debug log Change probe_node to return Result<(), String> instead of bool so the actual failure reason (transport error, probe timeout, etc.) propagates to the caller and appears in the debug-level log message when a node is banned. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… String Use the proper error type and convert to string only at the display site via tracing's Display formatting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

packages/rs-dapi-client/src/dapi_client.rs (1)
159-166: Hide transport internals behind a client-level health-check API.

These accessors make ConnectionPool and raw request settings part of the public crate surface just to bootstrap the checker from packages/rs-sdk/src/sdk.rs. A DapiClient-level entry point would keep those internals swappable without another public compatibility promise.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/rs-dapi-client/src/dapi_client.rs` around lines 159 - 166, Remove
the public exposure of transport internals by deleting or making private the
connection_pool(&self) -> &ConnectionPool and settings(&self) ->
&RequestSettings accessors on DapiClient, and add a single public
DapiClient-level health API (e.g., pub fn transport_health(&self) ->
Result<HealthStatus, TransportError> or pub fn check_health(&self) ->
HealthStatus) that performs the necessary checks using the pool and settings
internally; then update the bootstrap checker in packages/rs-sdk/src/sdk.rs to
call this new DapiClient method instead of accessing ConnectionPool or
RequestSettings directly so transport internals remain swappable.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/rs-dapi-client/src/health_check.rs`:
- Around line 95-126: The loop currently calls
address_list.get_next_ban_expiry() and treats None the same as “no bans”, which
delays re-probing even when address_list.get_expired_ban_addresses() is
non-empty; change the logic in the Phase 2 loop so that after computing
sleep_duration (or when get_next_ban_expiry() returns None) you first call
address_list.get_expired_ban_addresses() and if it returns non-empty,
immediately call probe_and_update_batch(&expired, &address_list, &pool,
&probe_settings, &config, &cancel_token).await (or set sleep_duration =
Duration::ZERO) before falling back to config.no_ban_idle_period, ensuring
expired bans are probed immediately and get_live_address() won’t hand out
recently-banned nodes prematurely.

In `@packages/rs-sdk/src/sdk.rs`:
- Around line 134-136: Change the raw optional JoinHandle ownership into an
Arc-backed guard so clones share the same health-check task and only abort it
when the last clone is dropped: introduce a small HealthCheckGuard type that
owns the tokio::task::JoinHandle<()> (and an AbortHandle/abort logic) and
implements Drop to abort the task when the guard is dropped; replace the Sdk
field _health_check_handle: Option<tokio::task::JoinHandle<()>> with
_health_check_guard: Option<Arc<HealthCheckGuard>> (or Option<Arc<dyn
DropGuard>>), update Sdk::clone to clone the Arc instead of taking/stripping the
handle, and remove the explicit abort in Sdk::drop so the task is only canceled
by the HealthCheckGuard::Drop when the last Arc reference is released.

---

Nitpick comments:
In `@packages/rs-dapi-client/src/dapi_client.rs`:
- Around line 159-166: Remove the public exposure of transport internals by
deleting or making private the connection_pool(&self) -> &ConnectionPool and
settings(&self) -> &RequestSettings accessors on DapiClient, and add a single
public DapiClient-level health API (e.g., pub fn transport_health(&self) ->
Result<HealthStatus, TransportError> or pub fn check_health(&self) ->
HealthStatus) that performs the necessary checks using the pool and settings
internally; then update the bootstrap checker in packages/rs-sdk/src/sdk.rs to
call this new DapiClient method instead of accessing ConnectionPool or
RequestSettings directly so transport internals remain swappable.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f31813b2-a3b9-4677-b418-a6bdf4e1e58f

📥 Commits

Reviewing files that changed from the base of the PR and between 95bdf2c and fc732a9.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (6)

packages/rs-dapi-client/Cargo.toml
packages/rs-dapi-client/src/address_list.rs
packages/rs-dapi-client/src/dapi_client.rs
packages/rs-dapi-client/src/health_check.rs
packages/rs-dapi-client/src/lib.rs
packages/rs-sdk/src/sdk.rs

packages/rs-dapi-client/src/health_check.rs

packages/rs-sdk/src/sdk.rs

codecov · 2026-03-17T09:59:39Z

Codecov Report

❌ Patch coverage is 86.82842% with 76 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.98%. Comparing base (94cefb3) to head (b076885).

Files with missing lines	Patch %	Lines
packages/rs-sdk/src/sdk.rs	52.63%	54 Missing ⚠️
packages/rs-dapi-client/src/health_check.rs	94.59%	18 Missing ⚠️
packages/rs-dapi-client/src/address_list.rs	98.36%	2 Missing ⚠️
packages/rs-dapi-client/src/dapi_client.rs	75.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           v3.1-dev    #3353      +/-   ##
============================================
+ Coverage     79.93%   79.98%   +0.05%     
============================================
  Files          2861     2862       +1     
  Lines        280236   281400    +1164     
============================================
+ Hits         223993   225069    +1076     
- Misses        56243    56331      +88

Components	Coverage Δ
dpp	`71.82% <ø> (ø)`
drive	`82.30% <ø> (ø)`
drive-abci	`86.67% <ø> (ø)`
sdk	`39.72% <52.63%> (+3.17%)`	⬆️
dapi-client	`85.21% <95.24%> (+6.14%)`	⬆️
platform-version	`∅ <ø> (∅)`
platform-value	`83.36% <ø> (ø)`
platform-wallet	`76.09% <ø> (ø)`
drive-proof-verifier	`55.26% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

packages/rs-dapi-client/src/health_check.rs (1)

95-126: ⚠️ Potential issue | 🟠 Major

Probe already-expired bans before falling back to the idle sleep.

get_next_ban_expiry() ignores bans that are already in the past, so if a re-probe batch runs longer than the gap to the next expiry, this loop comes back here, gets None, and sleeps no_ban_idle_period before touching those nodes. That stretches the documented “small race window” to a full idle period and lets get_live_address() hand those addresses out unchecked.

Suggested fix

     // Phase 2: Watch for ban expirations
     loop {
+        let expired = address_list.get_expired_ban_addresses();
+        if !expired.is_empty() {
+            tracing::debug!(
+                count = expired.len(),
+                "re-probing addresses with expired bans"
+            );
+            probe_and_update_batch(
+                &expired,
+                &address_list,
+                &pool,
+                &probe_settings,
+                &config,
+                &cancel_token,
+            )
+            .await;
+            continue;
+        }
+
         let sleep_duration = match address_list.get_next_ban_expiry() {
             Some(expiry) => {
                 let until = expiry - chrono::Utc::now();
                 until.to_std().unwrap_or(Duration::ZERO)
             }
             None => config.no_ban_idle_period,
         };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/rs-dapi-client/src/health_check.rs` around lines 95 - 126, The loop
currently computes sleep_duration from get_next_ban_expiry() and may sleep the
full no_ban_idle_period even when bans have already expired; before picking the
sleep branch, call address_list.get_expired_ban_addresses() and if it returns a
non-empty Vec, immediately call probe_and_update_batch(...) (same args as the
existing branch) and continue the loop instead of sleeping. This ensures
already-expired bans are re-probed immediately; keep the existing tokio::select!
cancel handling and only fall back to sleeping when get_expired_ban_addresses()
is empty and get_next_ban_expiry() returns None.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/rs-dapi-client/src/health_check.rs`:
- Around line 145-161: The async probe can overwrite newer bans; fix by
snapshotting the address' ban state before calling probe_node and only apply
address_list.unban or address_list.ban if the current ban-generation/timestamp
equals that snapshot. Concretely: read a lightweight version/token (e.g.,
ban_generation or banned_until timestamp) from address_list into a local
variable (snapshot) immediately before calling probe_node(address, &pool,
&settings). After probe_node returns, re-check the current token from
address_list and only call address_list.unban(address) on Ok(()) if the token
still matches the snapshot, and only call address_list.ban(address) on Err(_) if
the token still matches the snapshot; this prevents stale probe results from
clearing or double-applying newer foreground bans.

---

Duplicate comments:
In `@packages/rs-dapi-client/src/health_check.rs`:
- Around line 95-126: The loop currently computes sleep_duration from
get_next_ban_expiry() and may sleep the full no_ban_idle_period even when bans
have already expired; before picking the sleep branch, call
address_list.get_expired_ban_addresses() and if it returns a non-empty Vec,
immediately call probe_and_update_batch(...) (same args as the existing branch)
and continue the loop instead of sleeping. This ensures already-expired bans are
re-probed immediately; keep the existing tokio::select! cancel handling and only
fall back to sleeping when get_expired_ban_addresses() is empty and
get_next_ban_expiry() returns None.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5251aea2-a142-4ae8-a7c6-05c597d07e8d

📥 Commits

Reviewing files that changed from the base of the PR and between fc732a9 and 4b313b9.

📒 Files selected for processing (1)

packages/rs-dapi-client/src/health_check.rs

packages/rs-dapi-client/src/health_check.rs

shumkov · 2026-03-17T10:02:49Z

packages/rs-dapi-client/src/health_check.rs

+/// require modifying `get_live_address()` to never auto-unban expired bans, which would break
+/// the existing reactive ban behavior.
+pub async fn run_health_check(
+    address_list: AddressList,


Consume pool and list?

Address list is clone-friendly:

pub struct AddressList { addresses: Arc<RwLock<HashMap<Address, AddressStatus>>>, base_ban_period: Duration, }

So we can clone and consume.

Same about connection pool:

pub struct ConnectionPool { inner: Arc<Mutex<LruCache<String, PoolItem>>>, }

shumkov · 2026-03-17T10:08:01Z

packages/rs-sdk/src/sdk.rs

+impl Drop for Sdk {
+    fn drop(&mut self) {
+        // Abort the background health check task when the Sdk that owns the handle is dropped.
+        // Clones always have `None`, so only the original Sdk triggers this.


Why clones have none?

impl Clone for Sdk sets this to None, but I agree it's confusing and can lead to bugs. Claudius will refactor this to use Arc<> and some data structure that will abort automatically once last reference is done.

shumkov · 2026-03-17T10:08:59Z

packages/rs-sdk/src/sdk.rs


+                // Extract clones needed for health check before dapi is moved
+                #[cfg(not(target_arch = "wasm32"))]
+                let hc_address_list = dapi.address_list().clone();


So it's independent clone?

already mentioned, clone() is recommended way of managing objects in Sdk, they are lightweight and clone-friendly - use Arc<> internally. At least they should ;-).

…onToken - Replace _health_check_handle with health_check_cancel CancellationToken (child of cancel_token), shared across all Sdk clones so dropping the original no longer kills the health check for clones - Add stop_health_check() and is_health_check_running() to Sdk (not cfg-gated, works on all targets) - Remove Drop impl for Sdk — the spawned task exits via token cancellation - Check expired bans before idle sleep in Phase 2 (coderabbitai thread 11) - Document accepted stale probe race condition (coderabbitai thread 13) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use ArcSwapOption<CancellationToken> for lock-free atomic swap of the health check token. Some = running, None = stopped. - stop_health_check(): swap(None) + cancel the old token - start_health_check(config): stop existing, create fresh child token, spawn new task - is_health_check_running(): load().is_some() All Sdk clones share the same ArcSwapOption via Arc, so any clone can start/stop the health check for all of them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 819756a5-ac14-42ca-9802-d39480f1755d

📥 Commits

Reviewing files that changed from the base of the PR and between 4b313b9 and 4f15288.

📒 Files selected for processing (2)

packages/rs-dapi-client/src/health_check.rs
packages/rs-sdk/src/sdk.rs

packages/rs-sdk/src/sdk.rs

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/rs-sdk/src/sdk.rs`:
- Around line 455-485: Race between concurrent calls to start_health_check can
leave an orphaned spawned task because stop_health_check and the subsequent
spawn/store are not atomic; fix by serializing the stop+spawn sequence. Add a
lightweight Mutex (e.g., health_check_mutex: tokio::sync::Mutex<()>) to the Sdk
struct and acquire it at the start of start_health_check (and likewise in
stop_health_check if it mutates the same state) so the sequence that calls
stop_health_check(), creates new_cancel, spawns the tokio task, and stores
health_check_cancel is executed under the lock; update start_health_check (and
stop_health_check) to lock/unlock around the critical section referencing
health_check_cancel, cancel_token, and the spawn logic to eliminate the race.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 08b28c98-6195-4c6d-be4d-294de92923d9

📥 Commits

Reviewing files that changed from the base of the PR and between 4f15288 and 0bb481e.

📒 Files selected for processing (1)

packages/rs-sdk/src/sdk.rs

packages/rs-sdk/src/sdk.rs

…ancel Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Call sdk.start_health_check() instead of duplicating the spawn logic. Removes pre-extracted hc_address_list/hc_pool/hc_ca_cert variables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

♻️ Duplicate comments (1)

packages/rs-sdk/src/sdk.rs (1)

467-495: ⚠️ Potential issue | 🟠 Major

Serialize start_health_check to prevent orphaned background tasks.

Line 468 performs stop and restart in separate steps without one shared lock. Concurrent calls can each spawn a task, but only the last token is retained, leaving earlier tasks unmanaged until SDK-wide cancellation.

🔧 Proposed fix

 #[cfg(not(target_arch = "wasm32"))]
 pub fn start_health_check(&self, config: rs_dapi_client::HealthCheckConfig) {
-    self.stop_health_check();
-
-    let new_cancel = self.cancel_token.child_token();
+    let mut cancel_guard = self
+        .health_check_cancel
+        .lock()
+        .expect("health_check_cancel mutex poisoned");
+    cancel_guard.cancel();

     if let SdkInstance::Dapi { dapi, .. } = &self.inner {
+        let new_cancel = self.cancel_token.child_token();
         let hc_address_list = dapi.address_list().clone();
         let hc_pool = dapi.connection_pool().clone();
         let hc_ca_cert = dapi.ca_certificate.clone();
         let hc_token = new_cancel.clone();

         if let Ok(runtime) = tokio::runtime::Handle::try_current() {
-            *self
-                .health_check_cancel
-                .lock()
-                .expect("health_check_cancel mutex poisoned") = new_cancel;
+            *cancel_guard = new_cancel;
+            drop(cancel_guard);
             drop(
                 runtime.spawn(rs_dapi_client::health_check::run_health_check(
                     hc_address_list,
                     hc_pool,
                     config,
                     hc_token,
                     hc_ca_cert,
                 )),
             );
         } else {
             tracing::warn!("no Tokio runtime found, cannot start health check");
         }
     }
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/rs-sdk/src/sdk.rs` around lines 467 - 495, Concurrent calls to
start_health_check can race with stop_health_check and each spawn a background
task whose cancel token gets overwritten; serialize start/stop by holding the
same mutex around the stop + new token assignment + spawn so only one caller can
perform the sequence. In practice, acquire the health_check_cancel mutex at the
start of start_health_check, call stop_health_check (or perform its logic) while
still holding that lock, create and store new_cancel into health_check_cancel,
then spawn the health check task before releasing the lock; reference the
start_health_check method, stop_health_check helper, the health_check_cancel
Mutex and cancel_token.child_token to locate where to apply this change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@packages/rs-sdk/src/sdk.rs`:
- Around line 467-495: Concurrent calls to start_health_check can race with
stop_health_check and each spawn a background task whose cancel token gets
overwritten; serialize start/stop by holding the same mutex around the stop +
new token assignment + spawn so only one caller can perform the sequence. In
practice, acquire the health_check_cancel mutex at the start of
start_health_check, call stop_health_check (or perform its logic) while still
holding that lock, create and store new_cancel into health_check_cancel, then
spawn the health check task before releasing the lock; reference the
start_health_check method, stop_health_check helper, the health_check_cancel
Mutex and cancel_token.child_token to locate where to apply this change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5f4ab381-fecb-4cdd-a15c-c6a57f550795

📥 Commits

Reviewing files that changed from the base of the PR and between 0bb481e and 062d06d.

📒 Files selected for processing (1)

packages/rs-sdk/src/sdk.rs

#3371) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

thepastaclaw

Code Review

Well-structured background health check with good WASM/native portability via futures::select_biased!. Two significant lifecycle issues: a FusedFuture cancellation bug that can create zombie tasks when cancellation fires during active probing, and the health check silently fails to start in FFI (mobile) SDK paths because build() is called outside a Tokio runtime context. The WASM integration is functional but the PR description incorrectly claims WASM exclusion, and JavaScript has no API to stop the background task once started.

Reviewed commit: b70f39d

🔴 1 blocking | 🟡 5 suggestion(s) | 💬 1 nitpick(s)

1 additional finding(s) omitted (not in diff).

🤖 Prompt for all review comments with AI agents

These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-dapi-client/src/health_check.rs`:
- [BLOCKING] lines 116-158: FusedFuture cancellation consumed inside `probe_and_update_batch` makes outer loop unkillable
  The `stop` future is fused once at line 102 and passed by `&mut` into `probe_and_update_batch`. If cancellation fires during probing, the `select_biased!` inside that function (line 190-195) resolves the `stop` branch — permanently terminating the `FusedFuture`. Back in `health_check_loop`, the outer `select_biased!` (line 147-153) skips the terminated `stop` branch and only waits on `sleep_fut`. The loop then continues indefinitely.

Consequences:
1. `stop_health_check()` during active probing creates a zombie task that runs until SDK shutdown
2. `is_health_check_running()` returns `false` (cancelled token) while the task is still running
3. A subsequent `start_health_check()` spawns a second concurrent health check

Fix: check `stop.is_terminated()` after `probe_and_update_batch` returns and break if true.
- [SUGGESTION] lines 187-190: Health check re-probe systematically escalates ban duration with no cap
  Each failed health check re-probe calls `address_list.ban(address)` (line 189), which increments `ban_count` and applies `base_period * e^(ban_count - 1)`. Unlike foreground bans which only trigger on user requests, the health check drives this escalation deterministically on every sleep-to-expiry cycle. After 7 cycles: ~66 min ban. After 10: ~6 hours. After 12: ~112 days.

A node temporarily down for maintenance (30 min) may get escalated to hours-long bans before it can prove itself healthy. Consider either capping the health-check ban tier, not escalating `ban_count` when re-banning an already-banned node, or adding a max ban duration.

In `packages/rs-sdk/src/sdk.rs`:
- [SUGGESTION] lines 481-500: Health check silently fails to start in FFI (mobile) SDK paths
  `start_health_check()` stores the new `CancellationToken` in the mutex (line 481-484) *before* checking `tokio::runtime::Handle::try_current()` (line 487). When `build()` is called outside an entered Tokio runtime — which is exactly what happens in the FFI layer at `packages/rs-sdk-ffi/src/sdk.rs:160` where `init_or_get_runtime()` creates a runtime but never calls `runtime.enter()` before `builder.build()` — `try_current()` fails, the task is never spawned, but `is_health_check_running()` returns `true` because the stored token isn't cancelled.

This means all mobile SDK users (iOS/Android via FFI) silently get no health checking despite it being enabled by default, with an incorrect status report.
- [SUGGESTION] lines 466-514: Non-atomic `start_health_check` can orphan tasks under concurrent calls
  The method acquires the mutex twice: once inside `stop_health_check()` and once to store the new token (line 481-484). Between these acquisitions, a concurrent `start_health_check()` from another `Sdk` clone can interleave, causing one spawned task's cancellation token to be overwritten. That task becomes unreachable via `stop_health_check()` — only SDK-level shutdown can stop it.

Fix: hold the mutex for the entire start operation (stop → create token → store → spawn) instead of separate lock/unlock cycles.
- [SUGGESTION] lines 488-497: Dropped `JoinHandle` silently swallows panics from health check task
  The `JoinHandle` from `runtime.spawn(...)` is explicitly dropped (line 489-497). If the health check task panics (e.g., from a poisoned RwLock), the panic is silently lost and `is_health_check_running()` remains stale. For a long-running background task, consider either spawning a monitoring wrapper that logs `JoinError::is_panic()`, or storing an `AbortHandle` for cleanup.

In `packages/wasm-sdk/src/sdk.rs`:
- [SUGGESTION] lines 329-347: WASM SDK runs full health check loop but exposes no lifecycle controls to JavaScript
  Despite the PR description claiming 'WASM excluded — entire module gated with `#[cfg(not(target_arch = "wasm32"))]`', the health_check module has no cfg gate (line 11 of lib.rs). The full probing loop runs on WASM via `wasm_bindgen_futures::spawn_local()` (sdk.rs:503-511) using `gloo_timers` for sleep.

This is actually fine — the code is correctly portable. But two issues:
1. The PR description should be updated to reflect that health checks ARE fully functional on WASM
2. `WasmSdk` doesn't expose `shutdown()` or `stop_health_check()` via `#[wasm_bindgen]`, so JavaScript can start a perpetual background task with no way to cancel it

packages/rs-dapi-client/src/health_check.rs

packages/rs-sdk/src/sdk.rs

packages/rs-dapi-client/src/health_check.rs

packages/rs-sdk/src/sdk.rs

packages/rs-dapi-client/src/health_check.rs

FIX 1: Break out of health_check_loop when cancellation is consumed inside probe_and_update_batch. The FusedFuture becomes terminated, causing the outer select_biased! to skip the stop branch forever. FIX 2: Only store the new CancellationToken into the mutex after confirming the tokio runtime exists and the task will be spawned. Previously, is_health_check_running() returned true in FFI paths where no task was actually running. FIX 3: Don't re-ban already-banned nodes during health check re-probes. Each ban() call increments ban_count with exponential backoff; deterministic re-probing drove ban duration to absurd levels (112 days after 12 cycles). FIX 4: Hold the mutex for the entire start_health_check operation to prevent concurrent calls from orphaning tasks. The previous implementation released and re-acquired the mutex, allowing interleaving. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

thepastaclaw

Code Review

Two real issues confirmed. The ban-escalation guard at health_check.rs:202 creates a Phase 2 regression where failed re-probes never re-ban expired nodes, leaving dead nodes live for ~59s per cycle. The missing Drop impl on Sdk means health-check tasks leak if dropped without calling shutdown(). All 6 prior findings are resolved.

Reviewed commit: aec5665

🔴 1 blocking | 🟡 2 suggestion(s)

🤖 Prompt for all review comments with AI agents

These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-dapi-client/src/health_check.rs`:
- [BLOCKING] lines 196-207: Failed Phase 2 re-probes never re-ban expired nodes, leaving dead nodes live for ~59s per cycle
  The `!is_banned(address)` guard at line 202 was added to prevent ban-escalation from deterministic re-probing. But `is_banned()` checks `ban_count > 0` (address_list.rs:94-96), and ALL Phase 2 candidates have `ban_count > 0` by definition (they come from `get_expired_ban_addresses()` which requires `ban_count > 0` at line 305). So `!is_banned()` is ALWAYS false for Phase 2 failures — `ban()` is never called.

Consequence: after a failed re-probe, `banned_until` stays at the old expired timestamp. `get_live_address()` (address_list.rs:231-238) checks `banned_until < now` — since the ban is expired, the dead node passes the filter and is returned to foreground traffic. `get_next_ban_expiry()` returns `None` (no future bans), so the loop sleeps for `no_ban_idle_period` (59s). The dead node is live for the entire sleep window.

Suggested fix: replace the guard with logic that refreshes the ban without escalating. For example, `unban()` + `ban()` resets `ban_count` to 0 then sets it to 1 with a base-period ban duration. Alternatively, add an `AddressStatus::refresh_ban()` method that resets `banned_until` to `now + base_period` without touching `ban_count`.
- [SUGGESTION] lines 245-496: No test verifies that Phase 2 re-probe keeps persistently dead nodes banned
  The existing tests cover Phase 1 startup banning and cancellation, but no test exercises the Phase 2 re-probe-after-expiry path. A test with a short ban period that waits for expiry and then verifies the dead node is still banned (not returned by `get_live_address()`) would have caught finding #1.

In `packages/rs-sdk/src/sdk.rs`:
- [SUGGESTION] lines 469-525: Health-check task leaks when Sdk is dropped without calling shutdown()
  There is no `impl Drop for Sdk`. `CancellationToken::drop()` does NOT cancel the token (tokio-util semantics). If the Sdk is dropped without an explicit `shutdown()` call, the spawned health-check task continues as a zombie.

At the FFI boundary this is concrete: `dash_sdk_destroy` (rs-sdk-ffi/src/sdk.rs:521-524) does `Box::from_raw(handle)` which drops `SDKWrapper` → drops `Sdk`, but never calls `shutdown()`. Since `runtime` is `Arc<Runtime>`, the runtime survives and the orphaned task keeps running.

Mitigating factor: `health_check_cancel` is a child of `cancel_token`, so calling `shutdown()` does cancel it. The issue is only when `shutdown()` is skipped.

Options: (1) Add `impl Drop for Sdk { fn drop(&mut self) { self.cancel_token.cancel(); } }` — but since Sdk is Clone, this cancels on any clone's drop. (2) Call `sdk.shutdown()` in `dash_sdk_destroy` before dropping. (3) Document that `shutdown()` must be called.

packages/rs-dapi-client/src/health_check.rs

packages/rs-sdk/src/sdk.rs

packages/rs-dapi-client/src/health_check.rs

Wrap the spawned run_health_check JoinHandle in a monitor task that logs panics/failures and cancels the health-check token, so is_health_check_running() reflects the actual task state instead of going stale on panic. Also expand the no_ban_idle_period doc comment to warn that the value must be less than the effective ban duration for the re-probe guarantee to hold. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

thepastaclaw

Code Review

Small incremental addressing prior review feedback. Adds a panic-monitoring wrapper around the health check task and improves documentation for the no_ban_idle_period config field. Both changes are correct and well-implemented.

Reviewed commit: 5c7825d

… Drop impl FIX 1: Phase 2 re-probes now correctly re-ban failed nodes. The old `!is_banned()` guard was always false for Phase 2 candidates (they all have ban_count > 0 from get_expired_ban_addresses), so ban() was never called and dead nodes re-entered live rotation. Replace with unban()+ban() to reset ban_count to 1 with a fresh base-period duration. FIX 2: Add test_phase2_reprobe_failure_rebans_node that exercises the Phase 2 re-probe-after-expiry path with a short ban period (200ms). FIX 3: Add Arc<()>-based Drop impl for Sdk that cancels background tasks only when the last clone is dropped, preventing both zombie health-check tasks and premature cancellation when intermediate clones are dropped. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

thepastaclaw

Code Review

Incremental push fixes the Phase 2 re-ban regression (dead nodes escaping ban after failed re-probe) and adds automatic background-task cleanup on last Sdk clone drop. Both fixes target real bugs. Two concurrency gaps exist: (1) the unban()+ban() two-step creates a brief window where an unhealthy node appears live, and (2) the Arc::strong_count check in Drop has a TOCTOU race that can cause missed cancellation on concurrent last-clone drops. Both are low-probability in practice but have clean fixes.

Reviewed commit: dd8a8eb

🟡 2 suggestion(s)

🤖 Prompt for all review comments with AI agents

These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-dapi-client/src/health_check.rs`:
- [SUGGESTION] lines 202-208: `unban()`+`ban()` two-step creates window where dead node appears live
  `AddressList::unban()` and `AddressList::ban()` each acquire and release the `RwLock` independently. Between the two calls, a concurrent `get_live_address()`/`get_live_addresses()` call can observe `banned_until = None` and return the known-unhealthy node as a live candidate — potentially routing one request to a dead node before the ban is re-applied.

The window is very brief (sub-microsecond in the common single-threaded case), and the existing design already documents an accepted race at ban expiry (lines 62-66). However, this gap is strictly unnecessary and can be closed with an atomic `reset_ban()` method that holds a single write lock across both operations.

All 8 review agents independently flagged this issue.

In `packages/rs-sdk/src/sdk.rs`:
- [SUGGESTION] lines 167-176: `Arc::strong_count` TOCTOU — consider `Arc::into_inner` for race-free last-owner detection
  If two threads drop the final two `Sdk` clones concurrently, both destructors observe `strong_count == 2` (the Arc isn't decremented until after `drop()` returns), so neither calls `cancel()`. Background tasks then leak until the tokio runtime shuts down.

The comment acknowledges this is "not perfectly atomic," but since MSRV is 1.92 and `Arc::into_inner` has been stable since Rust 1.70, a race-free alternative is available:

```rust
impl Drop for Sdk {
    fn drop(&mut self) {
        let shared = std::mem::replace(&mut self._shared_ref, Arc::new(()));
        if Arc::into_inner(shared).is_some() {
            self.cancel_token.cancel();
        }
    }
}

Arc::into_inner returns Some only when the calling thread holds the last strong reference, with no TOCTOU gap. The replacement Arc::new(()) is trivial (ZST, no heap allocation).

</details>

packages/rs-dapi-client/src/health_check.rs

packages/rs-sdk/src/sdk.rs

Add AddressList::reset_ban() that holds a single write lock to atomically reset ban history and apply a fresh base-period ban, closing the TOCTOU window between separate unban()+ban() calls. Use it in health_check.rs. Replace Arc::strong_count TOCTOU check in Sdk::drop with Arc::into_inner, which is race-free and stable since Rust 1.70. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

thepastaclaw

Code Review

This push cleanly addresses both prior review suggestions:

TOCTOU race in unban()+ban() — replaced with a new atomic reset_ban() method that acquires a single write lock, resets ban_count to 1, and applies a fresh base-period ban. Two unit tests cover the happy path and the no-op case for nonexistent addresses.
Arc::strong_count TOCTOU in Sdk::drop — replaced with Arc::into_inner, which is race-free by specification (returns Some only for the true last owner). The std::mem::replace with Arc::new(()) is a zero-sized type so no allocation concern.

Both fixes are well-targeted, minimal, and correct. No new issues introduced.

Reviewed commit: 4154f16

lklimek · 2026-03-18T16:25:22Z

Closing after discussion with @shumkov , we will fall back to TrustedHttpContextProvider.

…heck # Conflicts: # packages/rs-dapi-client/src/address_list.rs

thepastaclaw · 2026-03-30T15:12:06Z

✅ Review complete (commit b076885)

github-actions · 2026-03-30T15:22:43Z

✅ DashSDKFFI.xcframework built for this PR.

Workflow run: https://github.com/dashpay/platform/actions/runs/23752065129
Artifacts: DashSDKFFI.xcframework (folder), DashSDKFFI.xcframework.zip, xc_checksum.txt

SwiftPM (host the zip at a stable URL, then use):

.binaryTarget(
  name: "DashSDKFFI",
  url: "https://your.cdn.example/DashSDKFFI.xcframework.zip",
  checksum: "5143f7afbefe2bc88e1a504ed2fa2e961ab1f673908a82ae3870067c482b5a79"
)

Xcode manual integration:

Download 'DashSDKFFI.xcframework' artifact from the run link above.
Drag it into your app target (Frameworks, Libraries & Embedded Content) and set Embed & Sign.
If using the Swift wrapper package, point its binaryTarget to the xcframework location or add the package and place the xcframework at the expected path.

thepastaclaw

Code Review

Well-designed health check system with clean async architecture and proper cancellation handling. The sleep-to-expiry model is efficient and the documented race window is an honest, acceptable tradeoff. Two issues worth addressing: the WASM build enables health checks by default but exposes no explicit shutdown API (relying on JS GC for cleanup), and the idle sleep can miss bans created mid-sleep, weakening the re-probe-before-reuse guarantee.

Reviewed commit: b076885

🟡 3 suggestion(s) | 💬 1 nitpick(s)

🤖 Prompt for all review comments with AI agents

These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-dapi-client/src/health_check.rs`:
- [SUGGESTION] lines 42-46: Idle sleep can miss bans created mid-sleep, weakening re-probe-before-reuse guarantee
  When no bans exist, the watcher sleeps for `no_ban_idle_period` (default 59s). If a foreground request bans a node at t=58s for the default 60s (expiring at t=118s), the watcher wakes at t=59s, sees the next expiry at t=118s, and sleeps for another 59s. The node's ban expires at t=118s and it re-enters `get_live_address()` rotation before the watcher re-probes it at ~t=118s.

The doc comment on `no_ban_idle_period` says it ensures the watcher wakes *before* bans expire, but this only holds for bans that already exist when the sleep begins. Bans created during the sleep are missed until the next wake cycle.

This isn't critical — the documented race window in `run_health_check()` already acknowledges that nodes can briefly re-enter rotation before re-probe, and the reactive ban logic handles nodes that fail on actual requests. But the doc comment on `no_ban_idle_period` overpromises. Consider either: (a) adjusting the doc comment to clarify the guarantee only applies to pre-existing bans, or (b) using a notify/channel from the ban path to interrupt the idle sleep.

In `packages/rs-sdk/src/sdk.rs`:
- [SUGGESTION] lines 543-553: WASM health check has no JS-visible shutdown path — relies on GC for cleanup
  On WASM, `SdkBuilder` enables the health check by default and `start_health_check()` spawns it via `spawn_local()`. The spawned task is only stopped when the `CancellationToken` is cancelled — which happens in `Sdk::shutdown()` or on `Drop`. But `WasmSdk` doesn't export `shutdown()` or `stop_health_check()` to JS, so the only cleanup path is the wasm-bindgen-generated `free()` method, which JS callers aren't required to invoke (it depends on GC finalization).

A forgotten `WasmSdk` instance can leave a permanent background probe loop alive for the lifetime of the page/worker. Additionally, if `run_health_check()` panics on WASM, the cancel token is never cancelled (no monitor task like native has), leaving `is_health_check_running()` permanently stale.

Consider either: (a) exporting a `shutdown()` or `stop_health_check()` method on `WasmSdkBuilder`/`WasmSdk`, or (b) defaulting health checks to off on WASM targets.
- [SUGGESTION] lines 511-538: Health check silently disabled when SDK is built outside a Tokio runtime
  `SdkBuilder::default()` enables health checks, but `start_health_check()` on native only spawns the task if `tokio::runtime::Handle::try_current()` succeeds. If the SDK is constructed in synchronous setup code (before entering the Tokio runtime), the health check is silently skipped with just a `tracing::warn!`. The builder advertises health checks as on-by-default, but whether they actually start depends on call-site context.

The code already handles this by not storing the cancel token when no runtime is found (so `is_health_check_running()` returns false), which is correct. But the silent skip could surprise users. Consider either returning an error from `build()` when health checks are enabled but no runtime is available, or documenting this behavior prominently.

thepastaclaw · 2026-03-31T21:23:24Z

packages/rs-dapi-client/src/health_check.rs

+    /// period shorter than this value, nodes may re-enter rotation before the health check
+    /// can re-probe them. Adjust accordingly.
+    pub no_ban_idle_period: Duration,
+}
+


🟡 Suggestion: Idle sleep can miss bans created mid-sleep, weakening re-probe-before-reuse guarantee

When no bans exist, the watcher sleeps for no_ban_idle_period (default 59s). If a foreground request bans a node at t=58s for the default 60s (expiring at t=118s), the watcher wakes at t=59s, sees the next expiry at t=118s, and sleeps for another 59s. The node's ban expires at t=118s and it re-enters get_live_address() rotation before the watcher re-probes it at ~t=118s.

The doc comment on no_ban_idle_period says it ensures the watcher wakes before bans expire, but this only holds for bans that already exist when the sleep begins. Bans created during the sleep are missed until the next wake cycle.

This isn't critical — the documented race window in run_health_check() already acknowledges that nodes can briefly re-enter rotation before re-probe, and the reactive ban logic handles nodes that fail on actual requests. But the doc comment on no_ban_idle_period overpromises. Consider either: (a) adjusting the doc comment to clarify the guarantee only applies to pre-existing bans, or (b) using a notify/channel from the ban path to interrupt the idle sleep.

_{source: ['codex']}

🤖 Fix this with AI agents

These findings are from an automated code review. Verify each finding against the current code and only fix it if needed. In `packages/rs-dapi-client/src/health_check.rs`: - [SUGGESTION] lines 42-46: Idle sleep can miss bans created mid-sleep, weakening re-probe-before-reuse guarantee When no bans exist, the watcher sleeps for `no_ban_idle_period` (default 59s). If a foreground request bans a node at t=58s for the default 60s (expiring at t=118s), the watcher wakes at t=59s, sees the next expiry at t=118s, and sleeps for another 59s. The node's ban expires at t=118s and it re-enters `get_live_address()` rotation before the watcher re-probes it at ~t=118s. The doc comment on `no_ban_idle_period` says it ensures the watcher wakes *before* bans expire, but this only holds for bans that already exist when the sleep begins. Bans created during the sleep are missed until the next wake cycle. This isn't critical — the documented race window in `run_health_check()` already acknowledges that nodes can briefly re-enter rotation before re-probe, and the reactive ban logic handles nodes that fail on actual requests. But the doc comment on `no_ban_idle_period` overpromises. Consider either: (a) adjusting the doc comment to clarify the guarantee only applies to pre-existing bans, or (b) using a notify/channel from the ban path to interrupt the idle sleep.

thepastaclaw · 2026-03-31T21:23:24Z

packages/rs-sdk/src/sdk.rs

+            {
+                *cancel_guard = new_cancel;
+                drop(cancel_guard);
+                wasm_bindgen_futures::spawn_local(rs_dapi_client::health_check::run_health_check(
+                    hc_address_list,
+                    hc_pool,
+                    config,
+                    hc_token,
+                    hc_ca_cert,
+                ));
+            }


🟡 Suggestion: WASM health check has no JS-visible shutdown path — relies on GC for cleanup

On WASM, SdkBuilder enables the health check by default and start_health_check() spawns it via spawn_local(). The spawned task is only stopped when the CancellationToken is cancelled — which happens in Sdk::shutdown() or on Drop. But WasmSdk doesn't export shutdown() or stop_health_check() to JS, so the only cleanup path is the wasm-bindgen-generated free() method, which JS callers aren't required to invoke (it depends on GC finalization).

A forgotten WasmSdk instance can leave a permanent background probe loop alive for the lifetime of the page/worker. Additionally, if run_health_check() panics on WASM, the cancel token is never cancelled (no monitor task like native has), leaving is_health_check_running() permanently stale.

Consider either: (a) exporting a shutdown() or stop_health_check() method on WasmSdkBuilder/WasmSdk, or (b) defaulting health checks to off on WASM targets.

_{source: ['codex']}

🤖 Fix this with AI agents

These findings are from an automated code review. Verify each finding against the current code and only fix it if needed. In `packages/rs-sdk/src/sdk.rs`: - [SUGGESTION] lines 543-553: WASM health check has no JS-visible shutdown path — relies on GC for cleanup On WASM, `SdkBuilder` enables the health check by default and `start_health_check()` spawns it via `spawn_local()`. The spawned task is only stopped when the `CancellationToken` is cancelled — which happens in `Sdk::shutdown()` or on `Drop`. But `WasmSdk` doesn't export `shutdown()` or `stop_health_check()` to JS, so the only cleanup path is the wasm-bindgen-generated `free()` method, which JS callers aren't required to invoke (it depends on GC finalization). A forgotten `WasmSdk` instance can leave a permanent background probe loop alive for the lifetime of the page/worker. Additionally, if `run_health_check()` panics on WASM, the cancel token is never cancelled (no monitor task like native has), leaving `is_health_check_running()` permanently stale. Consider either: (a) exporting a `shutdown()` or `stop_health_check()` method on `WasmSdkBuilder`/`WasmSdk`, or (b) defaulting health checks to off on WASM targets.

thepastaclaw · 2026-03-31T21:23:24Z

packages/rs-sdk/src/sdk.rs

+                if let Ok(runtime) = tokio::runtime::Handle::try_current() {
+                    *cancel_guard = new_cancel;
+                    drop(cancel_guard);
+                    let hc_cancel_for_monitor = hc_token.clone();
+                    let handle = runtime.spawn(rs_dapi_client::health_check::run_health_check(
+                        hc_address_list,
+                        hc_pool,
+                        config,
+                        hc_token,
+                        hc_ca_cert,
+                    ));
+                    // Monitor task: logs panics and cancels the token so
+                    // is_health_check_running() reflects the actual state.
+                    drop(runtime.spawn(async move {
+                        if let Err(e) = handle.await {
+                            if e.is_panic() {
+                                tracing::error!("health check task panicked: {:?}", e);
+                            } else {
+                                tracing::warn!("health check task failed: {:?}", e);
+                            }
+                            hc_cancel_for_monitor.cancel();
+                        }
+                    }));
+                } else {
+                    // No runtime: don't store the token so is_health_check_running()
+                    // correctly returns false.
+                    drop(cancel_guard);
+                    tracing::warn!("no Tokio runtime found, cannot start health check");


🟡 Suggestion: Health check silently disabled when SDK is built outside a Tokio runtime

SdkBuilder::default() enables health checks, but start_health_check() on native only spawns the task if tokio::runtime::Handle::try_current() succeeds. If the SDK is constructed in synchronous setup code (before entering the Tokio runtime), the health check is silently skipped with just a tracing::warn!. The builder advertises health checks as on-by-default, but whether they actually start depends on call-site context.

The code already handles this by not storing the cancel token when no runtime is found (so is_health_check_running() returns false), which is correct. But the silent skip could surprise users. Consider either returning an error from build() when health checks are enabled but no runtime is available, or documenting this behavior prominently.

_{source: ['codex']}

🤖 Fix this with AI agents

These findings are from an automated code review. Verify each finding against the current code and only fix it if needed. In `packages/rs-sdk/src/sdk.rs`: - [SUGGESTION] lines 511-538: Health check silently disabled when SDK is built outside a Tokio runtime `SdkBuilder::default()` enables health checks, but `start_health_check()` on native only spawns the task if `tokio::runtime::Handle::try_current()` succeeds. If the SDK is constructed in synchronous setup code (before entering the Tokio runtime), the health check is silently skipped with just a `tracing::warn!`. The builder advertises health checks as on-by-default, but whether they actually start depends on call-site context. The code already handles this by not storing the cancel token when no runtime is found (so `is_health_check_running()` returns false), which is correct. But the silent skip could surprise users. Consider either returning an error from `build()` when health checks are enabled but no runtime is available, or documenting this behavior prominently.

thepastaclaw · 2026-03-31T21:23:24Z

packages/rs-sdk/src/sdk.rs

+    pub fn start_health_check(&self, config: rs_dapi_client::HealthCheckConfig) {
+        let mut cancel_guard = self
+            .health_check_cancel
+            .lock()
+            .expect("health_check_cancel mutex poisoned");
+
+        // Stop existing health check (atomic: no mutex release between stop and start)
+        cancel_guard.cancel();
+
+        let new_cancel = self.cancel_token.child_token();
+


💬 Nitpick: Restarting health check doesn't await previous task completion

start_health_check() cancels the previous token and immediately spawns a new task without waiting for the old one to finish. In theory, the old task could still be inside probe_and_update_batch() when the new task starts its startup probe, causing concurrent ban/unban mutations on the shared AddressList.

In practice the impact is minimal: the old task will observe cancellation on its next select_biased! check and exit promptly, and ban/unban operations are individually atomic. But for correctness, consider storing the JoinHandle and awaiting it before spawning the replacement, or adding a generation counter to discard stale probe results.

_{source: ['codex']}

thepastaclaw

Code Review

This PR’s health-check loop is close, but three core behaviors in the checked-out patch do not match the feature contract it introduces. The startup probe is still asynchronous relative to build(), the default-on feature silently never starts in common native construction flows, and failed re-probes flatten ban history instead of preserving the existing backoff semantics.

Reviewed commit: b076885

🔴 3 blocking

🤖 Prompt for all review comments with AI agents

These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-sdk/src/sdk.rs`:
- [BLOCKING] lines 1087-1088: The advertised startup probe is still fire-and-forget, so the first request can race dead nodes
  `SdkBuilder::build()` enables the feature by calling `sdk.start_health_check(hc_config)`, but that method only spawns `run_health_check()` and returns immediately. Phase 1 probing then happens asynchronously inside `run_health_check()` (`packages/rs-dapi-client/src/health_check.rs`, lines 110-119), so `build()` can return before any address has been checked or banned. A caller that sends a request right after constructing the SDK can still select an unprobed dead node from the existing address list, which contradicts the PR’s promised startup behavior that the first user request already sees a cleaned list.
- [BLOCKING] lines 511-538: Default-on health checks silently never start when the SDK is built outside an active Tokio runtime
  The feature is configured as enabled by default, but `start_health_check()` only stores the token and spawns the task if `tokio::runtime::Handle::try_current()` succeeds. When `SdkBuilder::build()` is called from synchronous setup code, this branch only logs `no Tokio runtime found, cannot start health check` and returns, with no retry path once the caller later enters async code. That makes the same default configuration behave differently depending on where construction happens, and in a common native usage pattern the promised background health check is just absent.

In `packages/rs-dapi-client/src/health_check.rs`:
- [BLOCKING] lines 201-206: Failed re-probes flatten ban history instead of preserving the existing backoff semantics
  On re-probe failure the loop calls `address_list.reset_ban(address)`, and `reset_ban()` (`packages/rs-dapi-client/src/address_list.rs`, lines 172-176) forcibly sets `ban_count = 1` with a fresh base-period ban. That means a node that stays dead across multiple expiry cycles is retried every base ban period forever instead of following the existing exponential backoff implemented by `AddressStatus::ban()`. The patch comment explicitly says this prevents escalation, but the PR description says ban-expiry re-probes should reuse the existing ban behavior, so the shipped semantics no longer match the feature contract and create constant background churn against permanently dead nodes.

thepastaclaw · 2026-04-08T01:55:05Z

packages/rs-sdk/src/sdk.rs

+                if let Some(hc_config) = health_check_config {
+                    sdk.start_health_check(hc_config);


🔴 Blocking: The advertised startup probe is still fire-and-forget, so the first request can race dead nodes

SdkBuilder::build() enables the feature by calling sdk.start_health_check(hc_config), but that method only spawns run_health_check() and returns immediately. Phase 1 probing then happens asynchronously inside run_health_check() (packages/rs-dapi-client/src/health_check.rs, lines 110-119), so build() can return before any address has been checked or banned. A caller that sends a request right after constructing the SDK can still select an unprobed dead node from the existing address list, which contradicts the PR’s promised startup behavior that the first user request already sees a cleaned list.

_{source: ['codex']}

🤖 Fix this with AI agents

These findings are from an automated code review. Verify each finding against the current code and only fix it if needed. In `packages/rs-sdk/src/sdk.rs`: - [BLOCKING] lines 1087-1088: The advertised startup probe is still fire-and-forget, so the first request can race dead nodes `SdkBuilder::build()` enables the feature by calling `sdk.start_health_check(hc_config)`, but that method only spawns `run_health_check()` and returns immediately. Phase 1 probing then happens asynchronously inside `run_health_check()` (`packages/rs-dapi-client/src/health_check.rs`, lines 110-119), so `build()` can return before any address has been checked or banned. A caller that sends a request right after constructing the SDK can still select an unprobed dead node from the existing address list, which contradicts the PR’s promised startup behavior that the first user request already sees a cleaned list.

thepastaclaw · 2026-04-08T01:55:05Z

packages/rs-sdk/src/sdk.rs

+                if let Ok(runtime) = tokio::runtime::Handle::try_current() {
+                    *cancel_guard = new_cancel;
+                    drop(cancel_guard);
+                    let hc_cancel_for_monitor = hc_token.clone();
+                    let handle = runtime.spawn(rs_dapi_client::health_check::run_health_check(
+                        hc_address_list,
+                        hc_pool,
+                        config,
+                        hc_token,
+                        hc_ca_cert,
+                    ));
+                    // Monitor task: logs panics and cancels the token so
+                    // is_health_check_running() reflects the actual state.
+                    drop(runtime.spawn(async move {
+                        if let Err(e) = handle.await {
+                            if e.is_panic() {
+                                tracing::error!("health check task panicked: {:?}", e);
+                            } else {
+                                tracing::warn!("health check task failed: {:?}", e);
+                            }
+                            hc_cancel_for_monitor.cancel();
+                        }
+                    }));
+                } else {
+                    // No runtime: don't store the token so is_health_check_running()
+                    // correctly returns false.
+                    drop(cancel_guard);
+                    tracing::warn!("no Tokio runtime found, cannot start health check");


🔴 Blocking: Default-on health checks silently never start when the SDK is built outside an active Tokio runtime

The feature is configured as enabled by default, but start_health_check() only stores the token and spawns the task if tokio::runtime::Handle::try_current() succeeds. When SdkBuilder::build() is called from synchronous setup code, this branch only logs no Tokio runtime found, cannot start health check and returns, with no retry path once the caller later enters async code. That makes the same default configuration behave differently depending on where construction happens, and in a common native usage pattern the promised background health check is just absent.

_{source: ['codex']}

🤖 Fix this with AI agents

These findings are from an automated code review. Verify each finding against the current code and only fix it if needed. In `packages/rs-sdk/src/sdk.rs`: - [BLOCKING] lines 511-538: Default-on health checks silently never start when the SDK is built outside an active Tokio runtime The feature is configured as enabled by default, but `start_health_check()` only stores the token and spawns the task if `tokio::runtime::Handle::try_current()` succeeds. When `SdkBuilder::build()` is called from synchronous setup code, this branch only logs `no Tokio runtime found, cannot start health check` and returns, with no retry path once the caller later enters async code. That makes the same default configuration behave differently depending on where construction happens, and in a common native usage pattern the promised background health check is just absent.

thepastaclaw · 2026-04-08T01:55:05Z

packages/rs-dapi-client/src/health_check.rs

+                    Err(error) => {
+                        // reset_ban atomically resets ban history and applies a fresh
+                        // base-period ban, preventing exponential backoff escalation
+                        // while closing the unban+ban race window.
+                        address_list.reset_ban(address);
+                        tracing::debug!(%address, error = %error, "health check: node is unhealthy, re-banned");


🔴 Blocking: Failed re-probes flatten ban history instead of preserving the existing backoff semantics

On re-probe failure the loop calls address_list.reset_ban(address), and reset_ban() (packages/rs-dapi-client/src/address_list.rs, lines 172-176) forcibly sets ban_count = 1 with a fresh base-period ban. That means a node that stays dead across multiple expiry cycles is retried every base ban period forever instead of following the existing exponential backoff implemented by AddressStatus::ban(). The patch comment explicitly says this prevents escalation, but the PR description says ban-expiry re-probes should reuse the existing ban behavior, so the shipped semantics no longer match the feature contract and create constant background churn against permanently dead nodes.

_{source: ['codex']}

🤖 Fix this with AI agents

These findings are from an automated code review. Verify each finding against the current code and only fix it if needed. In `packages/rs-dapi-client/src/health_check.rs`: - [BLOCKING] lines 201-206: Failed re-probes flatten ban history instead of preserving the existing backoff semantics On re-probe failure the loop calls `address_list.reset_ban(address)`, and `reset_ban()` (`packages/rs-dapi-client/src/address_list.rs`, lines 172-176) forcibly sets `ban_count = 1` with a fresh base-period ban. That means a node that stays dead across multiple expiry cycles is retried every base ban period forever instead of following the existing exponential backoff implemented by `AddressStatus::ban()`. The patch comment explicitly says this prevents escalation, but the PR description says ban-expiry re-probes should reuse the existing ban behavior, so the shipped semantics no longer match the feature contract and create constant background churn against permanently dead nodes.

github-actions bot added this to the v3.1.0 milestone Mar 16, 2026

lklimek requested a review from Copilot March 16, 2026 21:48

Copilot started reviewing on behalf of lklimek March 16, 2026 21:48 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

packages/rs-sdk/src/sdk.rs Outdated Show resolved Hide resolved

packages/rs-dapi-client/src/health_check.rs Show resolved Hide resolved

packages/rs-dapi-client/src/health_check.rs Outdated Show resolved Hide resolved

packages/rs-dapi-client/src/address_list.rs Show resolved Hide resolved

lklimek commented Mar 16, 2026

View reviewed changes

lklimek changed the title ~~feat(rs-dapi-client): add background health check for DAPI nodes~~ feat(dapi): add background health check for DAPI nodes Mar 17, 2026

lklimek force-pushed the feat/sdk-health-check branch 2 times, most recently from 81d46ba to 0ee1996 Compare March 17, 2026 08:40

lklimek and others added 2 commits March 17, 2026 10:01

lklimek force-pushed the feat/sdk-health-check branch from 0ee1996 to fc732a9 Compare March 17, 2026 09:04

lklimek marked this pull request as ready for review March 17, 2026 09:32

lklimek requested review from QuantumExplorer and shumkov as code owners March 17, 2026 09:32

lklimek and others added 2 commits March 17, 2026 10:43

fix(rs-dapi-client): return TransportError from probe_node instead of…

4b313b9

… String Use the proper error type and convert to string only at the display site via tracing's Display formatting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

packages/rs-dapi-client/src/health_check.rs Outdated Show resolved Hide resolved

packages/rs-sdk/src/sdk.rs Outdated Show resolved Hide resolved

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

packages/rs-dapi-client/src/health_check.rs Outdated Show resolved Hide resolved

shumkov requested changes Mar 17, 2026

View reviewed changes

lklimek and others added 2 commits March 17, 2026 12:09

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

packages/rs-sdk/src/sdk.rs Show resolved Hide resolved

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

packages/rs-sdk/src/sdk.rs Show resolved Hide resolved

lklimek and others added 2 commits March 17, 2026 12:37

refactor(rs-sdk): replace ArcSwapOption with Mutex for health_check_c…

3c55636

…ancel Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor(rs-sdk): deduplicate health check spawn in build()

062d06d

Call sdk.start_health_check() instead of duplicating the spawn logic. Removes pre-extracted hc_address_list/hc_pool/hc_ca_cert variables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

lklimek mentioned this pull request Mar 17, 2026

feat(rs-dapi-client): make health check cross-platform (native + WASM) #3371

Merged

5 tasks

feat(rs-dapi-client): make health check cross-platform (native + WASM) (

477cc39

#3371) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

lklimek requested a review from shumkov March 17, 2026 12:23

lklimek and others added 5 commits March 17, 2026 13:27

docs(rs-dapi-client): explain why health_check_loop avoids tokio

bf79484

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor: simplify probes

6b6147f

chore: healthcheck default to ban period - 11

7121ade

chore: fix loop

3b6d799

chore: even simpler

b70f39d

thepastaclaw suggested changes Mar 17, 2026

View reviewed changes

lklimek added the claudius-review Trigger automated review label Mar 18, 2026

thepastaclaw suggested changes Mar 18, 2026

View reviewed changes

packages/rs-dapi-client/src/health_check.rs Show resolved Hide resolved

packages/rs-sdk/src/sdk.rs Show resolved Hide resolved

packages/rs-dapi-client/src/health_check.rs Show resolved Hide resolved

thepastaclaw reviewed Mar 18, 2026

View reviewed changes

packages/rs-dapi-client/src/health_check.rs Outdated Show resolved Hide resolved

packages/rs-sdk/src/sdk.rs Show resolved Hide resolved

thepastaclaw reviewed Mar 18, 2026

View reviewed changes

lklimek removed the claudius-review Trigger automated review label Mar 18, 2026

lklimek closed this Mar 18, 2026

lklimek reopened this Mar 30, 2026

Merge remote-tracking branch 'origin/v3.1-dev' into feat/sdk-health-c…

b076885

…heck # Conflicts: # packages/rs-dapi-client/src/address_list.rs

thepastaclaw reviewed Mar 31, 2026

View reviewed changes

thepastaclaw suggested changes Apr 8, 2026

View reviewed changes

		if let Some(hc_config) = health_check_config {
		sdk.start_health_check(hc_config);

Conversation

lklimek commented Mar 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue being fixed or feature implemented

What was done?

rs-dapi-client changes:

dash-sdk changes:

Design decisions:

How Has This Been Tested?

Unit tests added:

Verification commands:

Breaking Changes

Checklist:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

thepastaclaw left a comment

Choose a reason for hiding this comment

lklimek commented Mar 16, 2026 •

edited by coderabbitai bot

Loading

`rs-dapi-client` changes:

`dash-sdk` changes:

coderabbitai bot commented Mar 16, 2026 •

edited

Loading

codecov bot commented Mar 17, 2026 •

edited

Loading

thepastaclaw commented Mar 30, 2026 •

edited

Loading