feat(dapi): add background health check for DAPI nodes#3353
feat(dapi): add background health check for DAPI nodes#3353
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a non‑WASM background health check subsystem (HealthCheckConfig + run_health_check), AddressList ban/expiry query APIs, DapiClient getters, a non‑WASM Changes
Sequence DiagramsequenceDiagram
participant SDK as SDK Builder / Sdk
participant HealthCheck as Health Check Task
participant AddrList as AddressList
participant Pool as ConnectionPool
participant Transport as Transport/Node
SDK->>HealthCheck: spawn run_health_check(config, cancel_token)
activate HealthCheck
HealthCheck->>AddrList: get_all_addresses()
AddrList-->>HealthCheck: addresses
par Startup probes
HealthCheck->>Pool: acquire connections (batched)
Pool->>Transport: probe_node (GetStatus)
Transport-->>HealthCheck: success / failure
HealthCheck->>AddrList: ban/unban addresses
end
loop Watch loop
HealthCheck->>AddrList: get_next_ban_expiry()
AddrList-->>HealthCheck: next_expiry or None
HealthCheck->>HealthCheck: sleep until expiry or idle_period
HealthCheck->>AddrList: get_expired_ban_addresses()
AddrList-->>HealthCheck: expired_addresses
par Re-probe expired
HealthCheck->>Pool: probe_node for expired
Transport-->>HealthCheck: success / failure
HealthCheck->>AddrList: unban healthy / keep ban
end
end
SDK-->>HealthCheck: CancellationToken (abort)
HealthCheck-->>SDK: task stopped
deactivate HealthCheck
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Adds an opt-in (default-on for non-WASM) background health check to proactively probe DAPI nodes at startup and re-probe nodes when bans expire, aiming to reduce user-facing timeouts by keeping the live node set cleaner.
Changes:
- Introduces
rs-dapi-clientbackground health check module with startup probing + ban-expiry re-probing and associated config/tests. - Extends
AddressListwith helpers to enumerate all addresses and ban-expiry/expired-ban queries; addsban_count()accessor. - Updates
dash-sdkbuilder to configure/spawn the health check task, stores itsJoinHandle, and aborts it onSdkdrop (non-WASM).
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/rs-sdk/src/sdk.rs | Spawns/stores background health-check task from SdkBuilder, aborts task in Drop. |
| packages/rs-dapi-client/src/lib.rs | Exposes health_check module + HealthCheckConfig for non-WASM. |
| packages/rs-dapi-client/src/health_check.rs | New health check implementation + tests (startup probe + ban-expiry watcher). |
| packages/rs-dapi-client/src/dapi_client.rs | Adds accessors for connection pool and settings to support health check wiring. |
| packages/rs-dapi-client/src/address_list.rs | Adds address/ban helper methods + tests; adds ban_count() getter. |
| packages/rs-dapi-client/Cargo.toml | Adds tokio-util (non-WASM) for CancellationToken. |
| Cargo.lock | Locks tokio-util dependency. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
81d46ba to
0ee1996
Compare
Introduces an opt-in health check system with two behaviors: - Startup probe: probes all nodes when the SDK is built, bans dead ones - Ban-expiry re-probe: re-probes nodes when their ban timer expires Enabled by default, opt-out via SdkBuilder::with_health_check(false). Gated behind #[cfg(not(target_arch = "wasm32"))]. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Refactor SdkBuilder to use Option<HealthCheckConfig> instead of separate bool + config fields (simpler API, None = disabled) - Log banned/healthy count after startup probe completes - Change no_ban_idle_period default to 59s (ban expiry - 1s) - Pass CancellationToken through probe_and_update_batch for graceful shutdown during probing - Clamp max_concurrent_probes to min 1 to prevent hang - Guard tokio::spawn with Handle::try_current() to avoid panic when build() is called outside a Tokio runtime - Clarify is_banned() doc comment about temporal vs state semantics Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0ee1996 to
fc732a9
Compare
… debug log Change probe_node to return Result<(), String> instead of bool so the actual failure reason (transport error, probe timeout, etc.) propagates to the caller and appears in the debug-level log message when a node is banned. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… String Use the proper error type and convert to string only at the display site via tracing's Display formatting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
packages/rs-dapi-client/src/dapi_client.rs (1)
159-166: Hide transport internals behind a client-level health-check API.These accessors make
ConnectionPooland raw request settings part of the public crate surface just to bootstrap the checker frompackages/rs-sdk/src/sdk.rs. ADapiClient-level entry point would keep those internals swappable without another public compatibility promise.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/rs-dapi-client/src/dapi_client.rs` around lines 159 - 166, Remove the public exposure of transport internals by deleting or making private the connection_pool(&self) -> &ConnectionPool and settings(&self) -> &RequestSettings accessors on DapiClient, and add a single public DapiClient-level health API (e.g., pub fn transport_health(&self) -> Result<HealthStatus, TransportError> or pub fn check_health(&self) -> HealthStatus) that performs the necessary checks using the pool and settings internally; then update the bootstrap checker in packages/rs-sdk/src/sdk.rs to call this new DapiClient method instead of accessing ConnectionPool or RequestSettings directly so transport internals remain swappable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/rs-dapi-client/src/health_check.rs`:
- Around line 95-126: The loop currently calls
address_list.get_next_ban_expiry() and treats None the same as “no bans”, which
delays re-probing even when address_list.get_expired_ban_addresses() is
non-empty; change the logic in the Phase 2 loop so that after computing
sleep_duration (or when get_next_ban_expiry() returns None) you first call
address_list.get_expired_ban_addresses() and if it returns non-empty,
immediately call probe_and_update_batch(&expired, &address_list, &pool,
&probe_settings, &config, &cancel_token).await (or set sleep_duration =
Duration::ZERO) before falling back to config.no_ban_idle_period, ensuring
expired bans are probed immediately and get_live_address() won’t hand out
recently-banned nodes prematurely.
In `@packages/rs-sdk/src/sdk.rs`:
- Around line 134-136: Change the raw optional JoinHandle ownership into an
Arc-backed guard so clones share the same health-check task and only abort it
when the last clone is dropped: introduce a small HealthCheckGuard type that
owns the tokio::task::JoinHandle<()> (and an AbortHandle/abort logic) and
implements Drop to abort the task when the guard is dropped; replace the Sdk
field _health_check_handle: Option<tokio::task::JoinHandle<()>> with
_health_check_guard: Option<Arc<HealthCheckGuard>> (or Option<Arc<dyn
DropGuard>>), update Sdk::clone to clone the Arc instead of taking/stripping the
handle, and remove the explicit abort in Sdk::drop so the task is only canceled
by the HealthCheckGuard::Drop when the last Arc reference is released.
---
Nitpick comments:
In `@packages/rs-dapi-client/src/dapi_client.rs`:
- Around line 159-166: Remove the public exposure of transport internals by
deleting or making private the connection_pool(&self) -> &ConnectionPool and
settings(&self) -> &RequestSettings accessors on DapiClient, and add a single
public DapiClient-level health API (e.g., pub fn transport_health(&self) ->
Result<HealthStatus, TransportError> or pub fn check_health(&self) ->
HealthStatus) that performs the necessary checks using the pool and settings
internally; then update the bootstrap checker in packages/rs-sdk/src/sdk.rs to
call this new DapiClient method instead of accessing ConnectionPool or
RequestSettings directly so transport internals remain swappable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f31813b2-a3b9-4677-b418-a6bdf4e1e58f
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
packages/rs-dapi-client/Cargo.tomlpackages/rs-dapi-client/src/address_list.rspackages/rs-dapi-client/src/dapi_client.rspackages/rs-dapi-client/src/health_check.rspackages/rs-dapi-client/src/lib.rspackages/rs-sdk/src/sdk.rs
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## v3.1-dev #3353 +/- ##
============================================
+ Coverage 79.93% 79.98% +0.05%
============================================
Files 2861 2862 +1
Lines 280236 281400 +1164
============================================
+ Hits 223993 225069 +1076
- Misses 56243 56331 +88
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
packages/rs-dapi-client/src/health_check.rs (1)
95-126:⚠️ Potential issue | 🟠 MajorProbe already-expired bans before falling back to the idle sleep.
get_next_ban_expiry()ignores bans that are already in the past, so if a re-probe batch runs longer than the gap to the next expiry, this loop comes back here, getsNone, and sleepsno_ban_idle_periodbefore touching those nodes. That stretches the documented “small race window” to a full idle period and letsget_live_address()hand those addresses out unchecked.Suggested fix
// Phase 2: Watch for ban expirations loop { + let expired = address_list.get_expired_ban_addresses(); + if !expired.is_empty() { + tracing::debug!( + count = expired.len(), + "re-probing addresses with expired bans" + ); + probe_and_update_batch( + &expired, + &address_list, + &pool, + &probe_settings, + &config, + &cancel_token, + ) + .await; + continue; + } + let sleep_duration = match address_list.get_next_ban_expiry() { Some(expiry) => { let until = expiry - chrono::Utc::now(); until.to_std().unwrap_or(Duration::ZERO) } None => config.no_ban_idle_period, };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/rs-dapi-client/src/health_check.rs` around lines 95 - 126, The loop currently computes sleep_duration from get_next_ban_expiry() and may sleep the full no_ban_idle_period even when bans have already expired; before picking the sleep branch, call address_list.get_expired_ban_addresses() and if it returns a non-empty Vec, immediately call probe_and_update_batch(...) (same args as the existing branch) and continue the loop instead of sleeping. This ensures already-expired bans are re-probed immediately; keep the existing tokio::select! cancel handling and only fall back to sleeping when get_expired_ban_addresses() is empty and get_next_ban_expiry() returns None.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/rs-dapi-client/src/health_check.rs`:
- Around line 145-161: The async probe can overwrite newer bans; fix by
snapshotting the address' ban state before calling probe_node and only apply
address_list.unban or address_list.ban if the current ban-generation/timestamp
equals that snapshot. Concretely: read a lightweight version/token (e.g.,
ban_generation or banned_until timestamp) from address_list into a local
variable (snapshot) immediately before calling probe_node(address, &pool,
&settings). After probe_node returns, re-check the current token from
address_list and only call address_list.unban(address) on Ok(()) if the token
still matches the snapshot, and only call address_list.ban(address) on Err(_) if
the token still matches the snapshot; this prevents stale probe results from
clearing or double-applying newer foreground bans.
---
Duplicate comments:
In `@packages/rs-dapi-client/src/health_check.rs`:
- Around line 95-126: The loop currently computes sleep_duration from
get_next_ban_expiry() and may sleep the full no_ban_idle_period even when bans
have already expired; before picking the sleep branch, call
address_list.get_expired_ban_addresses() and if it returns a non-empty Vec,
immediately call probe_and_update_batch(...) (same args as the existing branch)
and continue the loop instead of sleeping. This ensures already-expired bans are
re-probed immediately; keep the existing tokio::select! cancel handling and only
fall back to sleeping when get_expired_ban_addresses() is empty and
get_next_ban_expiry() returns None.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5251aea2-a142-4ae8-a7c6-05c597d07e8d
📒 Files selected for processing (1)
packages/rs-dapi-client/src/health_check.rs
| /// require modifying `get_live_address()` to never auto-unban expired bans, which would break | ||
| /// the existing reactive ban behavior. | ||
| pub async fn run_health_check( | ||
| address_list: AddressList, |
There was a problem hiding this comment.
Address list is clone-friendly:
pub struct AddressList {
addresses: Arc<RwLock<HashMap<Address, AddressStatus>>>,
base_ban_period: Duration,
}
So we can clone and consume.
There was a problem hiding this comment.
Same about connection pool:
pub struct ConnectionPool {
inner: Arc<Mutex<LruCache<String, PoolItem>>>,
}
packages/rs-sdk/src/sdk.rs
Outdated
| impl Drop for Sdk { | ||
| fn drop(&mut self) { | ||
| // Abort the background health check task when the Sdk that owns the handle is dropped. | ||
| // Clones always have `None`, so only the original Sdk triggers this. |
There was a problem hiding this comment.
impl Clone for Sdk sets this to None, but I agree it's confusing and can lead to bugs. Claudius will refactor this to use Arc<> and some data structure that will abort automatically once last reference is done.
packages/rs-sdk/src/sdk.rs
Outdated
|
|
||
| // Extract clones needed for health check before dapi is moved | ||
| #[cfg(not(target_arch = "wasm32"))] | ||
| let hc_address_list = dapi.address_list().clone(); |
There was a problem hiding this comment.
So it's independent clone?
There was a problem hiding this comment.
already mentioned, clone() is recommended way of managing objects in Sdk, they are lightweight and clone-friendly - use Arc<> internally. At least they should ;-).
…onToken - Replace _health_check_handle with health_check_cancel CancellationToken (child of cancel_token), shared across all Sdk clones so dropping the original no longer kills the health check for clones - Add stop_health_check() and is_health_check_running() to Sdk (not cfg-gated, works on all targets) - Remove Drop impl for Sdk — the spawned task exits via token cancellation - Check expired bans before idle sleep in Phase 2 (coderabbitai thread 11) - Document accepted stale probe race condition (coderabbitai thread 13) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use ArcSwapOption<CancellationToken> for lock-free atomic swap of the health check token. Some = running, None = stopped. - stop_health_check(): swap(None) + cancel the old token - start_health_check(config): stop existing, create fresh child token, spawn new task - is_health_check_running(): load().is_some() All Sdk clones share the same ArcSwapOption via Arc, so any clone can start/stop the health check for all of them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 819756a5-ac14-42ca-9802-d39480f1755d
📒 Files selected for processing (2)
packages/rs-dapi-client/src/health_check.rspackages/rs-sdk/src/sdk.rs
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/rs-sdk/src/sdk.rs`:
- Around line 455-485: Race between concurrent calls to start_health_check can
leave an orphaned spawned task because stop_health_check and the subsequent
spawn/store are not atomic; fix by serializing the stop+spawn sequence. Add a
lightweight Mutex (e.g., health_check_mutex: tokio::sync::Mutex<()>) to the Sdk
struct and acquire it at the start of start_health_check (and likewise in
stop_health_check if it mutates the same state) so the sequence that calls
stop_health_check(), creates new_cancel, spawns the tokio task, and stores
health_check_cancel is executed under the lock; update start_health_check (and
stop_health_check) to lock/unlock around the critical section referencing
health_check_cancel, cancel_token, and the spawn logic to eliminate the race.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 08b28c98-6195-4c6d-be4d-294de92923d9
📒 Files selected for processing (1)
packages/rs-sdk/src/sdk.rs
…ancel Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Call sdk.start_health_check() instead of duplicating the spawn logic. Removes pre-extracted hc_address_list/hc_pool/hc_ca_cert variables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
packages/rs-sdk/src/sdk.rs (1)
467-495:⚠️ Potential issue | 🟠 MajorSerialize
start_health_checkto prevent orphaned background tasks.Line 468 performs stop and restart in separate steps without one shared lock. Concurrent calls can each spawn a task, but only the last token is retained, leaving earlier tasks unmanaged until SDK-wide cancellation.
🔧 Proposed fix
#[cfg(not(target_arch = "wasm32"))] pub fn start_health_check(&self, config: rs_dapi_client::HealthCheckConfig) { - self.stop_health_check(); - - let new_cancel = self.cancel_token.child_token(); + let mut cancel_guard = self + .health_check_cancel + .lock() + .expect("health_check_cancel mutex poisoned"); + cancel_guard.cancel(); if let SdkInstance::Dapi { dapi, .. } = &self.inner { + let new_cancel = self.cancel_token.child_token(); let hc_address_list = dapi.address_list().clone(); let hc_pool = dapi.connection_pool().clone(); let hc_ca_cert = dapi.ca_certificate.clone(); let hc_token = new_cancel.clone(); if let Ok(runtime) = tokio::runtime::Handle::try_current() { - *self - .health_check_cancel - .lock() - .expect("health_check_cancel mutex poisoned") = new_cancel; + *cancel_guard = new_cancel; + drop(cancel_guard); drop( runtime.spawn(rs_dapi_client::health_check::run_health_check( hc_address_list, hc_pool, config, hc_token, hc_ca_cert, )), ); } else { tracing::warn!("no Tokio runtime found, cannot start health check"); } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/rs-sdk/src/sdk.rs` around lines 467 - 495, Concurrent calls to start_health_check can race with stop_health_check and each spawn a background task whose cancel token gets overwritten; serialize start/stop by holding the same mutex around the stop + new token assignment + spawn so only one caller can perform the sequence. In practice, acquire the health_check_cancel mutex at the start of start_health_check, call stop_health_check (or perform its logic) while still holding that lock, create and store new_cancel into health_check_cancel, then spawn the health check task before releasing the lock; reference the start_health_check method, stop_health_check helper, the health_check_cancel Mutex and cancel_token.child_token to locate where to apply this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@packages/rs-sdk/src/sdk.rs`:
- Around line 467-495: Concurrent calls to start_health_check can race with
stop_health_check and each spawn a background task whose cancel token gets
overwritten; serialize start/stop by holding the same mutex around the stop +
new token assignment + spawn so only one caller can perform the sequence. In
practice, acquire the health_check_cancel mutex at the start of
start_health_check, call stop_health_check (or perform its logic) while still
holding that lock, create and store new_cancel into health_check_cancel, then
spawn the health check task before releasing the lock; reference the
start_health_check method, stop_health_check helper, the health_check_cancel
Mutex and cancel_token.child_token to locate where to apply this change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5f4ab381-fecb-4cdd-a15c-c6a57f550795
📒 Files selected for processing (1)
packages/rs-sdk/src/sdk.rs
#3371) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Well-structured background health check with good WASM/native portability via futures::select_biased!. Two significant lifecycle issues: a FusedFuture cancellation bug that can create zombie tasks when cancellation fires during active probing, and the health check silently fails to start in FFI (mobile) SDK paths because build() is called outside a Tokio runtime context. The WASM integration is functional but the PR description incorrectly claims WASM exclusion, and JavaScript has no API to stop the background task once started.
Reviewed commit: b70f39d
🔴 1 blocking | 🟡 5 suggestion(s) | 💬 1 nitpick(s)
1 additional finding(s) omitted (not in diff).
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-dapi-client/src/health_check.rs`:
- [BLOCKING] lines 116-158: FusedFuture cancellation consumed inside `probe_and_update_batch` makes outer loop unkillable
The `stop` future is fused once at line 102 and passed by `&mut` into `probe_and_update_batch`. If cancellation fires during probing, the `select_biased!` inside that function (line 190-195) resolves the `stop` branch — permanently terminating the `FusedFuture`. Back in `health_check_loop`, the outer `select_biased!` (line 147-153) skips the terminated `stop` branch and only waits on `sleep_fut`. The loop then continues indefinitely.
Consequences:
1. `stop_health_check()` during active probing creates a zombie task that runs until SDK shutdown
2. `is_health_check_running()` returns `false` (cancelled token) while the task is still running
3. A subsequent `start_health_check()` spawns a second concurrent health check
Fix: check `stop.is_terminated()` after `probe_and_update_batch` returns and break if true.
- [SUGGESTION] lines 187-190: Health check re-probe systematically escalates ban duration with no cap
Each failed health check re-probe calls `address_list.ban(address)` (line 189), which increments `ban_count` and applies `base_period * e^(ban_count - 1)`. Unlike foreground bans which only trigger on user requests, the health check drives this escalation deterministically on every sleep-to-expiry cycle. After 7 cycles: ~66 min ban. After 10: ~6 hours. After 12: ~112 days.
A node temporarily down for maintenance (30 min) may get escalated to hours-long bans before it can prove itself healthy. Consider either capping the health-check ban tier, not escalating `ban_count` when re-banning an already-banned node, or adding a max ban duration.
In `packages/rs-sdk/src/sdk.rs`:
- [SUGGESTION] lines 481-500: Health check silently fails to start in FFI (mobile) SDK paths
`start_health_check()` stores the new `CancellationToken` in the mutex (line 481-484) *before* checking `tokio::runtime::Handle::try_current()` (line 487). When `build()` is called outside an entered Tokio runtime — which is exactly what happens in the FFI layer at `packages/rs-sdk-ffi/src/sdk.rs:160` where `init_or_get_runtime()` creates a runtime but never calls `runtime.enter()` before `builder.build()` — `try_current()` fails, the task is never spawned, but `is_health_check_running()` returns `true` because the stored token isn't cancelled.
This means all mobile SDK users (iOS/Android via FFI) silently get no health checking despite it being enabled by default, with an incorrect status report.
- [SUGGESTION] lines 466-514: Non-atomic `start_health_check` can orphan tasks under concurrent calls
The method acquires the mutex twice: once inside `stop_health_check()` and once to store the new token (line 481-484). Between these acquisitions, a concurrent `start_health_check()` from another `Sdk` clone can interleave, causing one spawned task's cancellation token to be overwritten. That task becomes unreachable via `stop_health_check()` — only SDK-level shutdown can stop it.
Fix: hold the mutex for the entire start operation (stop → create token → store → spawn) instead of separate lock/unlock cycles.
- [SUGGESTION] lines 488-497: Dropped `JoinHandle` silently swallows panics from health check task
The `JoinHandle` from `runtime.spawn(...)` is explicitly dropped (line 489-497). If the health check task panics (e.g., from a poisoned RwLock), the panic is silently lost and `is_health_check_running()` remains stale. For a long-running background task, consider either spawning a monitoring wrapper that logs `JoinError::is_panic()`, or storing an `AbortHandle` for cleanup.
In `packages/wasm-sdk/src/sdk.rs`:
- [SUGGESTION] lines 329-347: WASM SDK runs full health check loop but exposes no lifecycle controls to JavaScript
Despite the PR description claiming 'WASM excluded — entire module gated with `#[cfg(not(target_arch = "wasm32"))]`', the health_check module has no cfg gate (line 11 of lib.rs). The full probing loop runs on WASM via `wasm_bindgen_futures::spawn_local()` (sdk.rs:503-511) using `gloo_timers` for sleep.
This is actually fine — the code is correctly portable. But two issues:
1. The PR description should be updated to reflect that health checks ARE fully functional on WASM
2. `WasmSdk` doesn't expose `shutdown()` or `stop_health_check()` via `#[wasm_bindgen]`, so JavaScript can start a perpetual background task with no way to cancel it
FIX 1: Break out of health_check_loop when cancellation is consumed inside probe_and_update_batch. The FusedFuture becomes terminated, causing the outer select_biased! to skip the stop branch forever. FIX 2: Only store the new CancellationToken into the mutex after confirming the tokio runtime exists and the task will be spawned. Previously, is_health_check_running() returned true in FFI paths where no task was actually running. FIX 3: Don't re-ban already-banned nodes during health check re-probes. Each ban() call increments ban_count with exponential backoff; deterministic re-probing drove ban duration to absurd levels (112 days after 12 cycles). FIX 4: Hold the mutex for the entire start_health_check operation to prevent concurrent calls from orphaning tasks. The previous implementation released and re-acquired the mutex, allowing interleaving. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Two real issues confirmed. The ban-escalation guard at health_check.rs:202 creates a Phase 2 regression where failed re-probes never re-ban expired nodes, leaving dead nodes live for ~59s per cycle. The missing Drop impl on Sdk means health-check tasks leak if dropped without calling shutdown(). All 6 prior findings are resolved.
Reviewed commit: aec5665
🔴 1 blocking | 🟡 2 suggestion(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-dapi-client/src/health_check.rs`:
- [BLOCKING] lines 196-207: Failed Phase 2 re-probes never re-ban expired nodes, leaving dead nodes live for ~59s per cycle
The `!is_banned(address)` guard at line 202 was added to prevent ban-escalation from deterministic re-probing. But `is_banned()` checks `ban_count > 0` (address_list.rs:94-96), and ALL Phase 2 candidates have `ban_count > 0` by definition (they come from `get_expired_ban_addresses()` which requires `ban_count > 0` at line 305). So `!is_banned()` is ALWAYS false for Phase 2 failures — `ban()` is never called.
Consequence: after a failed re-probe, `banned_until` stays at the old expired timestamp. `get_live_address()` (address_list.rs:231-238) checks `banned_until < now` — since the ban is expired, the dead node passes the filter and is returned to foreground traffic. `get_next_ban_expiry()` returns `None` (no future bans), so the loop sleeps for `no_ban_idle_period` (59s). The dead node is live for the entire sleep window.
Suggested fix: replace the guard with logic that refreshes the ban without escalating. For example, `unban()` + `ban()` resets `ban_count` to 0 then sets it to 1 with a base-period ban duration. Alternatively, add an `AddressStatus::refresh_ban()` method that resets `banned_until` to `now + base_period` without touching `ban_count`.
- [SUGGESTION] lines 245-496: No test verifies that Phase 2 re-probe keeps persistently dead nodes banned
The existing tests cover Phase 1 startup banning and cancellation, but no test exercises the Phase 2 re-probe-after-expiry path. A test with a short ban period that waits for expiry and then verifies the dead node is still banned (not returned by `get_live_address()`) would have caught finding #1.
In `packages/rs-sdk/src/sdk.rs`:
- [SUGGESTION] lines 469-525: Health-check task leaks when Sdk is dropped without calling shutdown()
There is no `impl Drop for Sdk`. `CancellationToken::drop()` does NOT cancel the token (tokio-util semantics). If the Sdk is dropped without an explicit `shutdown()` call, the spawned health-check task continues as a zombie.
At the FFI boundary this is concrete: `dash_sdk_destroy` (rs-sdk-ffi/src/sdk.rs:521-524) does `Box::from_raw(handle)` which drops `SDKWrapper` → drops `Sdk`, but never calls `shutdown()`. Since `runtime` is `Arc<Runtime>`, the runtime survives and the orphaned task keeps running.
Mitigating factor: `health_check_cancel` is a child of `cancel_token`, so calling `shutdown()` does cancel it. The issue is only when `shutdown()` is skipped.
Options: (1) Add `impl Drop for Sdk { fn drop(&mut self) { self.cancel_token.cancel(); } }` — but since Sdk is Clone, this cancels on any clone's drop. (2) Call `sdk.shutdown()` in `dash_sdk_destroy` before dropping. (3) Document that `shutdown()` must be called.
Wrap the spawned run_health_check JoinHandle in a monitor task that logs panics/failures and cancels the health-check token, so is_health_check_running() reflects the actual task state instead of going stale on panic. Also expand the no_ban_idle_period doc comment to warn that the value must be less than the effective ban duration for the re-probe guarantee to hold. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Small incremental addressing prior review feedback. Adds a panic-monitoring wrapper around the health check task and improves documentation for the no_ban_idle_period config field. Both changes are correct and well-implemented.
Reviewed commit: 5c7825d
… Drop impl FIX 1: Phase 2 re-probes now correctly re-ban failed nodes. The old `!is_banned()` guard was always false for Phase 2 candidates (they all have ban_count > 0 from get_expired_ban_addresses), so ban() was never called and dead nodes re-entered live rotation. Replace with unban()+ban() to reset ban_count to 1 with a fresh base-period duration. FIX 2: Add test_phase2_reprobe_failure_rebans_node that exercises the Phase 2 re-probe-after-expiry path with a short ban period (200ms). FIX 3: Add Arc<()>-based Drop impl for Sdk that cancels background tasks only when the last clone is dropped, preventing both zombie health-check tasks and premature cancellation when intermediate clones are dropped. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Incremental push fixes the Phase 2 re-ban regression (dead nodes escaping ban after failed re-probe) and adds automatic background-task cleanup on last Sdk clone drop. Both fixes target real bugs. Two concurrency gaps exist: (1) the unban()+ban() two-step creates a brief window where an unhealthy node appears live, and (2) the Arc::strong_count check in Drop has a TOCTOU race that can cause missed cancellation on concurrent last-clone drops. Both are low-probability in practice but have clean fixes.
Reviewed commit: dd8a8eb
🟡 2 suggestion(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-dapi-client/src/health_check.rs`:
- [SUGGESTION] lines 202-208: `unban()`+`ban()` two-step creates window where dead node appears live
`AddressList::unban()` and `AddressList::ban()` each acquire and release the `RwLock` independently. Between the two calls, a concurrent `get_live_address()`/`get_live_addresses()` call can observe `banned_until = None` and return the known-unhealthy node as a live candidate — potentially routing one request to a dead node before the ban is re-applied.
The window is very brief (sub-microsecond in the common single-threaded case), and the existing design already documents an accepted race at ban expiry (lines 62-66). However, this gap is strictly unnecessary and can be closed with an atomic `reset_ban()` method that holds a single write lock across both operations.
All 8 review agents independently flagged this issue.
In `packages/rs-sdk/src/sdk.rs`:
- [SUGGESTION] lines 167-176: `Arc::strong_count` TOCTOU — consider `Arc::into_inner` for race-free last-owner detection
If two threads drop the final two `Sdk` clones concurrently, both destructors observe `strong_count == 2` (the Arc isn't decremented until after `drop()` returns), so neither calls `cancel()`. Background tasks then leak until the tokio runtime shuts down.
The comment acknowledges this is "not perfectly atomic," but since MSRV is 1.92 and `Arc::into_inner` has been stable since Rust 1.70, a race-free alternative is available:
```rust
impl Drop for Sdk {
fn drop(&mut self) {
let shared = std::mem::replace(&mut self._shared_ref, Arc::new(()));
if Arc::into_inner(shared).is_some() {
self.cancel_token.cancel();
}
}
}
Arc::into_inner returns Some only when the calling thread holds the last strong reference, with no TOCTOU gap. The replacement Arc::new(()) is trivial (ZST, no heap allocation).
</details>
Add AddressList::reset_ban() that holds a single write lock to atomically reset ban history and apply a fresh base-period ban, closing the TOCTOU window between separate unban()+ban() calls. Use it in health_check.rs. Replace Arc::strong_count TOCTOU check in Sdk::drop with Arc::into_inner, which is race-free and stable since Rust 1.70. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
This push cleanly addresses both prior review suggestions:
-
TOCTOU race in
unban()+ban()— replaced with a new atomicreset_ban()method that acquires a single write lock, resetsban_countto 1, and applies a fresh base-period ban. Two unit tests cover the happy path and the no-op case for nonexistent addresses. -
Arc::strong_countTOCTOU inSdk::drop— replaced withArc::into_inner, which is race-free by specification (returnsSomeonly for the true last owner). Thestd::mem::replacewithArc::new(())is a zero-sized type so no allocation concern.
Both fixes are well-targeted, minimal, and correct. No new issues introduced.
Reviewed commit: 4154f16
|
Closing after discussion with @shumkov , we will fall back to TrustedHttpContextProvider. |
…heck # Conflicts: # packages/rs-dapi-client/src/address_list.rs
|
✅ Review complete (commit b076885) |
|
✅ DashSDKFFI.xcframework built for this PR.
SwiftPM (host the zip at a stable URL, then use): .binaryTarget(
name: "DashSDKFFI",
url: "https://your.cdn.example/DashSDKFFI.xcframework.zip",
checksum: "5143f7afbefe2bc88e1a504ed2fa2e961ab1f673908a82ae3870067c482b5a79"
)Xcode manual integration:
|
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Well-designed health check system with clean async architecture and proper cancellation handling. The sleep-to-expiry model is efficient and the documented race window is an honest, acceptable tradeoff. Two issues worth addressing: the WASM build enables health checks by default but exposes no explicit shutdown API (relying on JS GC for cleanup), and the idle sleep can miss bans created mid-sleep, weakening the re-probe-before-reuse guarantee.
Reviewed commit: b076885
🟡 3 suggestion(s) | 💬 1 nitpick(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-dapi-client/src/health_check.rs`:
- [SUGGESTION] lines 42-46: Idle sleep can miss bans created mid-sleep, weakening re-probe-before-reuse guarantee
When no bans exist, the watcher sleeps for `no_ban_idle_period` (default 59s). If a foreground request bans a node at t=58s for the default 60s (expiring at t=118s), the watcher wakes at t=59s, sees the next expiry at t=118s, and sleeps for another 59s. The node's ban expires at t=118s and it re-enters `get_live_address()` rotation before the watcher re-probes it at ~t=118s.
The doc comment on `no_ban_idle_period` says it ensures the watcher wakes *before* bans expire, but this only holds for bans that already exist when the sleep begins. Bans created during the sleep are missed until the next wake cycle.
This isn't critical — the documented race window in `run_health_check()` already acknowledges that nodes can briefly re-enter rotation before re-probe, and the reactive ban logic handles nodes that fail on actual requests. But the doc comment on `no_ban_idle_period` overpromises. Consider either: (a) adjusting the doc comment to clarify the guarantee only applies to pre-existing bans, or (b) using a notify/channel from the ban path to interrupt the idle sleep.
In `packages/rs-sdk/src/sdk.rs`:
- [SUGGESTION] lines 543-553: WASM health check has no JS-visible shutdown path — relies on GC for cleanup
On WASM, `SdkBuilder` enables the health check by default and `start_health_check()` spawns it via `spawn_local()`. The spawned task is only stopped when the `CancellationToken` is cancelled — which happens in `Sdk::shutdown()` or on `Drop`. But `WasmSdk` doesn't export `shutdown()` or `stop_health_check()` to JS, so the only cleanup path is the wasm-bindgen-generated `free()` method, which JS callers aren't required to invoke (it depends on GC finalization).
A forgotten `WasmSdk` instance can leave a permanent background probe loop alive for the lifetime of the page/worker. Additionally, if `run_health_check()` panics on WASM, the cancel token is never cancelled (no monitor task like native has), leaving `is_health_check_running()` permanently stale.
Consider either: (a) exporting a `shutdown()` or `stop_health_check()` method on `WasmSdkBuilder`/`WasmSdk`, or (b) defaulting health checks to off on WASM targets.
- [SUGGESTION] lines 511-538: Health check silently disabled when SDK is built outside a Tokio runtime
`SdkBuilder::default()` enables health checks, but `start_health_check()` on native only spawns the task if `tokio::runtime::Handle::try_current()` succeeds. If the SDK is constructed in synchronous setup code (before entering the Tokio runtime), the health check is silently skipped with just a `tracing::warn!`. The builder advertises health checks as on-by-default, but whether they actually start depends on call-site context.
The code already handles this by not storing the cancel token when no runtime is found (so `is_health_check_running()` returns false), which is correct. But the silent skip could surprise users. Consider either returning an error from `build()` when health checks are enabled but no runtime is available, or documenting this behavior prominently.
| /// period shorter than this value, nodes may re-enter rotation before the health check | ||
| /// can re-probe them. Adjust accordingly. | ||
| pub no_ban_idle_period: Duration, | ||
| } | ||
|
|
There was a problem hiding this comment.
🟡 Suggestion: Idle sleep can miss bans created mid-sleep, weakening re-probe-before-reuse guarantee
When no bans exist, the watcher sleeps for no_ban_idle_period (default 59s). If a foreground request bans a node at t=58s for the default 60s (expiring at t=118s), the watcher wakes at t=59s, sees the next expiry at t=118s, and sleeps for another 59s. The node's ban expires at t=118s and it re-enters get_live_address() rotation before the watcher re-probes it at ~t=118s.
The doc comment on no_ban_idle_period says it ensures the watcher wakes before bans expire, but this only holds for bans that already exist when the sleep begins. Bans created during the sleep are missed until the next wake cycle.
This isn't critical — the documented race window in run_health_check() already acknowledges that nodes can briefly re-enter rotation before re-probe, and the reactive ban logic handles nodes that fail on actual requests. But the doc comment on no_ban_idle_period overpromises. Consider either: (a) adjusting the doc comment to clarify the guarantee only applies to pre-existing bans, or (b) using a notify/channel from the ban path to interrupt the idle sleep.
source: ['codex']
🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-dapi-client/src/health_check.rs`:
- [SUGGESTION] lines 42-46: Idle sleep can miss bans created mid-sleep, weakening re-probe-before-reuse guarantee
When no bans exist, the watcher sleeps for `no_ban_idle_period` (default 59s). If a foreground request bans a node at t=58s for the default 60s (expiring at t=118s), the watcher wakes at t=59s, sees the next expiry at t=118s, and sleeps for another 59s. The node's ban expires at t=118s and it re-enters `get_live_address()` rotation before the watcher re-probes it at ~t=118s.
The doc comment on `no_ban_idle_period` says it ensures the watcher wakes *before* bans expire, but this only holds for bans that already exist when the sleep begins. Bans created during the sleep are missed until the next wake cycle.
This isn't critical — the documented race window in `run_health_check()` already acknowledges that nodes can briefly re-enter rotation before re-probe, and the reactive ban logic handles nodes that fail on actual requests. But the doc comment on `no_ban_idle_period` overpromises. Consider either: (a) adjusting the doc comment to clarify the guarantee only applies to pre-existing bans, or (b) using a notify/channel from the ban path to interrupt the idle sleep.
| { | ||
| *cancel_guard = new_cancel; | ||
| drop(cancel_guard); | ||
| wasm_bindgen_futures::spawn_local(rs_dapi_client::health_check::run_health_check( | ||
| hc_address_list, | ||
| hc_pool, | ||
| config, | ||
| hc_token, | ||
| hc_ca_cert, | ||
| )); | ||
| } |
There was a problem hiding this comment.
🟡 Suggestion: WASM health check has no JS-visible shutdown path — relies on GC for cleanup
On WASM, SdkBuilder enables the health check by default and start_health_check() spawns it via spawn_local(). The spawned task is only stopped when the CancellationToken is cancelled — which happens in Sdk::shutdown() or on Drop. But WasmSdk doesn't export shutdown() or stop_health_check() to JS, so the only cleanup path is the wasm-bindgen-generated free() method, which JS callers aren't required to invoke (it depends on GC finalization).
A forgotten WasmSdk instance can leave a permanent background probe loop alive for the lifetime of the page/worker. Additionally, if run_health_check() panics on WASM, the cancel token is never cancelled (no monitor task like native has), leaving is_health_check_running() permanently stale.
Consider either: (a) exporting a shutdown() or stop_health_check() method on WasmSdkBuilder/WasmSdk, or (b) defaulting health checks to off on WASM targets.
source: ['codex']
🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-sdk/src/sdk.rs`:
- [SUGGESTION] lines 543-553: WASM health check has no JS-visible shutdown path — relies on GC for cleanup
On WASM, `SdkBuilder` enables the health check by default and `start_health_check()` spawns it via `spawn_local()`. The spawned task is only stopped when the `CancellationToken` is cancelled — which happens in `Sdk::shutdown()` or on `Drop`. But `WasmSdk` doesn't export `shutdown()` or `stop_health_check()` to JS, so the only cleanup path is the wasm-bindgen-generated `free()` method, which JS callers aren't required to invoke (it depends on GC finalization).
A forgotten `WasmSdk` instance can leave a permanent background probe loop alive for the lifetime of the page/worker. Additionally, if `run_health_check()` panics on WASM, the cancel token is never cancelled (no monitor task like native has), leaving `is_health_check_running()` permanently stale.
Consider either: (a) exporting a `shutdown()` or `stop_health_check()` method on `WasmSdkBuilder`/`WasmSdk`, or (b) defaulting health checks to off on WASM targets.
| if let Ok(runtime) = tokio::runtime::Handle::try_current() { | ||
| *cancel_guard = new_cancel; | ||
| drop(cancel_guard); | ||
| let hc_cancel_for_monitor = hc_token.clone(); | ||
| let handle = runtime.spawn(rs_dapi_client::health_check::run_health_check( | ||
| hc_address_list, | ||
| hc_pool, | ||
| config, | ||
| hc_token, | ||
| hc_ca_cert, | ||
| )); | ||
| // Monitor task: logs panics and cancels the token so | ||
| // is_health_check_running() reflects the actual state. | ||
| drop(runtime.spawn(async move { | ||
| if let Err(e) = handle.await { | ||
| if e.is_panic() { | ||
| tracing::error!("health check task panicked: {:?}", e); | ||
| } else { | ||
| tracing::warn!("health check task failed: {:?}", e); | ||
| } | ||
| hc_cancel_for_monitor.cancel(); | ||
| } | ||
| })); | ||
| } else { | ||
| // No runtime: don't store the token so is_health_check_running() | ||
| // correctly returns false. | ||
| drop(cancel_guard); | ||
| tracing::warn!("no Tokio runtime found, cannot start health check"); |
There was a problem hiding this comment.
🟡 Suggestion: Health check silently disabled when SDK is built outside a Tokio runtime
SdkBuilder::default() enables health checks, but start_health_check() on native only spawns the task if tokio::runtime::Handle::try_current() succeeds. If the SDK is constructed in synchronous setup code (before entering the Tokio runtime), the health check is silently skipped with just a tracing::warn!. The builder advertises health checks as on-by-default, but whether they actually start depends on call-site context.
The code already handles this by not storing the cancel token when no runtime is found (so is_health_check_running() returns false), which is correct. But the silent skip could surprise users. Consider either returning an error from build() when health checks are enabled but no runtime is available, or documenting this behavior prominently.
source: ['codex']
🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-sdk/src/sdk.rs`:
- [SUGGESTION] lines 511-538: Health check silently disabled when SDK is built outside a Tokio runtime
`SdkBuilder::default()` enables health checks, but `start_health_check()` on native only spawns the task if `tokio::runtime::Handle::try_current()` succeeds. If the SDK is constructed in synchronous setup code (before entering the Tokio runtime), the health check is silently skipped with just a `tracing::warn!`. The builder advertises health checks as on-by-default, but whether they actually start depends on call-site context.
The code already handles this by not storing the cancel token when no runtime is found (so `is_health_check_running()` returns false), which is correct. But the silent skip could surprise users. Consider either returning an error from `build()` when health checks are enabled but no runtime is available, or documenting this behavior prominently.
| pub fn start_health_check(&self, config: rs_dapi_client::HealthCheckConfig) { | ||
| let mut cancel_guard = self | ||
| .health_check_cancel | ||
| .lock() | ||
| .expect("health_check_cancel mutex poisoned"); | ||
|
|
||
| // Stop existing health check (atomic: no mutex release between stop and start) | ||
| cancel_guard.cancel(); | ||
|
|
||
| let new_cancel = self.cancel_token.child_token(); | ||
|
|
There was a problem hiding this comment.
💬 Nitpick: Restarting health check doesn't await previous task completion
start_health_check() cancels the previous token and immediately spawns a new task without waiting for the old one to finish. In theory, the old task could still be inside probe_and_update_batch() when the new task starts its startup probe, causing concurrent ban/unban mutations on the shared AddressList.
In practice the impact is minimal: the old task will observe cancellation on its next select_biased! check and exit promptly, and ban/unban operations are individually atomic. But for correctness, consider storing the JoinHandle and awaiting it before spawning the replacement, or adding a generation counter to discard stale probe results.
source: ['codex']
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
This PR’s health-check loop is close, but three core behaviors in the checked-out patch do not match the feature contract it introduces. The startup probe is still asynchronous relative to build(), the default-on feature silently never starts in common native construction flows, and failed re-probes flatten ban history instead of preserving the existing backoff semantics.
Reviewed commit: b076885
🔴 3 blocking
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-sdk/src/sdk.rs`:
- [BLOCKING] lines 1087-1088: The advertised startup probe is still fire-and-forget, so the first request can race dead nodes
`SdkBuilder::build()` enables the feature by calling `sdk.start_health_check(hc_config)`, but that method only spawns `run_health_check()` and returns immediately. Phase 1 probing then happens asynchronously inside `run_health_check()` (`packages/rs-dapi-client/src/health_check.rs`, lines 110-119), so `build()` can return before any address has been checked or banned. A caller that sends a request right after constructing the SDK can still select an unprobed dead node from the existing address list, which contradicts the PR’s promised startup behavior that the first user request already sees a cleaned list.
- [BLOCKING] lines 511-538: Default-on health checks silently never start when the SDK is built outside an active Tokio runtime
The feature is configured as enabled by default, but `start_health_check()` only stores the token and spawns the task if `tokio::runtime::Handle::try_current()` succeeds. When `SdkBuilder::build()` is called from synchronous setup code, this branch only logs `no Tokio runtime found, cannot start health check` and returns, with no retry path once the caller later enters async code. That makes the same default configuration behave differently depending on where construction happens, and in a common native usage pattern the promised background health check is just absent.
In `packages/rs-dapi-client/src/health_check.rs`:
- [BLOCKING] lines 201-206: Failed re-probes flatten ban history instead of preserving the existing backoff semantics
On re-probe failure the loop calls `address_list.reset_ban(address)`, and `reset_ban()` (`packages/rs-dapi-client/src/address_list.rs`, lines 172-176) forcibly sets `ban_count = 1` with a fresh base-period ban. That means a node that stays dead across multiple expiry cycles is retried every base ban period forever instead of following the existing exponential backoff implemented by `AddressStatus::ban()`. The patch comment explicitly says this prevents escalation, but the PR description says ban-expiry re-probes should reuse the existing ban behavior, so the shipped semantics no longer match the feature contract and create constant background churn against permanently dead nodes.
| if let Some(hc_config) = health_check_config { | ||
| sdk.start_health_check(hc_config); |
There was a problem hiding this comment.
🔴 Blocking: The advertised startup probe is still fire-and-forget, so the first request can race dead nodes
SdkBuilder::build() enables the feature by calling sdk.start_health_check(hc_config), but that method only spawns run_health_check() and returns immediately. Phase 1 probing then happens asynchronously inside run_health_check() (packages/rs-dapi-client/src/health_check.rs, lines 110-119), so build() can return before any address has been checked or banned. A caller that sends a request right after constructing the SDK can still select an unprobed dead node from the existing address list, which contradicts the PR’s promised startup behavior that the first user request already sees a cleaned list.
source: ['codex']
🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-sdk/src/sdk.rs`:
- [BLOCKING] lines 1087-1088: The advertised startup probe is still fire-and-forget, so the first request can race dead nodes
`SdkBuilder::build()` enables the feature by calling `sdk.start_health_check(hc_config)`, but that method only spawns `run_health_check()` and returns immediately. Phase 1 probing then happens asynchronously inside `run_health_check()` (`packages/rs-dapi-client/src/health_check.rs`, lines 110-119), so `build()` can return before any address has been checked or banned. A caller that sends a request right after constructing the SDK can still select an unprobed dead node from the existing address list, which contradicts the PR’s promised startup behavior that the first user request already sees a cleaned list.
| if let Ok(runtime) = tokio::runtime::Handle::try_current() { | ||
| *cancel_guard = new_cancel; | ||
| drop(cancel_guard); | ||
| let hc_cancel_for_monitor = hc_token.clone(); | ||
| let handle = runtime.spawn(rs_dapi_client::health_check::run_health_check( | ||
| hc_address_list, | ||
| hc_pool, | ||
| config, | ||
| hc_token, | ||
| hc_ca_cert, | ||
| )); | ||
| // Monitor task: logs panics and cancels the token so | ||
| // is_health_check_running() reflects the actual state. | ||
| drop(runtime.spawn(async move { | ||
| if let Err(e) = handle.await { | ||
| if e.is_panic() { | ||
| tracing::error!("health check task panicked: {:?}", e); | ||
| } else { | ||
| tracing::warn!("health check task failed: {:?}", e); | ||
| } | ||
| hc_cancel_for_monitor.cancel(); | ||
| } | ||
| })); | ||
| } else { | ||
| // No runtime: don't store the token so is_health_check_running() | ||
| // correctly returns false. | ||
| drop(cancel_guard); | ||
| tracing::warn!("no Tokio runtime found, cannot start health check"); |
There was a problem hiding this comment.
🔴 Blocking: Default-on health checks silently never start when the SDK is built outside an active Tokio runtime
The feature is configured as enabled by default, but start_health_check() only stores the token and spawns the task if tokio::runtime::Handle::try_current() succeeds. When SdkBuilder::build() is called from synchronous setup code, this branch only logs no Tokio runtime found, cannot start health check and returns, with no retry path once the caller later enters async code. That makes the same default configuration behave differently depending on where construction happens, and in a common native usage pattern the promised background health check is just absent.
source: ['codex']
🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-sdk/src/sdk.rs`:
- [BLOCKING] lines 511-538: Default-on health checks silently never start when the SDK is built outside an active Tokio runtime
The feature is configured as enabled by default, but `start_health_check()` only stores the token and spawns the task if `tokio::runtime::Handle::try_current()` succeeds. When `SdkBuilder::build()` is called from synchronous setup code, this branch only logs `no Tokio runtime found, cannot start health check` and returns, with no retry path once the caller later enters async code. That makes the same default configuration behave differently depending on where construction happens, and in a common native usage pattern the promised background health check is just absent.
| Err(error) => { | ||
| // reset_ban atomically resets ban history and applies a fresh | ||
| // base-period ban, preventing exponential backoff escalation | ||
| // while closing the unban+ban race window. | ||
| address_list.reset_ban(address); | ||
| tracing::debug!(%address, error = %error, "health check: node is unhealthy, re-banned"); |
There was a problem hiding this comment.
🔴 Blocking: Failed re-probes flatten ban history instead of preserving the existing backoff semantics
On re-probe failure the loop calls address_list.reset_ban(address), and reset_ban() (packages/rs-dapi-client/src/address_list.rs, lines 172-176) forcibly sets ban_count = 1 with a fresh base-period ban. That means a node that stays dead across multiple expiry cycles is retried every base ban period forever instead of following the existing exponential backoff implemented by AddressStatus::ban(). The patch comment explicitly says this prevents escalation, but the PR description says ban-expiry re-probes should reuse the existing ban behavior, so the shipped semantics no longer match the feature contract and create constant background churn against permanently dead nodes.
source: ['codex']
🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-dapi-client/src/health_check.rs`:
- [BLOCKING] lines 201-206: Failed re-probes flatten ban history instead of preserving the existing backoff semantics
On re-probe failure the loop calls `address_list.reset_ban(address)`, and `reset_ban()` (`packages/rs-dapi-client/src/address_list.rs`, lines 172-176) forcibly sets `ban_count = 1` with a fresh base-period ban. That means a node that stays dead across multiple expiry cycles is retried every base ban period forever instead of following the existing exponential backoff implemented by `AddressStatus::ban()`. The patch comment explicitly says this prevents escalation, but the PR description says ban-expiry re-probes should reuse the existing ban behavior, so the shipped semantics no longer match the feature contract and create constant background churn against permanently dead nodes.
Issue being fixed or feature implemented
Many DAPI nodes on the network are broken, causing users to hit timeouts and retry errors before the SDK discovers they're dead. Currently, node health is purely reactive — nodes are only banned when actual user requests fail.
This PR adds an opt-in background health check system with two behaviors:
No periodic interval probes. The existing reactive ban logic handles nodes that break during normal operations.
What was done?
rs-dapi-clientchanges:address_list.rs— Addedget_all_addresses(),get_next_ban_expiry(),get_expired_ban_addresses()methods andban_count()getter onAddressStatusdapi_client.rs— Addedconnection_pool()andsettings()accessors onDapiClienthealth_check.rs(new) —HealthCheckConfigstruct andrun_health_check()async function implementing:GetStatusRequest(lightest gRPC call)futures::stream::for_each_concurrentCancellationToken+tokio::select!lib.rs— Module registration gated with#[cfg(not(target_arch = "wasm32"))]Cargo.toml— Addedtokio-utildependency forCancellationTokendash-sdkchanges:sdk.rs— Addedhealth_checkandhealth_check_configfields toSdkBuilder, builder methodswith_health_check()andwith_health_check_config(), spawning logic inbuild(),_health_check_handleonSdk, andDropimpl that aborts the taskDesign decisions:
SdkBuilder::with_health_check(false)#[cfg(not(target_arch = "wasm32"))]AddressList::ban()handles exponential backoff automaticallyban_failed_address: falseto avoid interfering with reactive banningget_live_address()may return a previously-banned node (documented in code)How Has This Been Tested?
Unit tests added:
address_list.rs(5 new tests):test_get_all_addresses_returns_both_banned_and_unbannedtest_get_next_ban_expiry_returns_earliesttest_get_next_ban_expiry_none_when_no_banstest_get_expired_ban_addressestest_address_status_ban_counthealth_check.rs(8 tests):test_health_check_config_default— Config defaults match spectest_cancel_token_stops_loop— Cancellation breaks the looptest_probe_unreachable_node_returns_false— Probe with non-routable addresstest_startup_probe_bans_unreachable— Startup probe bans dead nodestest_startup_probe_empty_address_list— Empty list handled gracefullytest_startup_probe_bans_all_unreachable_nodes— All-dead scenariotest_probe_settings_do_not_ban_on_failure— Probe settings correctnesstest_cancel_during_ban_watch_phase— Cancellation during Phase 2Verification commands:
Breaking Changes
None. Health check is additive and backward-compatible. Existing
SdkBuilderusage works unchanged (health check is enabled by default but doesn't alter existing behavior).Checklist:
🤖 Co-authored by Claudius the Magnificent AI Agent
Summary by CodeRabbit
New Features
Tests