Skip to content

refactor: simplify architecture of quorum observer and quorum participant#7247

Open
knst wants to merge 16 commits intodashpay:developfrom
knst:refactor-llmq-observer-parent
Open

refactor: simplify architecture of quorum observer and quorum participant#7247
knst wants to merge 16 commits intodashpay:developfrom
knst:refactor-llmq-observer-parent

Conversation

@knst
Copy link
Copy Markdown
Collaborator

@knst knst commented Mar 24, 2026

Issue being fixed or feature implemented

It's a follow-up for #7066 and related PRs

This PR is reducing mental concentration when reading quorum-related code by significant simplification amount of levels of abstractions and inheritance classes from each other.

This PR is a direct dependency of #7234 (kernel / chainstate project)

Performance is not expected to be improved by any noticeable margin.

Compilation time is expected to be improved marginally due to fewer files (-2) to compile and less transitive includes over src/llmq/quorumsman.h.

What was done?

This changes has been spotted during #7234 and extracted to own PR.

  • Remove QuorumObserverParent; decouple QuorumParticipant from QuorumObserver inheritance — they become independent implementations of a common QuorumRole interface.
  • Merge QuorumObserver and ObserverContext into a single class, then remove the transitional QuorumRoleBase by folding it into QuorumRole.
  • Rename src/llmq/observer/context.{h,cpp} to src/llmq/observer.{h,cpp} — flatten the now-single-file subdirectory.
  • Inline QuorumParticipant into ActiveContext — it only applies to active masternode mode; src/active/quorums.{h,cpp} deleted.
  • Remove four friend declarations from CDKGSession (ActiveDKGSession, ActiveDKGSessionHandler, CDKGSessionHandler, CDKGSessionManager); affected members changed from private to protected.
  • Extract network code from CQuorumManager and ActiveContext into new NetQuorum (src/llmq/net_quorum.{h,cpp}) — handles QGETDATA/QDATA processing, quorum peer connections, data recovery, and periodic cleanup.

Class hierarchy — before vs. after

*by claude

Before:
QuorumObserverParent (interface, implemented by CQuorumManager)

QuorumObserver (base class with shared state)
├── ObserverContext (observer mode, also CValidationInterface)
└── QuorumParticipant (active mode, in src/active/quorums.h)

After:
QuorumRole (pure virtual interface in quorumsman.h)
├── ObserverContext (observer mode, in src/llmq/observer.h)
└── ActiveContext (active mode, QuorumParticipant inlined)

How Has This Been Tested?

Run unit & functional tests

Breaking Changes

N/A

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated relevant unit/integration/functional/e2e tests
  • I have made corresponding changes to the documentation
  • I have assigned this pull request to a milestone

@knst knst added this to the 24 milestone Mar 24, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 24, 2026

⚠️ Potential Merge Conflicts Detected

This PR has potential conflicts with the following open PRs:

Please coordinate with the authors of these PRs to avoid merge conflicts.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

The patch replaces the previous QuorumObserver/QuorumParticipant model with a new llmq::QuorumRole abstraction and a NetQuorum network handler. CQuorumManager no longer depends on observer inheritance and exposes request/validation APIs (RegisterDataRequest, ValidateDataResponse, GetCachedMutableQuorum, WriteContributions). ActiveContext and ObserverContext now implement QuorumRole hooks; many observer/quorum observer sources and headers were removed or consolidated. CDKGSession gained protected visibility and hash-based lookup accessors. Validation/initialization callbacks were added to dispatch an InitializeCurrentBlockTip callback. Build sources and exported headers were updated accordingly.

Sequence Diagram(s)

sequenceDiagram
    participant Chain as Blockchain
    participant NetQuorum as llmq::NetQuorum
    participant QMan as CQuorumManager
    participant Worker as WorkerPool
    participant Peers as Network/Connman
    participant DB as QuorumDataDB

    Chain->>NetQuorum: InitializeCurrentBlockTip(tip, ibd) / UpdatedBlockTip(...)
    NetQuorum->>QMan: CheckQuorumConnections(llmqParams, pindexNew)
    QMan-->>NetQuorum: quorum list / connection decisions
    NetQuorum->>NetQuorum: TriggerQuorumDataRecoveryThreads(quorum)
    NetQuorum->>Worker: Start DataRecoveryThread(request missing data)
    Worker->>Peers: Send QGETDATA requests
    Peers-->>NetQuorum: Receive QDATA responses
    NetQuorum->>QMan: ValidateDataResponse(requestKey, response)
    alt response OK
        NetQuorum->>DB: WriteContributions(quorum)
        NetQuorum->>NetQuorum: Update quorum secret share
    else mismatch / duplicate
        NetQuorum->>Peers: Treat as invalid / ignore
    end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.49% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: simplifying the architecture of quorum observer and quorum participant through abstraction and inheritance reduction.
Description check ✅ Passed The PR description clearly outlines the refactoring goal of simplifying quorum observer and participant architecture by removing inheritance layers and creating a unified QuorumRole interface.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
src/llmq/dkgsession.h (1)

281-339: Prefer narrower hooks over widening most of the session state to protected.

This trades a small set of explicit friends for subclass access to invCs, the message maps, and most of the mutable DKG session state. I'd keep as much of this private as possible and add a few targeted hooks/accessors for the remaining subclass needs, so the locking and ordering invariants stay localized.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/dkgsession.h` around lines 281 - 339, Make the large protected block
private and provide narrow accessor/hooks for subclasses instead of exposing
most session state; keep invCs, cs_pending, and the message maps (contributions,
complaints, justifications, prematureCommitments, validCommitments) private and
add specific thread-safe methods (e.g., bool TryGetContribution(uint256 hash,
CDKGContribution& out), void ForEachContributionLocked(fn),
AddComplaint/RemoveComplaint, ValidateAndMarkCommitment,
GetPendingContributionVerifications(), etc.) that acquire the proper locks and
expose only the minimal operations subclasses need; similarly keep members,
membersMap, relayMembers, vvecContribution, receivedVvecs,
vecEncryptedContributions, m_sk_contributions, myProTxHash/myId/myIdx private
and add targeted getters/setters or iteration hooks (e.g., GetMemberByIndex,
FindMemberId, AppendEncryptedContribution) to preserve invariants and localize
locking around invCs and cs_pending while removing the broad protected
visibility.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/active/context.cpp`:
- Line 40: The file uses standard algorithms (std::sort and std::ranges::any_of)
but only includes <ranges>, which risks build breakage; add a direct include for
<algorithm> at the top of the file (alongside the existing `#include` <ranges>) so
that usages in functions referencing std::sort and std::ranges::any_of are
guaranteed to compile across STL implementations.

In `@src/llmq/observer.cpp`:
- Around line 35-38: The destructor ObserverContext::~ObserverContext is not
formatted per project style; run clang-format (or apply the repository's
formatting tool) on the hunk containing ObserverContext::~ObserverContext and
the call to m_qman.DisconnectManagers() so the braces, indentation and spacing
match the project's clang-format rules and the diff-format job will pass.

In `@src/llmq/quorumsman.cpp`:
- Around line 772-775: The QuorumRole destructor currently calls Stop()
(QuorumRole::~QuorumRole()) which can run after derived members are destroyed
and allows ActiveContext's workerPool jobs to dereference torn-down derived
state; remove the Stop() call from the base destructor and instead
enforce/verify that shutdown happened earlier by asserting or logging if Stop()
wasn't already invoked. Locate where derived classes or owners create/own
ActiveContext/workerPool and ensure they call Stop() (or otherwise quiesce
workerPool) before derived member teardown; add an assertion in
QuorumRole::~QuorumRole() that Stop() has already been called (or that the
workerPool is quiesced) to fail-fast if shutdown ordering is violated. Ensure
any code paths that currently rely on the base destructor to stop workers are
updated to call Stop() explicitly before destruction.
- Around line 913-918: The loop currently increments nTries before calling
m_qman.IsDataRequestPending and then continues without hitting the 1s sleep,
which burns attempts on pending requests; fix by not incrementing nTries until
you actually make a request: compute pCurrentMemberHash using (start_offset +
nTries) without ++, call m_qman.IsDataRequestPending(*pCurrentMemberHash,...),
and only increment nTries (or advance start_offset) when you proceed to send a
request; alternatively ensure the continue path still triggers the per-iteration
sleep — adjust the logic around pCurrentMemberHash, nTries, and the continue so
IsDataRequestPending does not consume an attempt and the loop still sleeps on
pending checks.

In `@src/llmq/quorumsman.h`:
- Line 75: Add a forward declaration for llmq::QuorumRole before the
CQuorumManager class so the member llmq::QuorumRole* m_handler (and any other
uses of llmq::QuorumRole in CQuorumManager) are valid; specifically, insert
"namespace llmq { class QuorumRole; }" (or equivalent forward declaration) above
the CQuorumManager definition to resolve the use-before-declaration of
llmq::QuorumRole referenced by m_handler and any other members/methods in
CQuorumManager.

---

Nitpick comments:
In `@src/llmq/dkgsession.h`:
- Around line 281-339: Make the large protected block private and provide narrow
accessor/hooks for subclasses instead of exposing most session state; keep
invCs, cs_pending, and the message maps (contributions, complaints,
justifications, prematureCommitments, validCommitments) private and add specific
thread-safe methods (e.g., bool TryGetContribution(uint256 hash,
CDKGContribution& out), void ForEachContributionLocked(fn),
AddComplaint/RemoveComplaint, ValidateAndMarkCommitment,
GetPendingContributionVerifications(), etc.) that acquire the proper locks and
expose only the minimal operations subclasses need; similarly keep members,
membersMap, relayMembers, vvecContribution, receivedVvecs,
vecEncryptedContributions, m_sk_contributions, myProTxHash/myId/myIdx private
and add targeted getters/setters or iteration hooks (e.g., GetMemberByIndex,
FindMemberId, AppendEncryptedContribution) to preserve invariants and localize
locking around invCs and cs_pending while removing the broad protected
visibility.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f51bb0d0-4ebb-4427-8fb1-dc54321440d3

📥 Commits

Reviewing files that changed from the base of the PR and between 65366a4 and 40ab4f9.

📒 Files selected for processing (20)
  • src/Makefile.am
  • src/active/context.cpp
  • src/active/context.h
  • src/active/dkgsessionhandler.cpp
  • src/active/quorums.cpp
  • src/active/quorums.h
  • src/init.cpp
  • src/llmq/dkgsession.cpp
  • src/llmq/dkgsession.h
  • src/llmq/observer.cpp
  • src/llmq/observer.h
  • src/llmq/observer/quorums.cpp
  • src/llmq/observer/quorums.h
  • src/llmq/quorums.h
  • src/llmq/quorumsman.cpp
  • src/llmq/quorumsman.h
  • src/net_processing.cpp
  • src/node/context.cpp
  • src/rpc/quorums.cpp
  • test/util/data/non-backported.txt
💤 Files with no reviewable changes (5)
  • test/util/data/non-backported.txt
  • src/active/quorums.cpp
  • src/llmq/observer/quorums.cpp
  • src/llmq/observer/quorums.h
  • src/active/quorums.h

Copy link
Copy Markdown

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Clean, well-structured refactoring that successfully collapses the QuorumObserver/QuorumParticipant/QuorumObserverParent hierarchy into a single QuorumRole base class with two concrete subclasses (ObserverContext and ActiveContext). The code movement is mechanically correct with no logic changes, CDKGSession friend removal is properly replaced with getters and protected access, and the destruction/initialization ordering is preserved.

Reviewed commit: 40ab4f9

🟡 2 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `src/llmq/quorumsman.h`:
- [SUGGESTION] lines 244-249: CheckQuorumConnections and TriggerQuorumDataRecoveryThreads should be protected, not public
  These two virtual methods are internal implementation hooks of the Template Method pattern — they're called by `UpdatedBlockTip` and overridden by `ActiveContext`, but never called externally through a `QuorumRole*`. In the old `QuorumObserver` class they were `protected virtual`. Making them public unnecessarily exposes internal state management.

In `ActiveContext::context.h` these overrides are correctly in the `protected` section (line 83–87), which reinforces that this is the intended visibility.

In `src/active/context.cpp`:
- [SUGGESTION] lines 283-286: Worker pool lifetime vs ActiveContext member lifetime during abnormal destruction
  The worker pool now lives in the `QuorumRole` base class, while tasks queued via `StartDataRecoveryThread` capture `this` (an `ActiveContext*`) and dereference `nodeman` (line 285). C++ destruction order is: derived destructor → member destruction → base destructor. So if `ActiveContext` is ever destroyed without a prior explicit `Stop()` call, there's a window where `nodeman` is already destroyed but the worker pool hasn't been stopped yet — `QuorumRole::~QuorumRole()` calls `Stop()` too late.

In practice the normal shutdown path (`init.cpp:293`) always calls `Stop()` before destruction, and `QuorumRole::~QuorumRole()` has a safety-net `Stop()` call. But in the old code, the worker pool was inside the owned `qman_handler` member, so it was destroyed in member-destruction order alongside the other members — there was no split between member lifetimes and pool lifetime. This refactor introduced a subtle ordering change that could bite during exception-unwind or future refactors.

Consider adding an `assert` or explicit `Stop()` in `ActiveContext::~ActiveContext()` to make the contract self-enforcing.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 54d9a6b83a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@github-actions
Copy link
Copy Markdown

This pull request has conflicts, please rebase.

@knst knst force-pushed the refactor-llmq-observer-parent branch from 54d9a6b to 323c98a Compare March 24, 2026 14:43
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
src/llmq/quorumsman.cpp (2)

772-775: ⚠️ Potential issue | 🟠 Major

Base destructor calls Stop() which may cause UAF issues.

The base class destructor (QuorumRole::~QuorumRole()) calls Stop(), but base destructors run after derived member destruction. Since ActiveContext queues workerPool jobs that may reference derived state, this creates a shutdown-time use-after-free window.

Consider asserting that Stop() was already called, rather than calling it from the destructor:

 QuorumRole::~QuorumRole()
 {
-    Stop();
+    // Derived classes must call Stop() before destruction
+    assert(workerPool.size() == 0 || !workerPool.is_running());
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/quorumsman.cpp` around lines 772 - 775, The base destructor
QuorumRole::~QuorumRole() must not call Stop() because derived state may already
be destroyed and ActiveContext/workerPool tasks can reference that derived
state; remove the Stop() call from QuorumRole::~QuorumRole() and instead assert
that Stop() was already invoked (e.g. check a boolean like m_running or a
finished/shutdown flag) to catch incorrect shutdown ordering, or move the Stop()
invocation into derived class destructors where derived resources are still
valid; reference QuorumRole::~QuorumRole, Stop(), and ActiveContext/workerPool
when making the change.

912-922: ⚠️ Potential issue | 🟠 Major

Recovery loop burns attempts on pending requests without sleeping.

When IsDataRequestPending() returns true, the loop increments nTries and continues without hitting the 1-second sleep at the bottom. This can rapidly exhaust all members and report "All tried but failed" without actually waiting for pending requests to complete.

Consider not incrementing nTries until a request is actually sent:

-            if ((GetTime<std::chrono::seconds>().count() - nTimeLastSuccess) > nRequestTimeout) {
-                if (nTries >= vecMemberHashes.size()) {
-                    printLog("All tried but failed");
-                    break;
-                }
-                pCurrentMemberHash = &vecMemberHashes[(start_offset + nTries++) % vecMemberHashes.size()];
-                if (m_qman.IsDataRequestPending(*pCurrentMemberHash, /*we_requested=*/true, pQuorum->qc->quorumHash,
-                                                pQuorum->qc->llmqType)) {
-                    printLog("Already asked");
-                    continue;
-                }
+            if ((GetTime<std::chrono::seconds>().count() - nTimeLastSuccess) > nRequestTimeout) {
+                // Find next member without a pending request
+                size_t attempts = 0;
+                while (attempts < vecMemberHashes.size()) {
+                    pCurrentMemberHash = &vecMemberHashes[(start_offset + nTries) % vecMemberHashes.size()];
+                    if (!m_qman.IsDataRequestPending(*pCurrentMemberHash, /*we_requested=*/true, pQuorum->qc->quorumHash,
+                                                    pQuorum->qc->llmqType)) {
+                        break;
+                    }
+                    nTries++;
+                    attempts++;
+                }
+                if (attempts >= vecMemberHashes.size()) {
+                    printLog("All pending or tried");
+                    // Don't break - wait for pending to complete
+                }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/quorumsman.cpp` around lines 912 - 922, The loop currently
increments nTries when selecting pCurrentMemberHash before calling
IsDataRequestPending, which causes pending-request branches to still consume
attempts and skip the 1s sleep; change the logic in the recovery loop around
IsDataRequestPending so that nTries is only incremented when an actual request
is sent (i.e., move the nTries++ to after the code path that sends a request or
use a separate index variable for probing and only increment nTries on a real
send), keep the IsDataRequestPending check as-is (call on *pCurrentMemberHash),
and ensure that when the branch finds a pending request it does not continue
past the sleep (so the loop will hit the existing 1-second sleep) and preserve
the existing printLog(...) messages (e.g., "Already asked" / "All tried but
failed").
🧹 Nitpick comments (1)
src/active/context.cpp (1)

182-182: Redundant assertion on gsl::not_null parameter.

The block_index parameter is declared as gsl::not_null<const CBlockIndex*>, which already guarantees the pointer is non-null at the call site. This assertion is unnecessary.

Proposed removal
     if (request.GetDataMask() & llmq::CQuorumDataRequest::ENCRYPTED_CONTRIBUTIONS) {
-        assert(block_index);
-
         int memberIdx = quorum.GetMemberIndex(request.GetProTxHash());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/active/context.cpp` at line 182, Remove the redundant runtime assertion
for the non-null pointer: delete the call to assert(block_index); in the
function that takes gsl::not_null<const CBlockIndex*> block_index (i.e., where
the parameter is declared as gsl::not_null<const CBlockIndex*> block_index). The
gsl::not_null type already enforces non-null at call sites, so simply remove the
assert and leave the parameter usage as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/active/context.cpp`:
- Line 173: The modulo expression "nIndex % quorum.qc->validMembers.size()" can
divide by zero if quorum.qc->validMembers is empty; add a defensive check before
the modulo (inspect quorum.qc and quorum.qc->validMembers.empty()) and handle
the empty case (e.g., return 0 or throw/log a clear error) so you never perform
the modulo on size() == 0; update the function containing that return to use
this guard and keep the rest of the logic unchanged.

---

Duplicate comments:
In `@src/llmq/quorumsman.cpp`:
- Around line 772-775: The base destructor QuorumRole::~QuorumRole() must not
call Stop() because derived state may already be destroyed and
ActiveContext/workerPool tasks can reference that derived state; remove the
Stop() call from QuorumRole::~QuorumRole() and instead assert that Stop() was
already invoked (e.g. check a boolean like m_running or a finished/shutdown
flag) to catch incorrect shutdown ordering, or move the Stop() invocation into
derived class destructors where derived resources are still valid; reference
QuorumRole::~QuorumRole, Stop(), and ActiveContext/workerPool when making the
change.
- Around line 912-922: The loop currently increments nTries when selecting
pCurrentMemberHash before calling IsDataRequestPending, which causes
pending-request branches to still consume attempts and skip the 1s sleep; change
the logic in the recovery loop around IsDataRequestPending so that nTries is
only incremented when an actual request is sent (i.e., move the nTries++ to
after the code path that sends a request or use a separate index variable for
probing and only increment nTries on a real send), keep the IsDataRequestPending
check as-is (call on *pCurrentMemberHash), and ensure that when the branch finds
a pending request it does not continue past the sleep (so the loop will hit the
existing 1-second sleep) and preserve the existing printLog(...) messages (e.g.,
"Already asked" / "All tried but failed").

---

Nitpick comments:
In `@src/active/context.cpp`:
- Line 182: Remove the redundant runtime assertion for the non-null pointer:
delete the call to assert(block_index); in the function that takes
gsl::not_null<const CBlockIndex*> block_index (i.e., where the parameter is
declared as gsl::not_null<const CBlockIndex*> block_index). The gsl::not_null
type already enforces non-null at call sites, so simply remove the assert and
leave the parameter usage as-is.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a10c1fef-7887-4c73-b10b-d5db9bfee1be

📥 Commits

Reviewing files that changed from the base of the PR and between 54d9a6b and 323c98a.

📒 Files selected for processing (21)
  • src/Makefile.am
  • src/active/context.cpp
  • src/active/context.h
  • src/active/dkgsession.cpp
  • src/active/dkgsessionhandler.cpp
  • src/active/quorums.cpp
  • src/active/quorums.h
  • src/init.cpp
  • src/llmq/dkgsession.cpp
  • src/llmq/dkgsession.h
  • src/llmq/observer.cpp
  • src/llmq/observer.h
  • src/llmq/observer/quorums.cpp
  • src/llmq/observer/quorums.h
  • src/llmq/quorums.h
  • src/llmq/quorumsman.cpp
  • src/llmq/quorumsman.h
  • src/net_processing.cpp
  • src/node/context.cpp
  • src/rpc/quorums.cpp
  • test/util/data/non-backported.txt
💤 Files with no reviewable changes (5)
  • test/util/data/non-backported.txt
  • src/llmq/observer/quorums.cpp
  • src/active/quorums.h
  • src/active/quorums.cpp
  • src/llmq/observer/quorums.h
✅ Files skipped from review due to trivial changes (2)
  • src/node/context.cpp
  • src/net_processing.cpp
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/rpc/quorums.cpp
  • src/active/dkgsession.cpp
  • src/llmq/quorums.h
  • src/init.cpp

}
}
}
return nIndex % quorum.qc->validMembers.size();
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential division by zero if validMembers is empty.

If quorum.qc->validMembers.size() is 0, this modulo operation causes undefined behavior. While a properly initialized quorum should always have valid members, adding a defensive check would prevent crashes from edge cases or future refactoring.

Proposed defensive check
-    return nIndex % quorum.qc->validMembers.size();
+    const size_t validMembersCount = quorum.qc->validMembers.size();
+    assert(validMembersCount > 0);
+    return nIndex % validMembersCount;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return nIndex % quorum.qc->validMembers.size();
const size_t validMembersCount = quorum.qc->validMembers.size();
assert(validMembersCount > 0);
return nIndex % validMembersCount;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/active/context.cpp` at line 173, The modulo expression "nIndex %
quorum.qc->validMembers.size()" can divide by zero if quorum.qc->validMembers is
empty; add a defensive check before the modulo (inspect quorum.qc and
quorum.qc->validMembers.empty()) and handle the empty case (e.g., return 0 or
throw/log a clear error) so you never perform the modulo on size() == 0; update
the function containing that return to use this guard and keep the rest of the
logic unchanged.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a reachable current-SHA bug. The helper moved to net_quorum.cpp, and its callers return early when quorum.qc->validMembers is empty before computing the recovery offset.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

Let me verify the caller guards in net_quorum.cpp.


🏁 Script executed:

#!/bin/bash
# Check callers of GetQuorumRecoveryStartOffset and validMembers guards in net_quorum.cpp
rg -n "GetQuorumRecoveryStartOffset|validMembers" --type cpp -C3

Length of output: 51449


@thepastaclaw, confirmed. Both callers of GetQuorumRecoveryStartOffset in src/llmq/net_quorum.cpp explicitly guard against the empty case before computing the offset:

  • StartVvecSyncThread (line 587): if (pQuorum->qc->validMembers.empty()) return;
  • StartSkShareRecoveryThread (line 620): if (pQuorum->qc->validMembers.empty()) return;

The modulo-by-zero path is indeed unreachable. Thanks for the clarification — withdrawing the concern.


🧠 Learnings used
Learnt from: UdjinM6
Repo: dashpay/dash PR: 7265
File: src/wallet/wallet.cpp:3381-3382
Timestamp: 2026-04-03T14:17:17.656Z
Learning: In Dash Core, `strprintf()` uses `tinyformat` (type-safe printf). Do not flag format-specifier mismatches for integer types when using `strprintf()`: `%d` is correct for any integer type, including `CAmount`/`int64_t` arguments. Only treat `%d`/integer combinations as a formatting issue if they occur in a non-`strprintf()` (e.g., libc `printf`) context or there is evidence the format string is not being handled by `tinyformat`.

@knst knst force-pushed the refactor-llmq-observer-parent branch from 323c98a to 98399b4 Compare March 24, 2026 18:50
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/llmq/quorumsman.h (1)

135-159: ⚠️ Potential issue | 🟡 Minor

Re-run clang-format on the new declarations.

CI is already red here; these updated declarations and virtual signatures do not match the repository formatting yet.

Also applies to: 235-256

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/quorumsman.h` around lines 135 - 159, Re-run clang-format on the
updated declarations in quorumsman.h and fix formatting to match the repository
style for the following declarations: RequestQuorumData, GetQuorum, both
ScanQuorums overloads, IsMasternode, IsWatching, IsDataRequestPending,
GetDataRequestStatus, CleanupExpiredDataRequests, and CleanupOldQuorumData;
ensure spacing, line breaks, parameter wrapping and the placement/spacing of the
EXCLUSIVE_LOCKS_REQUIRED annotations match existing code style (and reformat the
other affected virtual/signature blocks in the file the same way).
♻️ Duplicate comments (4)
src/llmq/quorumsman.cpp (2)

772-775: ⚠️ Potential issue | 🟠 Major

Don't make the base destructor the first place that stops the quorum pool.

QuorumRole::~QuorumRole() runs after derived destructors, so it is too late to be the primary shutdown point for queued jobs. Stop() needs to happen before ObserverContext/ActiveContext tear down their own state; the base destructor should only verify that shutdown already happened.

Please verify that derived teardown already stops the shared pool before member destruction and that the shared pool still runs derived work. Expected: an explicit early Stop() in derived destructors or owners; otherwise this remains a shutdown-time lifetime hazard.

#!/bin/bash
set -euo pipefail

echo "Base shutdown path:"
rg -n -C2 'QuorumRole::~QuorumRole|void QuorumRole::Stop|void QuorumRole::Start' src/llmq/quorumsman.cpp

echo
echo "Derived destructors / explicit Stop() calls:"
rg -n -C2 '(ActiveContext|ObserverContext)::(~|Stop\()' src/active src/llmq

echo
echo "Shared quorum-pool jobs:"
rg -n -C3 'workerPool\.push|TriggerQuorumDataRecoveryThreads|StartDataRecoveryThread|StartVvecSyncThread' src/active src/llmq
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/quorumsman.cpp` around lines 772 - 775, The base destructor
QuorumRole::~QuorumRole currently calls Stop() too late (after derived
destructors run); move the primary shutdown of the shared worker pool out of the
base destructor and ensure derived classes (e.g., ObserverContext and
ActiveContext) explicitly call Stop() early in their own destructors or owners
perform an early Stop() before tearing down derived members; update or add
explicit Stop() calls in ActiveContext::~ActiveContext and
ObserverContext::~ObserverContext (or their owning shutdown paths) so the
workerPool and methods like
TriggerQuorumDataRecoveryThreads/StartDataRecoveryThread/StartVvecSyncThread are
stopped before derived member destruction, and keep QuorumRole::~QuorumRole as a
defensive check that verifies shutdown already occurred rather than performing
it.

917-923: ⚠️ Potential issue | 🟠 Major

Pending requests shouldn't burn recovery attempts.

nTries is incremented before IsDataRequestPending(), and the continue skips the normal 1-second backoff. A few outstanding requests can make this loop reach "All tried but failed" without actually giving any of them time to complete.

Proposed fix
-                pCurrentMemberHash = &vecMemberHashes[(start_offset + nTries++) % vecMemberHashes.size()];
+                pCurrentMemberHash = &vecMemberHashes[(start_offset + nTries) % vecMemberHashes.size()];
                 if (m_qman.IsDataRequestPending(*pCurrentMemberHash, /*we_requested=*/true, pQuorum->qc->quorumHash,
                                                 pQuorum->qc->llmqType)) {
                     printLog("Already asked");
+                    quorumThreadInterrupt.sleep_for(std::chrono::seconds(1));
                     continue;
                 }
+                ++nTries;
                 quorumThreadInterrupt.sleep_for(std::chrono::milliseconds(start_offset * 100));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/quorumsman.cpp` around lines 917 - 923, The loop currently
increments nTries before calling m_qman.IsDataRequestPending and then uses
continue, which consumes an attempt without waiting; to fix, compute the
candidate member index using a temporary variable (e.g., idx = (start_offset +
nTries) % vecMemberHashes.size()) and call IsDataRequestPending with that
candidate without incrementing nTries, then only increment nTries when you
actually perform an attempt (after passing the pending check) or otherwise
perform the 1-second backoff before continuing; adjust
quorumThreadInterrupt.sleep_for usage so the pending-case flows through the same
backoff path instead of skipping it.
src/llmq/quorumsman.h (1)

75-75: ⚠️ Potential issue | 🔴 Critical

Forward-declare QuorumRole before CQuorumManager.

Line 75 and Line 111 use llmq::QuorumRole before the class is declared at Line 197, so this header is no longer self-contained and can fail to parse when included on its own. Add a forward declaration next to the other llmq forward declarations before CQuorumManager.

Minimal fix
 class CDKGSessionManager;
 class CQuorumBlockProcessor;
 class CQuorumSnapshotManager;
+class QuorumRole;

Also applies to: 111-111

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/quorumsman.h` at line 75, Add a forward declaration for
llmq::QuorumRole before the declaration of CQuorumManager so the header is
self-contained: locate the block with the other llmq forward declarations (where
CQuorumManager is declared) and add "namespace llmq { class QuorumRole; }" (or
equivalent) so members like llmq::QuorumRole* m_handler and any use at lines
referencing QuorumRole compile without needing the full class definition.
src/active/context.cpp (1)

291-294: ⚠️ Potential issue | 🟠 Major

Stop the recovery pool before these jobs can outlive ActiveContext.

Lines 291-294 capture this and dereference nodeman, but workerPool now lives in QuorumRole. If destruction ever happens without an earlier explicit Stop(), this lambda can run after ActiveContext members start tearing down. Make shutdown self-enforcing here by stopping the pool before member destruction, or assert that shutdown already did.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/active/context.cpp` around lines 291 - 294, The lambda pushed into
workerPool captures this and dereferences nodeman, which can outlive
ActiveContext if the pool is owned by QuorumRole; ensure shutdown is enforced by
either stopping the pool before any ActiveContext members are destroyed or
asserting shutdown has already occurred. Modify the code around workerPool.push
(the lambda that calls DataRecoveryThread with pQuorum/pIndex and
nodeman->GetProTxHash()) to first call or verify Stop() (or an equivalent
self-enforcing shutdown on workerPool) or add an assertion that ActiveContext
has already been stopped, so the lambda cannot run after ActiveContext
destruction; update any related lifetimes or comments accordingly to reference
workerPool, Stop(), ActiveContext, QuorumRole, nodeman, and DataRecoveryThread.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/active/context.cpp`:
- Around line 153-154: Reformat the function signature and wrapped lines for
ActiveContext::GetQuorumRecoveryStartOffset and the nearby blocks mentioned (the
function headers and wrapped calls/conditionals in the chunks around lines
191-193, 203-205, and 241-330) with clang-format so they match the project's
style; run clang-format (or the repo's formatting command) on
src/active/context.cpp to reflow the long signatures, wrapped calls, and
conditionals and commit the updated formatting ensuring the signatures like
ActiveContext::GetQuorumRecoveryStartOffset are a single consistent style as
enforced by CI.

In `@src/llmq/observer.cpp`:
- Around line 57-62: ObserverContext::InitializeCurrentBlockTip is calling
llmq::QuorumRole::InitializeQuorumConnections(tip) after calling
UpdatedBlockTip(tip, nullptr, ibd), which duplicates the quorum init logic and
bypasses the IBD guard inside QuorumRole::UpdatedBlockTip; remove the explicit
call to llmq::QuorumRole::InitializeQuorumConnections(tip) from
InitializeCurrentBlockTip so quorum connections are only initialized via
UpdatedBlockTip/QuorumRole::UpdatedBlockTip (which already checks ibd) and let
the existing UpdatedBlockTip flow handle connection setup.
- Around line 44-47: The code currently sets ENCRYPTED_CONTRIBUTIONS_MISSING and
returns MisbehavingError when request_limit_exceeded is true, which penalizes
peers for a condition observers can't serve; change the branch in observer.cpp
so that after
request.SetError(CQuorumDataRequest::Errors::ENCRYPTED_CONTRIBUTIONS_MISSING)
you always return MessageProcessingResult{} (no MisbehavingError) regardless of
request_limit_exceeded, matching CQuorumManager::sendQDATA() behavior and
avoiding misbehavior scoring for ENCRYPTED_CONTRIBUTIONS_MISSING.

---

Outside diff comments:
In `@src/llmq/quorumsman.h`:
- Around line 135-159: Re-run clang-format on the updated declarations in
quorumsman.h and fix formatting to match the repository style for the following
declarations: RequestQuorumData, GetQuorum, both ScanQuorums overloads,
IsMasternode, IsWatching, IsDataRequestPending, GetDataRequestStatus,
CleanupExpiredDataRequests, and CleanupOldQuorumData; ensure spacing, line
breaks, parameter wrapping and the placement/spacing of the
EXCLUSIVE_LOCKS_REQUIRED annotations match existing code style (and reformat the
other affected virtual/signature blocks in the file the same way).

---

Duplicate comments:
In `@src/active/context.cpp`:
- Around line 291-294: The lambda pushed into workerPool captures this and
dereferences nodeman, which can outlive ActiveContext if the pool is owned by
QuorumRole; ensure shutdown is enforced by either stopping the pool before any
ActiveContext members are destroyed or asserting shutdown has already occurred.
Modify the code around workerPool.push (the lambda that calls DataRecoveryThread
with pQuorum/pIndex and nodeman->GetProTxHash()) to first call or verify Stop()
(or an equivalent self-enforcing shutdown on workerPool) or add an assertion
that ActiveContext has already been stopped, so the lambda cannot run after
ActiveContext destruction; update any related lifetimes or comments accordingly
to reference workerPool, Stop(), ActiveContext, QuorumRole, nodeman, and
DataRecoveryThread.

In `@src/llmq/quorumsman.cpp`:
- Around line 772-775: The base destructor QuorumRole::~QuorumRole currently
calls Stop() too late (after derived destructors run); move the primary shutdown
of the shared worker pool out of the base destructor and ensure derived classes
(e.g., ObserverContext and ActiveContext) explicitly call Stop() early in their
own destructors or owners perform an early Stop() before tearing down derived
members; update or add explicit Stop() calls in ActiveContext::~ActiveContext
and ObserverContext::~ObserverContext (or their owning shutdown paths) so the
workerPool and methods like
TriggerQuorumDataRecoveryThreads/StartDataRecoveryThread/StartVvecSyncThread are
stopped before derived member destruction, and keep QuorumRole::~QuorumRole as a
defensive check that verifies shutdown already occurred rather than performing
it.
- Around line 917-923: The loop currently increments nTries before calling
m_qman.IsDataRequestPending and then uses continue, which consumes an attempt
without waiting; to fix, compute the candidate member index using a temporary
variable (e.g., idx = (start_offset + nTries) % vecMemberHashes.size()) and call
IsDataRequestPending with that candidate without incrementing nTries, then only
increment nTries when you actually perform an attempt (after passing the pending
check) or otherwise perform the 1-second backoff before continuing; adjust
quorumThreadInterrupt.sleep_for usage so the pending-case flows through the same
backoff path instead of skipping it.

In `@src/llmq/quorumsman.h`:
- Line 75: Add a forward declaration for llmq::QuorumRole before the declaration
of CQuorumManager so the header is self-contained: locate the block with the
other llmq forward declarations (where CQuorumManager is declared) and add
"namespace llmq { class QuorumRole; }" (or equivalent) so members like
llmq::QuorumRole* m_handler and any use at lines referencing QuorumRole compile
without needing the full class definition.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: abc2aee3-2464-4810-95b6-71fe1dde4c97

📥 Commits

Reviewing files that changed from the base of the PR and between 323c98a and 98399b4.

📒 Files selected for processing (21)
  • src/Makefile.am
  • src/active/context.cpp
  • src/active/context.h
  • src/active/dkgsession.cpp
  • src/active/dkgsessionhandler.cpp
  • src/active/quorums.cpp
  • src/active/quorums.h
  • src/init.cpp
  • src/llmq/dkgsession.cpp
  • src/llmq/dkgsession.h
  • src/llmq/observer.cpp
  • src/llmq/observer.h
  • src/llmq/observer/quorums.cpp
  • src/llmq/observer/quorums.h
  • src/llmq/quorums.h
  • src/llmq/quorumsman.cpp
  • src/llmq/quorumsman.h
  • src/net_processing.cpp
  • src/node/context.cpp
  • src/rpc/quorums.cpp
  • test/util/data/non-backported.txt
💤 Files with no reviewable changes (5)
  • test/util/data/non-backported.txt
  • src/llmq/observer/quorums.h
  • src/llmq/observer/quorums.cpp
  • src/active/quorums.h
  • src/active/quorums.cpp
✅ Files skipped from review due to trivial changes (3)
  • src/rpc/quorums.cpp
  • src/node/context.cpp
  • src/net_processing.cpp
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/active/dkgsession.cpp
  • src/init.cpp
  • src/llmq/quorums.h

Comment on lines +153 to +154
size_t ActiveContext::GetQuorumRecoveryStartOffset(const llmq::CQuorum& quorum,
gsl::not_null<const CBlockIndex*> pIndex) const
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Re-run clang-format on this block.

The diff-format job is already failing on these signatures, wrapped calls, and conditionals, so this chunk will keep CI red until it is reformatted.

Also applies to: 191-193, 203-205, 241-330

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/active/context.cpp` around lines 153 - 154, Reformat the function
signature and wrapped lines for ActiveContext::GetQuorumRecoveryStartOffset and
the nearby blocks mentioned (the function headers and wrapped calls/conditionals
in the chunks around lines 191-193, 203-205, and 241-330) with clang-format so
they match the project's style; run clang-format (or the repo's formatting
command) on src/active/context.cpp to reflow the long signatures, wrapped calls,
and conditionals and commit the updated formatting ensuring the signatures like
ActiveContext::GetQuorumRecoveryStartOffset are a single consistent style as
enforced by CI.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formatting issue is stale on the current SHA. The PR's ClangFormat check passes, so there is no remaining actionable diff-format problem here.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thepastaclaw, thanks for the update — acknowledged. If the ClangFormat CI check is passing on the current SHA, there's nothing actionable here. I'll disregard this comment.


🧠 Learnings used
Learnt from: UdjinM6
Repo: dashpay/dash PR: 7265
File: src/wallet/wallet.cpp:3381-3382
Timestamp: 2026-04-03T14:17:17.656Z
Learning: In Dash Core, `strprintf()` uses `tinyformat` (type-safe printf). Do not flag format-specifier mismatches for integer types when using `strprintf()`: `%d` is correct for any integer type, including `CAmount`/`int64_t` arguments. Only treat `%d`/integer combinations as a formatting issue if they occur in a non-`strprintf()` (e.g., libc `printf`) context or there is evidence the format string is not being handled by `tinyformat`.

Comment on lines +44 to +47
if (request.GetDataMask() & CQuorumDataRequest::ENCRYPTED_CONTRIBUTIONS) {
request.SetError(CQuorumDataRequest::Errors::ENCRYPTED_CONTRIBUTIONS_MISSING);
return request_limit_exceeded ? MisbehavingError{25, "request limit exceeded"} : MessageProcessingResult{};
}
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't punish retries for data observers can never serve.

CQuorumManager::sendQDATA() explicitly skips misbehavior scoring for ENCRYPTED_CONTRIBUTIONS_MISSING, but this branch returns MisbehavingError for the same case when the request repeats. That turns a benign "data not available here" response into a peer penalty.

Proposed fix
     if (request.GetDataMask() & CQuorumDataRequest::ENCRYPTED_CONTRIBUTIONS) {
         request.SetError(CQuorumDataRequest::Errors::ENCRYPTED_CONTRIBUTIONS_MISSING);
-        return request_limit_exceeded ? MisbehavingError{25, "request limit exceeded"} : MessageProcessingResult{};
+        return {};
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/observer.cpp` around lines 44 - 47, The code currently sets
ENCRYPTED_CONTRIBUTIONS_MISSING and returns MisbehavingError when
request_limit_exceeded is true, which penalizes peers for a condition observers
can't serve; change the branch in observer.cpp so that after
request.SetError(CQuorumDataRequest::Errors::ENCRYPTED_CONTRIBUTIONS_MISSING)
you always return MessageProcessingResult{} (no MisbehavingError) regardless of
request_limit_exceeded, matching CQuorumManager::sendQDATA() behavior and
avoiding misbehavior scoring for ENCRYPTED_CONTRIBUTIONS_MISSING.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

net_quorum.cpp already skips misbehavior for ENCRYPTED_CONTRIBUTIONS_MISSING, isn't it?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

Let me check the relevant code in net_quorum.cpp to verify this.


🏁 Script executed:

#!/bin/bash
rg -n "ENCRYPTED_CONTRIBUTIONS" --type=cpp -C 5

Length of output: 9330


@knst — you're right. Looking at the actual net_quorum.cpp code, lines 94–97 explicitly suppress misbehavior for ENCRYPTED_CONTRIBUTIONS_MISSING:

case (CQuorumDataRequest::Errors::QUORUM_VERIFICATION_VECTOR_MISSING):
case (CQuorumDataRequest::Errors::ENCRYPTED_CONTRIBUTIONS_MISSING):
    // Do not punish limit exceed if we don't have the requested data
    break;

The caller in net_quorum.cpp inspects the error code set on the request object via this switch, and since ENCRYPTED_CONTRIBUTIONS_MISSING falls into the break-only branch (leaving misbehave as false), no peer penalty is issued — regardless of what the ProcessContribQGETDATA return value signals. My concern was incorrect. Apologies for the noise.


✏️ Learnings added
Learnt from: knst
Repo: dashpay/dash PR: 7247
File: src/llmq/observer.cpp:0-0
Timestamp: 2026-04-01T13:08:07.207Z
Learning: In `src/llmq/net_quorum.cpp`, the QGETDATA handler has an explicit switch statement that suppresses misbehavior scoring for `CQuorumDataRequest::Errors::ENCRYPTED_CONTRIBUTIONS_MISSING` and `QUORUM_VERIFICATION_VECTOR_MISSING`. Even if `ObserverContext::ProcessContribQGETDATA` in `src/llmq/observer.cpp` returns a `MisbehavingError`, the caller in `net_quorum.cpp` determines misbehavior by inspecting the error code set on the request object, not the return value of the processing method. The switch explicitly does NOT set `misbehave = request_limit_exceeded` for these two error cases.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: knst
Repo: dashpay/dash PR: 7247
File: src/llmq/quorumsman.cpp:0-0
Timestamp: 2026-04-01T12:54:47.684Z
Learning: In `src/llmq/quorumsman.cpp`, the `QuorumRole::DataRecoveryThread` loop intentionally increments `nTries` before calling `m_qman.IsDataRequestPending()` and uses `continue` without a sleep when the request is already pending. This is by design: the loop cycles through `vecMemberHashes` to find a member without an outstanding request, not a bug. This is moved code from the previous `QuorumParticipant`/`QuorumObserver` implementation (PR `#7247`).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't confirm a current-SHA issue here. In the refactored path, sendQDATA() still suppresses peer punishment for ENCRYPTED_CONTRIBUTIONS_MISSING, so the observer-side 'data not available here' case is not turned into a misbehavior penalty.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No further action needed here. This follow-up already acknowledges that the observer-side ENCRYPTED_CONTRIBUTIONS_MISSING path does not actually trigger peer punishment in the current code.

@UdjinM6 UdjinM6 requested a review from kwvg March 24, 2026 21:40
Copy link
Copy Markdown

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The incremental push rebases the original refactoring on top of three merged bugfix/test PRs (#7240, #7241, #7252) and adds two new clean refactoring commits: de-duplicating UpdatedBlockTip/InitializeQuorumConnections code and simplifying the observer_ctx guard in init.cpp. The two prior review findings (protected visibility for Template Method hooks, worker pool lifetime safety) remain unaddressed.

Reviewed commit: 98399b4

🟡 2 suggestion(s)

1 additional finding

🟡 suggestion: Worker pool tasks can outlive ActiveContext members on abnormal destruction

src/active/context.cpp (lines 76-80)

Worker pool tasks launched by StartDataRecoveryThread capture this (ActiveContext*) and dereference nodeman->GetProTxHash(). The worker pool Stop() lives in ~QuorumRole(), but C++ destruction order destroys ActiveContext members (including nodeman) before ~QuorumRole() runs. The normal shutdown path calls ActiveContext::Stop() first, but if destruction occurs without an explicit Stop() (e.g., exception during initialization), in-flight tasks could use a destroyed nodeman. Adding QuorumRole::Stop() in ~ActiveContext() would make the contract self-enforcing. (Same finding as prior review — not yet addressed.)

💡 Suggested change
ActiveContext::~ActiveContext()
{
    // Ensure worker pool is stopped before our members are destroyed,
    // since tasks may capture `this` and dereference nodeman.
    llmq::QuorumRole::Stop();
    m_qman.DisconnectManagers();
    m_isman.DisconnectSigner();
}
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `src/llmq/quorumsman.h`:
- [SUGGESTION] lines 245-250: CheckQuorumConnections and TriggerQuorumDataRecoveryThreads are still public — should be protected
  These two virtual methods are Template Method hooks — they are only ever called from within the QuorumRole hierarchy (by `InitializeQuorumConnections` and `UpdatedBlockTip`). Exposing them as `public` allows external callers to invoke per-LLMQ-type methods directly, bypassing the intended `InitializeQuorumConnections` entry point which iterates all types. In `ActiveContext::context.h` the overrides are correctly declared in the `protected` section, confirming the intended visibility. (Same finding as prior review — not yet addressed.)

In `src/active/context.cpp`:
- [SUGGESTION] lines 76-80: Worker pool tasks can outlive ActiveContext members on abnormal destruction
  Worker pool tasks launched by `StartDataRecoveryThread` capture `this` (ActiveContext*) and dereference `nodeman->GetProTxHash()`. The worker pool `Stop()` lives in `~QuorumRole()`, but C++ destruction order destroys ActiveContext members (including `nodeman`) *before* `~QuorumRole()` runs. The normal shutdown path calls `ActiveContext::Stop()` first, but if destruction occurs without an explicit `Stop()` (e.g., exception during initialization), in-flight tasks could use a destroyed `nodeman`. Adding `QuorumRole::Stop()` in `~ActiveContext()` would make the contract self-enforcing. (Same finding as prior review — not yet addressed.)

@knst
Copy link
Copy Markdown
Collaborator Author

knst commented Mar 26, 2026

@UdjinM6 UdjinM6 requested a review from kwvg 2 days ago

I have couple changes to add to PR, wait with review

@knst knst removed the request for review from kwvg March 26, 2026 13:01
@knst knst marked this pull request as draft March 26, 2026 13:01
@knst knst force-pushed the refactor-llmq-observer-parent branch 2 times, most recently from 76e8e3a to 8311e2e Compare March 30, 2026 19:05
@knst knst marked this pull request as ready for review March 31, 2026 13:35
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8311e2e592

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/init.cpp Outdated
}
// Note: active_ctx initialization is deferred until after nodeman->Init()
// so that GetProTxHash() is available for quorum connection setup.
GetMainSignals().InitializeCurrentBlockTip(tip, ibd);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Defer first quorum-tip initialization until proTxHash is set

Calling InitializeCurrentBlockTip here runs NetQuorum::InitializeCurrentBlockTip before nodeman->Init() (which is now done later at lines 2494-2502). In active masternode mode this triggers quorum scanning with a null GetProTxHash(), so CQuorumManager::BuildQuorumContributions cannot set skShare and caches recent quorums without it; subsequent lookups reuse the cached entries and do not rebuild. With -llmq-data-recovery=0 (or if recovery peers are unavailable), the node can stay unable to produce signature shares after restart.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

@knst knst Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like existing issue [if there is any], this PR is pure refactoring

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/llmq/quorumsman.h (1)

132-205: ⚠️ Potential issue | 🟡 Minor

Run clang-format on the new API block.

The diff-format job is still failing on these declarations, enum wrapping, and the QuorumRole block.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/quorumsman.h` around lines 132 - 205, Run clang-format on
src/llmq/quorumsman.h to fix the formatting in the newly added API block:
reformat the declarations for RegisterDataRequest, ValidateDataResponse,
IsDataRequestPending, GetDataRequestStatus, CleanupExpiredDataRequests,
CleanupOldQuorumData, GetCachedMutableQuorum, WriteContributions,
QueueQuorumForWarming, BuildQuorumFromCommitment, GetQuorum,
CacheWarmingThreadMain, MigrateOldQuorumDB, the SIGN_HEIGHT_OFFSET line,
SelectQuorumForSigning, VerifyRecoveredSig, and the QuorumRole class so they
follow the project's style (line wrapping, enum class/DataResponseValidation
formatting, spacing and brace placement); simply run the repository's
clang-format configuration (or your editor's clang-format) and commit the
resulting whitespace-only changes.
♻️ Duplicate comments (3)
src/llmq/net_quorum.cpp (2)

321-329: ⚠️ Potential issue | 🟠 Major

Avoid the unconditional second quorum-connection pass.

UpdatedBlockTip(tip, nullptr, ibd) already runs the connection scan when the node is ready. The extra loop repeats that work on synced restarts, and it still opens quorum connections during IBD.

Suggested fix
 void NetQuorum::InitializeCurrentBlockTip(const CBlockIndex* tip, bool ibd)
 {
     if (!m_role) return;
     UpdatedBlockTip(tip, nullptr, ibd);
-    if (tip) {
+    if (tip && !ibd && !m_mn_sync.IsBlockchainSynced()) {
         for (const auto& params : Params().GetConsensus().llmqs) {
             CheckQuorumConnections(params, tip);
         }
     }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/net_quorum.cpp` around lines 321 - 329,
NetQuorum::InitializeCurrentBlockTip currently calls UpdatedBlockTip(tip,
nullptr, ibd) and then unconditionally repeats a quorum-connection scan with a
loop over Params().GetConsensus().llmqs calling CheckQuorumConnections(params,
tip); remove that duplicate pass: delete the for-loop (or at minimum guard it so
it does not run during IBD and does not run when UpdatedBlockTip already
performed the scan) so quorum connections are not repeated or opened during
initial block download; keep UpdatedBlockTip, tip and ibd handling intact.

145-160: ⚠️ Potential issue | 🟠 Major

Don't score repeated ENCRYPTED_CONTRIBUTIONS_MISSING replies.

sendQDATA() explicitly suppresses scoring for ENCRYPTED_CONTRIBUTIONS_MISSING, but misbehave_contrib turns it back on in the final limit-exceeded check. Observers hit this path on every contribution request because m_nodeman is null.

Suggested fix
-        bool misbehave_contrib = ProcessContribQGETDATA(ssResponseData, *pQuorum, request, pQuorumBaseBlockIndex);
+        ProcessContribQGETDATA(ssResponseData, *pQuorum, request, pQuorumBaseBlockIndex);
@@
-        if (request_limit_exceeded && (misbehave_contrib || misbehave_qdata)) {
+        if (request_limit_exceeded && misbehave_qdata) {
             m_peer_manager->PeerMisbehaving(pfrom.GetId(), 25, "request limit exceeded");
         }

Also applies to: 233-248

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/net_quorum.cpp` around lines 145 - 160, The combined misbehaviour
check incorrectly re-enables scoring for ENCRYPTED_CONTRIBUTIONS_MISSING because
misbehave_contrib (from ProcessContribQGETDATA) can be true even when sendQDATA
suppressed scoring; update the limit-exceeded branch to ignore misbehave_contrib
when the effective error is
CQuorumDataRequest::Errors::ENCRYPTED_CONTRIBUTIONS_MISSING (or when
request.GetError()/ret_err equals that value) so that (misbehave_contrib ||
misbehave_qdata) only triggers PeerMisbehaving if the error is not
ENCRYPTED_CONTRIBUTIONS_MISSING; adjust the same logic in the other occurrence
(around the block that includes ProcessContribQGETDATA, sendQDATA,
request_limit_exceeded and m_peer_manager->PeerMisbehaving) to ensure
ENCRYPTED_CONTRIBUTIONS_MISSING never causes PeerMisbehaving.
src/llmq/quorumsman.h (1)

53-63: ⚠️ Potential issue | 🔴 Critical

Forward-declare QuorumRole before CQuorumManager.

This header stores llmq::QuorumRole* on Line 62 and accepts it in ConnectManagers() on Line 98, but the class is only declared on Line 189. That makes quorumsman.h depend on include order instead of being self-contained.

Minimal fix
 namespace llmq {
+class QuorumRole;
+
 enum class VerifyRecSigStatus : uint8_t {
#!/bin/bash
set -euo pipefail

rg -n 'class QuorumRole|m_handler|ConnectManagers' src/llmq/quorumsman.h
sed -n '38,110p' src/llmq/quorumsman.h
sed -n '185,205p' src/llmq/quorumsman.h

Expected result: the first local class QuorumRole appears after the m_handler/ConnectManagers uses, confirming the need for a forward declaration in this header.

Also applies to: 98-105

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/quorumsman.h` around lines 53 - 63, CQuorumManager.h currently uses
the llmq::QuorumRole pointer (m_handler) and accepts a QuorumRole in
ConnectManagers() before the QuorumRole class is declared, creating an
include-order dependency; fix this by adding a forward declaration for namespace
llmq { class QuorumRole; } near the top of quorumsman.h (before the
CQuorumManager declaration) so CQuorumManager, its member m_handler and its
ConnectManagers() declaration compile without requiring the full QuorumRole
definition.
🧹 Nitpick comments (1)
src/llmq/net_quorum.h (1)

45-46: Mention the validation-interface registration in this ownership note.

PeerManagerImpl::AddExtraHandler() also exposes NetQuorum through RegisterValidationInterface(), so this object has an extra non-owning reference path until StopHandlers() unregisters it. Calling that out here would make the lifetime contract around m_role and m_nodeman much clearer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/net_quorum.h` around lines 45 - 46, Update the ownership comment for
NetQuorum to note that although it is owned exclusively by PeerManagerImpl via
AddExtraHandler, it is also exposed via RegisterValidationInterface by
PeerManagerImpl::AddExtraHandler, creating an additional non-owning reference
path until StopHandlers() calls unregister; mention this to clarify the lifetime
guarantees around m_role and m_nodeman and that they remain valid until the
validation interface is unregistered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/active/dkgsessionhandler.cpp`:
- Around line 108-113: The LogPrintf calls in ActiveDKGSessionHandler (using
__func__, pQuorumBaseBlockIndex->nHeight, params.name, quorumIndex) are
misformatted; re-run clang-format or manually reflow the argument list so the
lines match the project's formatting rules (wrap/align arguments consistently),
ensuring both the "quorum initialization failed" and "quorum initialization OK"
LogPrintf invocations are formatted identically and pass the diff-format check.

In `@src/init.cpp`:
- Around line 2427-2430: Drain the validation callback queue before broadcasting
the current tip: after obtaining tip and ibd (the lines using
WITH_LOCK(::cs_main) and chainman.ActiveChainstate().IsInitialBlockDownload()),
call SyncWithValidationInterfaceQueue() to ensure any
BlockConnected/UpdatedBlockTip work left by ThreadImport() has been processed,
then call GetMainSignals().InitializeCurrentBlockTip(tip, ibd); this guarantees
subscribers see the final tip after older callbacks have run.

In `@src/llmq/dkgsession.h`:
- Around line 396-400: The new accessor declarations (GetContribution,
GetComplaint, GetJustification, GetPrematureCommitment) are not formatted to
project style; run clang-format or reflow these declarations so line-wrapping
and spacing match the repo's clang-format rules (preserving the signature and
the EXCLUSIVE_LOCKS_REQUIRED(!invCs) attribute and return types) to resolve the
diff-format failure.

In `@src/llmq/net_quorum.cpp`:
- Around line 31-706: The file src/llmq/net_quorum.cpp has formatting issues;
run the project's clang-format configuration on this file (e.g. clang-format -i
--style=file src/llmq/net_quorum.cpp) to normalize spacing/indentation and line
breaks across all functions (NetQuorum::NetQuorum, NetQuorum::Start,
NetQuorum::ProcessMessage, ProcessContribQGETDATA, ProcessContribQDATA,
DataRecoveryThread, StartVvecSyncThread, StartSkShareRecoveryThread,
StartCleanupOldQuorumDataThread, etc.), commit the formatted file, and re-run
the diff-format job to ensure the CI/lint errors are resolved.
- Around line 72-75: The QGETDATA guard currently rejects when m_role is null
(observer/watch mode) causing qwatch connections to be mis-scored; change the
conditional logic in the NetMsgType::QGETDATA handling so that qwatch peers are
allowed to proceed regardless of m_role (i.e. only enforce the
m_role/IsMasternode/GetVerifiedProRegTxHash checks for non-qwatch connections).
Update the condition around m_peer_manager->PeerMisbehaving call (referencing
NetMsgType::QGETDATA, m_role, IsMasternode(), pfrom.GetVerifiedProRegTxHash(),
and pfrom.qwatch) so watch-only observers are not penalized; leave the existing
m_nodeman nullptr handling further down unchanged.

In `@src/llmq/net_quorum.h`:
- Around line 48-116: The NetQuorum class declaration has inconsistent
wrapping/indentation (clang-format errors); run clang-format over this file and
reformat the NetQuorum declaration and its member function/type lines (including
the constructor NetQuorum(...), overridden NetHandler methods
ProcessMessage/Start/Stop/Interrupt, CValidationInterface methods
InitializeCurrentBlockTip/UpdatedBlockTip, and private helpers like
RequestQuorumData, GetQuorumsToDelete, CheckQuorumConnections,
TriggerQuorumDataRecoveryThreads, DataRecoveryThread, StartVvecSyncThread,
TryStartVvecSyncThread, StartSkShareRecoveryThread,
GetQuorumRecoveryStartOffset, StartCleanupOldQuorumDataThread,
ProcessContribQGETDATA, ProcessContribQDATA, and all member variable
declarations) so wrapping, alignment, and indentation match the project's
clang-format style; then re-run the CI formatting check to ensure the
diff-format job passes.

In `@src/llmq/observer.h`:
- Around line 35-52: Run clang-format on the ObserverContext header block to
normalize spacing and line breaks for the constructor and inline overrides;
specifically reformat the ObserverContext(...) constructor declaration and the
inline override methods IsMasternode, IsWatching, and SetQuorumSecretKeyShare so
they follow the project's style (consistent indentation, space before/after
braces and parameters, and each inline method signature/braced body on its own
properly formatted line) to satisfy the diff-format check.

In `@src/llmq/quorumsman.cpp`:
- Around line 365-395: The new helper functions RegisterDataRequest and
ValidateDataResponse have inconsistent wrapping and indentation that fails the
diff-format check; run clang-format (or apply the project's style rules) on this
block and reformat these functions so they match the project's style (fix
indentation, line breaks around the LOCK(cs_data_requests) lines, the auto
[old_pair, inserted] = ... destructuring line, parameter wrapping for
CQuorumManager::ValidateDataResponse, and closing braces) to ensure the
diff-format job passes.

---

Outside diff comments:
In `@src/llmq/quorumsman.h`:
- Around line 132-205: Run clang-format on src/llmq/quorumsman.h to fix the
formatting in the newly added API block: reformat the declarations for
RegisterDataRequest, ValidateDataResponse, IsDataRequestPending,
GetDataRequestStatus, CleanupExpiredDataRequests, CleanupOldQuorumData,
GetCachedMutableQuorum, WriteContributions, QueueQuorumForWarming,
BuildQuorumFromCommitment, GetQuorum, CacheWarmingThreadMain,
MigrateOldQuorumDB, the SIGN_HEIGHT_OFFSET line, SelectQuorumForSigning,
VerifyRecoveredSig, and the QuorumRole class so they follow the project's style
(line wrapping, enum class/DataResponseValidation formatting, spacing and brace
placement); simply run the repository's clang-format configuration (or your
editor's clang-format) and commit the resulting whitespace-only changes.

---

Duplicate comments:
In `@src/llmq/net_quorum.cpp`:
- Around line 321-329: NetQuorum::InitializeCurrentBlockTip currently calls
UpdatedBlockTip(tip, nullptr, ibd) and then unconditionally repeats a
quorum-connection scan with a loop over Params().GetConsensus().llmqs calling
CheckQuorumConnections(params, tip); remove that duplicate pass: delete the
for-loop (or at minimum guard it so it does not run during IBD and does not run
when UpdatedBlockTip already performed the scan) so quorum connections are not
repeated or opened during initial block download; keep UpdatedBlockTip, tip and
ibd handling intact.
- Around line 145-160: The combined misbehaviour check incorrectly re-enables
scoring for ENCRYPTED_CONTRIBUTIONS_MISSING because misbehave_contrib (from
ProcessContribQGETDATA) can be true even when sendQDATA suppressed scoring;
update the limit-exceeded branch to ignore misbehave_contrib when the effective
error is CQuorumDataRequest::Errors::ENCRYPTED_CONTRIBUTIONS_MISSING (or when
request.GetError()/ret_err equals that value) so that (misbehave_contrib ||
misbehave_qdata) only triggers PeerMisbehaving if the error is not
ENCRYPTED_CONTRIBUTIONS_MISSING; adjust the same logic in the other occurrence
(around the block that includes ProcessContribQGETDATA, sendQDATA,
request_limit_exceeded and m_peer_manager->PeerMisbehaving) to ensure
ENCRYPTED_CONTRIBUTIONS_MISSING never causes PeerMisbehaving.

In `@src/llmq/quorumsman.h`:
- Around line 53-63: CQuorumManager.h currently uses the llmq::QuorumRole
pointer (m_handler) and accepts a QuorumRole in ConnectManagers() before the
QuorumRole class is declared, creating an include-order dependency; fix this by
adding a forward declaration for namespace llmq { class QuorumRole; } near the
top of quorumsman.h (before the CQuorumManager declaration) so CQuorumManager,
its member m_handler and its ConnectManagers() declaration compile without
requiring the full QuorumRole definition.

---

Nitpick comments:
In `@src/llmq/net_quorum.h`:
- Around line 45-46: Update the ownership comment for NetQuorum to note that
although it is owned exclusively by PeerManagerImpl via AddExtraHandler, it is
also exposed via RegisterValidationInterface by
PeerManagerImpl::AddExtraHandler, creating an additional non-owning reference
path until StopHandlers() calls unregister; mention this to clarify the lifetime
guarantees around m_role and m_nodeman and that they remain valid until the
validation interface is unregistered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 3efa6f52-0a22-4cae-9543-3862cd739f6a

📥 Commits

Reviewing files that changed from the base of the PR and between 98399b4 and 8311e2e.

📒 Files selected for processing (29)
  • src/Makefile.am
  • src/active/context.cpp
  • src/active/context.h
  • src/active/dkgsession.cpp
  • src/active/dkgsessionhandler.cpp
  • src/active/quorums.cpp
  • src/active/quorums.h
  • src/chainlock/handler.cpp
  • src/dsnotificationinterface.cpp
  • src/dsnotificationinterface.h
  • src/init.cpp
  • src/llmq/dkgsession.cpp
  • src/llmq/dkgsession.h
  • src/llmq/net_quorum.cpp
  • src/llmq/net_quorum.h
  • src/llmq/observer.cpp
  • src/llmq/observer.h
  • src/llmq/observer/quorums.cpp
  • src/llmq/observer/quorums.h
  • src/llmq/quorums.h
  • src/llmq/quorumsman.cpp
  • src/llmq/quorumsman.h
  • src/net_processing.cpp
  • src/node/context.cpp
  • src/rpc/quorums.cpp
  • src/validationinterface.cpp
  • src/validationinterface.h
  • test/lint/lint-circular-dependencies.py
  • test/util/data/non-backported.txt
💤 Files with no reviewable changes (5)
  • test/util/data/non-backported.txt
  • src/llmq/observer/quorums.h
  • src/llmq/observer/quorums.cpp
  • src/active/quorums.cpp
  • src/active/quorums.h
✅ Files skipped from review due to trivial changes (3)
  • src/chainlock/handler.cpp
  • src/node/context.cpp
  • test/lint/lint-circular-dependencies.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/rpc/quorums.cpp
  • src/llmq/quorums.h

src/init.cpp Outdated
Comment on lines +2427 to +2430
{
const CBlockIndex* tip = WITH_LOCK(::cs_main, return chainman.ActiveTip());
const bool ibd = chainman.ActiveChainstate().IsInitialBlockDownload();
if (node.observer_ctx && !node.active_ctx) {
node.observer_ctx->InitializeCurrentBlockTip(tip, ibd);
}
// Note: active_ctx initialization is deferred until after nodeman->Init()
// so that GetProTxHash() is available for quorum connection setup.
GetMainSignals().InitializeCurrentBlockTip(tip, ibd);
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Drain the validation queue before this startup replay.

ThreadImport() can still leave earlier BlockConnected/UpdatedBlockTip work queued here. Broadcasting InitializeCurrentBlockTip() immediately lets subscribers observe the final tip before those older callbacks run.

Suggested fix
         {
+            SyncWithValidationInterfaceQueue();
             const CBlockIndex* tip = WITH_LOCK(::cs_main, return chainman.ActiveTip());
             const bool ibd = chainman.ActiveChainstate().IsInitialBlockDownload();
             GetMainSignals().InitializeCurrentBlockTip(tip, ibd);
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/init.cpp` around lines 2427 - 2430, Drain the validation callback queue
before broadcasting the current tip: after obtaining tip and ibd (the lines
using WITH_LOCK(::cs_main) and
chainman.ActiveChainstate().IsInitialBlockDownload()), call
SyncWithValidationInterfaceQueue() to ensure any BlockConnected/UpdatedBlockTip
work left by ThreadImport() has been processed, then call
GetMainSignals().InitializeCurrentBlockTip(tip, ibd); this guarantees
subscribers see the final tip after older callbacks have run.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's okay if tip is changed after; but it must be updated after nodeman is init.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knst, understood — the design is intentional. The concern about queued BlockConnected/UpdatedBlockTip callbacks from ThreadImport() racing with the InitializeCurrentBlockTip() broadcast is not an issue since subscribers are designed to handle tip updates that arrive later. The only hard ordering constraint is that the call must occur after nodeman->Init() (for the active_ctx path) so that GetProTxHash() returns a valid value when quorum connections are set up. Withdrawing the suggestion.


✏️ Learnings added
Learnt from: knst
Repo: dashpay/dash PR: 7247
File: src/init.cpp:2432-2435
Timestamp: 2026-04-01T13:12:12.362Z
Learning: In `src/init.cpp` (PR `#7247`), the two `GetMainSignals().InitializeCurrentBlockTip(tip, ibd)` calls inside the `loadblk` thread are intentionally placed as follows:
1. A first call fires before `node.active_ctx->nodeman->Init()` — covers the general/observer case.
2. A second call fires immediately after `node.active_ctx->nodeman->Init()` inside `if (node.active_ctx)` — this is required so `GetProTxHash()` (set by `nodeman->Init()`) is valid when quorum connection setup runs.
It is acceptable that the tip may change after these calls; no `SyncWithValidationInterfaceQueue()` fence is required before them. The critical invariant is ordering relative to `nodeman->Init()`, not relative to the validation-interface queue.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: knst
Repo: dashpay/dash PR: 7247
File: src/llmq/quorumsman.cpp:0-0
Timestamp: 2026-04-01T12:54:47.684Z
Learning: In `src/llmq/quorumsman.cpp`, the `QuorumRole::DataRecoveryThread` loop intentionally increments `nTries` before calling `m_qman.IsDataRequestPending()` and uses `continue` without a sleep when the request is already pending. This is by design: the loop cycles through `vecMemberHashes` to find a member without an outstanding request, not a bug. This is moved code from the previous `QuorumParticipant`/`QuorumObserver` implementation (PR `#7247`).

Learnt from: knst
Repo: dashpay/dash PR: 7247
File: src/llmq/observer.cpp:0-0
Timestamp: 2026-04-01T13:08:07.207Z
Learning: In `src/llmq/net_quorum.cpp`, the QGETDATA handler has an explicit switch statement that suppresses misbehavior scoring for `CQuorumDataRequest::Errors::ENCRYPTED_CONTRIBUTIONS_MISSING` and `QUORUM_VERIFICATION_VECTOR_MISSING`. Even if `ObserverContext::ProcessContribQGETDATA` in `src/llmq/observer.cpp` returns a `MisbehavingError`, the caller in `net_quorum.cpp` determines misbehavior by inspecting the error code set on the request object, not the return value of the processing method. The switch explicitly does NOT set `misbehave = request_limit_exceeded` for these two error cases.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a current-SHA correctness issue here. The startup replay ordering is intentionally handled around nodeman->Init(), and I did not find evidence that an added SyncWithValidationInterfaceQueue() is required for correctness in this refactor.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No further action needed here. This follow-up already withdraws the queue-drain suggestion and reflects the intended startup ordering on the current SHA.

Comment on lines +72 to +75
if (msg_type == NetMsgType::QGETDATA) {
if (!m_role || !m_role->IsMasternode() || (pfrom.GetVerifiedProRegTxHash().IsNull() && !pfrom.qwatch)) {
m_peer_manager->PeerMisbehaving(pfrom.GetId(), 10, "not a verified masternode or a qwatch connection");
return;
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Allow watch-only nodes to answer QGETDATA.

In watch mode init.cpp wires an ObserverContext into NetQuorum, and that role is not a masternode. This guard makes every watcher score inbound quorum-data requests instead of serving verification vectors; the m_nodeman == nullptr case is already handled lower down.

Suggested fix
-        if (!m_role || !m_role->IsMasternode() || (pfrom.GetVerifiedProRegTxHash().IsNull() && !pfrom.qwatch)) {
+        const bool can_serve_quorum_data = m_role && (m_role->IsMasternode() || m_role->IsWatching());
+        if (!can_serve_quorum_data || (pfrom.GetVerifiedProRegTxHash().IsNull() && !pfrom.qwatch)) {
             m_peer_manager->PeerMisbehaving(pfrom.GetId(), 10, "not a verified masternode or a qwatch connection");
             return;
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/net_quorum.cpp` around lines 72 - 75, The QGETDATA guard currently
rejects when m_role is null (observer/watch mode) causing qwatch connections to
be mis-scored; change the conditional logic in the NetMsgType::QGETDATA handling
so that qwatch peers are allowed to proceed regardless of m_role (i.e. only
enforce the m_role/IsMasternode/GetVerifiedProRegTxHash checks for non-qwatch
connections). Update the condition around m_peer_manager->PeerMisbehaving call
(referencing NetMsgType::QGETDATA, m_role, IsMasternode(),
pfrom.GetVerifiedProRegTxHash(), and pfrom.qwatch) so watch-only observers are
not penalized; leave the existing m_nodeman nullptr handling further down
unchanged.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't validate this as a regression in the current SHA. The old CQuorumManager::ProcessMessage() path also gated QGETDATA on IsMasternode(), so observer/watch nodes were not previously serving quorum data here; this refactor preserves that behavior rather than newly breaking it.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check git log to find the removed CQuorumManager::ProcessMessage QGETDATA guard
git log --oneline -10 HEAD
git diff HEAD~1 HEAD -- src/llmq/quorumsman.cpp | grep -A 20 "QGETDATA" | head -60

Length of output: 942


🏁 Script executed:

#!/bin/bash
# Find the commit that moved ProcessMessage from CQuorumManager to NetQuorum
git show 1be8863c7 -- src/llmq/quorumsman.cpp | grep -A 30 "QGETDATA" | head -80

Length of output: 4278


@thepastaclaw You're correct — I've now confirmed from the git history that the old CQuorumManager::ProcessMessage() used the same !IsMasternode() guard for QGETDATA:

if (!IsMasternode() || (pfrom.GetVerifiedProRegTxHash().IsNull() && !pfrom.qwatch)) {
    return MisbehavingError{10, "not a verified masternode or a qwatch connection"};
}

The new net_quorum.cpp code faithfully preserves this behavior. Notably, the old code did allow IsWatching() for the QDATA path, but QGETDATA (serving quorum data) was always masternode-only. My original comment was wrong — this is not a regression. Apologies for the noise!


🧠 Learnings used
Learnt from: UdjinM6
Repo: dashpay/dash PR: 7265
File: src/wallet/wallet.cpp:3381-3382
Timestamp: 2026-04-03T14:17:17.656Z
Learning: In Dash Core, `strprintf()` uses `tinyformat` (type-safe printf). Do not flag format-specifier mismatches for integer types when using `strprintf()`: `%d` is correct for any integer type, including `CAmount`/`int64_t` arguments. Only treat `%d`/integer combinations as a formatting issue if they occur in a non-`strprintf()` (e.g., libc `printf`) context or there is evidence the format string is not being handled by `tinyformat`.

@github-actions
Copy link
Copy Markdown

This pull request has conflicts, please rebase.

@knst knst force-pushed the refactor-llmq-observer-parent branch from 8311e2e to e2df648 Compare March 31, 2026 14:54
@thepastaclaw
Copy link
Copy Markdown

thepastaclaw commented Mar 31, 2026

✅ Review complete (commit 21fe75f)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e2df648675

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +64 to +67
void NetQuorum::Stop()
{
workerPool.clear_queue();
workerPool.stop(true);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Interrupt quorum workers before stopping NetQuorum

PrepareShutdown() invokes PeerManager::StopHandlers() directly, and this Stop() implementation no longer signals quorumThreadInterrupt before joining the pool. Recovery jobs in DataRecoveryThread wait/sleep on that interrupt, so workerPool.stop(true) can block for a long time (or indefinitely while waiting for sync conditions), which turns shutdown into a hang. The pre-refactor quorum observer stop path interrupted first, so this is a regression in teardown behavior.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

called here:

void PeerManagerImpl::InterruptHandlers()
{   
    for (auto& handler : m_handlers) {
        handler->Interrupt();
    }
}

const CBlockIndex* tip = WITH_LOCK(::cs_main, return chainman.ActiveTip());
const bool ibd = chainman.ActiveChainstate().IsInitialBlockDownload();
node.active_ctx->InitializeCurrentBlockTip(tip, ibd);
GetMainSignals().InitializeCurrentBlockTip(tip, ibd);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reinitialize active/observer tip state with UpdatedBlockTip

This now broadcasts only InitializeCurrentBlockTip, but ActiveContext/ObserverContext no longer implement that hook after the refactor, so their startup UpdatedBlockTip initialization path is skipped. As a result, quorum/DKG-related state in those contexts is not seeded on restart and remains stale until a later real tip update arrives, which can leave post-boot behavior inconsistent for a synced node.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Largely mechanical refactor that successfully flattens the quorum class hierarchy. Two issues surfaced: NetQuorum::Stop() no longer signals the thread interrupt before joining worker threads, which can stall shutdown when recovery/cleanup tasks are sleeping; and the startup path now broadcasts InitializeCurrentBlockTip twice in masternode mode, replaying validation callbacks for all subscribers rather than just the LLMQ component.

Reviewed commit: e2df648

🔴 1 blocking | 🟡 1 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `src/llmq/net_quorum.cpp`:
- [BLOCKING] lines 64-68: NetQuorum::Stop() doesn't signal thread interrupt before joining workers
  `Stop()` calls `workerPool.clear_queue()` then `workerPool.stop(true)` (join), but never signals `quorumThreadInterrupt`. Worker tasks like `DataRecoveryThread()` and `StartCleanupOldQuorumDataThread()` sleep on `quorumThreadInterrupt` (lines 500-501, 547, 579). If any of those tasks are active when `Stop()` is called, the `stop(true)` join blocks indefinitely because the sleeping thread never wakes up.

The `Interrupt()` method at line 62 does signal `quorumThreadInterrupt()`, but the shutdown path through `PrepareShutdown() → PeerManagerImpl::StopHandlers()` calls `Stop()` without necessarily calling `Interrupt()` first. The pre-refactor code had `QuorumObserver::Stop()` trigger the interrupt before joining.

Fix: call `quorumThreadInterrupt()` at the start of `Stop()`.

In `src/init.cpp`:
- [SUGGESTION] line 2512: InitializeCurrentBlockTip broadcast twice in masternode mode — replays callbacks for all subscribers
  `GetMainSignals().InitializeCurrentBlockTip(tip, ibd)` is now called twice in masternode mode: once after `ThreadImport()` (line ~2435) and again after `nodeman->Init()` (line 2512). The second call re-invokes every registered subscriber's `InitializeCurrentBlockTip()`, not just the LLMQ component that needs the `proTxHash`. `CDSNotificationInterface::InitializeCurrentBlockTip()` replays both `SynchronousUpdatedBlockTip()` and `UpdatedBlockTip()` on the same tip, so masternode startup now processes those callbacks twice.

The pre-refactor code targeted only the quorum-specific component after `nodeman->Init()`. Consider either re-initializing only the LLMQ/NetQuorum subscriber, or adding an idempotence guard so non-LLMQ listeners don't reprocess the same tip.

Comment on lines +64 to +68
void NetQuorum::Stop()
{
workerPool.clear_queue();
workerPool.stop(true);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Blocking: NetQuorum::Stop() doesn't signal thread interrupt before joining workers

Stop() calls workerPool.clear_queue() then workerPool.stop(true) (join), but never signals quorumThreadInterrupt. Worker tasks like DataRecoveryThread() and StartCleanupOldQuorumDataThread() sleep on quorumThreadInterrupt (lines 500-501, 547, 579). If any of those tasks are active when Stop() is called, the stop(true) join blocks indefinitely because the sleeping thread never wakes up.

The Interrupt() method at line 62 does signal quorumThreadInterrupt(), but the shutdown path through PrepareShutdown() → PeerManagerImpl::StopHandlers() calls Stop() without necessarily calling Interrupt() first. The pre-refactor code had QuorumObserver::Stop() trigger the interrupt before joining.

Fix: call quorumThreadInterrupt() at the start of Stop().

💡 Suggested change
Suggested change
void NetQuorum::Stop()
{
workerPool.clear_queue();
workerPool.stop(true);
}
void NetQuorum::Stop()
{
quorumThreadInterrupt();
workerPool.clear_queue();
workerPool.stop(true);
}

source: ['codex']

🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `src/llmq/net_quorum.cpp`:
- [BLOCKING] lines 64-68: NetQuorum::Stop() doesn't signal thread interrupt before joining workers
  `Stop()` calls `workerPool.clear_queue()` then `workerPool.stop(true)` (join), but never signals `quorumThreadInterrupt`. Worker tasks like `DataRecoveryThread()` and `StartCleanupOldQuorumDataThread()` sleep on `quorumThreadInterrupt` (lines 500-501, 547, 579). If any of those tasks are active when `Stop()` is called, the `stop(true)` join blocks indefinitely because the sleeping thread never wakes up.

The `Interrupt()` method at line 62 does signal `quorumThreadInterrupt()`, but the shutdown path through `PrepareShutdown() → PeerManagerImpl::StopHandlers()` calls `Stop()` without necessarily calling `Interrupt()` first. The pre-refactor code had `QuorumObserver::Stop()` trigger the interrupt before joining.

Fix: call `quorumThreadInterrupt()` at the start of `Stop()`.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

called here:

void PeerManagerImpl::InterruptHandlers()
{   
    for (auto& handler : m_handlers) {
        handler->Interrupt();
    }
}

const CBlockIndex* tip = WITH_LOCK(::cs_main, return chainman.ActiveTip());
const bool ibd = chainman.ActiveChainstate().IsInitialBlockDownload();
node.active_ctx->InitializeCurrentBlockTip(tip, ibd);
GetMainSignals().InitializeCurrentBlockTip(tip, ibd);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: InitializeCurrentBlockTip broadcast twice in masternode mode — replays callbacks for all subscribers

GetMainSignals().InitializeCurrentBlockTip(tip, ibd) is now called twice in masternode mode: once after ThreadImport() (line ~2435) and again after nodeman->Init() (line 2512). The second call re-invokes every registered subscriber's InitializeCurrentBlockTip(), not just the LLMQ component that needs the proTxHash. CDSNotificationInterface::InitializeCurrentBlockTip() replays both SynchronousUpdatedBlockTip() and UpdatedBlockTip() on the same tip, so masternode startup now processes those callbacks twice.

The pre-refactor code targeted only the quorum-specific component after nodeman->Init(). Consider either re-initializing only the LLMQ/NetQuorum subscriber, or adding an idempotence guard so non-LLMQ listeners don't reprocess the same tip.

source: ['codex']

🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `src/init.cpp`:
- [SUGGESTION] line 2512: InitializeCurrentBlockTip broadcast twice in masternode mode — replays callbacks for all subscribers
  `GetMainSignals().InitializeCurrentBlockTip(tip, ibd)` is now called twice in masternode mode: once after `ThreadImport()` (line ~2435) and again after `nodeman->Init()` (line 2512). The second call re-invokes every registered subscriber's `InitializeCurrentBlockTip()`, not just the LLMQ component that needs the `proTxHash`. `CDSNotificationInterface::InitializeCurrentBlockTip()` replays both `SynchronousUpdatedBlockTip()` and `UpdatedBlockTip()` on the same tip, so masternode startup now processes those callbacks twice.

The pre-refactor code targeted only the quorum-specific component after `nodeman->Init()`. Consider either re-initializing only the LLMQ/NetQuorum subscriber, or adding an idempotence guard so non-LLMQ listeners don't reprocess the same tip.

@knst knst requested a review from thepastaclaw April 1, 2026 13:15
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9f6581ca0e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +2220 to +2223
llmq::QuorumRole* quorum_role = node.active_ctx ? static_cast<llmq::QuorumRole*>(node.active_ctx.get())
: static_cast<llmq::QuorumRole*>(node.observer_ctx.get());
auto net_quorum = std::make_unique<llmq::NetQuorum>(
node.peerman.get(), *node.llmq_ctx->bls_worker, *node.connman, *node.dmnman,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip NetQuorum setup when no quorum role is active

When the node is neither an active masternode nor started with -watchquorums, quorum_role is null here but NetQuorum is still added as a handler. PeerManagerImpl::StartHandlers() then always calls NetQuorum::Start(), which unconditionally creates a q-mngr thread pool (workerPool.resize(m_worker_count), up to 4 threads by default). Before this refactor, quorum worker threads were only started in active/watch modes, so this introduces avoidable thread/memory overhead for normal nodes.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved, see 8bc5744

Copy link
Copy Markdown

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Thanks for addressing the prior review comments. The forward declaration and ~QuorumRole() = default changes are correct, and knst's clarifications about the shutdown ordering (Interrupt before Stop), the intentional recovery-loop behavior, and the ENCRYPTED_CONTRIBUTIONS_MISSING switch handling all check out.

However, the latest commit (9f6581ca) introduces a build break and CI is red across all platforms.

Reviewed commit: 9f6581c

🔴 1 blocking

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `src/llmq/net_quorum.cpp`:
- [BLOCKING] lines 587, 620: `pQuorum->validMembers` does not exist — should be `pQuorum->qc->validMembers`
  The latest commit added `if (pQuorum->validMembers.empty()) return;` in `StartVvecSyncThread` (line 587) and `StartSkShareRecoveryThread` (line 620). `CQuorum` does not have a `validMembers` member; `validMembers` lives in `CFinalCommitment` accessed via `pQuorum->qc->validMembers`. CI is failing on all platforms with: `error: no member named 'validMembers' in 'llmq::CQuorum'`.

Fix: `pQuorum->validMembers` → `pQuorum->qc->validMembers` in both locations.

@knst knst force-pushed the refactor-llmq-observer-parent branch from 9f6581c to 8bc5744 Compare April 1, 2026 15:54
@knst knst requested a review from thepastaclaw April 1, 2026 17:40
Copy link
Copy Markdown

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Verified 10 agent findings against source code on PR #7247 (SHA 55d8025). The convergent InitializeCurrentBlockTip finding is partially valid but lower severity than claimed — NetQuorum::InitializeCurrentBlockTip (net_quorum.cpp:322) properly handles quorum connection initialization at startup; the gap is only that ActiveContext/ObserverContext sub-components miss a tip-based initialization call on idle chains, which self-heals on the next block. The TransactionRemovedFromMempool finding is invalid — no behavioral change. The NetQuorum::Stop deadlock finding is invalid — no circular lock dependency exists. Several minor/cosmetic items confirmed.

Reviewed commit: 55d8025

🟡 3 suggestion(s) | 💬 1 nitpick(s)

3 additional findings

🟡 suggestion: ActiveContext/ObserverContext sub-components miss startup initialization on idle chains

src/active/context.cpp (line 1)

ActiveContext and ObserverContext don't override InitializeCurrentBlockTip (validationinterface.h:121 provides no-op default). NetQuorum::InitializeCurrentBlockTip (net_quorum.cpp:322-331) correctly handles quorum connections and data recovery at startup. However, ActiveContext::UpdatedBlockTip (context.cpp:98-107) calls nodeman->UpdatedBlockTip, ehf_sighandler->UpdatedBlockTip, gov_signer->UpdatedBlockTip, and qdkgsman->UpdatedBlockTip — none of these receive a startup kick on idle chains. These components will catch up when the next block arrives, so this is a minor gap rather than a functional break. The original fix commit (5e79dbe) added InitializeCurrentBlockTip to ActiveContext specifically for quorum connections — that concern is now handled by NetQuorum. Consider whether qdkgsman/ehf_sighandler need an explicit startup call or if next-block catchup is sufficient.

🟡 suggestion: InitializeCurrentBlockTip called twice — intentional but worth a comment clarification

src/init.cpp (line 1)

GetMainSignals().InitializeCurrentBlockTip() is called at init.cpp:2435 (after ThreadImport) and again at init.cpp:2512 (after nodeman->Init). The second call is guarded by if (node.active_ctx) and documented in comments (init.cpp:2506-2509). This is intentional: the first call runs before the masternode's proTxHash is set by nodeman->Init, so quorum connections skip masternodes with null proTxHash. The second call reinitializes with the correct proTxHash.

🟡 suggestion: Consider asserting m_quorum_base_block_index in CQuorum constructor

src/llmq/quorums.cpp (line 1)

CQuorum constructor (quorums.cpp:103-114) asserts qc != nullptr (line 113, added in HEAD commit) but doesn't assert m_quorum_base_block_index. The member is initialized from the constructor parameter _pQuorumBaseBlockIndex (line 108) which could be nullptr. All current callers appear to provide valid indices, and quorums.h:175 has a default of nullptr for the member. An assertion would be a safety belt consistent with the existing qc assertion.

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `src/active/context.cpp`:
- [SUGGESTION] line 1: ActiveContext/ObserverContext sub-components miss startup initialization on idle chains
  ActiveContext and ObserverContext don't override InitializeCurrentBlockTip (validationinterface.h:121 provides no-op default). NetQuorum::InitializeCurrentBlockTip (net_quorum.cpp:322-331) correctly handles quorum connections and data recovery at startup. However, ActiveContext::UpdatedBlockTip (context.cpp:98-107) calls nodeman->UpdatedBlockTip, ehf_sighandler->UpdatedBlockTip, gov_signer->UpdatedBlockTip, and qdkgsman->UpdatedBlockTip — none of these receive a startup kick on idle chains. These components will catch up when the next block arrives, so this is a minor gap rather than a functional break. The original fix commit (5e79dbe33e) added InitializeCurrentBlockTip to ActiveContext specifically for quorum connections — that concern is now handled by NetQuorum. Consider whether qdkgsman/ehf_sighandler need an explicit startup call or if next-block catchup is sufficient.

In `src/init.cpp`:
- [SUGGESTION] line 1: InitializeCurrentBlockTip called twice — intentional but worth a comment clarification
  GetMainSignals().InitializeCurrentBlockTip() is called at init.cpp:2435 (after ThreadImport) and again at init.cpp:2512 (after nodeman->Init). The second call is guarded by `if (node.active_ctx)` and documented in comments (init.cpp:2506-2509). This is intentional: the first call runs before the masternode's proTxHash is set by nodeman->Init, so quorum connections skip masternodes with null proTxHash. The second call reinitializes with the correct proTxHash.

In `src/llmq/quorums.cpp`:
- [SUGGESTION] line 1: Consider asserting m_quorum_base_block_index in CQuorum constructor
  CQuorum constructor (quorums.cpp:103-114) asserts `qc != nullptr` (line 113, added in HEAD commit) but doesn't assert m_quorum_base_block_index. The member is initialized from the constructor parameter _pQuorumBaseBlockIndex (line 108) which could be nullptr. All current callers appear to provide valid indices, and quorums.h:175 has a default of nullptr for the member. An assertion would be a safety belt consistent with the existing qc assertion.

@@ -0,0 +1,712 @@
// Copyright (c) 2025-2026 The Dash Core developers
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Nitpick: Missing space between && and std::ranges

At net_quorum.cpp:456: is_masternode &&std::ranges::any_of(...) — missing space after &&. Should be is_masternode && std::ranges::any_of(...).

source: ['claude']

kwvg
kwvg previously approved these changes Apr 2, 2026
Copy link
Copy Markdown
Collaborator

@kwvg kwvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK 55d8025

@UdjinM6
Copy link
Copy Markdown

UdjinM6 commented Apr 4, 2026

pls consider 343d393 and maybe 90cf7d3

@knst
Copy link
Copy Markdown
Collaborator Author

knst commented Apr 6, 2026

pls consider 343d393

            // On masternodes, defer the full broadcast until after
            // nodeman->Init() so that GetProTxHash() is available
            // for quorum connection setup and skShare derivation.
            // Only kick CDSNotificationInterface here (cached block
            // height for DS/MN payments/budgets).

If I understand correctly, with fixed m_operator_pk the pro-tx hash of current masternode could be changed, not only during start of Dash Core but also due to tip updates (if list of masternode is updated). Does it mean that we actually need to invalidate all caches of all quorums if protx hash is changed? @UdjinM6
I can't think about any real-case scenario at the moment, because it seems as changing protx affects only new quorums; not existing one.

I applied patch from your fix.

@knst knst requested a review from kwvg April 6, 2026 07:57
UdjinM6 and others added 2 commits April 8, 2026 14:59
On masternodes the first InitializeCurrentBlockTip (before
nodeman->Init) now reaches NetQuorum, which triggers ScanQuorums
and BuildQuorumContributions while GetProTxHash() is still null.
This causes SetQuorumSecretKeyShare to fail, caching the quorum
without skShare.  The second call after nodeman->Init hits the
cache and never rebuilds, leaving the node reliant on network
recovery for its own quorum keys.

Fix by only calling g_ds_notification_interface directly in the
first pass on masternodes, deferring the full broadcast to after
nodeman->Init.  Non-masternode nodes broadcast normally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ety assert

In TriggerQuorumDataRecoveryThreads, avoid calling GetProTxHash() in
observer mode where it returns a null hash.  Also adds missing const
qualifier and fixes operator spacing.

Assert m_quorum_base_block_index in CQuorum constructor, consistent
with the existing qc != nullptr assertion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@knst knst force-pushed the refactor-llmq-observer-parent branch from e9b93e3 to 21fe75f Compare April 8, 2026 07:59
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 21fe75f0bc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

const CBlockIndex* tip = WITH_LOCK(::cs_main, return chainman.ActiveTip());
const bool ibd = chainman.ActiveChainstate().IsInitialBlockDownload();
node.active_ctx->InitializeCurrentBlockTip(tip, ibd);
GetMainSignals().InitializeCurrentBlockTip(tip, ibd);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Initialize quorum contexts via UpdatedBlockTip on startup

Switching startup seeding to GetMainSignals().InitializeCurrentBlockTip(...) here skips ActiveContext/ObserverContext initialization logic, because neither context overrides InitializeCurrentBlockTip after this refactor, while their UpdatedBlockTip handlers still perform qdkgsman state updates. On a restart where no new tip arrives immediately, DKG/quorum state is left stale until a later block update, which can delay or break expected quorum behavior right after boot.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

I validated the current SHA and found one substantive regression in the startup replay path: the refactor replaced the quorum contexts' explicit InitializeCurrentBlockTip() hooks with broad GetMainSignals().InitializeCurrentBlockTip(...) broadcasts, but ActiveContext/ObserverContext no longer override that hook, so their DKG/session state is not seeded on restart until another block arrives. I did not confirm any of the live CodeRabbit inline findings against the current code; the remaining bot comments are either already fixed on this SHA, refer to moved code, or were already retracted in-thread.

I validated existing CodeRabbit comments separately in-thread instead of restating them here.

Reviewed commit: 21fe75f

🔴 1 blocking

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `src/init.cpp`:
- [BLOCKING] lines 2443-2445: Startup replay no longer seeds the quorum contexts with the current tip
  Before this refactor, both `ActiveContext` and `ObserverContext` had explicit `InitializeCurrentBlockTip(tip, ibd)` methods that replayed `UpdatedBlockTip(tip, nullptr, ibd)` on startup and then initialized quorum connections. In the current SHA, init now broadcasts `GetMainSignals().InitializeCurrentBlockTip(tip, ibd)` instead (`src/init.cpp:2443-2445`, repeated after `nodeman->Init()` at `2515-2523`), but the quorum contexts no longer override `InitializeCurrentBlockTip()` and only implement `UpdatedBlockTip()`. As a result, the startup replay never reaches `qdkgsman->UpdatedBlockTip(...)`, so DKG/session state stays at defaults until another block arrives. That leaves restarted masternodes and observer/watch nodes out of sync with any in-progress DKG round immediately after startup.

Comment on lines +2443 to +2445
// Non-masternode nodes (including observer-only): broadcast
// to all subscribers now; no proTxHash dependency.
GetMainSignals().InitializeCurrentBlockTip(tip, ibd);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Blocking: Startup replay no longer seeds the quorum contexts with the current tip

Before this refactor, both ActiveContext and ObserverContext had explicit InitializeCurrentBlockTip(tip, ibd) methods that replayed UpdatedBlockTip(tip, nullptr, ibd) on startup and then initialized quorum connections. In the current SHA, init now broadcasts GetMainSignals().InitializeCurrentBlockTip(tip, ibd) instead (src/init.cpp:2443-2445, repeated after nodeman->Init() at 2515-2523), but the quorum contexts no longer override InitializeCurrentBlockTip() and only implement UpdatedBlockTip(). As a result, the startup replay never reaches qdkgsman->UpdatedBlockTip(...), so DKG/session state stays at defaults until another block arrives. That leaves restarted masternodes and observer/watch nodes out of sync with any in-progress DKG round immediately after startup.

💡 Suggested change
Suggested change
// Non-masternode nodes (including observer-only): broadcast
// to all subscribers now; no proTxHash dependency.
GetMainSignals().InitializeCurrentBlockTip(tip, ibd);
Reintroduce `InitializeCurrentBlockTip()` on the quorum contexts (or call an equivalent explicit startup hook) so startup still replays `UpdatedBlockTip()` into `qdkgsman` before relying on later tip events.

source: ['codex-general', 'manual-general']

🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `src/init.cpp`:
- [BLOCKING] lines 2443-2445: Startup replay no longer seeds the quorum contexts with the current tip
  Before this refactor, both `ActiveContext` and `ObserverContext` had explicit `InitializeCurrentBlockTip(tip, ibd)` methods that replayed `UpdatedBlockTip(tip, nullptr, ibd)` on startup and then initialized quorum connections. In the current SHA, init now broadcasts `GetMainSignals().InitializeCurrentBlockTip(tip, ibd)` instead (`src/init.cpp:2443-2445`, repeated after `nodeman->Init()` at `2515-2523`), but the quorum contexts no longer override `InitializeCurrentBlockTip()` and only implement `UpdatedBlockTip()`. As a result, the startup replay never reaches `qdkgsman->UpdatedBlockTip(...)`, so DKG/session state stays at defaults until another block arrives. That leaves restarted masternodes and observer/watch nodes out of sync with any in-progress DKG round immediately after startup.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants