Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;
import java.util.concurrent.atomic.AtomicReference;
import java.util.stream.Collectors;

Expand Down Expand Up @@ -239,8 +241,20 @@ public String getLeaderGrpcAddress() throws ExecutionException, InterruptedExcep
waitingForLeader(10000);
}

return raftRpcClient.getGrpcAddress(raftNode.getLeaderId().getEndpoint().toString()).get()
.getGrpcAddress();
try {
RaftRpcProcessor.GetMemberResponse response = raftRpcClient
.getGrpcAddress(raftNode.getLeaderId().getEndpoint().toString())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ waitingForLeader(10000) does not guarantee that leaderId is non-null; it just waits up to the timeout and returns whatever it has. That means both the RPC call and this fallback path can still dereference raftNode.getLeaderId() and reintroduce an NPE during leader-election gaps.

Could we cache the leader once after waiting and turn the "leader still unknown" case into a controlled exception instead?

Suggested change
.getGrpcAddress(raftNode.getLeaderId().getEndpoint().toString())
PeerId leader = raftNode.getLeaderId();
if (leader == null) {
throw new ExecutionException(new IllegalStateException("Leader is not ready"));
}

Then reuse leader for both the RPC request and any follow-up handling.

.get(config.getRpcTimeout(), TimeUnit.MILLISECONDS);
Comment on lines +245 to +247
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waitingForLeader(10000) is called when raftNode.getLeaderId() is null, but the result isn’t used/rechecked. If the leader still isn’t elected after the wait, the subsequent raftNode.getLeaderId().getEndpoint() (and the fallback path) will still throw an NPE. Capture the leader into a local variable after waiting and handle the null case explicitly (e.g., throw a clear exception or return a sentinel that callers can handle).

Copilot uses AI. Check for mistakes.
if (response != null && response.getGrpcAddress() != null) {
return response.getGrpcAddress();
}
} catch (TimeoutException | ExecutionException e) {
log.warn("Failed to get leader gRPC address via RPC, falling back to endpoint derivation", e);
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log.warn(..., e) will likely be hit on every follower redirect in environments where the RPC path is consistently broken/blocked (the scenario described in #2959), which can flood logs and add overhead due to repeated stack traces. Consider reducing verbosity (e.g., warn without stack trace, debug with stack trace, or rate-limit) and include key context like leader endpoint/timeout in the message.

Suggested change
log.warn("Failed to get leader gRPC address via RPC, falling back to endpoint derivation", e);
String leaderEndpoint = raftNode.getLeaderId() != null
? String.valueOf(raftNode.getLeaderId().getEndpoint())
: "unknown";
long timeoutMs = config.getRpcTimeout();
log.warn(
"Failed to get leader gRPC address via RPC, falling back to endpoint derivation: " +
"leaderEndpoint={}, timeoutMs={}, errorType={}, errorMessage={}",
leaderEndpoint, timeoutMs, e.getClass().getSimpleName(), e.getMessage());
log.debug("Stack trace for failed getLeaderGrpcAddress RPC", e);

Copilot uses AI. Check for mistakes.
}

// Fallback: derive from raft endpoint IP + local gRPC port (best effort)
String leaderIp = raftNode.getLeaderId().getEndpoint().getIp();
return leaderIp + ":" + config.getGrpcPort();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ This fallback is still incorrect for clusters where PD nodes use different grpc.port values. In this repo's own multi-node test configs, application-server1.yml, application-server2.yml, and application-server3.yml advertise 8686, 8687, and 8688 respectively, so a follower on 8687 will redirect to leader-ip:8687 even when the elected leader is actually listening on 8686 or 8688. That turns the original NPE into a silent misroute.

If we can't recover the leader's advertised gRPC endpoint here, I think it's safer to fail fast than to synthesize an address from the local port, for example:

Suggested change
return leaderIp + ":" + config.getGrpcPort();
} catch (TimeoutException | ExecutionException e) {
throw new ExecutionException(
String.format("Failed to resolve leader gRPC address for %s", raftNode.getLeaderId()),
e);
}

A more complete fix would need a source of truth for the leader's actual grpcAddress, not the local node's port.

}

/**
Expand Down
Loading