-
Notifications
You must be signed in to change notification settings - Fork 580
fix(pd): add timeout and null-safety to getLeaderGrpcAddress() #2961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -26,6 +26,8 @@ | |||||||||||||||||||||
| import java.util.concurrent.CompletableFuture; | ||||||||||||||||||||||
| import java.util.concurrent.CountDownLatch; | ||||||||||||||||||||||
| import java.util.concurrent.ExecutionException; | ||||||||||||||||||||||
| import java.util.concurrent.TimeUnit; | ||||||||||||||||||||||
| import java.util.concurrent.TimeoutException; | ||||||||||||||||||||||
| import java.util.concurrent.atomic.AtomicReference; | ||||||||||||||||||||||
| import java.util.stream.Collectors; | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
|
|
@@ -239,8 +241,20 @@ public String getLeaderGrpcAddress() throws ExecutionException, InterruptedExcep | |||||||||||||||||||||
| waitingForLeader(10000); | ||||||||||||||||||||||
| } | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| return raftRpcClient.getGrpcAddress(raftNode.getLeaderId().getEndpoint().toString()).get() | ||||||||||||||||||||||
| .getGrpcAddress(); | ||||||||||||||||||||||
| try { | ||||||||||||||||||||||
| RaftRpcProcessor.GetMemberResponse response = raftRpcClient | ||||||||||||||||||||||
| .getGrpcAddress(raftNode.getLeaderId().getEndpoint().toString()) | ||||||||||||||||||||||
| .get(config.getRpcTimeout(), TimeUnit.MILLISECONDS); | ||||||||||||||||||||||
|
Comment on lines
+245
to
+247
|
||||||||||||||||||||||
| if (response != null && response.getGrpcAddress() != null) { | ||||||||||||||||||||||
| return response.getGrpcAddress(); | ||||||||||||||||||||||
| } | ||||||||||||||||||||||
| } catch (TimeoutException | ExecutionException e) { | ||||||||||||||||||||||
| log.warn("Failed to get leader gRPC address via RPC, falling back to endpoint derivation", e); | ||||||||||||||||||||||
|
||||||||||||||||||||||
| log.warn("Failed to get leader gRPC address via RPC, falling back to endpoint derivation", e); | |
| String leaderEndpoint = raftNode.getLeaderId() != null | |
| ? String.valueOf(raftNode.getLeaderId().getEndpoint()) | |
| : "unknown"; | |
| long timeoutMs = config.getRpcTimeout(); | |
| log.warn( | |
| "Failed to get leader gRPC address via RPC, falling back to endpoint derivation: " + | |
| "leaderEndpoint={}, timeoutMs={}, errorType={}, errorMessage={}", | |
| leaderEndpoint, timeoutMs, e.getClass().getSimpleName(), e.getMessage()); | |
| log.debug("Stack trace for failed getLeaderGrpcAddress RPC", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grpc.port values. In this repo's own multi-node test configs, application-server1.yml, application-server2.yml, and application-server3.yml advertise 8686, 8687, and 8688 respectively, so a follower on 8687 will redirect to leader-ip:8687 even when the elected leader is actually listening on 8686 or 8688. That turns the original NPE into a silent misroute.
If we can't recover the leader's advertised gRPC endpoint here, I think it's safer to fail fast than to synthesize an address from the local port, for example:
| return leaderIp + ":" + config.getGrpcPort(); | |
| } catch (TimeoutException | ExecutionException e) { | |
| throw new ExecutionException( | |
| String.format("Failed to resolve leader gRPC address for %s", raftNode.getLeaderId()), | |
| e); | |
| } |
A more complete fix would need a source of truth for the leader's actual grpcAddress, not the local node's port.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waitingForLeader(10000)does not guarantee thatleaderIdis non-null; it just waits up to the timeout and returns whatever it has. That means both the RPC call and this fallback path can still dereferenceraftNode.getLeaderId()and reintroduce an NPE during leader-election gaps.Could we cache the leader once after waiting and turn the "leader still unknown" case into a controlled exception instead?
Then reuse
leaderfor both the RPC request and any follow-up handling.