Skip to content

HDDS-14725. Display retry messages in cli when scm's are unavailable#9834

Open
Gargi-jais11 wants to merge 4 commits intoapache:masterfrom
Gargi-jais11:HDDS-14725
Open

HDDS-14725. Display retry messages in cli when scm's are unavailable#9834
Gargi-jais11 wants to merge 4 commits intoapache:masterfrom
Gargi-jais11:HDDS-14725

Conversation

@Gargi-jais11
Copy link
Contributor

@Gargi-jais11 Gargi-jais11 commented Feb 26, 2026

What changes were proposed in this pull request?

When all SCM instances are down or unreachable, CLI commands that query SCM (e.g. ozone admin datanode list, decommission, diskbalancer, usageinfo, maintenance, etc.) appear to hang for up to ~10–15 minutes before failing.

Current behaviour with scm's down in ha : Happens for all commands querying scm

bash-5.1$ ozone admin datanode list
<----------- Seems as stuck for 15mins with no cli error message ----------->

Proposed fix:
Make Retry logs to be shown up in the cli output to stderr in SCMFailoverProxyProviderBase.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14725

How was this patch tested?

Added integration test in TestFailoverWithScmHA for commands querying scm.
Tested locally:

bash-5.1$ ozone admin datanode list
ConnectException: Call From om1/172.18.0.10 to scm1, while invoking StorageContainerLocationProtocolPB over scm1. Retrying in 2000ms after 0 failover attempt(s).
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm2. Retrying in 2000ms after 1 failover attempt(s).
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm3. Retrying in 2000ms after 2 failover attempt(s).
NoRouteToHostException: No Route to Host from  om1/172.18.0.10 to scm1, while invoking StorageContainerLocationProtocolPB over scm1. Retrying in 2000ms after 3 failover attempt(s).
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm2. Retrying in 2000ms after 4 failover attempt(s).
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm3. Retrying in 2000ms after 5 failover attempt(s).
NoRouteToHostException: No Route to Host from  om1/172.18.0.10 to scm1, while invoking StorageContainerLocationProtocolPB over scm1. Retrying in 2000ms after 6 failover attempt(s).
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm2. Retrying in 2000ms after 7 failover attempt(s).
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm3. Retrying in 2000ms after 8 failover attempt(s).
NoRouteToHostException: No Route to Host from  om1/172.18.0.10 to scm1, while invoking StorageContainerLocationProtocolPB over scm1. Retrying in 2000ms after 9 failover attempt(s).
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm2. Retrying in 2000ms after 10 failover attempt(s).
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm3. Retrying in 2000ms after 11 failover attempt(s).
NoRouteToHostException: No Route to Host from  om1/172.18.0.10 to scm1, while invoking StorageContainerLocationProtocolPB over scm1. Retrying in 2000ms after 12 failover attempt(s).
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm2. Retrying in 2000ms after 13 failover attempt(s).
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm3. Retrying in 2000ms after 14 failover attempt(s).
No Route to Host from  om1/172.18.0.10 to scm1:9860 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see:  http://wiki.apache.org/hadoop/NoRouteToHost

@Gargi-jais11 Gargi-jais11 marked this pull request as ready for review February 26, 2026 11:47
@Gargi-jais11
Copy link
Contributor Author

@adoroszlai @ChenSammi Please review the patch

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Gargi-jais11 for the patch.

@Gargi-jais11 Gargi-jais11 requested a review from adoroszlai March 2, 2026 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants