[RPC Metric Part 1] Support two basic metrics in RPC client : Latency and error rate #89
[RPC Metric Part 1] Support two basic metrics in RPC client : Latency and error rate #89
Conversation
…ency, RPC error rate
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a3174e7ea3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
vlfig
left a comment
There was a problem hiding this comment.
Couple of comments, grab me if you need.
metrics/rpc_client.go
Outdated
| // RPCClientMetricsConfig holds labels for RPC client metrics. | ||
| // Empty strings are allowed; they will still be emitted as labels for filtering. | ||
| type RPCClientMetricsConfig struct { | ||
| Env string // e.g. "staging", "production" |
There was a problem hiding this comment.
I don't think this label should ever be populated by the application itself.
metrics/rpc_client.go
Outdated
| Env string // e.g. "staging", "production" | ||
| Network string // chain/network name | ||
| ChainID string // chain ID | ||
| RPCProvider string // RPC provider or node name (optional) |
There was a problem hiding this comment.
Where does this come from? I'd imagine this being called from logResult in rpc_client.go.
metrics/rpc_client.go
Outdated
| @@ -0,0 +1,125 @@ | |||
| // RPC client observability using Beholder. | |||
There was a problem hiding this comment.
Instead of a new module I'd imagine this becoming an expansion of metrics/client.go, which already has a (promauto) latency metric, to 1) include beholder as a "target" like in metrics/multinode.go; and 2) add the request error rate metric.
docs/rpc_observability.md
Outdated
| @@ -0,0 +1,54 @@ | |||
| # RPC Observability (Beholder) | |||
There was a problem hiding this comment.
Not sure you intended to commit this. I do like the idea of having a /docs folder in this style but I think that's better pursued with a broader effort. Leaving such a slim slice here would end up be more confusing, I think.
docs/rpc_observability.md
Outdated
|
|
||
| Create `RPCClientMetrics` with `metrics.NewRPCClientMetrics(metrics.RPCClientMetricsConfig{...})` and pass it as the last argument to `multinode.NewRPCClientBase(...)`. The follow-up interface refactor will make it easier for multinode/chain integrations to supply `env`, `network`, `chain_id`, and `rpc_provider`. | ||
|
|
||
| ## Follow-up: multinode integration (PR 2) |
There was a problem hiding this comment.
Better not mix the plan of what to do with description of what is.
vlfig
left a comment
There was a problem hiding this comment.
Couple of comments. Do tag Dmytro once you feel we're past these. I think we'll need approval from someone other than me.
metrics/client.go
Outdated
| RPCCallLatency = promauto.NewHistogramVec(prometheus.HistogramOpts{ | ||
| Name: "rpc_call_latency", | ||
| Help: "The duration of an RPC call in milliseconds", | ||
| Help: "The duration of an RPC call in seconds", |
metrics/client.go
Outdated
| latency: latency, | ||
| errorsTotal: errorsTotal, |
There was a problem hiding this comment.
maybe call them latencyHist and errorsCounter for clarity?
| Name: rpcCallLatencyBeholder, | ||
| Help: "The duration of an RPC call in milliseconds", | ||
| Buckets: []float64{ | ||
| float64(50 * time.Millisecond), |
There was a problem hiding this comment.
If we change bucket size here from nanoseconds to milliseconds, we'll need to change the values that we report, but this is a breaking change. Other teams and NOPs may already depend on the values being in nanoseconds.
metrics/client.go
Outdated
| }, | ||
| }, []string{"chainFamily", "chainID", "rpcUrl", "isSendOnly", "success", "rpcCallName"}) | ||
|
|
||
| RPCCallErrorsTotal = promauto.NewCounterVec(prometheus.CounterOpts{ |
There was a problem hiding this comment.
Why do we need a dedicated metric for errors, can't we derive the value using RPCCallLatency and sucess label?
There was a problem hiding this comment.
made the change so we can do this
sum by (chainFamily, chainID, rpcUrl, isSendOnly, rpcCallName) ( rate(rpc_call_latency_count{success="false"}[5m]) )
822f005 to
d838d80
Compare
Description
Allow RPC client to capture two metrics
rpc_call_latencyandrpc_call_errors_total, so it later can be exported to beholder and engineers can examine the RPC reliability per endpoint, and make informed decsion.it will be part of this dashboard https://grafana.ops.prod.cldev.sh/goto/cfgwx9lcfer5sa?orgId=1
Requires Dependencies
Resolves Dependencies