Skip to content

[RPC Metric Part 1] Support two basic metrics in RPC client : Latency and error rate #89

Open
guandali wants to merge 14 commits intomainfrom
lli/rpc-beholder-metric
Open

[RPC Metric Part 1] Support two basic metrics in RPC client : Latency and error rate #89
guandali wants to merge 14 commits intomainfrom
lli/rpc-beholder-metric

Conversation

@guandali
Copy link
Member

@guandali guandali commented Mar 13, 2026

Description

Allow RPC client to capture two metrics rpc_call_latency and rpc_call_errors_total , so it later can be exported to beholder and engineers can examine the RPC reliability per endpoint, and make informed decsion.
it will be part of this dashboard https://grafana.ops.prod.cldev.sh/goto/cfgwx9lcfer5sa?orgId=1

Requires Dependencies

Resolves Dependencies

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a3174e7ea3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@guandali guandali requested a review from a team as a code owner March 13, 2026 08:19
@guandali guandali changed the title publish two basic metrics from the RPC client into Beholder: Latency and error rate [RPC Metric Part 1] publish two basic metrics from the RPC client into Beholder: Latency and error rate Mar 17, 2026
@guandali guandali changed the title [RPC Metric Part 1] publish two basic metrics from the RPC client into Beholder: Latency and error rate [RPC Metric Part 1] Support two basic metrics in RPC client : Latency and error rate Mar 17, 2026
Copy link

@vlfig vlfig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments, grab me if you need.

// RPCClientMetricsConfig holds labels for RPC client metrics.
// Empty strings are allowed; they will still be emitted as labels for filtering.
type RPCClientMetricsConfig struct {
Env string // e.g. "staging", "production"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this label should ever be populated by the application itself.

Env string // e.g. "staging", "production"
Network string // chain/network name
ChainID string // chain ID
RPCProvider string // RPC provider or node name (optional)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this come from? I'd imagine this being called from logResult in rpc_client.go.

@@ -0,0 +1,125 @@
// RPC client observability using Beholder.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a new module I'd imagine this becoming an expansion of metrics/client.go, which already has a (promauto) latency metric, to 1) include beholder as a "target" like in metrics/multinode.go; and 2) add the request error rate metric.

@@ -0,0 +1,54 @@
# RPC Observability (Beholder)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure you intended to commit this. I do like the idea of having a /docs folder in this style but I think that's better pursued with a broader effort. Leaving such a slim slice here would end up be more confusing, I think.


Create `RPCClientMetrics` with `metrics.NewRPCClientMetrics(metrics.RPCClientMetricsConfig{...})` and pass it as the last argument to `multinode.NewRPCClientBase(...)`. The follow-up interface refactor will make it easier for multinode/chain integrations to supply `env`, `network`, `chain_id`, and `rpc_provider`.

## Follow-up: multinode integration (PR 2)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better not mix the plan of what to do with description of what is.

@guandali guandali requested a review from vlfig March 20, 2026 14:45
Copy link

@vlfig vlfig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments. Do tag Dmytro once you feel we're past these. I think we'll need approval from someone other than me.

RPCCallLatency = promauto.NewHistogramVec(prometheus.HistogramOpts{
Name: "rpc_call_latency",
Help: "The duration of an RPC call in milliseconds",
Help: "The duration of an RPC call in seconds",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You sure about this?

Comment on lines +85 to +86
latency: latency,
errorsTotal: errorsTotal,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call them latencyHist and errorsCounter for clarity?

@guandali guandali requested review from dhaidashenko and vlfig and removed request for dhaidashenko March 23, 2026 19:18
Name: rpcCallLatencyBeholder,
Help: "The duration of an RPC call in milliseconds",
Buckets: []float64{
float64(50 * time.Millisecond),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change bucket size here from nanoseconds to milliseconds, we'll need to change the values that we report, but this is a breaking change. Other teams and NOPs may already depend on the values being in nanoseconds.

},
}, []string{"chainFamily", "chainID", "rpcUrl", "isSendOnly", "success", "rpcCallName"})

RPCCallErrorsTotal = promauto.NewCounterVec(prometheus.CounterOpts{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a dedicated metric for errors, can't we derive the value using RPCCallLatency and sucess label?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the change so we can do this
sum by (chainFamily, chainID, rpcUrl, isSendOnly, rpcCallName) ( rate(rpc_call_latency_count{success="false"}[5m]) )

@guandali guandali force-pushed the lli/rpc-beholder-metric branch from 822f005 to d838d80 Compare March 24, 2026 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants