otel: add tcp metrics by AgraVator · Pull Request #12652 · grpc/grpc-java

AgraVator · 2026-02-11T07:25:38Z

Implements A80

sauravzg · 2026-02-17T10:19:36Z

core/src/main/java/io/grpc/internal/ClientTransportFactory.java

Can you review go/java-practices/null from the overall PR's perspective in terms of nullable vs optional ? I am not sure if we have a local java convention to prefer or avoid null

We can't yet use java.util.Optional in many parts of the code because we don't require API desugaring and it was added in Android API level 24. We can use Guava's Optional, but we don't want it in any cross-module API surface; we only use Guava internally. Given Guava's Optional is semi-discouraged, but we can't use the new thing either, we've generally avoided the problem and not used Optional; there's very few usages of Optional in the code base.

Note that @Nullable is classically just documentation in gRPC, as we don't run a nullness checker and everything defaults to nullable. Although with tools re-interpreting the annotations with different JSpecify semantics (e.g., #12338), there's complications in its usage in cross-module API surfaces.

sauravzg · 2026-02-17T10:19:44Z

netty/src/main/java/io/grpc/netty/NettyServer.java

  private final int maxMessageSize;
  private final int maxHeaderListSize;
  private final int softLimitHeaderListSize;
+  private MetricRecorder metricRecorder;


How is this being set? I don't see a constructor or a setter that accesses this value?

sauravzg · 2026-02-17T10:19:50Z

netty/src/main/java/io/grpc/netty/TcpMetrics.java

+import java.util.Collections;
+import java.util.List;
+
+final class TcpMetrics {


optional: Javadocs for classes might be helpful

sauravzg · 2026-02-17T10:20:10Z

netty/src/main/java/io/grpc/netty/TcpMetrics.java

+  }
+
+  static final class Metrics {
+    final LongCounterMetricInstrument connectionsCreated;


nullability annotations for fields where needed. Also consider the previous discussion about nullability style guide

sauravzg · 2026-02-17T10:20:13Z

netty/src/main/java/io/grpc/netty/TcpMetrics.java

+   * Safe metric registration or retrieval for environments where TcpMetrics might
+   * be loaded multiple times (e.g., shaded and unshaded).
+   */
+  private static LongCounterMetricInstrument safelyRegisterLongCounter(


Do we need the safelyRegi... private abstraction?

Given that all registration happens during construction and this class is finel , can't we simply ensure we dedupe our metrics during construction, catch and rethrow the error from the constructor? i.e. ensure that the input is valid and catch/throw from a single point which should never be reached.
The fact that "we are passing a valid set of metrics for registration" can probably be trivially unit tested when we construct a TcpMetrics instance.

This should reduce code bloat without any sacrifice to safety, but at the cost of "duplicate input will throw an error instead of being handled intuitively", which is a fair expectation IMO.

sauravzg · 2026-02-17T10:20:18Z

netty/src/main/java/io/grpc/netty/TcpMetrics.java

+      java.util.List<String> labelValues = getLabelValues(channel);
+      try {
+        if (channel.getClass().getName().equals(epollSocketChannelClassName)) {
+          Class<?> tcpInfoClass = Class.forName(epollTcpInfoClassName);


This is quite a lot of reflection in hot exporting path. So, a couple of questions.

Why do we choose reflectin over typecasting and calling actual methods? Reducing dependency surface?

Is there a way to optimise this like caching value on per channel level? Something like a creating a runnable per channel. Not sure how it'd be possible to share this data for the final collection on channel inactive.

sauravzg · 2026-02-17T10:20:21Z

netty/src/main/java/io/grpc/netty/TcpMetrics.java

+          Method rttMethod = tcpInfoClass.getMethod("rtt");
+
+          long totalRetrans = (Long) totalRetransMethod.invoke(info);
+          int retransmits = (Integer) retransmitsMethod.invoke(info);


Are these cumulative metrics or are these total retransmits since the connection start? If latter, the current logic is incorrect.

AAh I see, then I think we should not be using a counter for this... Will clarify in the gRFC

ejona86

"otel" is a very surprising prefix to use for the PR/commit, as opentelemetry isn't actually changed at all.

netty/src/main/java/io/grpc/netty/NettyServerBuilder.java

ejona86 · 2026-02-26T22:59:58Z

core/src/main/java/io/grpc/internal/ClientTransportFactory.java

+    @Nullable
+    public MetricRecorder getMetricRecorder() {
+      return metricRecorder;
+    }


LoadBalancer.Helper.getMetricRecorder() is non-null and ManagedChannelImpl creates a MetricRecorderImpl unconditionally. Make this non-null as well? I see that NameResolver allows it to be null; seems we should just change that. LoadBalancer used to be that way and it was changed after a bug in the Helper implementation caused the channel to panic.

If we need the optimization, we can have the MetricRecorder return whether a particular instrument is enabled. But I see no need to add an additional condition to usages for the metric recorder being missing entirely.

core/src/main/java/io/grpc/internal/InternalServer.java

netty/src/main/java/io/grpc/netty/TcpMetrics.java

ejona86 · 2026-02-26T23:50:47Z

netty/src/main/java/io/grpc/netty/TcpMetrics.java

+      this(metricRecorder, target, DEFAULT_METRICS);
+    }
+
+    Tracker(MetricRecorder metricRecorder, String target, Metrics metrics) {


@VisibleForTesting?

ejona86 · 2026-02-26T23:50:50Z

netty/src/main/java/io/grpc/netty/TcpMetrics.java

+          "io.netty.channel.epoll.EpollTcpInfo");
+    }
+
+    Tracker(MetricRecorder metricRecorder, String target, Metrics metrics,


@VisibleForTesting?

ejona86 · 2026-02-26T23:53:16Z

netty/src/main/java/io/grpc/netty/TcpMetrics.java

+    private io.netty.util.concurrent.ScheduledFuture<?> reportTimer;
+
+    void channelActive(Channel channel) {
+      if (metricRecorder != null && target != null) {


When would the target be null, and if it is null why would we stop metrics? Seems we should worst-case use "" or something of the style <unknown target>. Otherwise you're silently losing metrics, which is the worst failure model for metrics as then you have lies.

ejona86 · 2026-02-26T23:56:11Z

netty/src/main/java/io/grpc/netty/TcpMetrics.java

+              Collections.singletonList(target), labelValues);
+        }
+      } catch (Throwable t) {
+        // Epoll not available or error getting tcp_info, just ignore.


This comment only applies to a portion of the try block. Limit the scope of the try block to just where it is needed.

ejona86 · 2026-02-26T23:57:59Z

netty/src/main/java/io/grpc/netty/TcpMetrics.java

+            return;
+          }
+          synchronized (accessorCache) {
+            channelReflectionAccessor = accessorCache.get(epollTcpInfoClassName);


Please no specialized cache just for testing. I'd suggest a single field and if epollTcpInfoClassName is provided for testing then simply create a new one (and don't cache it). And we can do all that in the constructor instead of checking every invocation. We do have some "interesting" tools that might help testing this (e.g., StaticTestingClassLoader), but I'll have to look over the tests to see how much help it provides.

…ddress PR comments)

The backoff timer is only used when serializeRetries=true, and that exists to match the old/current pick_first's behavior as closely as possible. InternalSubchannel.updateAddresses() would take no action when in TRANSIENT_FAILURE; it would update the addresses and just wait for the backoff timer to expire. Note that this only impacts serializeRetries=true; in the other cases we do want to start trying to the new addresses immediately, because the backoff timers are in the subchannels. Note that this change was also important because requestConnection() can be directly triggered by the user with channel.getState(true), and that shouldn't defeat the backoff timer.

Since we're only supporting API levels 23+, all the supported Android versions handle multidex natively, and without any bugs to workaround. Also bump some minSdkVersion that didn't get updated in fa7b52b so that multiDex is actually enabled by default. See also b/476359563

… and 'core: Wait for backoff timer on address update in pick_first'

ejona86

The PR description's "Implements A80" doesn't link to the gRFC; it just links to this PR.

ejona86 · 2026-03-09T15:21:12Z

okhttp/src/test/java/io/grpc/okhttp/OkHttpTransportTest.java

-        .setTransportTracerFactory(fakeClockTransportTracer)
-        .buildTransportServers(streamTracerFactories);
+        .setTransportTracerFactory(fakeClockTransportTracer);
+    return InternalOkHttpServerBuilder


You don't need to use InternalOkHttpServerBuilder; buildTransportServers() is package-private... the code was already calling it directly.

ejona86 · 2026-03-10T13:56:11Z

netty/src/main/java/io/grpc/netty/NettyServerBuilder.java


+  @CanIgnoreReturnValue
+  @Override
+  public NettyServerBuilder addMetricSink(MetricSink metricSink) {


Looks like this method shouldn't have been created in this class, as it will be inherited from ForwardingServerBuilder.

Note that once this method is created, it can't be deleted, because previously-compiled code will be linking against this method that returns NettyServerBuilder, and if you delete it then the base class's method only returns ServerBuilder. (This is an ABI issue, not an API issue; recompiling doesn't see the issue.)

ejona86 · 2026-03-10T13:57:56Z

api/src/main/java/io/grpc/ServerBuilder.java

+   * @param metricSink the metric sink to add.
+   * @return this
+   */
+  public T addMetricSink(MetricSink metricSink) {


@ExperimentalApi. You can use "TODO" or similar as the string when developing, so just always add it immediately when creating a new API.

ejona86 · 2026-03-12T13:07:36Z

api/src/main/java/io/grpc/InternalTcpMetrics.java

+  private static List<Double> getMinRttBuckets() {
+    List<Double> buckets = new ArrayList<>(100);
+    for (int i = 1; i <= 100; i++) {
+      buckets.add(1e-6 * Math.pow(2.0, i * 0.24));
+    }
+    return Collections.unmodifiableList(buckets);
+  }


This is pretty specific, and a lot of buckets. Where did this come from?

I saw something similar in core...

If implementations are expected to do this, then it needs to be in the gRFC.

ejona86 · 2026-03-12T13:13:46Z

api/src/main/java/io/grpc/ServerBuilder.java

+   * @return this
+   */
+  public T addMetricSink(MetricSink metricSink) {
+    throw new UnsupportedOperationException();


If a server has no metrics, then they wouldn't implement this method. Seems this should just be a noop instead of throwing. (If it stays as throwing, we'd need to figure out how we wanted to handle it in GrpcOpenTelementry.)

ejona86 · 2026-03-12T13:52:13Z

netty/src/test/java/io/grpc/netty/NettyClientTransportTest.java

            Attributes.EMPTY,
-            channelz);
+            channelz,
+            mock(MetricRecorder.class));


Why is this one a mock, but the others in this same file aren't? This is the first mock used in the class, and there isn't any verification happening.

ejona86 · 2026-03-12T13:58:23Z

netty/src/main/java/io/grpc/netty/TcpMetrics.java

+    return DEFAULT_METRICS;
+  }
+
+  static final class Metrics {


Is there any point to this class?

ejona86 · 2026-03-12T13:59:13Z

netty/src/main/java/io/grpc/netty/TcpMetrics.java

+    boolean epollAvailable = false;
+    try {
+      Class<?> epollClass = Class.forName("io.netty.channel.epoll.Epoll");
+      Method isAvailableMethod = epollClass.getDeclaredMethod("isAvailable");
+      epollAvailable = (Boolean) isAvailableMethod.invoke(null);
+    } catch (ClassNotFoundException e) {
+      log.log(Level.FINE, "Epoll is not available", e);
+    } catch (Exception e) {
+      log.log(Level.FINE, "Failed to determine Epoll availability", e);
+    } catch (Error e) {
+      log.log(Level.FINE, "Failed to load native Epoll library", e);
+    }
+    log.log(Level.INFO, "Epoll available during static init of TcpMetrics:"
+        + "{0}", epollAvailable);


This looks completely unused.

ejona86 · 2026-03-12T14:02:45Z

netty/src/main/java/io/grpc/netty/TcpMetrics.java

+        tcpInfo = epollTcpInfoClass.getDeclaredConstructor().newInstance();
+        tcpInfoMethod = epollSocketChannelClass.getMethod("tcpInfo", epollTcpInfoClass);
+        totalRetransMethod = epollTcpInfoClass.getMethod("totalRetrans");
+        retransMethod = epollTcpInfoClass.getMethod("retrans");
+        rttMethod = epollTcpInfoClass.getMethod("rtt");


We should avoid doing reflection every connection. These reflection results should be reused.

ejona86 · 2026-03-12T14:03:46Z

netty/src/main/java/io/grpc/netty/TcpMetrics.java

+        totalRetransMethod = epollTcpInfoClass.getMethod("totalRetrans");
+        retransMethod = epollTcpInfoClass.getMethod("retrans");
+        rttMethod = epollTcpInfoClass.getMethod("rtt");
+      } catch (Exception | Error t) {


Please use precise error handling. What specific exceptions are expected here?

AgraVator closed this Feb 11, 2026

AgraVator reopened this Feb 11, 2026

AgraVator marked this pull request as ready for review February 16, 2026 16:14

sauravzg reviewed Feb 17, 2026

View reviewed changes

AgraVator requested a review from ejona86 February 26, 2026 09:54

ejona86 reviewed Feb 27, 2026

View reviewed changes

AgraVator force-pushed the tcp-metrics branch from 2990c28 to 05fc150 Compare March 6, 2026 18:58

feat: Implement TCP metrics collection and pipeline MetricRecorder (A…

4082b2d

…ddress PR comments)

AgraVator force-pushed the tcp-metrics branch from 05fc150 to 716eb36 Compare March 6, 2026 19:03

suggested changes

a8b66f4

AgraVator force-pushed the tcp-metrics branch from 716eb36 to a8b66f4 Compare March 6, 2026 19:11

AgraVator and others added 6 commits March 9, 2026 14:21

fix: format and BinderServerBuilder

1ee21f3

fix: logs and plumb metric sink

20c614f

fix: otel test

9dfa829

Revert 'android-interop-testing: Remove usage of MultiDexApplication'…

2fda6fa

… and 'core: Wait for backoff timer on address update in pick_first'

AgraVator requested a review from ejona86 March 11, 2026 09:52

AgraVator added 2 commits March 11, 2026 15:32

fix: make metric recorder non null and register metrics unconditionally

b12a433

fix: revert non-nullable metric recorder in orca

d8180d6

ejona86 reviewed Mar 12, 2026

View reviewed changes

Conversation

AgraVator commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ejona86 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ejona86 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AgraVator commented Feb 11, 2026 •

edited

Loading