Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 163 additions & 0 deletions A114-wrr-metric-names-for-computing-utilization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# A114: WRR Support for Custom Backend Metrics

- Author(s): sauravzg
- Approver: markdroth
- Status: In review
- Implemented in:
- Last updated: 2026-01-30
- Discussion at:

## Abstract

This proposal updates the client-side `weighted_round_robin` (WRR) load balancing policy to support customizable utilization metrics. It adds a new configuration field `metric_names_for_computing_utilization` to the WRR LB policy config. This allows users to specify which backend metrics should be used to compute endpoint weights, enabling the use of custom metrics (via ORCA named metrics) instead of relying solely on the default `application_utilization` or `cpu_utilization`.

## Background

The existing `weighted_round_robin` policy (defined in [gRFC A58][A58]) calculates endpoint weights based on standard metrics provided by the backend via ORCA (Open Request Cost Aggregation) load reports. Specifically, it uses `application_utilization` if available, and falls back to `cpu_utilization`.

However, services may want to drive load balancing decisions based on other resources, such as memory utilization, queue depth, or custom application-defined metrics. The [Custom Backend Metrics][A51] specification (ORCA) supports reporting arbitrary named metrics, and xDS has updated its WRR implementation to allow selecting these metrics for utilization calculation.

To support advanced load balancing scenarios, gRPC's WRR policy needs to support this flexibility.

### Related Proposals

- [A58: Client-Side Weighted Round Robin LB Policy][A58]
- [A51: Custom Backend Metrics][A51]

## Proposal

### Service Config Update

We will add a new field `metric_names_for_computing_utilization` to the [`WeightedRoundRobinLbConfig`][WeightedRoundRobinLbConfigProto] message in the Service Config.

```protobuf
message WeightedRoundRobinLbConfig {
// ... existing fields ...

// A list of metrics to be considered for overriding the default utilization
// computation behavior.
//
// For map fields in the ORCA proto, the string will be of the form
// "<map_field_name>.<map_key>". For example, the string "named_metrics.foo"
// will mean to look for the key "foo" in the ORCA "named_metrics" field.
//
// Utilization is computed by taking the max of the values of the
// metrics specified here. In the absence of this field or absence of
// valid values for the specified metrics, the policy will fall back to the
// existing default behavior.
repeated string metric_names_for_computing_utilization = 7;
}
```

### Weight Calculation Logic

The weight calculation logic in the WRR policy will be updated to determine the `utilization` value as follows from the [`OrcaLoadReport`][OrcaLoadReportProto]

1. **Check Custom Metrics**: If `metric_names_for_computing_utilization` is configured:
- Iterate through the specified metric names.
- **Resolve Metric Value**:
- If the name is of the format `field.key` (e.g., `named_metrics.foo`), look up the map field `field` and retrieve the value for `key`.
- If the name is a simple field name (e.g., `cpu_utilization`, `mem_utilization`), look up the `field`.
- **Only the following fields are supported:**
- `application_utilization`
- `cpu_utilization`
- `mem_utilization`
- `named_metrics.*`
- `utilization.*`
- **Compute Max**: Track the maximum value among all successfully resolved, positive ( > 0), finite metrics.
- If a max value is found, use it as the `utilization`.
2. **Fallback**: If checking custom metrics did not determine a valid utilization value (or if `metric_names_for_computing_utilization` is not configured), fall back to the existing WRR utilization behavior defined in [gRFC A58][A58].

#### Pseudocode

```
function GetUtilization(report, configured_metrics):
# 1. Check Custom Metrics
max_util = null

for metric_name in configured_metrics:
value = null

if metric_name contains ".":
# Map lookup (e.g. "named_metrics.foo" -> map="named_metrics", key="foo")
map_name, key = split_on_first_dot(metric_name)
if report has map field map_name:
value = report[map_name][key]
else:
# Root field lookup (e.g. "mem_utilization") via Reflection
if report has field metric_name:
value = report[metric_name]

# Only consider valid, non-nan, positive values
if value is not null and !is_nan(value) and value > 0:
if max_util is null or value > max_util:
max_util = value

if max_util is not null:
return max_util

# 2. Fallback to existing WRR behavior (A58)
if report.application_utilization > 0:
return report.application_utilization
return report.cpu_utilization
```

#### Implementation Notes

Since `OrcaLoadReport` is often exposed as a language-specific proxy object rather than a raw Protobuf message (e.g., in Java and C++), implementations are **not** expected to use Protobuf reflection to look up arbitrary fields. Instead, implementations should manually handle the supported metric list specified above.

As a consequence, support for any new standard fields added to `OrcaLoadReport` in the future will require explicit code changes in the implementation. This behavior is consistent with the [current Envoy implementation](https://github.com/envoyproxy/envoy/blob/35749578db375f5fe8ac5dd293cb7c4efb689611/source/common/orca/orca_load_metrics.cc#L47-L73).

#### Validity and Edge Cases

- **Nan Values**: As shown above, `NaN` values in reports are explicitly ignored to prevent undefined behavior in weight calculations. This is relevant because behavior of `max` on `NaN` in c++ is inconsistent based on order.
- **Zero Values**: Any value equal to `0.0` is treated as missing and ignored. This is consistent with [gRFC A58][A58] and the Envoy implementation, due to the lack of support for optional fields in the load report (where `0` cannot be distinguished from an unset field).
- **Negative Values**: Any value `< 0.0` is also treated as missing and **ignored silently**. This is consistent with the current [Envoy implementation](https://github.com/envoyproxy/envoy/blob/113f2c2d1015913256a2a8c6f9f97d0622623f45/source/extensions/load_balancing_policies/client_side_weighted_round_robin/client_side_weighted_round_robin_lb.cc#L214-L218).
- **Bound Checks**: The final selected `utilization` is subject to the standard validation logic from [gRFC A58][A58] (e.g., ensuring the value is positive) before being used to compute weight to avoid undefined behavior in weight calculations.
- **Normalization**: The WRR policy does not normalize reported metrics; the application is responsible for this.

The rest of the weight calculation formula (using QPS, EPS, and penalty) from [gRFC A58][A58] remains unchanged.

### xDS Integration

We will support the `metric_names_for_computing_utilization` field in the xDS [`ClientSideWeightedRoundRobin`][ClientSideWeightedRoundRobinProto] policy.

When converting the xDS configuration to the gRPC Service Config `WeightedRoundRobinLbConfig`, the `metric_names_for_computing_utilization` field should be copied over directly.

### Temporary environment variable protection

The features described in this proposal will be guarded by the environment variable `GRPC_EXPERIMENTAL_WRR_CUSTOM_METRICS`, which defaults to `false`.

## Rationale

- **Consistency with Envoy**: This design mirrors the corresponding feature in Envoy, ensuring consistent behavior for xDS-controlled clients.
- **Flexibility**: Allows users to define load balancing weights based on the actual bottleneck resource of their application (e.g., memory-bound services).
- **Backward Compatibility**: The default behavior (using `application_utilization` or `cpu_utilization`) remains unchanged if the new field is not configured.

## Implementation

This will be implemented in all languages C++, Java, and Go.

### C++

- **xDS Integration**: Update [`ClientSideWeightedRoundRobinLbPolicyConfigFactory`](https://github.com/grpc/grpc/blob/f7f13023412c1a589af7558eb0b9f8f664a76431/src/core/xds/grpc/xds_lb_policy_registry.cc#L68) to copy over the new field from the xDS configuration.
- **Config**: Update [`WeightedRoundRobinLbConfig`](https://github.com/grpc/grpc/blob/f7f13023412c1a589af7558eb0b9f8f664a76431/src/core/load_balancing/weighted_round_robin/weighted_round_robin.cc#L133) to include `metric_names_for_computing_utilization`.
- **Weight Calculation Logic**: Update callers of [`EndpointWeight::MaybeUpdateWeight`](https://github.com/grpc/grpc/blob/f7f13023412c1a589af7558eb0b9f8f664a76431/src/core/load_balancing/weighted_round_robin/weighted_round_robin.cc#L212) to implement the new utilisation selection logic.

### Java

- **xDS Integration**: Update [`convertWeightedRoundRobinConfig`](https://github.com/grpc/grpc-java/blob/a9f73f4c0aa5617aa2b6ae6ac805693915899b6a/xds/src/main/java/io/grpc/xds/LoadBalancerConfigFactory.java#L286) to copy over the field from the xDS configuration.
- **Config**: Update [`WeightedRoundRobinLoadBalancerConfig`](https://github.com/grpc/grpc-java/blob/a9f73f4c0aa5617aa2b6ae6ac805693915899b6a/xds/src/main/java/io/grpc/xds/WeightedRoundRobinLoadBalancer.java#L716) to include `metric_names_for_computing_utilization`.
- **Weight Calculation Logic**: Update [`OrcaPerRequestUtil`](https://github.com/grpc/grpc-java/blob/a9f73f4c0aa5617aa2b6ae6ac805693915899b6a/xds/src/main/java/io/grpc/xds/WeightedRoundRobinLoadBalancer.java#L364) to implement the new utilisation selection logic.

### Go

- **xDS Integration**: Update the configuration struct and conversion logic in [`converter.go`](https://github.com/grpc/grpc-go/blob/c05cfb3693bf18086810e671a6f9c05f296e0183/internal/xds/xdsclient/xdslbregistry/converter/converter.go#L241) to copy over the new field from the xDS configuration.
- **Config**: Update the configuration struct in [`config.go`](https://github.com/grpc/grpc-go/blob/7985bb44d26ecbeb8950996d028e38a0de08070b/balancer/weightedroundrobin/config.go#L26) to include `metric_names_for_computing_utilization`.
- **Weight Calculation Logic**: Update the weight update function in [`balancer.go`](https://github.com/grpc/grpc-go/blob/7985bb44d26ecbeb8950996d028e38a0de08070b/balancer/weightedroundrobin/balancer.go#L546) to implement the new utilisation selection logic.

[A58]: A58-client-side-weighted-round-robin-lb-policy.md
[A51]: A51-custom-backend-metrics.md
[ClientSideWeightedRoundRobinProto]: https://github.com/envoyproxy/envoy/blob/7242d5ad170523d7936849e596d261e3502c3886/api/envoy/extensions/load_balancing_policies/client_side_weighted_round_robin/v3/client_side_weighted_round_robin.proto#L86
[WeightedRoundRobinLbConfigProto]: https://github.com/grpc/grpc-proto/blob/ec99424f3b7dae9db194f848b4cea52ecfae07af/grpc/service_config/service_config.proto#L205
[OrcaLoadReportProto]: https://github.com/cncf/xds/blob/0feb69152e9f7e8a45c8a3cfe8c7dd93bca3512f/xds/data/orca/v3/orca_load_report.proto#L15