Skip to content

Conversation

@aadityaraj7769
Copy link

@aadityaraj7769 aadityaraj7769 commented Nov 23, 2025

Summary

This PR adds OpenTelemetry (OTel) metrics instrumentation to the XDS Client.

Changes

  • New Interface: Added XdsClientOtelMetricsProvider interface for collecting XDS client metrics via OpenTelemetry

    • Connection lifecycle events (connection lost, closed, reconnection)
    • Request/response tracking
    • Error tracking (resource not found, invalid resource)
    • Server latency measurements
    • Connection state monitoring
    • Initial wait time tracking for resource fetching
  • No-op Implementation: Added NoOpXdsClientOtelMetricsProvider as default implementation when metrics are disabled

  • Integration:

    • Integrated metrics provider into XdsClientImpl constructor with dependency injection pattern
    • Added metrics recording throughout XDS client lifecycle (connection events, resource updates, latency tracking)
    • Metrics provider passed to XdsClientJmx for JMX integration
    • Added builder support in D2ClientBuilder via setXdsClientOtelMetricsProvider()
  • Metrics Tracked:

    • Counters: connection events, requests/responses, errors (not found/invalid resources), IRV sent count
    • Histogram: server latency
    • Gauges: connection state, active initial wait time

Backward Compatibility

  • Fully backward compatible - defaults to no-op provider when not configured

About XdsSensor

The XdsClient Sensor tracks metrics for xDS-based service discovery and client-side operations. It monitors connection events, request/response counts, and latency distributions for xDS service calls. This sensor enables client-side observability for distributed service communications using xDS.

New Otel Metrics

Counters

  • Xds.Client.ErrorCount
  • Xds.Client.ConnectionStateChangedCount
  • Xds.Client.RequestCount
  • Xds.Client.ResponseReceivedCount
  • Xds.Client.InitialResourceVersionSentCount

Histogram

  • Xds.Server.Latency

Gauges

  • Xds.Client.DisconnectedFromXdsServer
  • Xds.Client.ActiveInitialWaitTimeMs

@Kevin-Ngo
Copy link

Kevin-Ngo commented Dec 3, 2025

Can we please add some context into the PR description around

  1. What the Xds metrics are used for?
  2. What are the new metrics now?
  3. How was this tested?


public boolean subscribeToUriGlobCollection = false;
public XdsServerMetricsProvider _xdsServerMetricsProvider = new NoOpXdsServerMetricsProvider();
public XdsClientOtelMetricsProvider _xdsClientOtelMetricsProvider = new NoOpXdsClientOtelMetricsProvider();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For public variables, we shouldn't prefix with _, that's the convention for private ones

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have followed the existing logic same as _xdsServerMetricsProvider which is basically for recording the xdsServer latency.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this is new code and should follow better standards.. that last one was likely incorrectly committed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.


public boolean subscribeToUriGlobCollection = false;
public XdsServerMetricsProvider _xdsServerMetricsProvider = new NoOpXdsServerMetricsProvider();
public XdsClientOtelMetricsProvider _xdsClientOtelMetricsProvider = new NoOpXdsClientOtelMetricsProvider();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add docstring above this variable to explain how it is used here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Comment on lines 310 to 313
// Get the client name from global prefix
String clientName = getGlobalPrefix(null);
xdsClientJmx.setClientName(clientName);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it needed to set here? Do we need to set it anywhere else?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set the client name because I want to use it in the XdsClientImpl file to record the server latency for this specific client. The clientName is one of the dimensions used in Otel for this metric.

*/
public class NoOpXdsClientOtelMetricsProvider implements XdsClientOtelMetricsProvider {

@Override

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add docstring {@inheritDoc} for all of these methods

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Comment on lines 67 to 73
public void setClientName(String clientName) {
_clientName = clientName;
}

public String getClientName() {
return _clientName;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing docstring

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

}

_xdsClientJmx = new XdsClientJmx(serverMetricsProvider);
_xdsClientJmx = new XdsClientJmx(serverMetricsProvider, xdsClientOtelMetricsProvider);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pass noop here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think we should do that. The xdsClientOtelMetricsProvider comes from the configuration and will always be either an XdsClientOtelTracker instance or a noop instance. Even if a null value is passed through constructor overloading, XdsClientJmx already handles that case internally and creates a new noop instance.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case, why in line 241 do we perform the check? Should we have that moved up before this line then?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That check is just for the backward compatibility. If XdsClientImpl constructor is called without XdsClientOtelMetricsProvider.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but i'm saying the usage of the actual parameter is a bit confusing though.. It is possible one value in line 239, then another after 241. We should just standardize that and have line 241 happen first.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Updated.

handleResourceRemoval(removedResources, type);
}

private void handleResourceUpdate(Map<String, ? extends ResourceUpdate> updates, ResourceType type)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this method need to be updated? Do we need to handle null or empty string here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly needed to update the trackServerLatencyHelper function so that server latency can be recorded with the clientName, which will now be fetched using xdsClientJmx.getClientName(). While backtracking the flow, I found that the clientName can be extracted from the handleResourceUpdate function. I also added null and empty string checks in D2ClientJmxManager at the point where the client name is extracted and where the setClientName function is called.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes some paths pass in NO_VALUE and others not?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

xdsStreamReadyTimeout,
config.subscribeToUriGlobCollection,
config._xdsServerMetricsProvider,
config._xdsClientOtelMetricsProvider,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are missing tests for this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are not any tests written for this file. Should I add a new test file for this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, please do. Sometimes old files won't have it, though it's good practice to have coverage. If we're going through adding new features, this is a good time to add some coverage - at the very least to test our change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests seem to not do anything. Can we use a test impl of the XdsClientOtelMetricsProvider to verify the methods were called with the expected input variables?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@aadityaraj7769
Copy link
Author

Can we please add some context into the PR description around

  1. What the Xds metrics are used for?
  2. What are the new metrics now?
  3. How was this tested?

Added in the PR description

@Kevin-Ngo
Copy link

Kevin-Ngo commented Dec 8, 2025

Since this is OSS, we can't really link internal google docs. We should remove it from the PR description, same with references to internal things, such as container etc.

Comment on lines 308 to 311
final String jmxName = String.format("%s-XdsClientJmx", getGlobalPrefix(null));

// Get the client name from global prefix
String clientName = getGlobalPrefix(null);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing i've noticed also is that getGlobalPrefix is also called on line 308. Can we just call that method one time and reuse it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

xdsStreamReadyTimeout,
config.subscribeToUriGlobCollection,
config._xdsServerMetricsProvider,
config._xdsClientOtelMetricsProvider,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, please do. Sometimes old files won't have it, though it's good practice to have coverage. If we're going through adding new features, this is a good time to add some coverage - at the very least to test our change.

}

_xdsClientJmx = new XdsClientJmx(serverMetricsProvider);
_xdsClientJmx = new XdsClientJmx(serverMetricsProvider, xdsClientOtelMetricsProvider);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case, why in line 241 do we perform the check? Should we have that moved up before this line then?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we turn these tests into Parameterized ones? E.g.

  @Test
  public void testRecordConnectionLost() {
    String clientName = "test-client-1";
    _testProvider.recordConnectionLost(clientName);

    assertEquals(_testProvider.getCallCount("recordConnectionLost"), 1);
    assertEquals(_testProvider.getLastClientName("recordConnectionLost"), clientName);
  }

  @Test
  public void testRecordConnectionClosed() {
    String clientName = "test-client-2";
    _testProvider.recordConnectionClosed(clientName);

    assertEquals(_testProvider.getCallCount("recordConnectionClosed"), 1);
    assertEquals(_testProvider.getLastClientName("recordConnectionClosed"), clientName);
  }

can be combined into a single parameterized test

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@aadityaraj7769
Copy link
Author

Since this is OSS, we can't really link internal google docs. We should remove it from the PR description, same with references to internal things, such as container etc.

Updated.

// Get the client name from global prefix
String clientName = getGlobalPrefix(null);

final String jmxName = String.format("%s-XdsClientJmx", clientName);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: jmxName has nothing really to do with line:313-line:315, and could probably be declared after.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

private final AtomicInteger _resourceInvalidCount = new AtomicInteger();
private final XdsServerMetricsProvider _xdsServerMetricsProvider;
private final XdsClientOtelMetricsProvider _xdsClientOtelMetricsProvider;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary extra endline here. Additionally, we really should create a common util that exposes "-" in case future sensors will also need to reference it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed extra endline and created new file D2JmxConstants which expose "-".

*
* @param clientName the name to identify this XDS client
*/
public void setClientName(String clientName) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is there a reason for this to be able to be set multiple times for a single instance? If not, we should have it be a one-way set. i.e. via atomic string or something.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used atomic string.

{
return _xdsClient.getActiveInitialWaitTimeMillis();
waitTime = _xdsClient.getActiveInitialWaitTimeMillis();
_xdsClientOtelMetricsProvider.updateActiveInitialWaitTime(_clientName, waitTime);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically, can't all these usages of _clientName pass a null or empty _clientName if it's not set or improperly set?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

}

_xdsClientJmx = new XdsClientJmx(serverMetricsProvider);
_xdsClientJmx = new XdsClientJmx(serverMetricsProvider, xdsClientOtelMetricsProvider);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but i'm saying the usage of the actual parameter is a bit confusing though.. It is possible one value in line 239, then another after 241. We should just standardize that and have line 241 happen first.

Comment on lines 1 to 15
/*
Copyright (c) 2023 LinkedIn Corp.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? This doesn't seem to be a standard across the files.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

verify(fixture._wildcardResourceWatcher).onRemoval(CLUSTER_RESOURCE_NAME);
verify(fixture._clusterSubscriber).onRemoval();
verify(fixture._uriMapWildcardSubscriber).onRemoval(CLUSTER_RESOURCE_NAME);
verifyZeroInteractions(fixture._serverMetricsProvider);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we verify our metrics aren't used here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

verify(fixture._resourceWatcher, times(0)).onChanged(D2_URI_MAP_GLOB_COLLECTION_UPDATE_WITH_DATA1);
verify(fixture._wildcardResourceWatcher, times(0)).onChanged(any(),
eq(D2_URI_MAP_GLOB_COLLECTION_UPDATE_WITH_DATA1));
verifyZeroInteractions(fixture._serverMetricsProvider);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically anywhere that _serverMetricsProvider is used, since this is a dual implementation, we should have the same assertions to leverage the test cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Comment on lines 1 to 8
/*
Copyright (c) 2025 LinkedIn Corp.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also licensing here

/**
* Common constants used across D2 JMX and metrics components.
*/
public final class D2JmxConstants

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be just a common utils file instead of D2JmxContants? The NO_VALUE will be used by possibly multiple sensors, so it should just be something like OpenTelemetryConstants.

Copy link

@Kevin-Ngo Kevin-Ngo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM (aside from minor improvement suggestions - which should be carried out) from Metrics standpoint. Melt ship - please have sensor owner review it to confirm it's capturing in the right places and has the correct testing.

@aadityaraj7769 aadityaraj7769 force-pushed the adiraj/xdsclient-otel-migration branch from 512b8e2 to f710f7e Compare December 18, 2025 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants