-
Notifications
You must be signed in to change notification settings - Fork 558
[WIP]: Add OpenTelemetry metrics support for XDS Client #1127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[WIP]: Add OpenTelemetry metrics support for XDS Client #1127
Conversation
|
Can we please add some context into the PR description around
|
|
|
||
| public boolean subscribeToUriGlobCollection = false; | ||
| public XdsServerMetricsProvider _xdsServerMetricsProvider = new NoOpXdsServerMetricsProvider(); | ||
| public XdsClientOtelMetricsProvider _xdsClientOtelMetricsProvider = new NoOpXdsClientOtelMetricsProvider(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For public variables, we shouldn't prefix with _, that's the convention for private ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have followed the existing logic same as _xdsServerMetricsProvider which is basically for recording the xdsServer latency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but this is new code and should follow better standards.. that last one was likely incorrectly committed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
|
|
||
| public boolean subscribeToUriGlobCollection = false; | ||
| public XdsServerMetricsProvider _xdsServerMetricsProvider = new NoOpXdsServerMetricsProvider(); | ||
| public XdsClientOtelMetricsProvider _xdsClientOtelMetricsProvider = new NoOpXdsClientOtelMetricsProvider(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add docstring above this variable to explain how it is used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
| // Get the client name from global prefix | ||
| String clientName = getGlobalPrefix(null); | ||
| xdsClientJmx.setClientName(clientName); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it needed to set here? Do we need to set it anywhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set the client name because I want to use it in the XdsClientImpl file to record the server latency for this specific client. The clientName is one of the dimensions used in Otel for this metric.
| */ | ||
| public class NoOpXdsClientOtelMetricsProvider implements XdsClientOtelMetricsProvider { | ||
|
|
||
| @Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add docstring {@inheritDoc} for all of these methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
| public void setClientName(String clientName) { | ||
| _clientName = clientName; | ||
| } | ||
|
|
||
| public String getClientName() { | ||
| return _clientName; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
| } | ||
|
|
||
| _xdsClientJmx = new XdsClientJmx(serverMetricsProvider); | ||
| _xdsClientJmx = new XdsClientJmx(serverMetricsProvider, xdsClientOtelMetricsProvider); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we pass noop here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think we should do that. The xdsClientOtelMetricsProvider comes from the configuration and will always be either an XdsClientOtelTracker instance or a noop instance. Even if a null value is passed through constructor overloading, XdsClientJmx already handles that case internally and creates a new noop instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is the case, why in line 241 do we perform the check? Should we have that moved up before this line then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That check is just for the backward compatibility. If XdsClientImpl constructor is called without XdsClientOtelMetricsProvider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but i'm saying the usage of the actual parameter is a bit confusing though.. It is possible one value in line 239, then another after 241. We should just standardize that and have line 241 happen first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Updated.
| handleResourceRemoval(removedResources, type); | ||
| } | ||
|
|
||
| private void handleResourceUpdate(Map<String, ? extends ResourceUpdate> updates, ResourceType type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this method need to be updated? Do we need to handle null or empty string here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly needed to update the trackServerLatencyHelper function so that server latency can be recorded with the clientName, which will now be fetched using xdsClientJmx.getClientName(). While backtracking the flow, I found that the clientName can be extracted from the handleResourceUpdate function. I also added null and empty string checks in D2ClientJmxManager at the point where the client name is extracted and where the setClientName function is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes some paths pass in NO_VALUE and others not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| xdsStreamReadyTimeout, | ||
| config.subscribeToUriGlobCollection, | ||
| config._xdsServerMetricsProvider, | ||
| config._xdsClientOtelMetricsProvider, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are missing tests for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are not any tests written for this file. Should I add a new test file for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, please do. Sometimes old files won't have it, though it's good practice to have coverage. If we're going through adding new features, this is a good time to add some coverage - at the very least to test our change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests seem to not do anything. Can we use a test impl of the XdsClientOtelMetricsProvider to verify the methods were called with the expected input variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
Added in the PR description |
|
Since this is OSS, we can't really link internal google docs. We should remove it from the PR description, same with references to internal things, such as |
| final String jmxName = String.format("%s-XdsClientJmx", getGlobalPrefix(null)); | ||
|
|
||
| // Get the client name from global prefix | ||
| String clientName = getGlobalPrefix(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing i've noticed also is that getGlobalPrefix is also called on line 308. Can we just call that method one time and reuse it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| xdsStreamReadyTimeout, | ||
| config.subscribeToUriGlobCollection, | ||
| config._xdsServerMetricsProvider, | ||
| config._xdsClientOtelMetricsProvider, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, please do. Sometimes old files won't have it, though it's good practice to have coverage. If we're going through adding new features, this is a good time to add some coverage - at the very least to test our change.
| } | ||
|
|
||
| _xdsClientJmx = new XdsClientJmx(serverMetricsProvider); | ||
| _xdsClientJmx = new XdsClientJmx(serverMetricsProvider, xdsClientOtelMetricsProvider); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is the case, why in line 241 do we perform the check? Should we have that moved up before this line then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we turn these tests into Parameterized ones? E.g.
@Test
public void testRecordConnectionLost() {
String clientName = "test-client-1";
_testProvider.recordConnectionLost(clientName);
assertEquals(_testProvider.getCallCount("recordConnectionLost"), 1);
assertEquals(_testProvider.getLastClientName("recordConnectionLost"), clientName);
}
@Test
public void testRecordConnectionClosed() {
String clientName = "test-client-2";
_testProvider.recordConnectionClosed(clientName);
assertEquals(_testProvider.getCallCount("recordConnectionClosed"), 1);
assertEquals(_testProvider.getLastClientName("recordConnectionClosed"), clientName);
}
can be combined into a single parameterized test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Updated. |
| // Get the client name from global prefix | ||
| String clientName = getGlobalPrefix(null); | ||
|
|
||
| final String jmxName = String.format("%s-XdsClientJmx", clientName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: jmxName has nothing really to do with line:313-line:315, and could probably be declared after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| private final AtomicInteger _resourceInvalidCount = new AtomicInteger(); | ||
| private final XdsServerMetricsProvider _xdsServerMetricsProvider; | ||
| private final XdsClientOtelMetricsProvider _xdsClientOtelMetricsProvider; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unnecessary extra endline here. Additionally, we really should create a common util that exposes "-" in case future sensors will also need to reference it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed extra endline and created new file D2JmxConstants which expose "-".
| * | ||
| * @param clientName the name to identify this XDS client | ||
| */ | ||
| public void setClientName(String clientName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Is there a reason for this to be able to be set multiple times for a single instance? If not, we should have it be a one-way set. i.e. via atomic string or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used atomic string.
| { | ||
| return _xdsClient.getActiveInitialWaitTimeMillis(); | ||
| waitTime = _xdsClient.getActiveInitialWaitTimeMillis(); | ||
| _xdsClientOtelMetricsProvider.updateActiveInitialWaitTime(_clientName, waitTime); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically, can't all these usages of _clientName pass a null or empty _clientName if it's not set or improperly set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| } | ||
|
|
||
| _xdsClientJmx = new XdsClientJmx(serverMetricsProvider); | ||
| _xdsClientJmx = new XdsClientJmx(serverMetricsProvider, xdsClientOtelMetricsProvider); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but i'm saying the usage of the actual parameter is a bit confusing though.. It is possible one value in line 239, then another after 241. We should just standardize that and have line 241 happen first.
| /* | ||
| Copyright (c) 2023 LinkedIn Corp. | ||
| Licensed under the Apache License, Version 2.0 (the "License"); | ||
| you may not use this file except in compliance with the License. | ||
| You may obtain a copy of the License at | ||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? This doesn't seem to be a standard across the files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
| verify(fixture._wildcardResourceWatcher).onRemoval(CLUSTER_RESOURCE_NAME); | ||
| verify(fixture._clusterSubscriber).onRemoval(); | ||
| verify(fixture._uriMapWildcardSubscriber).onRemoval(CLUSTER_RESOURCE_NAME); | ||
| verifyZeroInteractions(fixture._serverMetricsProvider); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we verify our metrics aren't used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| verify(fixture._resourceWatcher, times(0)).onChanged(D2_URI_MAP_GLOB_COLLECTION_UPDATE_WITH_DATA1); | ||
| verify(fixture._wildcardResourceWatcher, times(0)).onChanged(any(), | ||
| eq(D2_URI_MAP_GLOB_COLLECTION_UPDATE_WITH_DATA1)); | ||
| verifyZeroInteractions(fixture._serverMetricsProvider); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically anywhere that _serverMetricsProvider is used, since this is a dual implementation, we should have the same assertions to leverage the test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| /* | ||
| Copyright (c) 2025 LinkedIn Corp. | ||
| Licensed under the Apache License, Version 2.0 (the "License"); | ||
| you may not use this file except in compliance with the License. | ||
| You may obtain a copy of the License at | ||
| http://www.apache.org/licenses/LICENSE-2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also licensing here
| /** | ||
| * Common constants used across D2 JMX and metrics components. | ||
| */ | ||
| public final class D2JmxConstants |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be just a common utils file instead of D2JmxContants? The NO_VALUE will be used by possibly multiple sensors, so it should just be something like OpenTelemetryConstants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM (aside from minor improvement suggestions - which should be carried out) from Metrics standpoint. Melt ship - please have sensor owner review it to confirm it's capturing in the right places and has the correct testing.
512b8e2 to
f710f7e
Compare
Summary
This PR adds OpenTelemetry (OTel) metrics instrumentation to the XDS Client.
Changes
New Interface: Added
XdsClientOtelMetricsProviderinterface for collecting XDS client metrics via OpenTelemetryNo-op Implementation: Added
NoOpXdsClientOtelMetricsProvideras default implementation when metrics are disabledIntegration:
XdsClientImplconstructor with dependency injection patternXdsClientJmxfor JMX integrationD2ClientBuilderviasetXdsClientOtelMetricsProvider()Metrics Tracked:
Backward Compatibility
About XdsSensor
The XdsClient Sensor tracks metrics for xDS-based service discovery and client-side operations. It monitors connection events, request/response counts, and latency distributions for xDS service calls. This sensor enables client-side observability for distributed service communications using xDS.
New Otel Metrics
Counters
Histogram
Gauges