-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#1973] feat(server,coordinator): Refactor metrics system to reduce periodic reporting load #1991
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left comment
client-spark/spark2-shaded/pom.xml
Outdated
@@ -216,6 +216,10 @@ | |||
<pattern>picocli</pattern> | |||
<shadedPattern>${rss.shade.packageName}.picocli</shadedPattern> | |||
</relocation> | |||
<relocation> | |||
<pattern>com.codahale.metrics</pattern> | |||
<shadedPattern>${rss.shade.packageName}.com.codahale.metrics</shadedPattern> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check the dependencies of this is shaded too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like to add /api/xxx/metrics /api/xxx/prometheus/metrics result to the Description
of this PR.
public Gauge addGauge(String name, String help, String[] labels) { | ||
return Gauge.build().name(name).labelNames(labels).help(help).register(collectorRegistry); | ||
} | ||
|
||
public synchronized <T> void addGauge(String name, com.codahale.metrics.Gauge<T> metric) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
registerGaugeIfAbsent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
} | ||
|
||
public synchronized <T> void addGauge( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
registerCachedGaugeIfAbsent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pom.xml
Outdated
@@ -105,6 +105,7 @@ | |||
<skipITs>${skipTests}</skipITs> | |||
<skipBuildImage>true</skipBuildImage> | |||
<snakeyaml.version>2.2</snakeyaml.version> | |||
<prometheus.version>0.8.0</prometheus.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use ${prometheus.simpleclient.version}?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kuszz It looks a great feature, with this, we can easy to add Gauge to report size of collection, left a last question inline, besides this, there are no any other suggestion from me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @kuszz Thanks for your contribution!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two high-level questions:
- how will this new metrics system integrate with existing systems? I think a lot of people are already using prometheus to collect metrics data. It would be great if we can make a drop-in replacement for that. cc @xianjingfeng maybe you have some input about this part
- is there a plan to replace other existing metrics with the new one?
@@ -44,6 +51,8 @@ public MetricsManager(CollectorRegistry collectorRegistry, Map<String, String> d | |||
} else { | |||
this.collectorRegistry = collectorRegistry; | |||
} | |||
metricRegistry = new MetricRegistry(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: -> this.metricRegistry = xxx
.
It's clear to explicitly use this.xx
when updating the instance field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -513,4 +503,8 @@ private static void setUpMetrics(ShuffleServerConf serverConf) { | |||
.labelNames("app_id") | |||
.register(metricsManager.getCollectorRegistry()); | |||
} | |||
|
|||
public static MetricsManager getMetricsManager() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should expose metrics manager directly.
Instead, you could add methods in ShuffleServerMetrics to register new metrics, such as:
public static void registerEventQueueSize(com.codahale.metrics.Gauge<Integer> eventQueueSize) {
// do all kinds of validation if needed.
metricsManager.registerGaugeIfAbsent(EVENT_QUEUE_SIZE, eventQueueSize);
}
Or we can expose the metrics names and the registerMetrics
method combined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I removed this func
private static final String LABEL_SEPARATOR = ":"; | ||
private static final String NAME_SEPARATOR = "_"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems these two are never used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -112,4 +150,8 @@ public Summary.Child addLabeledSummary(String name) { | |||
} | |||
return builder.register(collectorRegistry).labels(defaultLabelValues); | |||
} | |||
|
|||
public com.codahale.metrics.Gauge getGauge(String name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this func display how to obtain the value of a gauge registered in new way
@@ -66,14 +75,43 @@ public Counter.Child addLabeledCounter(String name) { | |||
return c.labels(this.defaultLabelValues); | |||
} | |||
|
|||
@Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to add a javadoc to indicate which version this method will be removed and what alternative should be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the annotation, the new way can not add labels, so it cannot completely replace the old method
server/src/main/java/org/apache/uniffle/server/ShuffleServerMetrics.java
Show resolved
Hide resolved
This PR does not affect the original collection logic. https://www.robustperception.io/exposing-dropwizard-metrics-to-prometheus/ |
This is good to know. But it seems it doesn't support labels? Or it could be supported in another way? |
It seems that there is no way. |
4a22c8e
to
5c018b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kuszz, I think this is in the right direction. Just some coding style issues to be addressed.
It would be great if you can add some UTs for the new gauge metric as well.
common/src/main/java/org/apache/uniffle/common/metrics/MetricsManager.java
Outdated
Show resolved
Hide resolved
@@ -47,6 +50,7 @@ public MetricsManager(CollectorRegistry collectorRegistry, Map<String, String> d | |||
this.defaultLabelNames = defaultLabels.keySet().toArray(new String[0]); | |||
this.defaultLabelValues = | |||
Arrays.stream(defaultLabelNames).map(defaultLabels::get).toArray(String[]::new); | |||
this.gaugeMap = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use this.gaugeMap = JavaUtils.newConcurrentMap();
This is a performance issue in JDK8 for concurrent hash map, so we have to wrap it with a special one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I have modified it
common/src/main/java/org/apache/uniffle/common/metrics/MetricsManager.java
Outdated
Show resolved
Hide resolved
common/src/main/java/org/apache/uniffle/common/metrics/MetricsManager.java
Outdated
Show resolved
Hide resolved
@@ -112,4 +142,10 @@ public Summary.Child addLabeledSummary(String name) { | |||
} | |||
return builder.register(collectorRegistry).labels(defaultLabelValues); | |||
} | |||
|
|||
public void unRegister() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. This signature doesn't seem right.
It should be something like:
public void unregisterSupplierGauge(String name)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
common/src/main/java/org/apache/uniffle/common/metrics/SupplierGauge.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/uniffle/server/DefaultFlushEventHandler.java
Outdated
Show resolved
Hide resolved
common/src/main/java/org/apache/uniffle/common/metrics/MetricsManager.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally lgtm, except one minor comment.
@maobaolong please take a look as well in case you have more comments.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1991 +/- ##
============================================
- Coverage 52.77% 52.53% -0.24%
- Complexity 2498 2925 +427
============================================
Files 398 447 +49
Lines 18135 23546 +5411
Branches 1660 2196 +536
============================================
+ Hits 9570 12370 +2800
- Misses 7981 10383 +2402
- Partials 584 793 +209 ☔ View full report in Codecov by Sentry. |
public void unregisterAllSupplierGauge() { | ||
for (SupplierGauge gauge : supplierGaugeMap.values()) { | ||
collectorRegistry.unregister(gauge); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think we should clear the supplierGaugeMap as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kuszz @advancedxy It LGTM, I've give some suggestion off the github, and I checked all of my suggestion are addressed.
LGTM on my side too. @kuszz would you mind to rebase the latest master and make the CI happy. |
@advancedxy rust ci reports 404 error |
hmmm, it seems all the CI are failing? @zuston would you mind to take a look at the rust CI failure? |
@advancedxy If it is difficult to fix the rust related CI checks, we can ignore it by this for now? #2041 |
I'm fine to ignore that first, cc @zuston |
Thanks to @maobaolong, the rust ci checked is disabled temporary. @kuszz please rebase the master and retry CI, sorry for the inconvenience. |
@advancedxy done |
Merged, thanks @kuszz. |
…ric type and add requireBufferCount metrics (#2113) ### What changes were proposed in this pull request? Refactor SupplierGauge to support generic type and add requireBufferCount metrics ### Why are the changes needed? Fix: #1991 ### Does this PR introduce _any_ user-facing change? Add new metrics named `require_buffer_count` ### How was this patch tested? Tested through dashboard server metrics popup page. <img width="945" alt="image" src="https://github.com/user-attachments/assets/2fbbb1d1-7f1f-41ac-ab41-adba55790005"> --------- Co-authored-by: xianjingfeng <[email protected]>
### What changes were proposed in this pull request? Add another method to add gauge metric, we can use lambda to describe a gauge metric. ### Why are the changes needed? Fix: apache#1973 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UTs. (cherry picked from commit 8dfa2a7)
…t generic type and add requireBufferCount metrics (apache#2113) ### What changes were proposed in this pull request? Refactor SupplierGauge to support generic type and add requireBufferCount metrics ### Why are the changes needed? Fix: apache#1991 ### Does this PR introduce _any_ user-facing change? Add new metrics named `require_buffer_count` ### How was this patch tested? Tested through dashboard server metrics popup page. <img width="945" alt="image" src="https://github.com/user-attachments/assets/2fbbb1d1-7f1f-41ac-ab41-adba55790005"> --------- Co-authored-by: xianjingfeng <[email protected]> (cherry picked from commit d1aa51b)
What changes were proposed in this pull request?
Add another method to add gauge metric, we can use lambda to describe a gauge metric.
Why are the changes needed?
Fix: #1973
Does this PR introduce any user-facing change?
No.
How was this patch tested?
UTs.