Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#1973] feat(server,coordinator): Refactor metrics system to reduce periodic reporting load #1991

Merged
merged 1 commit into from
Aug 14, 2024

Conversation

kuszz
Copy link
Contributor

@kuszz kuszz commented Jul 31, 2024

What changes were proposed in this pull request?

Add another method to add gauge metric, we can use lambda to describe a gauge metric.

Why are the changes needed?

Fix: #1973

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UTs.

@rickyma rickyma changed the title [#1973] feat(server,coordinator): Refactor metrics system to reduce periodically report metrics heavy behavior [#1973] feat(server,coordinator): Refactor metrics system to reduce periodic reporting load Jul 31, 2024
Copy link

github-actions bot commented Jul 31, 2024

Test Results

 2 792 files  ±0   2 792 suites  ±0   5h 51m 4s ⏱️ +47s
   988 tests ±0     987 ✅ ±0   1 💤 ±0  0 ❌ ±0 
12 403 runs  ±0  12 388 ✅ ±0  15 💤 ±0  0 ❌ ±0 

Results for commit 29aec85. ± Comparison against base commit 3673b55.

♻️ This comment has been updated with latest results.

@kuszz kuszz closed this Aug 1, 2024
@kuszz kuszz reopened this Aug 1, 2024
Copy link
Member

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left comment

@@ -216,6 +216,10 @@
<pattern>picocli</pattern>
<shadedPattern>${rss.shade.packageName}.picocli</shadedPattern>
</relocation>
<relocation>
<pattern>com.codahale.metrics</pattern>
<shadedPattern>${rss.shade.packageName}.com.codahale.metrics</shadedPattern>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check the dependencies of this is shaded too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Member

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to add /api/xxx/metrics /api/xxx/prometheus/metrics result to the Description of this PR.

public Gauge addGauge(String name, String help, String[] labels) {
return Gauge.build().name(name).labelNames(labels).help(help).register(collectorRegistry);
}

public synchronized <T> void addGauge(String name, com.codahale.metrics.Gauge<T> metric) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

registerGaugeIfAbsent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
}

public synchronized <T> void addGauge(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

registerCachedGaugeIfAbsent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated
@@ -105,6 +105,7 @@
<skipITs>${skipTests}</skipITs>
<skipBuildImage>true</skipBuildImage>
<snakeyaml.version>2.2</snakeyaml.version>
<prometheus.version>0.8.0</prometheus.version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use ${prometheus.simpleclient.version}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kuszz It looks a great feature, with this, we can easy to add Gauge to report size of collection, left a last question inline, besides this, there are no any other suggestion from me.

Copy link
Member

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @kuszz Thanks for your contribution!

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two high-level questions:

  1. how will this new metrics system integrate with existing systems? I think a lot of people are already using prometheus to collect metrics data. It would be great if we can make a drop-in replacement for that. cc @xianjingfeng maybe you have some input about this part
  2. is there a plan to replace other existing metrics with the new one?

@@ -44,6 +51,8 @@ public MetricsManager(CollectorRegistry collectorRegistry, Map<String, String> d
} else {
this.collectorRegistry = collectorRegistry;
}
metricRegistry = new MetricRegistry();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: -> this.metricRegistry = xxx.

It's clear to explicitly use this.xx when updating the instance field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -513,4 +503,8 @@ private static void setUpMetrics(ShuffleServerConf serverConf) {
.labelNames("app_id")
.register(metricsManager.getCollectorRegistry());
}

public static MetricsManager getMetricsManager() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should expose metrics manager directly.
Instead, you could add methods in ShuffleServerMetrics to register new metrics, such as:

  public static void registerEventQueueSize(com.codahale.metrics.Gauge<Integer> eventQueueSize) {
   // do all kinds of validation if needed.
    metricsManager.registerGaugeIfAbsent(EVENT_QUEUE_SIZE, eventQueueSize);
  }

Or we can expose the metrics names and the registerMetrics method combined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I removed this func

Comment on lines 41 to 42
private static final String LABEL_SEPARATOR = ":";
private static final String NAME_SEPARATOR = "_";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems these two are never used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -112,4 +150,8 @@ public Summary.Child addLabeledSummary(String name) {
}
return builder.register(collectorRegistry).labels(defaultLabelValues);
}

public com.codahale.metrics.Gauge getGauge(String name) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this func display how to obtain the value of a gauge registered in new way

@@ -66,14 +75,43 @@ public Counter.Child addLabeledCounter(String name) {
return c.labels(this.defaultLabelValues);
}

@Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to add a javadoc to indicate which version this method will be removed and what alternative should be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the annotation, the new way can not add labels, so it cannot completely replace the old method

@xianjingfeng
Copy link
Member

  1. how will this new metrics system integrate with existing systems? I think a lot of people are already using prometheus to collect metrics data. It would be great if we can make a drop-in replacement for that. cc @xianjingfeng maybe you have some input about this part

This PR does not affect the original collection logic. https://www.robustperception.io/exposing-dropwizard-metrics-to-prometheus/

@advancedxy
Copy link
Contributor

  1. how will this new metrics system integrate with existing systems? I think a lot of people are already using prometheus to collect metrics data. It would be great if we can make a drop-in replacement for that. cc @xianjingfeng maybe you have some input about this part

This PR does not affect the original collection logic. https://www.robustperception.io/exposing-dropwizard-metrics-to-prometheus/

This is good to know. But it seems it doesn't support labels? Or it could be supported in another way?

@xianjingfeng
Copy link
Member

This is good to know. But it seems it doesn't support labels? Or it could be supported in another way?

It seems that there is no way.
https://groups.google.com/g/prometheus-users/c/6SwcjeAYkiE/m/VvOSd7nSAgAJ

@kuszz kuszz force-pushed the fixMetricSystem branch 2 times, most recently from 4a22c8e to 5c018b3 Compare August 7, 2024 10:46
@kuszz kuszz closed this Aug 8, 2024
@kuszz kuszz reopened this Aug 8, 2024
@kuszz kuszz closed this Aug 8, 2024
@kuszz kuszz reopened this Aug 8, 2024
@kuszz kuszz closed this Aug 8, 2024
@kuszz kuszz reopened this Aug 8, 2024
Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kuszz, I think this is in the right direction. Just some coding style issues to be addressed.

It would be great if you can add some UTs for the new gauge metric as well.

@@ -47,6 +50,7 @@ public MetricsManager(CollectorRegistry collectorRegistry, Map<String, String> d
this.defaultLabelNames = defaultLabels.keySet().toArray(new String[0]);
this.defaultLabelValues =
Arrays.stream(defaultLabelNames).map(defaultLabels::get).toArray(String[]::new);
this.gaugeMap = new ConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use this.gaugeMap = JavaUtils.newConcurrentMap();

This is a performance issue in JDK8 for concurrent hash map, so we have to wrap it with a special one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I have modified it

@@ -112,4 +142,10 @@ public Summary.Child addLabeledSummary(String name) {
}
return builder.register(collectorRegistry).labels(defaultLabelValues);
}

public void unRegister() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. This signature doesn't seem right.

It should be something like:

public void unregisterSupplierGauge(String name)

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally lgtm, except one minor comment.

@maobaolong please take a look as well in case you have more comments.

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 38.88889% with 22 lines in your changes missing coverage. Please review.

Project coverage is 52.53%. Comparing base (5ddcc28) to head (e42cd9f).
Report is 37 commits behind head on master.

Files Patch % Lines
...g/apache/uniffle/common/metrics/SupplierGauge.java 0.00% 15 Missing ⚠️
.../apache/uniffle/common/metrics/MetricsManager.java 12.50% 7 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #1991      +/-   ##
============================================
- Coverage     52.77%   52.53%   -0.24%     
- Complexity     2498     2925     +427     
============================================
  Files           398      447      +49     
  Lines         18135    23546    +5411     
  Branches       1660     2196     +536     
============================================
+ Hits           9570    12370    +2800     
- Misses         7981    10383    +2402     
- Partials        584      793     +209     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kuszz kuszz force-pushed the fixMetricSystem branch from e42cd9f to a893f50 Compare August 9, 2024 07:29
public void unregisterAllSupplierGauge() {
for (SupplierGauge gauge : supplierGaugeMap.values()) {
collectorRegistry.unregister(gauge);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think we should clear the supplierGaugeMap as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@kuszz kuszz force-pushed the fixMetricSystem branch from c5bc70f to e03a243 Compare August 9, 2024 08:44
Copy link
Member

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kuszz @advancedxy It LGTM, I've give some suggestion off the github, and I checked all of my suggestion are addressed.

@advancedxy
Copy link
Contributor

LGTM on my side too. @kuszz would you mind to rebase the latest master and make the CI happy.

@kuszz kuszz closed this Aug 12, 2024
@kuszz kuszz reopened this Aug 12, 2024
@kuszz kuszz closed this Aug 12, 2024
@kuszz kuszz reopened this Aug 12, 2024
@kuszz kuszz closed this Aug 13, 2024
@kuszz kuszz reopened this Aug 13, 2024
@kuszz
Copy link
Contributor Author

kuszz commented Aug 13, 2024

LGTM on my side too. @kuszz would you mind to rebase the latest master and make the CI happy.

@advancedxy rust ci reports 404 error

@advancedxy
Copy link
Contributor

hmmm, it seems all the CI are failing? @zuston would you mind to take a look at the rust CI failure?

@maobaolong
Copy link
Member

@advancedxy If it is difficult to fix the rust related CI checks, we can ignore it by this for now? #2041

@advancedxy
Copy link
Contributor

I'm fine to ignore that first, cc @zuston

@advancedxy
Copy link
Contributor

Thanks to @maobaolong, the rust ci checked is disabled temporary. @kuszz please rebase the master and retry CI, sorry for the inconvenience.

@kuszz
Copy link
Contributor Author

kuszz commented Aug 14, 2024

Thanks to @maobaolong, the rust ci checked is disabled temporary. @kuszz please rebase the master and retry CI, sorry for the inconvenience.

@advancedxy done

@advancedxy advancedxy merged commit 8dfa2a7 into apache:master Aug 14, 2024
41 checks passed
@advancedxy
Copy link
Contributor

Merged, thanks @kuszz.

zuston pushed a commit that referenced this pull request Sep 18, 2024
…ric type and add requireBufferCount metrics (#2113)

### What changes were proposed in this pull request?

Refactor SupplierGauge to support generic type and add requireBufferCount metrics

### Why are the changes needed?

Fix: #1991

### Does this PR introduce _any_ user-facing change?

Add new metrics named `require_buffer_count`

### How was this patch tested?

Tested through dashboard server metrics popup page.

<img width="945" alt="image" src="https://github.com/user-attachments/assets/2fbbb1d1-7f1f-41ac-ab41-adba55790005">


---------

Co-authored-by: xianjingfeng <[email protected]>
maobaolong pushed a commit to maobaolong/incubator-uniffle that referenced this pull request Sep 24, 2024
### What changes were proposed in this pull request?
Add another method to add gauge metric, we can use lambda to describe a gauge metric.

### Why are the changes needed?
Fix: apache#1973

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
UTs.

(cherry picked from commit 8dfa2a7)
maobaolong added a commit to maobaolong/incubator-uniffle that referenced this pull request Nov 4, 2024
…t generic type and add requireBufferCount metrics (apache#2113)

### What changes were proposed in this pull request?

Refactor SupplierGauge to support generic type and add requireBufferCount metrics

### Why are the changes needed?

Fix: apache#1991

### Does this PR introduce _any_ user-facing change?

Add new metrics named `require_buffer_count`

### How was this patch tested?

Tested through dashboard server metrics popup page.

<img width="945" alt="image" src="https://github.com/user-attachments/assets/2fbbb1d1-7f1f-41ac-ab41-adba55790005">

---------

Co-authored-by: xianjingfeng <[email protected]>
(cherry picked from commit d1aa51b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] Refactor metrics system to reduce periodically report metrics heavy behavior
5 participants