Consider rewriting scraper goroutine scheduling #778

serathius · 2021-05-29T22:31:15Z

Metrics Server scraper is responsible for creating paralleling metric collection. Every cycle it lists all the nodes in the cluster and creates goroutine to scrape each node. Each go-routine waits some random time (to avoid scrapping at the same time), collects metrics and then is removed. Problems with this approach:

goroutine churn - each cycle allocates and deletes goroutines
memory churn - all data needs to be copied into one batch before saving to storage
bursty CPU and IO - scraping is executed only at the beginning of cycle
freshness - saving metrics requires to wait for the slowest node

What would you like to be added:

Instead of listing nodes every cycle and churning goroutines, we should maintain a goroutine per node and use informer event handler to add/remove goroutines as nodes are updated. Similar approach is used by Prometheus Server.

Things to consider:

Measuring performance improvement in scraper
How storage would consume partial data, decide to remove old one and performance overhead caused by that
How to test code and prevent leaking goroutines

Why is this needed:

I have listed some problems, but we should test it properly before committing to it. We should create benchmarks and validate scraper and storage performance before merging any code, as in worst case we could result in more complicated code without addressing any of the problems.

/cc @yangjunmyfm192085 @dgrisonnet

/kind feature

yangjunmyfm192085 · 2021-05-31T11:44:27Z

Sounds good, We really need to consider and test properly before committing, and so when we plan to start it?
After #777 merged?

serathius · 2021-05-31T12:27:33Z

I don't think this is nesesery blocked by #777 as they are pretty independent changes. I think we can start to do performance testing.

yangjunmyfm192085 · 2021-06-01T00:15:40Z

ok, Let me take a look, and try to do it?

dgrisonnet · 2021-06-01T09:44:43Z

How storage would consume partial data, decide to remove old one and performance overhead caused by that

I don't think this is possible with the current storage implementation.
To throw a suggestion to work around that, maybe we could have 1 store per node. I don't really see any drawbacks to that, except that getting PodMetrics will become more complex.

How to test code and prevent leaking goroutines

We could check for leaks with: https://github.com/uber-go/goleak
Also, I don't know if metrics-server registers the client_golang go collector, but it could give us insights regarding the number of goroutines that are currently running with the go_goroutines metrics.

yangjunmyfm192085 · 2021-06-08T23:55:29Z

/assign

yangjunmyfm192085 · 2021-06-19T14:19:58Z

I am working on this issue recently, I have a few questions

Do we need to deal with scraper and storag both in the goroutine? or only scrape is ok?
If deal with scrape and store both in the goroutine , do we need to modify the process of the store
use informer event handler to add/remove goroutines, do we need to deal with update event?
/cc @serathius @dgrisonnet

k8s-triage-robot · 2021-09-17T15:01:38Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2021-10-17T15:51:42Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

dgrisonnet · 2021-10-18T09:30:11Z

/remove-lifecycle rotten
/lifecycle frozen

serathius · 2021-12-06T11:08:11Z

This should implement Redesign goroutine management to reduce scheduling cost (will require redesigning storage) point from #857

jasonliu747 · 2022-02-22T07:00:14Z

/remove-lifecycle frozen
@yangjunmyfm192085 Hi, thanks for your contribution, any update on this issue?

serathius · 2022-02-22T08:56:59Z

/lifecycle frozen
/cc @shuaich

yangjunmyfm192085 · 2022-02-23T05:59:07Z

Sorry, I delayed this issue due to lack of time and other things. I modified it at https://github.com/yangjunmyfm192085/metrics-server/tree/rewriting-scraper but not submit pr. I will submit the pr as soon as possible, and need help me to review whether the modification is reasonable. If there is no problem, I will do a comparison test @serathius.
Of course, @shuaich is also welcome to participate

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label May 29, 2021

k8s-ci-robot assigned yangjunmyfm192085 Jun 8, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 17, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 17, 2021

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Oct 18, 2021

k8s-ci-robot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 22, 2022

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 22, 2022

yangjunmyfm192085 mentioned this issue Feb 25, 2022

rewriting scraper goroutine scheduling #972

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider rewriting scraper goroutine scheduling #778

Consider rewriting scraper goroutine scheduling #778

serathius commented May 29, 2021

yangjunmyfm192085 commented May 31, 2021

serathius commented May 31, 2021

yangjunmyfm192085 commented Jun 1, 2021

dgrisonnet commented Jun 1, 2021

yangjunmyfm192085 commented Jun 8, 2021

yangjunmyfm192085 commented Jun 19, 2021 •

edited

Loading

k8s-triage-robot commented Sep 17, 2021

k8s-triage-robot commented Oct 17, 2021

dgrisonnet commented Oct 18, 2021

serathius commented Dec 6, 2021

jasonliu747 commented Feb 22, 2022

serathius commented Feb 22, 2022 •

edited

Loading

yangjunmyfm192085 commented Feb 23, 2022 •

edited

Loading

Consider rewriting scraper goroutine scheduling #778

Consider rewriting scraper goroutine scheduling #778

Comments

serathius commented May 29, 2021

yangjunmyfm192085 commented May 31, 2021

serathius commented May 31, 2021

yangjunmyfm192085 commented Jun 1, 2021

dgrisonnet commented Jun 1, 2021

yangjunmyfm192085 commented Jun 8, 2021

yangjunmyfm192085 commented Jun 19, 2021 • edited Loading

k8s-triage-robot commented Sep 17, 2021

k8s-triage-robot commented Oct 17, 2021

dgrisonnet commented Oct 18, 2021

serathius commented Dec 6, 2021

jasonliu747 commented Feb 22, 2022

serathius commented Feb 22, 2022 • edited Loading

yangjunmyfm192085 commented Feb 23, 2022 • edited Loading

yangjunmyfm192085 commented Jun 19, 2021 •

edited

Loading

serathius commented Feb 22, 2022 •

edited

Loading

yangjunmyfm192085 commented Feb 23, 2022 •

edited

Loading