Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider rewriting scraper goroutine scheduling #778

Open
serathius opened this issue May 29, 2021 · 13 comments
Open

Consider rewriting scraper goroutine scheduling #778

serathius opened this issue May 29, 2021 · 13 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@serathius
Copy link
Contributor

Metrics Server scraper is responsible for creating paralleling metric collection. Every cycle it lists all the nodes in the cluster and creates goroutine to scrape each node. Each go-routine waits some random time (to avoid scrapping at the same time), collects metrics and then is removed. Problems with this approach:

  • goroutine churn - each cycle allocates and deletes goroutines
  • memory churn - all data needs to be copied into one batch before saving to storage
  • bursty CPU and IO - scraping is executed only at the beginning of cycle
  • freshness - saving metrics requires to wait for the slowest node

What would you like to be added:

Instead of listing nodes every cycle and churning goroutines, we should maintain a goroutine per node and use informer event handler to add/remove goroutines as nodes are updated. Similar approach is used by Prometheus Server.

Things to consider:

  • Measuring performance improvement in scraper
  • How storage would consume partial data, decide to remove old one and performance overhead caused by that
  • How to test code and prevent leaking goroutines

Why is this needed:

I have listed some problems, but we should test it properly before committing to it. We should create benchmarks and validate scraper and storage performance before merging any code, as in worst case we could result in more complicated code without addressing any of the problems.

/cc @yangjunmyfm192085 @dgrisonnet

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label May 29, 2021
@yangjunmyfm192085
Copy link
Contributor

Sounds good, We really need to consider and test properly before committing, and so when we plan to start it?
After #777 merged?

@serathius
Copy link
Contributor Author

I don't think this is nesesery blocked by #777 as they are pretty independent changes. I think we can start to do performance testing.

@yangjunmyfm192085
Copy link
Contributor

ok, Let me take a look, and try to do it?

@dgrisonnet
Copy link
Member

How storage would consume partial data, decide to remove old one and performance overhead caused by that

I don't think this is possible with the current storage implementation.
To throw a suggestion to work around that, maybe we could have 1 store per node. I don't really see any drawbacks to that, except that getting PodMetrics will become more complex.

How to test code and prevent leaking goroutines

We could check for leaks with: https://github.com/uber-go/goleak
Also, I don't know if metrics-server registers the client_golang go collector, but it could give us insights regarding the number of goroutines that are currently running with the go_goroutines metrics.

@yangjunmyfm192085
Copy link
Contributor

/assign

@yangjunmyfm192085
Copy link
Contributor

yangjunmyfm192085 commented Jun 19, 2021

I am working on this issue recently, I have a few questions

  • Do we need to deal with scraper and storag both in the goroutine? or only scrape is ok?
  • If deal with scrape and store both in the goroutine , do we need to modify the process of the store
  • use informer event handler to add/remove goroutines, do we need to deal with update event?
    /cc @serathius @dgrisonnet

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 17, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 17, 2021
@dgrisonnet
Copy link
Member

/remove-lifecycle rotten
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Oct 18, 2021
@serathius
Copy link
Contributor Author

This should implement Redesign goroutine management to reduce scheduling cost (will require redesigning storage) point from #857

@jasonliu747
Copy link

/remove-lifecycle frozen
@yangjunmyfm192085 Hi, thanks for your contribution, any update on this issue?

@k8s-ci-robot k8s-ci-robot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 22, 2022
@serathius
Copy link
Contributor Author

serathius commented Feb 22, 2022

/lifecycle frozen
/cc @shuaich

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 22, 2022
@yangjunmyfm192085
Copy link
Contributor

yangjunmyfm192085 commented Feb 23, 2022

Sorry, I delayed this issue due to lack of time and other things. I modified it at https://github.com/yangjunmyfm192085/metrics-server/tree/rewriting-scraper but not submit pr. I will submit the pr as soon as possible, and need help me to review whether the modification is reasonable. If there is no problem, I will do a comparison test @serathius.
Of course, @shuaich is also welcome to participate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants