-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rewriting scraper goroutine scheduling #972
rewriting scraper goroutine scheduling #972
Conversation
f7ec7d3
to
a579b29
Compare
I did a comparative performance test for the modification of metrics-server.
results: after modification memory(v0.6.1: 45Mi, after modification: 40Mi -- command after modification The number of nodes may be small. From the test results, there is no obvious difference in CPU and memory usage. |
pkg/scraper/scraper.go
Outdated
nodeLister: nodeLister, | ||
kubeletClient: client, | ||
scrapeTimeout: scrapeTimeout, | ||
func NewManageNodeScraper(client client.KubeletMetricsGetter, scrapeTimeout time.Duration, metricResolution time.Duration, store storage.Storage) *manageNodeScraper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to ManagedNodeScraper
is not needed, I don't think we will want to have both implementations in the same codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will update it.
pkg/scraper/scraper.go
Outdated
m.nodeLock.Lock() | ||
defer m.nodeLock.Unlock() | ||
if working, exists := m.isWorking[node.UID]; exists && working { | ||
klog.V(1).ErrorS(fmt.Errorf("Scrape in node is already running"), "", "node", klog.KObj(node)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should not call verbose V(1)
with ErrorS
.
Let's change it to Info log.
pkg/scraper/scraper.go
Outdated
ticker := time.NewTicker(m.metricResolution) | ||
defer ticker.Stop() | ||
|
||
res, _ := m.ScrapeData(node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why return error from ScrapeData
if it's silenced everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update it.
pkg/scraper/scraper.go
Outdated
} | ||
}() | ||
if _, exists := m.isWorking[node.UID]; !exists || !m.isWorking[node.UID] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if _, exists := m.isWorking[node.UID]; !exists || !m.isWorking[node.UID] { | |
if !m.isWorking[node.UID] { |
pkg/scraper/scraper.go
Outdated
startTime := myClock.Now() | ||
baseCtx, cancel := context.WithCancel(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use WithCancel
if cancel
is used anywhere, also 2 lines below you use WithTimeout
that also gives you cancel.
pkg/scraper/scraper.go
Outdated
requestTotal.WithLabelValues("false").Inc() | ||
return nil, err | ||
} | ||
requestTotal.WithLabelValues("true").Inc() | ||
return ms, nil | ||
res := &storage.MetricsBatch{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This merging is not needed at all.
pkg/scraper/scraper.go
Outdated
klog.V(6).InfoS("Storing metrics from node", "node", klog.KObj(node)) | ||
m.storage.Store(res) | ||
} | ||
func (m *manageNodeScraper) UpdateNodeScraper(node *corev1.Node) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how this differs from AddNodeScraper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have defined three functions AddNodeScraper
/UpdateNodeScraper
/DeleteNodeScraper
based on three events in pkg/server/config.go, but I am not sure if we need to deal with it based on the update event. If not, it should be deleted. it is similar to AddNodeScraper
pkg/scraper/scraper.go
Outdated
func (m *manageNodeScraper) DeleteNodeScraper(node *corev1.Node) { | ||
m.nodeLock.Lock() | ||
defer m.nodeLock.Unlock() | ||
if working, exists := m.isWorking[node.UID]; exists && working { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if working, exists := m.isWorking[node.UID]; exists && working { | |
if m.isWorking[node.UID] { |
pkg/scraper/scraper.go
Outdated
if _, exists := m.stop[node.UID]; exists { | ||
m.stop[node.UID] <- struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if _, exists := m.stop[node.UID]; exists { | |
m.stop[node.UID] <- struct{}{} | |
if stop, exists := m.stop[node.UID]; exists { | |
stop <- struct{}{} |
pkg/storage/node.go
Outdated
@@ -69,8 +72,8 @@ func (s *nodeStorage) GetMetrics(nodes ...*corev1.Node) ([]metrics.NodeMetrics, | |||
} | |||
|
|||
func (s *nodeStorage) Store(batch *MetricsBatch) { | |||
lastNodes := make(map[string]MetricsPoint, len(batch.Nodes)) | |||
prevNodes := make(map[string]MetricsPoint, len(batch.Nodes)) | |||
lastNodes := make(map[string]MetricsPoint, len(batch.Nodes)+len(s.last)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to allocate space for batch.Nodes as the size of data stored should stay around the same level.
pkg/storage/node.go
Outdated
@@ -96,6 +99,17 @@ func (s *nodeStorage) Store(batch *MetricsBatch) { | |||
} | |||
} | |||
} | |||
newTimeStamp := time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is pretty inefficient. Every time we scrape a node, we scan all nodes if their metrics are outdated. This gives us O(n^2) time and memory complexity to scrape and store metrics from n nodes.
If this was not picked up in the tests you prepared, means that they were not accurate. Please look into writing a benchmark for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will optimize it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's focus on creating a benchmark that would detect the difference.
Current code vs results make me think that we need to write a proper benchmark to reliably measure improvement. Note: I treat this PR as a prof of concept that can be used to compare performance. @yangjunmyfm192085 no need to update/fix the tests. /cc @shuaich |
@serathius: GitHub didn't allow me to request PR reviews from the following users: shuaich. Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
ca43dde
to
6f2b94e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/test pull-metrics-server-test-e2e
6f2b94e
to
7adfa24
Compare
0195d7e
to
18b92b1
Compare
18b92b1
to
e73d3a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/retest
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
e73d3a8
to
4cdc172
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: yangjunmyfm192085 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
@yangjunmyfm192085: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@jasonliu747: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
||
func (s *server) tick(ctx context.Context, startTime time.Time) { | ||
s.tickStatusMux.Lock() | ||
s.tickLastStart = startTime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I am wrong.
In this implementation, probeMetricCollectionTimely()
will always return an error, right?
What this PR does / why we need it:
Consider rewriting scraper goroutine scheduling , and we should test it properly before committing to it. We should create benchmarks and validate scraper and storage performance before merging any code, as in worst case we could result in more complicated code without addressing any of the problems.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #778