*: add concurrently processing for individual metrics and introduce some flags #475

zimulala · 2025-04-17T02:56:40Z

What problem does this PR solve?

The current bottleneck in diag data pulling lies in the long export time or direct failure of metrics with the largest storage footprint.

What is changed and how it works?

To address potential failures in exporting metrics with large data volumes:
- When a request retrieves too many samples or encounters Prometheus OOM issues, mitigation strategies include:
  - Reducing the time range for each query request. The current minimum value is 1 minute (this setting works for metrics with less than 8 million series).
- To address the issue of excessive concurrency (currently default is 5):
  - Set an upper limit, not exceeding Prometheus’s default value of 20.
- Optimized connection pool management (use default http.Client)
  - By customizing the connection pool size and layered timeouts, blocked connections are prevented from affecting requests, improving concurrency stability.
When a single metric has a large data volume, it can cause slow execution:
- Enable parallel processing for pulling data of individual metrics, although this may lead to OOM issues.
Regarding OOM problems caused by pulling top-heavy metrics:
- When the metrics-low-priority flag is set, separate the pulling of top-heavy metrics from normal metrics. This ensures that normal metric pulling is unaffected, and attempts can still be made to retrieve top-heavy metrics.
  - If pulling top-heavy metrics fails, consider reducing concurrency and decreasing the time range of each request.
  - This may result in very slow data retrieval.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)

No code

Code changes

Has exported function/method change
Has exported variable/fields change
Has interface methods change
Has persistent data change

Side effects

Possible performance regression
Increased code complexity
Breaking backward compatibility

Related changes

Need to cherry-pick to the release branch
Need to update the documentation

CLAassistant · 2025-04-17T02:56:46Z

All committers have signed the CLA.

zimulala · 2025-04-18T01:41:23Z

cmd/diag/command/collect.go

 	cmd.Flags().StringSliceVar(&ext, "exclude", nil, "types of data not to collect")
 	cmd.Flags().StringSliceVar(&cOpt.MetricsFilter, "metricsfilter", nil, "prefix of metrics to collect")
 	cmd.Flags().StringSliceVar(&cOpt.MetricsExclude, "metricsexclude", []string{"node_interrupts_total"}, "prefix of metrics to exclude")
+	cmd.Flags().StringSliceVar(&cOpt.MetricsLowPriority, "metrics-low-priority", []string{"tidb_tikvclient_request_seconds_bucket"},


This needs to be considered: whether to separate the data pulling of high-volume metrics from regular metrics by default.

zimulala · 2025-06-10T02:05:41Z

PTAL @XuHuaiyu

zimulala · 2025-06-16T01:54:53Z

PTAL @XuHuaiyu

zimulala · 2025-06-24T03:40:22Z

PTAL @XuHuaiyu

zimulala · 2025-10-13T02:48:19Z

The static-tests issue will be fixed in #485

zimulala · 2025-10-13T02:48:35Z

PTAL @XuHuaiyu

zimulala added 2 commits April 3, 2025 17:53

*: concurrency handle a single metric

5b0d57e

*: add two flags and add concurrently handle a single metric

9ac08d0

zimulala commented Apr 18, 2025

View reviewed changes

collector: update errors

4e78028

zimulala force-pushed the zimuxia/conncurrency branch from 180a406 to 4e78028 Compare April 18, 2025 01:46

zimulala added 2 commits April 22, 2025 16:12

*: tiny update logs

a4e3a0c

*: tiny update log

e18990b

zimulala added 2 commits October 13, 2025 10:05

Merge branch 'master' into zimuxia/conncurrency

eb69b40

*: make vet happy

ed6f196

zimulala mentioned this pull request Nov 26, 2025

*: add metrics-min-interval to improve the chance of collecting top-heavy metrics #490

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

*: add concurrently processing for individual metrics and introduce some flags #475

*: add concurrently processing for individual metrics and introduce some flags #475

Uh oh!

zimulala commented Apr 17, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Apr 17, 2025 •

edited by pingcap-cla-assistant bot

Loading

Uh oh!

zimulala Apr 18, 2025

Uh oh!

zimulala commented Jun 10, 2025

Uh oh!

zimulala commented Jun 16, 2025

Uh oh!

zimulala commented Jun 24, 2025

Uh oh!

zimulala commented Oct 13, 2025

Uh oh!

zimulala commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

*: add concurrently processing for individual metrics and introduce some flags #475

Are you sure you want to change the base?

*: add concurrently processing for individual metrics and introduce some flags #475

Uh oh!

Conversation

zimulala commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Uh oh!

CLAassistant commented Apr 17, 2025 • edited by pingcap-cla-assistant bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zimulala Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

zimulala commented Jun 10, 2025

Uh oh!

zimulala commented Jun 16, 2025

Uh oh!

zimulala commented Jun 24, 2025

Uh oh!

zimulala commented Oct 13, 2025

Uh oh!

zimulala commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zimulala commented Apr 17, 2025 •

edited

Loading

CLAassistant commented Apr 17, 2025 •

edited by pingcap-cla-assistant bot

Loading