Skip to content

Conversation

@zimulala
Copy link
Contributor

@zimulala zimulala commented Apr 17, 2025

What problem does this PR solve?

The current bottleneck in diag data pulling lies in the long export time or direct failure of metrics with the largest storage footprint.

What is changed and how it works?

  • To address potential failures in exporting metrics with large data volumes:
    • When a request retrieves too many samples or encounters Prometheus OOM issues, mitigation strategies include:
      • Reducing the time range for each query request. The current minimum value is 1 minute (this setting works for metrics with less than 8 million series).
    • To address the issue of excessive concurrency (currently default is 5):
      • Set an upper limit, not exceeding Prometheus’s default value of 20.
    • Optimized connection pool management (use default http.Client)
      • By customizing the connection pool size and layered timeouts, blocked connections are prevented from affecting requests, improving concurrency stability.
  • When a single metric has a large data volume, it can cause slow execution:
    • Enable parallel processing for pulling data of individual metrics, although this may lead to OOM issues.
  • Regarding OOM problems caused by pulling top-heavy metrics:
    • When the metrics-low-priority flag is set, separate the pulling of top-heavy metrics from normal metrics. This ensures that normal metric pulling is unaffected, and attempts can still be made to retrieve top-heavy metrics.
      • If pulling top-heavy metrics fails, consider reducing concurrency and decreasing the time range of each request.
      • This may result in very slow data retrieval.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
result
  • No code

Code changes

  • Has exported function/method change
  • Has exported variable/fields change
  • Has interface methods change
  • Has persistent data change

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

@CLAassistant
Copy link

CLAassistant commented Apr 17, 2025

CLA assistant check
All committers have signed the CLA.

cmd.Flags().StringSliceVar(&ext, "exclude", nil, "types of data not to collect")
cmd.Flags().StringSliceVar(&cOpt.MetricsFilter, "metricsfilter", nil, "prefix of metrics to collect")
cmd.Flags().StringSliceVar(&cOpt.MetricsExclude, "metricsexclude", []string{"node_interrupts_total"}, "prefix of metrics to exclude")
cmd.Flags().StringSliceVar(&cOpt.MetricsLowPriority, "metrics-low-priority", []string{"tidb_tikvclient_request_seconds_bucket"},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be considered: whether to separate the data pulling of high-volume metrics from regular metrics by default.

@zimulala zimulala force-pushed the zimuxia/conncurrency branch from 180a406 to 4e78028 Compare April 18, 2025 01:46
@zimulala
Copy link
Contributor Author

PTAL @XuHuaiyu

2 similar comments
@zimulala
Copy link
Contributor Author

PTAL @XuHuaiyu

@zimulala
Copy link
Contributor Author

PTAL @XuHuaiyu

@zimulala
Copy link
Contributor Author

The static-tests issue will be fixed in #485

@zimulala
Copy link
Contributor Author

PTAL @XuHuaiyu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants