-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitoring node runs out of RAM and CPU resources with growth of the tables number and data in it #2429
Comments
Alive yet monitoring node: https://eu-west-1.console.aws.amazon.com/ec2/home?region=eu-west-1#InstanceDetails:instanceId=i-022788c782a7a759c @fruch , @roydahan this is ^ the same problem observed here: enterprise-2024.2/reproducers/scale-5000-tables-test#3 |
@vponomaryov do the monitoring server match the Memory Space requirement |
I've found we have some metrics which are tables specific like and it's default on is that on purpose ? |
Yes. |
I've found the answer, scylladb/scylladb#13293 yes it was deliberately and @tzach you got the benchmark you asked back then :) we have |
In the test run used for the bug report was used following instance type for the monitoring node: |
Please fetch from Prometheus UI the TSDB status page, which will help us analyzing this. |
TSDB StatusHead Stats
Head Cardinality StatsTop 10 label names with value count
Top 10 series count by metric names
Top 10 label names with high memory usage
Top 10 series count by label value pairs
|
Setting of the Resource usage on the monitoring node: |
There's really nothing we can do. A part of the number of metrics is proportional to the number of nodes multiplied by the number of tables. I will need to come up with a better metrics prediction formula (though it will always be difficult). When we have many tables, we can use a bigger monitoring node or disable the per-table metrics. |
Installation details
Panel Name: any
Dashboard Name: any
Scylla-Monitoring Version:
4.8.0
Scylla-Version:
2024.2.0~rc3-20241004.89f8638e9e9b
Monitor node instance type:
m6i.xlarge
Running a test which creates tables in batches by 125 we observe constant memory and CPU utilization growth:
The same about disk utilization:
Result of the
top
command:DB nodes load:
On the DB nodes load screenshot may be observed the situation with batches.
Each
tooth
is population of the 125 tables.Argus: scylla-staging/valerii/vp-scale-5000-tables-test#3
CI job: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/vp-scale-5000-tables-test/3
The text was updated successfully, but these errors were encountered: