Monitoring node runs out of RAM and CPU resources with growth of the tables number and data in it #2429

vponomaryov · 2024-12-02T17:39:12Z

Installation details
Panel Name: any
Dashboard Name: any
Scylla-Monitoring Version: 4.8.0
Scylla-Version: 2024.2.0~rc3-20241004.89f8638e9e9b
Monitor node instance type: m6i.xlarge

Running a test which creates tables in batches by 125 we observe constant memory and CPU utilization growth:

The same about disk utilization:

Result of the top command:

Tasks: 134 total,   1 running, 133 sleeping,   0 stopped,   0 zombie
%Cpu(s): 25.6 us,  0.2 sy,  0.0 ni, 74.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  15717.2 total,   1244.2 free,  12641.1 used,   1831.9 buff/cache
MiB Swap:  20480.0 total,  16750.0 free,   3730.0 used.   2393.5 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                                                
   5527 ubuntu    20   0  113.7g  12.1g 570644 S 100.3  79.1   8266:01 prometheus                                                                                                                                                                                             
   9710 scylla    20   0   16.0t  76860  20480 S   1.0   0.5  92:07.13 scylla                                                                                                                                                                                                 
    414 root      20   0 1949744  17860   8192 S   0.3   0.1   4:16.46 containerd                                                                                                                                                                                             
   2977 root      20   0 2134828  32928  14080 S   0.3   0.2   3:13.23 dockerd                                                                                                                                                                                                
   5508 root      20   0 1238716   6408   3456 S   0.3   0.0   1:22.68 containerd-shim                                                                                                                                                                                        
   9718 scylla-+  20   0 1266796  25560  11904 S   0.3   0.2   4:57.53 scylla-manager                                                                                                                                                                                         
  57000 root      20   0 1319948  24704  16768 S   0.3   0.2   0:00.04 snapd                                                                                                                                                                                                  
      1 root      20   0  167584   6480   4048 S   0.0   0.0   0:23.68 systemd                                                                                                                                                                                                
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.06 kthreadd                                                                                                                                                                                               
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp

DB nodes load:

On the DB nodes load screenshot may be observed the situation with batches.
Each tooth is population of the 125 tables.

Argus: scylla-staging/valerii/vp-scale-5000-tables-test#3
CI job: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/vp-scale-5000-tables-test/3

The text was updated successfully, but these errors were encountered:

vponomaryov · 2024-12-02T17:47:13Z

Alive yet monitoring node: https://eu-west-1.console.aws.amazon.com/ec2/home?region=eu-west-1#InstanceDetails:instanceId=i-022788c782a7a759c

@fruch , @roydahan this is ^ the same problem observed here: enterprise-2024.2/reproducers/scale-5000-tables-test#3

tzach · 2024-12-02T18:25:03Z

@vponomaryov do the monitoring server match the Memory Space requirement
https://monitoring.docs.scylladb.com/stable/install/monitoring-stack.html#calculating-prometheus-minimal-memory-space-requirement

fruch · 2024-12-02T18:27:53Z

@amnonh

I've found we have some metrics which are tables specific like scylla_column_family_memtable_row_hits
quick tour of the scylla code, and I've found the flag that enable it:
https://github.com/scylladb/scylladb/blob/acd643bd75468703150b2e23b1bbf05a3e95e42d/db/config.cc#L1012

and it's default on

is that on purpose ?

mykaul · 2024-12-02T18:33:48Z

@amnonh

I've found we have some metrics which are tables specific like scylla_column_family_memtable_row_hits quick tour of the scylla code, and I've found the flag that enable it: https://github.com/scylladb/scylladb/blob/acd643bd75468703150b2e23b1bbf05a3e95e42d/db/config.cc#L1012

and it's default on

is that on purpose ?

Yes.

fruch · 2024-12-02T18:38:48Z

I've found the answer, scylladb/scylladb#13293

yes it was deliberately

and @tzach you got the benchmark you asked back then :)
it's bad, and the calculator from https://monitoring.docs.scylladb.com/stable/install/monitoring-stack.html#calculating-prometheus-minimal-memory-space-requirement doesn't help much when you have 5000+ tables

we have t3.large for the monitor, which maybe not exactly as the calculator suggest, but two years ago it was working o.k. for this case...

vponomaryov · 2024-12-02T18:43:29Z

@vponomaryov do the monitoring server match the Memory Space requirement https://monitoring.docs.scylladb.com/stable/install/monitoring-stack.html#calculating-prometheus-minimal-memory-space-requirement

we have t3.large for the monitor, which maybe not exactly as the calculator suggest, but two years ago it was working o.k. for this case...

In the test run used for the bug report was used following instance type for the monitoring node: m6i.xlarge

mykaul · 2024-12-03T10:02:23Z

@vponomaryov do the monitoring server match the Memory Space requirement https://monitoring.docs.scylladb.com/stable/install/monitoring-stack.html#calculating-prometheus-minimal-memory-space-requirement

we have t3.large for the monitor, which maybe not exactly as the calculator suggest, but two years ago it was working o.k. for this case...

In the test run used for the bug report was used following instance type for the monitoring node: m6i.xlarge

Please fetch from Prometheus UI the TSDB status page, which will help us analyzing this.

vponomaryov · 2024-12-03T11:56:56Z

Please fetch from Prometheus UI the TSDB status page, which will help us analyzing this.

TSDB Status

Head Stats

Number of Series	Number of Chunks	Number of Label Pairs	Current Min Time	Current Max Time
2843270	15940315	12293	2024-12-01T06:00:00.714Z (1733032800714)	2024-12-01T09:37:40.845Z (1733045860845)

Head Cardinality Stats

Top 10 label names with value count

Name	Count
cf	10056
name	1193
le	143
type	115
devices	83
handler	51
collector	46
name	35
cpu	32
shard	30

Top 10 series count by metric names

Name	Count
scylla_column_family_write_latency_bucket	1366170
scylla_column_family_read_latency_bucket	679835
wlatencyaks	55646
wlatencyp95ks	55646
wlatencyp99ks	55646
scylla_column_family_cache_hit_rate	50280
scylla_column_family_live_sstable	50280
scylla_column_family_total_disk_space	50280
scylla_column_family_live_disk_space	50280
rlatencyp99ks	27724

Top 10 label names with high memory usage

Name	Bytes
name	106236467
cf	44598208
cluster	28081620
instance	26033983
le	25698553
dc	25301132
job	15793954
ks	13335682
by	2367290
class	339838

Top 10 series count by label value pairs

Name	Count
dc=eu-west-1	2810750
cluster=my-cluster	2808162
ks=feeds	2664829
job=scylla	2556887
name=scylla_column_family_write_latency_bucket	1366170
instance=10.4.4.193	853470
instance=10.4.6.77	853400
instance=10.4.4.64	853271
name=scylla_column_family_read_latency_bucket	679835
instance=10.4.6.145	132761

vponomaryov · 2024-12-04T15:09:50Z

Setting of the enable_node_aggregated_table_metrics: false scylla config option did remove the problem in the scylla-staging/valerii/vp-scale-5000-tables-test#4.

Resource usage on the monitoring node:

amnonh · 2024-12-07T09:15:45Z

There's really nothing we can do. A part of the number of metrics is proportional to the number of nodes multiplied by the number of tables. I will need to come up with a better metrics prediction formula (though it will always be difficult). When we have many tables, we can use a bigger monitoring node or disable the per-table metrics.

vponomaryov added the bug Something isn't working right label Dec 2, 2024

mykaul assigned amnonh Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring node runs out of RAM and CPU resources with growth of the tables number and data in it #2429

Monitoring node runs out of RAM and CPU resources with growth of the tables number and data in it #2429

vponomaryov commented Dec 2, 2024 •

edited

Loading

vponomaryov commented Dec 2, 2024 •

edited

Loading

tzach commented Dec 2, 2024

fruch commented Dec 2, 2024

mykaul commented Dec 2, 2024

fruch commented Dec 2, 2024

vponomaryov commented Dec 2, 2024

mykaul commented Dec 3, 2024

vponomaryov commented Dec 3, 2024 •

edited

Loading

vponomaryov commented Dec 4, 2024 •

edited

Loading

amnonh commented Dec 7, 2024

Monitoring node runs out of RAM and CPU resources with growth of the tables number and data in it #2429

Monitoring node runs out of RAM and CPU resources with growth of the tables number and data in it #2429

Comments

vponomaryov commented Dec 2, 2024 • edited Loading

vponomaryov commented Dec 2, 2024 • edited Loading

tzach commented Dec 2, 2024

fruch commented Dec 2, 2024

mykaul commented Dec 2, 2024

fruch commented Dec 2, 2024

vponomaryov commented Dec 2, 2024

mykaul commented Dec 3, 2024

vponomaryov commented Dec 3, 2024 • edited Loading

TSDB Status

Head Stats

Head Cardinality Stats

Top 10 label names with value count

Top 10 series count by metric names

Top 10 label names with high memory usage

Top 10 series count by label value pairs

vponomaryov commented Dec 4, 2024 • edited Loading

amnonh commented Dec 7, 2024

vponomaryov commented Dec 2, 2024 •

edited

Loading

vponomaryov commented Dec 2, 2024 •

edited

Loading

vponomaryov commented Dec 3, 2024 •

edited

Loading

vponomaryov commented Dec 4, 2024 •

edited

Loading