test(perf): grow shrink cluster 3 nodes in parallel #7504

soyacz · 2024-05-29T06:19:10Z

We want to verify how fast we can double cluster capacity and shrink it back.

Testing

- draft testing here: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/lukasz/job/scylla-master-perf-regression-latency-650gb-grow-shrink/

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

fruch · 2024-05-29T06:29:50Z

performance_regression_test.py

@@ -330,7 +330,7 @@ def run_workload(self, stress_cmd, nemesis=False, sub_type=None):
        stress_queue = self.run_stress_thread(stress_cmd=stress_cmd, stress_num=1, stats_aggregate_cmds=False)
        if nemesis:
            interval = self.params.get('nemesis_interval')
-            time.sleep(interval * 60)  # Sleeping one interval (in minutes) before starting the nemesis
+            time.sleep(5 * 60)  # Sleeping one interval (in minutes) before starting the nemesis


This is just for testing ? right ?

I try to make it working ASAP, so I took some shortcuts here and there. Later I'll remove them.

fruch · 2024-05-29T06:32:19Z

sdcm/nemesis.py

@@ -4013,6 +4041,14 @@ def disrupt_grow_shrink_cluster(self):
        self._grow_cluster(rack=None)
        self._shrink_cluster(rack=None)

+    def disrupt_grow_shrink_cluster_parallel(self):


Later we might want to use the same nemesis

But have it configurable based on scylla capabilities and SCT configuration

fruch · 2024-05-29T06:34:20Z

sdcm/nemesis.py

+        try:
+            with adaptive_timeout(Operations.NEW_NODE, node=self.cluster.nodes[0], timeout=timeout):
+                self.cluster.wait_for_init(node_list=new_nodes, timeout=timeout, check_node_health=False)
+            self.cluster.set_seeds()


Messing with the seed isn't really needed anymore, just before starting a node.

ok, I'll remove it. See I trimmed original procedure (like setting nemesis target) and I'm not sure of it yet.

soyacz · 2024-05-29T07:26:35Z

btw. I'm taking another shortcut - disabled multi rack scenario, as adding multirack support would take more time to implement.

roydahan · 2024-05-29T08:15:02Z

sdcm/nemesis.py

+            self.steady_state_latency()
+            self.has_steady_run = True
+        self._grow_cluster_parallel(rack=None)
+        self._shrink_cluster(rack=None)


We need here the part that wait for the balanced cluster and then double the load.
During doubling the load we should also measure latency.

I think it’s more important to deal with than the parallel decommission for now.

fruch · 2024-06-10T08:43:53Z

test-cases/performance/perf-regression-latency-650gb-grow-shrink.yaml

-stress_cmd_r: "cassandra-stress read no-warmup  cl=QUORUM duration=800m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate 'threads=250 fixed=10310/s' -col 'size=FIXED(128) n=FIXED(8)' -pop 'dist=gauss(1..650000000,325000000,9750000)' "
-stress_cmd_m: "cassandra-stress mixed no-warmup cl=QUORUM duration=800m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate 'threads=250 fixed=8750/s' -col 'size=FIXED(128) n=FIXED(8)' -pop 'dist=gauss(1..650000000,325000000,6500000)' "
+# Enterprise cmd's
+stress_cmd_w: "cql-stress-cassandra-stress write no-warmup cl=QUORUM duration=2850m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate 'threads=250 fixed=21000/s' -col 'size=FIXED(128) n=FIXED(8)' -pop 'dist=gauss(1..650000000,325000000,9750000)' "


let's split to two case, one with c-s and one with cql-stress

fruch · 2024-06-10T08:51:25Z

sdcm/nemesis.py

+        InfoEvent(message=f"Start grow cluster on {add_nodes_number} nodes").publish()
+        self.add_new_nodes_parallel(count=add_nodes_number, rack=rack)
+        self.log.info("Finish cluster grow")
+        time.sleep(300)  # TODO: currently, just in case, to be removed


Consider if we can remove

fruch · 2024-06-10T09:02:45Z

sdcm/nemesis.py

@@ -5004,6 +5077,22 @@ def disrupt_disable_binary_gossip_execute_major_compaction(self):
            self.target_node.restart_scylla_server()
            raise

+    def _wait_for_tablets_balanced(self):


I would put it into sdcm.utils

defaults/aws_config.yaml

We want to verify how fast we can double cluster capacity and shrink it back.

Added waiting for tablets balance and doubling load

We want to try parallel decommission nodes.

Scalability test with 80% cpu load on OSS version.

Using cql-stress to verify if problems relate to java driver or are scylladb server based.

fruch · 2024-07-18T07:48:21Z

@soyacz

we can close this one ?

soyacz · 2024-07-18T08:32:26Z

@soyacz

we can close this one ?

yes

soyacz requested a review from fruch May 29, 2024 06:19

github-actions bot assigned soyacz May 29, 2024

fruch reviewed May 29, 2024

View reviewed changes

roydahan reviewed May 29, 2024

View reviewed changes

soyacz force-pushed the scale-tablets-cluster-perf-master branch 19 times, most recently from c2f16aa to 73b517e Compare June 5, 2024 11:43

soyacz force-pushed the scale-tablets-cluster-perf-master branch 3 times, most recently from f1ee2ad to 69e5f92 Compare June 7, 2024 06:46

fruch reviewed Jun 10, 2024

View reviewed changes

fruch reviewed Jun 17, 2024

View reviewed changes

defaults/aws_config.yaml Show resolved Hide resolved

soyacz added 4 commits June 26, 2024 14:51

test(perf): grow shrink cluster 3 nodes in parallel

42d6959

We want to verify how fast we can double cluster capacity and shrink it back.

test(perf): scale out scale in test

6df5828

Added waiting for tablets balance and doubling load

improvement(decommission): enabled to do it in parallel

f450e19

We want to try parallel decommission nodes.

test(perf): commands for 80 percent cpu load

681860e

Scalability test with 80% cpu load on OSS version.

soyacz force-pushed the scale-tablets-cluster-perf-master branch 2 times, most recently from 8623cd4 to dbbcef2 Compare June 27, 2024 08:13

test(elasticity): use cql-stress

c39df1b

Using cql-stress to verify if problems relate to java driver or are scylladb server based.

soyacz force-pushed the scale-tablets-cluster-perf-master branch from dbbcef2 to c39df1b Compare June 27, 2024 08:59

soyacz closed this Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(perf): grow shrink cluster 3 nodes in parallel #7504

test(perf): grow shrink cluster 3 nodes in parallel #7504

soyacz commented May 29, 2024

fruch May 29, 2024

soyacz May 29, 2024

fruch May 29, 2024

fruch May 29, 2024

soyacz May 29, 2024

soyacz commented May 29, 2024

roydahan May 29, 2024

fruch Jun 10, 2024

fruch Jun 10, 2024

fruch Jun 10, 2024

fruch commented Jul 18, 2024

soyacz commented Jul 18, 2024

test(perf): grow shrink cluster 3 nodes in parallel #7504

test(perf): grow shrink cluster 3 nodes in parallel #7504

Conversation

soyacz commented May 29, 2024

Testing

PR pre-checks (self review)

Reminders

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soyacz commented May 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fruch commented Jul 18, 2024

soyacz commented Jul 18, 2024