Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(perf): grow shrink cluster 3 nodes in parallel #7504

Closed

Conversation

soyacz
Copy link
Contributor

@soyacz soyacz commented May 29, 2024

We want to verify how fast we can double cluster capacity and shrink it back.

Testing

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

@soyacz soyacz requested a review from fruch May 29, 2024 06:19
@@ -330,7 +330,7 @@ def run_workload(self, stress_cmd, nemesis=False, sub_type=None):
stress_queue = self.run_stress_thread(stress_cmd=stress_cmd, stress_num=1, stats_aggregate_cmds=False)
if nemesis:
interval = self.params.get('nemesis_interval')
time.sleep(interval * 60) # Sleeping one interval (in minutes) before starting the nemesis
time.sleep(5 * 60) # Sleeping one interval (in minutes) before starting the nemesis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for testing ? right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to make it working ASAP, so I took some shortcuts here and there. Later I'll remove them.

@@ -4013,6 +4041,14 @@ def disrupt_grow_shrink_cluster(self):
self._grow_cluster(rack=None)
self._shrink_cluster(rack=None)

def disrupt_grow_shrink_cluster_parallel(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later we might want to use the same nemesis

But have it configurable based on scylla capabilities and SCT configuration

try:
with adaptive_timeout(Operations.NEW_NODE, node=self.cluster.nodes[0], timeout=timeout):
self.cluster.wait_for_init(node_list=new_nodes, timeout=timeout, check_node_health=False)
self.cluster.set_seeds()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Messing with the seed isn't really needed anymore, just before starting a node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll remove it. See I trimmed original procedure (like setting nemesis target) and I'm not sure of it yet.

@soyacz
Copy link
Contributor Author

soyacz commented May 29, 2024

btw. I'm taking another shortcut - disabled multi rack scenario, as adding multirack support would take more time to implement.

self.steady_state_latency()
self.has_steady_run = True
self._grow_cluster_parallel(rack=None)
self._shrink_cluster(rack=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need here the part that wait for the balanced cluster and then double the load.
During doubling the load we should also measure latency.

I think it’s more important to deal with than the parallel decommission for now.

@soyacz soyacz force-pushed the scale-tablets-cluster-perf-master branch 19 times, most recently from c2f16aa to 73b517e Compare June 5, 2024 11:43
@soyacz soyacz force-pushed the scale-tablets-cluster-perf-master branch 3 times, most recently from f1ee2ad to 69e5f92 Compare June 7, 2024 06:46
stress_cmd_r: "cassandra-stress read no-warmup cl=QUORUM duration=800m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate 'threads=250 fixed=10310/s' -col 'size=FIXED(128) n=FIXED(8)' -pop 'dist=gauss(1..650000000,325000000,9750000)' "
stress_cmd_m: "cassandra-stress mixed no-warmup cl=QUORUM duration=800m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate 'threads=250 fixed=8750/s' -col 'size=FIXED(128) n=FIXED(8)' -pop 'dist=gauss(1..650000000,325000000,6500000)' "
# Enterprise cmd's
stress_cmd_w: "cql-stress-cassandra-stress write no-warmup cl=QUORUM duration=2850m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate 'threads=250 fixed=21000/s' -col 'size=FIXED(128) n=FIXED(8)' -pop 'dist=gauss(1..650000000,325000000,9750000)' "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's split to two case, one with c-s and one with cql-stress

InfoEvent(message=f"Start grow cluster on {add_nodes_number} nodes").publish()
self.add_new_nodes_parallel(count=add_nodes_number, rack=rack)
self.log.info("Finish cluster grow")
time.sleep(300) # TODO: currently, just in case, to be removed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider if we can remove

@@ -5004,6 +5077,22 @@ def disrupt_disable_binary_gossip_execute_major_compaction(self):
self.target_node.restart_scylla_server()
raise

def _wait_for_tablets_balanced(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put it into sdcm.utils

soyacz added 4 commits June 26, 2024 14:51
We want to verify how fast we can double cluster capacity and shrink it
back.
Added waiting for tablets balance and doubling load
We want to try parallel decommission nodes.
Scalability test with 80% cpu load on OSS version.
@soyacz soyacz force-pushed the scale-tablets-cluster-perf-master branch 2 times, most recently from 8623cd4 to dbbcef2 Compare June 27, 2024 08:13
Using cql-stress to verify if problems relate to java driver or are
scylladb server based.
@soyacz soyacz force-pushed the scale-tablets-cluster-perf-master branch from dbbcef2 to c39df1b Compare June 27, 2024 08:59
@fruch
Copy link
Contributor

fruch commented Jul 18, 2024

@soyacz

we can close this one ?

@soyacz
Copy link
Contributor Author

soyacz commented Jul 18, 2024

@soyacz

we can close this one ?

yes

@soyacz soyacz closed this Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants