-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(perf): grow shrink cluster 3 nodes in parallel #7504
Conversation
@@ -330,7 +330,7 @@ def run_workload(self, stress_cmd, nemesis=False, sub_type=None): | |||
stress_queue = self.run_stress_thread(stress_cmd=stress_cmd, stress_num=1, stats_aggregate_cmds=False) | |||
if nemesis: | |||
interval = self.params.get('nemesis_interval') | |||
time.sleep(interval * 60) # Sleeping one interval (in minutes) before starting the nemesis | |||
time.sleep(5 * 60) # Sleeping one interval (in minutes) before starting the nemesis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just for testing ? right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try to make it working ASAP, so I took some shortcuts here and there. Later I'll remove them.
@@ -4013,6 +4041,14 @@ def disrupt_grow_shrink_cluster(self): | |||
self._grow_cluster(rack=None) | |||
self._shrink_cluster(rack=None) | |||
|
|||
def disrupt_grow_shrink_cluster_parallel(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later we might want to use the same nemesis
But have it configurable based on scylla capabilities and SCT configuration
try: | ||
with adaptive_timeout(Operations.NEW_NODE, node=self.cluster.nodes[0], timeout=timeout): | ||
self.cluster.wait_for_init(node_list=new_nodes, timeout=timeout, check_node_health=False) | ||
self.cluster.set_seeds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Messing with the seed isn't really needed anymore, just before starting a node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll remove it. See I trimmed original procedure (like setting nemesis target) and I'm not sure of it yet.
btw. I'm taking another shortcut - disabled multi rack scenario, as adding multirack support would take more time to implement. |
self.steady_state_latency() | ||
self.has_steady_run = True | ||
self._grow_cluster_parallel(rack=None) | ||
self._shrink_cluster(rack=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need here the part that wait for the balanced cluster and then double the load.
During doubling the load we should also measure latency.
I think it’s more important to deal with than the parallel decommission for now.
c2f16aa
to
73b517e
Compare
f1ee2ad
to
69e5f92
Compare
stress_cmd_r: "cassandra-stress read no-warmup cl=QUORUM duration=800m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate 'threads=250 fixed=10310/s' -col 'size=FIXED(128) n=FIXED(8)' -pop 'dist=gauss(1..650000000,325000000,9750000)' " | ||
stress_cmd_m: "cassandra-stress mixed no-warmup cl=QUORUM duration=800m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate 'threads=250 fixed=8750/s' -col 'size=FIXED(128) n=FIXED(8)' -pop 'dist=gauss(1..650000000,325000000,6500000)' " | ||
# Enterprise cmd's | ||
stress_cmd_w: "cql-stress-cassandra-stress write no-warmup cl=QUORUM duration=2850m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate 'threads=250 fixed=21000/s' -col 'size=FIXED(128) n=FIXED(8)' -pop 'dist=gauss(1..650000000,325000000,9750000)' " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's split to two case, one with c-s and one with cql-stress
InfoEvent(message=f"Start grow cluster on {add_nodes_number} nodes").publish() | ||
self.add_new_nodes_parallel(count=add_nodes_number, rack=rack) | ||
self.log.info("Finish cluster grow") | ||
time.sleep(300) # TODO: currently, just in case, to be removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider if we can remove
@@ -5004,6 +5077,22 @@ def disrupt_disable_binary_gossip_execute_major_compaction(self): | |||
self.target_node.restart_scylla_server() | |||
raise | |||
|
|||
def _wait_for_tablets_balanced(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put it into sdcm.utils
We want to verify how fast we can double cluster capacity and shrink it back.
Added waiting for tablets balance and doubling load
We want to try parallel decommission nodes.
Scalability test with 80% cpu load on OSS version.
8623cd4
to
dbbcef2
Compare
Using cql-stress to verify if problems relate to java driver or are scylladb server based.
dbbcef2
to
c39df1b
Compare
we can close this one ? |
yes |
We want to verify how fast we can double cluster capacity and shrink it back.
Testing
PR pre-checks (self review)
backport
labelsReminders
sdcm/sct_config.py
)unit-test/
folder)