fix(nemesis.py): abort nemesis on stress command failure #9585

dimakr · 2024-12-18T19:41:41Z

Abort the nemesis flow early when a stress command fails, as continuing it is often invalid due to the cluster being in an unexpected state.
For example, if the _prepare_test_table routine fails, subsequent steps that depend on the test table (or attempt disruptions on top of it) will also fail.

This change adds a check for results of stress command triggered within the nemesis code, ensuring the nemesis halts early if the stress command is unsuccessful.

Fixes: #8722

Testing

GrowShrinkClusterNemesis, both c-s and s-b loads are executed during the nemesis
TruncateMonkey, both c-s and s-b loads are executed during the nemesis
Tested locally to ensure/simulate a failed stress command - added an invalid parameter to cassandra-stress command in Nemesis._prepare_test_table (here

scylla-cluster-tests/sdcm/nemesis.py

Line 2072 in 532ed3f

stress_cmd = "cassandra-stress write n=400000 cl=QUORUM -mode native cql3 " \

), which makes the stress command fail.
Executed PR-provision-test configuration for TruncateMonkey nemesis class. As a result the disrupt_truncate disruption was started few times, but interrupted early as the stress command in _prepare_test_table
was failing:

< t:2024-12-20 22:04:24,417 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > sdcm.exceptions.NemesisStressFailure: Aborting 'TruncateMonkey' nemesis as 'cassandra-stress write whoa n=400000 cl=QUORUM -mode native cql3 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -log interval=5' stress command failed with the following errors:
< t:2024-12-20 22:04:24,417 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   on node 'PR-provision-test-dmitriy-loader-node-a2eaf144-2': ['Stress command completed with bad status 1: ']
< t:2024-12-20 22:04:24,417 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   on node 'PR-provision-test-dmitriy-loader-node-a2eaf144-1': ['Stress command completed with bad status 1: ']

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

fruch · 2024-12-18T22:35:59Z

@dimakr

can you run individual nemesis on at least two of the changed nemesis, to show case it's working (it can be AWS or docker backend if they are relevant)

vponomaryov · 2024-12-19T09:24:13Z

sdcm/nemesis.py

@@ -4422,6 +4438,7 @@ def run_write_scylla_bench_load(write_cmd):
                    regex=".*sstable - Error while linking SSTable.*filesystem error: stat failed: No such file or directory.*"):
                write_thread = self.tester.run_stress_thread(stress_cmd=write_cmd, stop_test_on_failure=False)
                self.tester.verify_stress_thread(write_thread)
+                self.stop_nemesis_on_stress_errors(write_thread)


Here we run the same command on each of the existing loaders.
In this case we are totally ok to get n_loaders - 1 failures.
We need result at least from some of them.

Probably it is the case for some other nemesis.

It runs a single command on multiple loaders just because of the default parameter round_robin=False defined in the self.tester.run_stress_thread.

So, probably we should have some passes and some fails checks.

Changed the check in the new method so that a nemesis would be aborted if all 'n_loadersstress commands failed. Checked this for both cases -round_robin` enabled and disabled.

Abort the nemesis flow early when a stress command fails, as continuing it is often invalid due to the cluster being in an unexpected state. For example, if the _prepare_test_table routine fails, subsequent steps that depend on the test table (or attempt disruptions on top of it) will also fail. This change adds a check for results of stress command triggered within the nemesis code, ensuring the nemesis halts early if the stress command is unsuccessful. Fixes: scylladb#8722

dimakr · 2024-12-22T18:57:00Z

can you run individual nemesis on at least two of the changed nemesis, to show case it's working (it can be AWS or docker backend if they are relevant)

Executed longevity-5gb-1h-nemesis config for GrowShrinkClusterNemesis and TruncateMonkey nemeses. Linked the jobs in the PR descriptions. In both cases the c-s and s-b commands are executed (and stop_nemesis_on_error check is performed) during the nemesis.

fruch · 2024-12-23T06:17:18Z

sdcm/scylla_bench_thread.py

@@ -251,7 +251,7 @@ def _run_stress(self, loader, loader_idx, cpu_idx):  # pylint: disable=too-many-
            except Exception as exc:  # pylint: disable=broad-except  # noqa: BLE001
                self.configure_event_on_failure(stress_event=scylla_bench_event, exc=exc)

-        return loader, result
+        return loader, result, scylla_bench_event


Maybe we need to align this on all stress commands ?

This new stop_nemesis_on_stress_errors method is now applied to results of stress commands that are triggered from nemesis.py. There are 3 commands that are executed this way in nemesis code - c-s, s-b, and cql-stress-cassandra-stress.
All of them (with this^ change included) are returning tuple of (loader, result, event).

I'm not sure, if/which stress commands are also triggered from nemesis.py.

I think regardless of what is being used right now, it would be good to align them to return identically from this function.

Agree, would be less confusion if we had single return type for all cases.

I've just set down with @dimakr, and we agreed that there would be a follow up for this change

that would incorporate stop_nemesis_on_stress_errors() into verify verify_results() and align the signatures of get_results() and verify_results()

fruch · 2024-12-23T06:18:21Z

sdcm/nemesis.py

+            self.stop_nemesis_on_stress_errors(cs_thread)
+
+    def stop_nemesis_on_stress_errors(self, stress_thread: DockerBasedStressThread) -> None:
+        stress_results = super(stress_thread.__class__, stress_thread).get_results()


Why is the super call needed here ? What are we trying to skip ? And why ?

This is to directly call the get_results method to get all results from threads. Some command specific get_results implementation would filter out failed commands. E.g. in c-s the get_results performs:

def get_results(self) -> list[dict | None]: ret = [] results = super().get_results() for _, result, event in results: if not result: # Silently skip if stress command threw error, since it was already reported in _run_stress continue ... return ret

It's a bit specific c-s, we should at least put a comment explaining it, it's quite unusual to skip up the MRO like that, I barely understand the reason it's c-s is doing that, let's at least point to the reason we jump it.

github-actions bot assigned dimakr Dec 18, 2024

vponomaryov reviewed Dec 19, 2024

View reviewed changes

dimakr force-pushed the disrupt_truncate_check_results_sct8722 branch from 3aeb089 to 6de0904 Compare December 20, 2024 10:05

dimakr force-pushed the disrupt_truncate_check_results_sct8722 branch from 6de0904 to bf20854 Compare December 20, 2024 17:46

dimakr marked this pull request as ready for review December 22, 2024 19:03

dimakr requested review from vponomaryov and fruch December 22, 2024 19:03

fruch added backport/2024.1 Need backport to 2024.1 backport/2024.2 Need backport to 2024.2 backport/6.2 labels Dec 23, 2024

fruch reviewed Dec 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(nemesis.py): abort nemesis on stress command failure #9585

fix(nemesis.py): abort nemesis on stress command failure #9585

dimakr commented Dec 18, 2024 •

edited

Loading

fruch commented Dec 18, 2024

vponomaryov Dec 19, 2024 •

edited

Loading

dimakr Dec 22, 2024

dimakr commented Dec 22, 2024

fruch Dec 23, 2024

dimakr Dec 23, 2024

fruch Dec 23, 2024

vponomaryov Dec 24, 2024

fruch Dec 24, 2024

fruch Dec 23, 2024

dimakr Dec 23, 2024

fruch Dec 23, 2024

fix(nemesis.py): abort nemesis on stress command failure #9585

Are you sure you want to change the base?

fix(nemesis.py): abort nemesis on stress command failure #9585

Conversation

dimakr commented Dec 18, 2024 • edited Loading

Testing

PR pre-checks (self review)

Reminders

fruch commented Dec 18, 2024

vponomaryov Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimakr commented Dec 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimakr commented Dec 18, 2024 •

edited

Loading

vponomaryov Dec 19, 2024 •

edited

Loading