-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(nemesis.py): abort nemesis on stress command failure #9585
base: master
Are you sure you want to change the base?
fix(nemesis.py): abort nemesis on stress command failure #9585
Conversation
can you run individual nemesis on at least two of the changed nemesis, to show case it's working (it can be AWS or docker backend if they are relevant) |
@@ -4422,6 +4438,7 @@ def run_write_scylla_bench_load(write_cmd): | |||
regex=".*sstable - Error while linking SSTable.*filesystem error: stat failed: No such file or directory.*"): | |||
write_thread = self.tester.run_stress_thread(stress_cmd=write_cmd, stop_test_on_failure=False) | |||
self.tester.verify_stress_thread(write_thread) | |||
self.stop_nemesis_on_stress_errors(write_thread) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we run the same command on each of the existing loaders.
In this case we are totally ok to get n_loaders - 1
failures.
We need result at least from some of them.
Probably it is the case for some other nemesis.
It runs a single command on multiple loaders just because of the default parameter round_robin=False
defined in the self.tester.run_stress_thread
.
So, probably we should have some passes
and some fails
checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the check in the new method so that a nemesis would be aborted if all 'n_loadersstress commands failed. Checked this for both cases -
round_robin` enabled and disabled.
3aeb089
to
6de0904
Compare
Abort the nemesis flow early when a stress command fails, as continuing it is often invalid due to the cluster being in an unexpected state. For example, if the _prepare_test_table routine fails, subsequent steps that depend on the test table (or attempt disruptions on top of it) will also fail. This change adds a check for results of stress command triggered within the nemesis code, ensuring the nemesis halts early if the stress command is unsuccessful. Fixes: scylladb#8722
6de0904
to
bf20854
Compare
Executed |
@@ -251,7 +251,7 @@ def _run_stress(self, loader, loader_idx, cpu_idx): # pylint: disable=too-many- | |||
except Exception as exc: # pylint: disable=broad-except # noqa: BLE001 | |||
self.configure_event_on_failure(stress_event=scylla_bench_event, exc=exc) | |||
|
|||
return loader, result | |||
return loader, result, scylla_bench_event |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need to align this on all stress commands ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new stop_nemesis_on_stress_errors
method is now applied to results of stress commands that are triggered from nemesis.py. There are 3 commands that are executed this way in nemesis code - c-s, s-b, and cql-stress-cassandra-stress.
All of them (with this^ change included) are returning tuple of (loader, result, event).
I'm not sure, if/which stress commands are also triggered from nemesis.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think regardless of what is being used right now, it would be good to align them to return identically from this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, would be less confusion if we had single return type for all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just set down with @dimakr, and we agreed that there would be a follow up for this change
that would incorporate stop_nemesis_on_stress_errors()
into verify verify_results()
and align the signatures of get_results()
and verify_results()
self.stop_nemesis_on_stress_errors(cs_thread) | ||
|
||
def stop_nemesis_on_stress_errors(self, stress_thread: DockerBasedStressThread) -> None: | ||
stress_results = super(stress_thread.__class__, stress_thread).get_results() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the super call needed here ? What are we trying to skip ? And why ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to directly call the get_results
method to get all results from threads. Some command specific get_results
implementation would filter out failed commands. E.g. in c-s the get_results
performs:
def get_results(self) -> list[dict | None]:
ret = []
results = super().get_results()
for _, result, event in results:
if not result:
# Silently skip if stress command threw error, since it was already reported in _run_stress
continue
...
return ret
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit specific c-s, we should at least put a comment explaining it, it's quite unusual to skip up the MRO like that, I barely understand the reason it's c-s is doing that, let's at least point to the reason we jump it.
Abort the nemesis flow early when a stress command fails, as continuing it is often invalid due to the cluster being in an unexpected state.
For example, if the _prepare_test_table routine fails, subsequent steps that depend on the test table (or attempt disruptions on top of it) will also fail.
This change adds a check for results of stress command triggered within the nemesis code, ensuring the nemesis halts early if the stress command is unsuccessful.
Fixes: #8722
Testing
scylla-cluster-tests/sdcm/nemesis.py
Line 2072 in 532ed3f
Executed PR-provision-test configuration for TruncateMonkey nemesis class. As a result the disrupt_truncate disruption was started few times, but interrupted early as the stress command in _prepare_test_table
was failing:
PR pre-checks (self review)
backport
labelsReminders
sdcm/sct_config.py
)unit-test/
folder)