Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(nemesis.py): abort nemesis on stress command failure #9585

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dimakr
Copy link
Contributor

@dimakr dimakr commented Dec 18, 2024

Abort the nemesis flow early when a stress command fails, as continuing it is often invalid due to the cluster being in an unexpected state.
For example, if the _prepare_test_table routine fails, subsequent steps that depend on the test table (or attempt disruptions on top of it) will also fail.

This change adds a check for results of stress command triggered within the nemesis code, ensuring the nemesis halts early if the stress command is unsuccessful.

Fixes: #8722

Testing

< t:2024-12-20 22:04:24,417 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > sdcm.exceptions.NemesisStressFailure: Aborting 'TruncateMonkey' nemesis as 'cassandra-stress write whoa n=400000 cl=QUORUM -mode native cql3 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -log interval=5' stress command failed with the following errors:
< t:2024-12-20 22:04:24,417 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   on node 'PR-provision-test-dmitriy-loader-node-a2eaf144-2': ['Stress command completed with bad status 1: ']
< t:2024-12-20 22:04:24,417 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   on node 'PR-provision-test-dmitriy-loader-node-a2eaf144-1': ['Stress command completed with bad status 1: ']

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

@fruch
Copy link
Contributor

fruch commented Dec 18, 2024

@dimakr

can you run individual nemesis on at least two of the changed nemesis, to show case it's working (it can be AWS or docker backend if they are relevant)

@@ -4422,6 +4438,7 @@ def run_write_scylla_bench_load(write_cmd):
regex=".*sstable - Error while linking SSTable.*filesystem error: stat failed: No such file or directory.*"):
write_thread = self.tester.run_stress_thread(stress_cmd=write_cmd, stop_test_on_failure=False)
self.tester.verify_stress_thread(write_thread)
self.stop_nemesis_on_stress_errors(write_thread)
Copy link
Contributor

@vponomaryov vponomaryov Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we run the same command on each of the existing loaders.
In this case we are totally ok to get n_loaders - 1 failures.
We need result at least from some of them.

Probably it is the case for some other nemesis.

It runs a single command on multiple loaders just because of the default parameter round_robin=False defined in the self.tester.run_stress_thread.

So, probably we should have some passes and some fails checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the check in the new method so that a nemesis would be aborted if all 'n_loadersstress commands failed. Checked this for both cases -round_robin` enabled and disabled.

@dimakr dimakr force-pushed the disrupt_truncate_check_results_sct8722 branch from 3aeb089 to 6de0904 Compare December 20, 2024 10:05
Abort the nemesis flow early when a stress command fails, as continuing it is
often invalid due to the cluster being in an unexpected state. For example, if
the _prepare_test_table routine fails, subsequent steps that depend on the test
table (or attempt disruptions on top of it) will also fail.

This change adds a check for results of stress command triggered within the nemesis
code, ensuring the nemesis halts early if the stress command is unsuccessful.

Fixes: scylladb#8722
@dimakr dimakr force-pushed the disrupt_truncate_check_results_sct8722 branch from 6de0904 to bf20854 Compare December 20, 2024 17:46
@dimakr
Copy link
Contributor Author

dimakr commented Dec 22, 2024

can you run individual nemesis on at least two of the changed nemesis, to show case it's working (it can be AWS or docker backend if they are relevant)

Executed longevity-5gb-1h-nemesis config for GrowShrinkClusterNemesis and TruncateMonkey nemeses. Linked the jobs in the PR descriptions. In both cases the c-s and s-b commands are executed (and stop_nemesis_on_error check is performed) during the nemesis.

@dimakr dimakr marked this pull request as ready for review December 22, 2024 19:03
@dimakr dimakr requested review from vponomaryov and fruch December 22, 2024 19:03
@fruch fruch added backport/2024.1 Need backport to 2024.1 backport/2024.2 Need backport to 2024.2 backport/6.2 labels Dec 23, 2024
@@ -251,7 +251,7 @@ def _run_stress(self, loader, loader_idx, cpu_idx): # pylint: disable=too-many-
except Exception as exc: # pylint: disable=broad-except # noqa: BLE001
self.configure_event_on_failure(stress_event=scylla_bench_event, exc=exc)

return loader, result
return loader, result, scylla_bench_event
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to align this on all stress commands ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new stop_nemesis_on_stress_errors method is now applied to results of stress commands that are triggered from nemesis.py. There are 3 commands that are executed this way in nemesis code - c-s, s-b, and cql-stress-cassandra-stress.
All of them (with this^ change included) are returning tuple of (loader, result, event).

I'm not sure, if/which stress commands are also triggered from nemesis.py.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think regardless of what is being used right now, it would be good to align them to return identically from this function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, would be less confusion if we had single return type for all cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just set down with @dimakr, and we agreed that there would be a follow up for this change

that would incorporate stop_nemesis_on_stress_errors() into verify verify_results() and align the signatures of get_results() and verify_results()

self.stop_nemesis_on_stress_errors(cs_thread)

def stop_nemesis_on_stress_errors(self, stress_thread: DockerBasedStressThread) -> None:
stress_results = super(stress_thread.__class__, stress_thread).get_results()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the super call needed here ? What are we trying to skip ? And why ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to directly call the get_results method to get all results from threads. Some command specific get_results implementation would filter out failed commands. E.g. in c-s the get_results performs:

    def get_results(self) -> list[dict | None]:
        ret = []
        results = super().get_results()

        for _, result, event in results:
            if not result:
                # Silently skip if stress command threw error, since it was already reported in _run_stress
                continue
...
        return ret

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit specific c-s, we should at least put a comment explaining it, it's quite unusual to skip up the MRO like that, I barely understand the reason it's c-s is doing that, let's at least point to the reason we jump it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/6.2 backport/2024.1 Need backport to 2024.1 backport/2024.2 Need backport to 2024.2
Projects
None yet
3 participants