-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sdcm.nemesis.Nemesis._target_node_pool
value is not always synchronised with cluster.data_nodes
value when parallel nemesis run
#9448
Comments
@aleksbykov I see you added recently |
@juliayakovlev , This fix: #9429 should help |
as I mentioned in the issue description, when I tested the fix #9425 that set a new node.running nemesis attribute in goal to prevent this new node been chosen as target node, I found that it does not solve the problem. It was happened because the |
yes , i see. The problem persist before also because list is not thread-safe, but with target_pools it become more visible. |
I got the problem with different scenario, but the reason may be the same. In the tests below we run decommissioning of 3 nodes, one by one. I noticed that less than 3 nodes were decommissioned. The reason that previously terminated node was chosen for decommissioning. See examples from log in "Discussion" tab: https://argus.scylladb.com/tests/scylla-cluster-tests/c7f51531-0a18-4499-9153-fad31baec19f
https://argus.scylladb.com/tests/scylla-cluster-tests/73f09e4d-0bde-47ed-a0d2-3094210a99bc Another runs with same problem: |
Opened separate issue #9496 |
Packages
Scylla version:
6.3.0~dev-20241125.cb6c55209aa3
with build-id401a7e10e591ff86514130b3090d1eecf5d284ec
Kernel Version:
6.8.0-1019-aws
Issue description
While testing of fix for #8401 I found that recently added variable
sdcm.nemesis.Nemesis._target_node_pool
value is not always synchronised withcluster.data_nodes
value when parallel nemesis run.It causes to parallel nemesis failure.
For testing I added print outs.
The test was run with 3 DB nodes.
Parallel nemeses
AddRemoveDc
andAddDropColumnMonkey
New node
longevity-100gb-4h-fix-add--db-node-a41fd37c-16
was added byAddRemoveDc
and node running nemesis was set toAddRemoveDc
.cluster.data_nodes keep this new node (
AddRemoveDcNemesis
thread):_target_node_pool does not have this info (
AddRemoveDcNemesis
thread):New node was terminated in the end of nemesis:
cluster.data_nodes variable and _target_node_pool from
AddRemoveDcNemesis
thread - terminated node does not exists anymore:But from
AddDropColumnMonkey
thread the terminated node is still visible in the _target_node_pool.As result the terminated node was chosen as target node and nemesis failed:
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Cluster size: 3 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-0324a1ff38ee9ec75
(aws: undefined_region)Test:
staging-longevity-100gb-4h
Test id:
a41fd37c-6230-4f53-9376-2fc20c5f3a1c
Test name:
scylla-staging/yulia/staging-longevity-100gb-4h
Test method:
longevity_test.LongevityTest.test_custom_time
Test config file(s):
Logs and commands
$ hydra investigate show-monitor a41fd37c-6230-4f53-9376-2fc20c5f3a1c
$ hydra investigate show-logs a41fd37c-6230-4f53-9376-2fc20c5f3a1c
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: