`sdcm.nemesis.Nemesis._target_node_pool` value is not always synchronised with `cluster.data_nodes` value when parallel nemesis run #9448

juliayakovlev · 2024-12-02T12:25:27Z

Packages

Scylla version: 6.3.0~dev-20241125.cb6c55209aa3 with build-id 401a7e10e591ff86514130b3090d1eecf5d284ec

Kernel Version: 6.8.0-1019-aws

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

While testing of fix for #8401 I found that recently added variable sdcm.nemesis.Nemesis._target_node_pool value is not always synchronised with cluster.data_nodes value when parallel nemesis run.
It causes to parallel nemesis failure.

For testing I added print outs.

The test was run with 3 DB nodes.
Parallel nemeses AddRemoveDc and AddDropColumnMonkey

New node longevity-100gb-4h-fix-add--db-node-a41fd37c-16 was added by AddRemoveDc and node running nemesis was set to AddRemoveDc.
cluster.data_nodes keep this new node (AddRemoveDcNemesis thread):

< t:2024-11-26 09:37:50,610 f:nemesis.py      l:4761 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.AddRemoveDcNemesis: Nodes after set from cluster.data_nodes: [('longevity-100gb-4h-fix-add--db-node-a41fd37c-1', None), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-2', 'AddRemoveDc'), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-3', None), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-16', 'AddRemoveDc')]

_target_node_pool does not have this info (AddRemoveDcNemesis thread):

< t:2024-11-26 09:37:50,610 f:nemesis.py      l:4759 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.AddRemoveDcNemesis: Nodes after set from _target_node_pool: [('longevity-100gb-4h-fix-add--db-node-a41fd37c-1', None), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-2', 'AddRemoveDc'), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-3', None)]

New node was terminated in the end of nemesis:

< t:2024-11-26 09:39:50,992 f:cluster.py      l:5029 c:sdcm.cluster         p:INFO  > Decommission Node longevity-100gb-4h-fix-add--db-node-a41fd37c-16 [63.33.213.160 | 10.4.8.224] PASS
< t:2024-11-26 09:40:16,249 f:cluster.py      l:1231 c:sdcm.cluster         p:INFO  > Node longevity-100gb-4h-fix-add--db-node-a41fd37c-16 [63.33.213.160 | 10.4.8.224] destroyed

cluster.data_nodes variable and _target_node_pool from AddRemoveDcNemesis thread - terminated node does not exists anymore:

< t:2024-11-26 09:40:18,317 f:nemesis.py      l:4784 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.AddRemoveDcNemesis: Nodes after unset from _target_node_pool: [('longevity-100gb-4h-fix-add--db-node-a41fd37c-1', None), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-2', 'AddRemoveDc'), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-3', None)]

< t:2024-11-26 09:40:18,317 f:nemesis.py      l:4786 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.AddRemoveDcNemesis: Nodes after unset from cluster.data_nodes: [('longevity-100gb-4h-fix-add--db-node-a41fd37c-1', None), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-2', 'AddRemoveDc'), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-3', None)]

< t:2024-11-26 09:40:55,860 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-11-26 09:40:55.859: (DisruptionEvent Severity.NORMAL) period_type=end event_id=348927a5-cdee-4e3c-8300-f82ab42af77f duration=5m38s: nemesis_name=AddRemoveDc target_node=Node longevity-100gb-4h-fix-add--db-node-a41fd37c-2 [18.201.16.43 | 10.4.10.62]

But from AddDropColumnMonkey thread the terminated node is still visible in the _target_node_pool.
As result the terminated node was chosen as target node and nemesis failed:

< t:2024-11-26 09:41:37,006 f:nemesis.py      l:423  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.AddDropColumnMonkey: _target_node_pool: [('longevity-100gb-4h-fix-add--db-node-a41fd37c-1', None), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-2', None), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-3', None), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-16', None)]
< t:2024-11-26 09:41:37,040 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-11-26 09:41:37.035: (DisruptionEvent Severity.NORMAL) period_type=begin event_id=bf7849d6-e7cb-4bee-99d0-cb573c533272: nemesis_name=AddDropColumn target_node=Node longevity-100gb-4h-fix-add--db-node-a41fd37c-16 [63.33.213.160 | 10.4.8.224]

< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-11-26 10:10:03.836: (DisruptionEvent Severity.ERROR) period_type=end event_id=bf7849d6-e7cb-
4bee-99d0-cb573c533272 duration=28m26s: nemesis_name=AddDropColumn target_node=Node longevity-100gb-4h-fix-add--db-node-a41fd37c-16 [63.33.213.160 | 10.4.8.224] errors=Failed to run a comman
d due to exception!
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Command: '/usr/bin/cqlsh --no-color   --request-timeout=120 --connect-timeout=60 --ssl -e "descri
be keyspace1.standard1" 10.4.8.224'
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Stdout:
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Stderr:
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Exception:  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line
 593, in run
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     self.connect()
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 529, in c
onnect
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     raise ConnectTimeout(ex_msg) from exc
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Failed to connect in 60 seconds, last error: (ConnectError)Error connecting to host '10.4.8.224:2
2' - No route to host

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 3 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

longevity-100gb-4h-fix-add--db-node-a41fd37c-9 (54.73.37.130 | 10.4.10.163) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-8 (52.211.156.150 | 10.4.8.239) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-7 (18.201.46.105 | 10.4.11.16) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-6 (3.248.184.128 | 10.4.11.225) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-5 (54.75.133.106 | 10.4.8.114) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-4 (18.201.92.174 | 10.4.10.247) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-3 (3.250.205.43 | 10.4.11.78) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-26 (3.249.173.193 | 10.4.9.190) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-25 (3.254.132.170 | 10.4.11.145) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-24 (54.194.92.253 | 10.4.11.166) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-23 (54.171.118.217 | 10.4.8.222) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-22 (63.33.69.200 | 10.4.8.47) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-21 (54.194.112.146 | 10.4.8.40) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-20 (34.251.35.214 | 10.4.8.69) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-2 (18.201.16.43 | 10.4.10.62) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-19 (54.74.142.215 | 10.4.8.202) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-18 (18.201.69.15 | 10.4.9.118) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-17 (54.217.153.201 | 10.4.9.250) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-16 (63.33.213.160 | 10.4.8.224) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-15 (54.217.167.110 | 10.4.11.126) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-14 (3.249.173.241 | 10.4.9.44) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-13 (54.217.146.48 | 10.4.8.211) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-12 (34.245.208.18 | 10.4.8.94) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-11 (3.248.199.225 | 10.4.8.166) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-10 (18.202.233.242 | 10.4.8.51) (shards: 7)
longevity-100gb-4h-fix-add--db-node-a41fd37c-1 (54.170.46.37 | 10.4.9.64) (shards: 7)

OS / Image: ami-0324a1ff38ee9ec75 (aws: undefined_region)

Test: staging-longevity-100gb-4h
Test id: a41fd37c-6230-4f53-9376-2fc20c5f3a1c
Test name: scylla-staging/yulia/staging-longevity-100gb-4h
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

longevity-100gb-4h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor a41fd37c-6230-4f53-9376-2fc20c5f3a1c
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs a41fd37c-6230-4f53-9376-2fc20c5f3a1c

Logs:

longevity-100gb-4h-fix-add--db-node-a41fd37c-4 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-4-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-5 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-5-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-6 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-6-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-7 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-7-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-8 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-8-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-9 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-9-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-10 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-10-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-11 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-11-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-12 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-12-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-13 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-13-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-14 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-14-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-15 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-15-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-16 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-16-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-17 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-17-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-18 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-18-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-19 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-19-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-20 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-20-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-21 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-21-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-22 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-22-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-23 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-23-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-24 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-24-a41fd37c.tar.gz
longevity-100gb-4h-fix-add--db-node-a41fd37c-25 - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_071710/longevity-100gb-4h-fix-add--db-node-a41fd37c-25-a41fd37c.tar.gz
db-cluster-a41fd37c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_112817/db-cluster-a41fd37c.tar.gz
sct-runner-events-a41fd37c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_112817/sct-runner-events-a41fd37c.tar.gz
sct-a41fd37c.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_112817/sct-a41fd37c.log.tar.gz
loader-set-a41fd37c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_112817/loader-set-a41fd37c.tar.gz
monitor-set-a41fd37c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_112817/monitor-set-a41fd37c.tar.gz
parallel-timelines-report-a41fd37c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a41fd37c-6230-4f53-9376-2fc20c5f3a1c/20241126_112817/parallel-timelines-report-a41fd37c.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

juliayakovlev · 2024-12-02T12:26:37Z

@aleksbykov I see you added recently _target_node_pool variable support. Can you, please, take a look on this issue?

aleksbykov · 2024-12-03T08:34:18Z

@juliayakovlev , This fix: #9429 should help

juliayakovlev · 2024-12-03T08:54:57Z

@juliayakovlev , This fix: #9429 should help

as I mentioned in the issue description, when I tested the fix #9425 that set a new node.running nemesis attribute in goal to prevent this new node been chosen as target node, I found that it does not solve the problem. It was happened because the _target_node_pool was not synchronised.
Please, rebase your fix and try to run the test with prints out. It is not always causes to nemesis failure. But by prints it may show if the problem still happens

aleksbykov · 2024-12-03T12:36:20Z

yes , i see. The problem persist before also because list is not thread-safe, but with target_pools it become more visible.

juliayakovlev · 2024-12-05T15:19:39Z

I got the problem with different scenario, but the reason may be the same.
@aleksbykov can you, please, check if it is the same reason

In the tests below we run decommissioning of 3 nodes, one by one. I noticed that less than 3 nodes were decommissioned. The reason that previously terminated node was chosen for decommissioning. See examples from log in "Discussion" tab:

https://argus.scylladb.com/tests/scylla-cluster-tests/c7f51531-0a18-4499-9153-fad31baec19f
Destoyed node was chosen as target node for decommission:

&lt; t:2024-12-05 01:32:06,712 f:cluster.py      l:1237 c:sdcm.cluster         p:INFO  &gt; Node perf-latency-nemesis-ubuntu-db-node-c7f51531-1 
[98.84.120.246 | 10.12.2.226] destroyed

&lt; t:2024-12-05 02:44:37,496 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  &gt; 2024-12-05 02:44:37.494: (InfoEvent Severity.NORMAL) period_type=not-set event_id=c776fe21-9f61-4529-bef8-7592bd7ea177: message=StartEvent - 
ShrinkCluster started decommissioning a node Node perf-latency-nemesis-ubuntu-db-node-c7f51531-1 [98.84.120.246 | 10.12.2.226]

&lt; t:2024-12-05 02:51:49,400 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  &gt; 2024-12-05 02:51:49.385: (NodetoolEvent Severity.NORMAL) period_type=begin event_id=f0bbdb59-6b85-479a-b6ef-5b0f413a85a8: 
nodetool_command=decommission node=perf-latency-nemesis-ubuntu-db-node-c7f51531-1

https://argus.scylladb.com/tests/scylla-cluster-tests/73f09e4d-0bde-47ed-a0d2-3094210a99bc

Another runs with same problem:
https://argus.scylladb.com/tests/scylla-cluster-tests/41d3fe6b-e837-46df-b528-42fe34f51ca6
https://argus.scylladb.com/tests/scylla-cluster-tests/a3e64331-4a89-475b-9717-dc8f026733f5

juliayakovlev · 2024-12-08T11:50:10Z

I got the problem with different scenario, but the reason may be the same. @aleksbykov can you, please, check if it is the same reason

In the tests below we run decommissioning of 3 nodes, one by one. I noticed that less than 3 nodes were decommissioned. The reason that previously terminated node was chosen for decommissioning. See examples from log in "Discussion" tab:

https://argus.scylladb.com/tests/scylla-cluster-tests/c7f51531-0a18-4499-9153-fad31baec19f Destoyed node was chosen as target node for decommission:
&lt; t:2024-12-05 01:32:06,712 f:cluster.py      l:1237 c:sdcm.cluster         p:INFO  &gt; Node perf-latency-nemesis-ubuntu-db-node-c7f51531-1 
[98.84.120.246 | 10.12.2.226] destroyed

&lt; t:2024-12-05 02:44:37,496 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  &gt; 2024-12-05 02:44:37.494: (InfoEvent Severity.NORMAL) period_type=not-set event_id=c776fe21-9f61-4529-bef8-7592bd7ea177: message=StartEvent - 
ShrinkCluster started decommissioning a node Node perf-latency-nemesis-ubuntu-db-node-c7f51531-1 [98.84.120.246 | 10.12.2.226]

&lt; t:2024-12-05 02:51:49,400 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  &gt; 2024-12-05 02:51:49.385: (NodetoolEvent Severity.NORMAL) period_type=begin event_id=f0bbdb59-6b85-479a-b6ef-5b0f413a85a8: 
nodetool_command=decommission node=perf-latency-nemesis-ubuntu-db-node-c7f51531-1
https://argus.scylladb.com/tests/scylla-cluster-tests/73f09e4d-0bde-47ed-a0d2-3094210a99bc

Another runs with same problem: https://argus.scylladb.com/tests/scylla-cluster-tests/41d3fe6b-e837-46df-b528-42fe34f51ca6 https://argus.scylladb.com/tests/scylla-cluster-tests/a3e64331-4a89-475b-9717-dc8f026733f5

Opened separate issue #9496

github-actions bot assigned juliayakovlev Dec 2, 2024

juliayakovlev assigned aleksbykov and unassigned juliayakovlev Dec 2, 2024

aleksbykov added the Bug Something isn't working right label Dec 3, 2024

aleksbykov mentioned this issue Dec 9, 2024

fix(nemesis-target-pool): set method attribute for target pool #9517

Merged

2 tasks

fruch closed this as completed in #9517 Dec 10, 2024

mergify bot mentioned this issue Dec 10, 2024

[Backport 6.2] fix(nemesis-target-pool): set method attribute for target pool #9527

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`sdcm.nemesis.Nemesis._target_node_pool` value is not always synchronised with `cluster.data_nodes` value when parallel nemesis run #9448

`sdcm.nemesis.Nemesis._target_node_pool` value is not always synchronised with `cluster.data_nodes` value when parallel nemesis run #9448

juliayakovlev commented Dec 2, 2024 •

edited

Loading

Logs:

juliayakovlev commented Dec 2, 2024 •

edited

Loading

aleksbykov commented Dec 3, 2024 •

edited

Loading

juliayakovlev commented Dec 3, 2024 •

edited

Loading

aleksbykov commented Dec 3, 2024

juliayakovlev commented Dec 5, 2024 •

edited

Loading

juliayakovlev commented Dec 8, 2024

sdcm.nemesis.Nemesis._target_node_pool value is not always synchronised with cluster.data_nodes value when parallel nemesis run #9448

sdcm.nemesis.Nemesis._target_node_pool value is not always synchronised with cluster.data_nodes value when parallel nemesis run #9448

Comments

juliayakovlev commented Dec 2, 2024 • edited Loading

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

juliayakovlev commented Dec 2, 2024 • edited Loading

aleksbykov commented Dec 3, 2024 • edited Loading

juliayakovlev commented Dec 3, 2024 • edited Loading

aleksbykov commented Dec 3, 2024

juliayakovlev commented Dec 5, 2024 • edited Loading

juliayakovlev commented Dec 8, 2024

`sdcm.nemesis.Nemesis._target_node_pool` value is not always synchronised with `cluster.data_nodes` value when parallel nemesis run #9448

`sdcm.nemesis.Nemesis._target_node_pool` value is not always synchronised with `cluster.data_nodes` value when parallel nemesis run #9448

juliayakovlev commented Dec 2, 2024 •

edited

Loading

juliayakovlev commented Dec 2, 2024 •

edited

Loading

aleksbykov commented Dec 3, 2024 •

edited

Loading

juliayakovlev commented Dec 3, 2024 •

edited

Loading

juliayakovlev commented Dec 5, 2024 •

edited

Loading