Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sdcm.nemesis.Nemesis._target_node_pool value is not always synchronised with cluster.data_nodes value when parallel nemesis run #9448

Closed
2 tasks
juliayakovlev opened this issue Dec 2, 2024 · 6 comments · Fixed by #9517
Assignees
Labels
Bug Something isn't working right

Comments

@juliayakovlev
Copy link
Contributor

juliayakovlev commented Dec 2, 2024

Packages

Scylla version: 6.3.0~dev-20241125.cb6c55209aa3 with build-id 401a7e10e591ff86514130b3090d1eecf5d284ec

Kernel Version: 6.8.0-1019-aws

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

While testing of fix for #8401 I found that recently added variable sdcm.nemesis.Nemesis._target_node_pool value is not always synchronised with cluster.data_nodes value when parallel nemesis run.
It causes to parallel nemesis failure.

For testing I added print outs.

The test was run with 3 DB nodes.
Parallel nemeses AddRemoveDc and AddDropColumnMonkey

New node longevity-100gb-4h-fix-add--db-node-a41fd37c-16 was added by AddRemoveDc and node running nemesis was set to AddRemoveDc.
cluster.data_nodes keep this new node (AddRemoveDcNemesis thread):

< t:2024-11-26 09:37:50,610 f:nemesis.py      l:4761 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.AddRemoveDcNemesis: Nodes after set from cluster.data_nodes: [('longevity-100gb-4h-fix-add--db-node-a41fd37c-1', None), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-2', 'AddRemoveDc'), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-3', None), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-16', 'AddRemoveDc')]

_target_node_pool does not have this info (AddRemoveDcNemesis thread):

< t:2024-11-26 09:37:50,610 f:nemesis.py      l:4759 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.AddRemoveDcNemesis: Nodes after set from _target_node_pool: [('longevity-100gb-4h-fix-add--db-node-a41fd37c-1', None), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-2', 'AddRemoveDc'), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-3', None)]

New node was terminated in the end of nemesis:

< t:2024-11-26 09:39:50,992 f:cluster.py      l:5029 c:sdcm.cluster         p:INFO  > Decommission Node longevity-100gb-4h-fix-add--db-node-a41fd37c-16 [63.33.213.160 | 10.4.8.224] PASS
< t:2024-11-26 09:40:16,249 f:cluster.py      l:1231 c:sdcm.cluster         p:INFO  > Node longevity-100gb-4h-fix-add--db-node-a41fd37c-16 [63.33.213.160 | 10.4.8.224] destroyed

cluster.data_nodes variable and _target_node_pool from AddRemoveDcNemesis thread - terminated node does not exists anymore:

< t:2024-11-26 09:40:18,317 f:nemesis.py      l:4784 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.AddRemoveDcNemesis: Nodes after unset from _target_node_pool: [('longevity-100gb-4h-fix-add--db-node-a41fd37c-1', None), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-2', 'AddRemoveDc'), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-3', None)]

< t:2024-11-26 09:40:18,317 f:nemesis.py      l:4786 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.AddRemoveDcNemesis: Nodes after unset from cluster.data_nodes: [('longevity-100gb-4h-fix-add--db-node-a41fd37c-1', None), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-2', 'AddRemoveDc'), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-3', None)]

< t:2024-11-26 09:40:55,860 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-11-26 09:40:55.859: (DisruptionEvent Severity.NORMAL) period_type=end event_id=348927a5-cdee-4e3c-8300-f82ab42af77f duration=5m38s: nemesis_name=AddRemoveDc target_node=Node longevity-100gb-4h-fix-add--db-node-a41fd37c-2 [18.201.16.43 | 10.4.10.62]

But from AddDropColumnMonkey thread the terminated node is still visible in the _target_node_pool.
As result the terminated node was chosen as target node and nemesis failed:

< t:2024-11-26 09:41:37,006 f:nemesis.py      l:423  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.AddDropColumnMonkey: _target_node_pool: [('longevity-100gb-4h-fix-add--db-node-a41fd37c-1', None), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-2', None), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-3', None), ('longevity-100gb-4h-fix-add--db-node-a41fd37c-16', None)]
< t:2024-11-26 09:41:37,040 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-11-26 09:41:37.035: (DisruptionEvent Severity.NORMAL) period_type=begin event_id=bf7849d6-e7cb-4bee-99d0-cb573c533272: nemesis_name=AddDropColumn target_node=Node longevity-100gb-4h-fix-add--db-node-a41fd37c-16 [63.33.213.160 | 10.4.8.224]

< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-11-26 10:10:03.836: (DisruptionEvent Severity.ERROR) period_type=end event_id=bf7849d6-e7cb-
4bee-99d0-cb573c533272 duration=28m26s: nemesis_name=AddDropColumn target_node=Node longevity-100gb-4h-fix-add--db-node-a41fd37c-16 [63.33.213.160 | 10.4.8.224] errors=Failed to run a comman
d due to exception!
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Command: '/usr/bin/cqlsh --no-color   --request-timeout=120 --connect-timeout=60 --ssl -e "descri
be keyspace1.standard1" 10.4.8.224'
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Stdout:
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Stderr:
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Exception:  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line
 593, in run
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     self.connect()
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 529, in c
onnect
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     raise ConnectTimeout(ex_msg) from exc
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2024-11-26 10:10:03,838 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Failed to connect in 60 seconds, last error: (ConnectError)Error connecting to host '10.4.8.224:2
2' - No route to host

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 3 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

  • longevity-100gb-4h-fix-add--db-node-a41fd37c-9 (54.73.37.130 | 10.4.10.163) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-8 (52.211.156.150 | 10.4.8.239) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-7 (18.201.46.105 | 10.4.11.16) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-6 (3.248.184.128 | 10.4.11.225) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-5 (54.75.133.106 | 10.4.8.114) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-4 (18.201.92.174 | 10.4.10.247) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-3 (3.250.205.43 | 10.4.11.78) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-26 (3.249.173.193 | 10.4.9.190) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-25 (3.254.132.170 | 10.4.11.145) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-24 (54.194.92.253 | 10.4.11.166) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-23 (54.171.118.217 | 10.4.8.222) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-22 (63.33.69.200 | 10.4.8.47) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-21 (54.194.112.146 | 10.4.8.40) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-20 (34.251.35.214 | 10.4.8.69) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-2 (18.201.16.43 | 10.4.10.62) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-19 (54.74.142.215 | 10.4.8.202) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-18 (18.201.69.15 | 10.4.9.118) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-17 (54.217.153.201 | 10.4.9.250) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-16 (63.33.213.160 | 10.4.8.224) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-15 (54.217.167.110 | 10.4.11.126) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-14 (3.249.173.241 | 10.4.9.44) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-13 (54.217.146.48 | 10.4.8.211) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-12 (34.245.208.18 | 10.4.8.94) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-11 (3.248.199.225 | 10.4.8.166) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-10 (18.202.233.242 | 10.4.8.51) (shards: 7)
  • longevity-100gb-4h-fix-add--db-node-a41fd37c-1 (54.170.46.37 | 10.4.9.64) (shards: 7)

OS / Image: ami-0324a1ff38ee9ec75 (aws: undefined_region)

Test: staging-longevity-100gb-4h
Test id: a41fd37c-6230-4f53-9376-2fc20c5f3a1c
Test name: scylla-staging/yulia/staging-longevity-100gb-4h
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor a41fd37c-6230-4f53-9376-2fc20c5f3a1c
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs a41fd37c-6230-4f53-9376-2fc20c5f3a1c

Logs:

Jenkins job URL
Argus

@juliayakovlev
Copy link
Contributor Author

juliayakovlev commented Dec 2, 2024

@aleksbykov I see you added recently _target_node_pool variable support. Can you, please, take a look on this issue?

@aleksbykov
Copy link
Contributor

aleksbykov commented Dec 3, 2024

@juliayakovlev , This fix: #9429 should help

@juliayakovlev
Copy link
Contributor Author

juliayakovlev commented Dec 3, 2024

@juliayakovlev , This fix: #9429 should help

as I mentioned in the issue description, when I tested the fix #9425 that set a new node.running nemesis attribute in goal to prevent this new node been chosen as target node, I found that it does not solve the problem. It was happened because the _target_node_pool was not synchronised.
Please, rebase your fix and try to run the test with prints out. It is not always causes to nemesis failure. But by prints it may show if the problem still happens

@aleksbykov
Copy link
Contributor

yes , i see. The problem persist before also because list is not thread-safe, but with target_pools it become more visible.

@aleksbykov aleksbykov added the Bug Something isn't working right label Dec 3, 2024
@juliayakovlev
Copy link
Contributor Author

juliayakovlev commented Dec 5, 2024

I got the problem with different scenario, but the reason may be the same.
@aleksbykov can you, please, check if it is the same reason

In the tests below we run decommissioning of 3 nodes, one by one. I noticed that less than 3 nodes were decommissioned. The reason that previously terminated node was chosen for decommissioning. See examples from log in "Discussion" tab:

https://argus.scylladb.com/tests/scylla-cluster-tests/c7f51531-0a18-4499-9153-fad31baec19f
Destoyed node was chosen as target node for decommission:

&lt; t:2024-12-05 01:32:06,712 f:cluster.py      l:1237 c:sdcm.cluster         p:INFO  &gt; Node perf-latency-nemesis-ubuntu-db-node-c7f51531-1 
[98.84.120.246 | 10.12.2.226] destroyed

&lt; t:2024-12-05 02:44:37,496 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  &gt; 2024-12-05 02:44:37.494: (InfoEvent Severity.NORMAL) period_type=not-set event_id=c776fe21-9f61-4529-bef8-7592bd7ea177: message=StartEvent - 
ShrinkCluster started decommissioning a node Node perf-latency-nemesis-ubuntu-db-node-c7f51531-1 [98.84.120.246 | 10.12.2.226]

&lt; t:2024-12-05 02:51:49,400 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  &gt; 2024-12-05 02:51:49.385: (NodetoolEvent Severity.NORMAL) period_type=begin event_id=f0bbdb59-6b85-479a-b6ef-5b0f413a85a8: 
nodetool_command=decommission node=perf-latency-nemesis-ubuntu-db-node-c7f51531-1

https://argus.scylladb.com/tests/scylla-cluster-tests/73f09e4d-0bde-47ed-a0d2-3094210a99bc

Another runs with same problem:
https://argus.scylladb.com/tests/scylla-cluster-tests/41d3fe6b-e837-46df-b528-42fe34f51ca6
https://argus.scylladb.com/tests/scylla-cluster-tests/a3e64331-4a89-475b-9717-dc8f026733f5

@juliayakovlev
Copy link
Contributor Author

I got the problem with different scenario, but the reason may be the same. @aleksbykov can you, please, check if it is the same reason

In the tests below we run decommissioning of 3 nodes, one by one. I noticed that less than 3 nodes were decommissioned. The reason that previously terminated node was chosen for decommissioning. See examples from log in "Discussion" tab:

https://argus.scylladb.com/tests/scylla-cluster-tests/c7f51531-0a18-4499-9153-fad31baec19f Destoyed node was chosen as target node for decommission:

&lt; t:2024-12-05 01:32:06,712 f:cluster.py      l:1237 c:sdcm.cluster         p:INFO  &gt; Node perf-latency-nemesis-ubuntu-db-node-c7f51531-1 
[98.84.120.246 | 10.12.2.226] destroyed

&lt; t:2024-12-05 02:44:37,496 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  &gt; 2024-12-05 02:44:37.494: (InfoEvent Severity.NORMAL) period_type=not-set event_id=c776fe21-9f61-4529-bef8-7592bd7ea177: message=StartEvent - 
ShrinkCluster started decommissioning a node Node perf-latency-nemesis-ubuntu-db-node-c7f51531-1 [98.84.120.246 | 10.12.2.226]

&lt; t:2024-12-05 02:51:49,400 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  &gt; 2024-12-05 02:51:49.385: (NodetoolEvent Severity.NORMAL) period_type=begin event_id=f0bbdb59-6b85-479a-b6ef-5b0f413a85a8: 
nodetool_command=decommission node=perf-latency-nemesis-ubuntu-db-node-c7f51531-1

https://argus.scylladb.com/tests/scylla-cluster-tests/73f09e4d-0bde-47ed-a0d2-3094210a99bc

Another runs with same problem: https://argus.scylladb.com/tests/scylla-cluster-tests/41d3fe6b-e837-46df-b528-42fe34f51ca6 https://argus.scylladb.com/tests/scylla-cluster-tests/a3e64331-4a89-475b-9717-dc8f026733f5

Opened separate issue #9496

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working right
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants