health check fails in `check_group0_tokenring_consistency` if one of the nodes is down #9599

yarongilor · 2024-12-22T08:45:04Z

Packages

Scylla version: 2024.3.0~dev-20241218.42cc7a4f12de with build-id 0ee8a26c08783c18bd6dead5ba27a9e622efa885

Kernel Version: 6.8.0-1021-aws

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.
node-1 gets no-space-left (following a core dump:

DatabaseLogEvent
ERROR
disrupt_switch_between_password_authenticator_and_saslauthd_authenticator_and_back
2024-12-19 20:46:23.828
Received: 2024-12-19 20:46:23.804
one-time
elasticity-test-nemesis-master-db-node-90bfa08f-1
2024-12-19 20:46:23.828 <2024-12-19 20:46:23.804>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=14c23b02-53e1-4bfa-b976-a93a9abb930c: type=NO_SPACE_ERROR regex=No space left on device line_number=10491 node=elasticity-test-nemesis-master-db-node-90bfa08f-1
2024-12-19T20:46:23.804+00:00 elasticity-test-nemesis-master-db-node-90bfa08f-1      !ERR | scylla[10468]:  [shard 0:main] storage_service - Shutting down communications due to I/O errors until operator intervention: Disk error: std::system_error (error system:28, No space left on device)
0x1f26dfd

The node fails to restart scylla service as well:

Nemesis Information
Class: Sisyphus
Name: disrupt_switch_between_password_authenticator_and_saslauthd_authenticator_and_back
Status: Failed
Failure reason
Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5446, in wrapper
    result = method(*args[1:], **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1032, in disrupt_switch_between_password_authenticator_and_saslauthd_authenticator_and_back
    update_authenticator(self.cluster.nodes, orig_auth)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 2432, in update_authenticator
    node.restart_scylla_server()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2535, in restart_scylla_server
    self.restart_service(service_name='scylla-server', timeout=timeout * 2)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2461, in restart_service
    self._service_cmd(service_name=service_name, cmd='restart', timeout=timeout, ignore_status=ignore_status)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2445, in _service_cmd
    return self.remoter.run(cmd, timeout=timeout, ignore_status=ignore_status)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 653, in run
    result = _run()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 72, in inner
    return func(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 644, in _run
    return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 577, in _run_execute
    result = connection.run(**command_kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 625, in run
    return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 660, in _complete_run
    raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!

Command: 'sudo systemctl restart scylla-server.service'

Exit code: 1

Stdout:



Stderr:

Job for scylla-server.service failed because the control process exited with error code.
See "systemctl status scylla-server.service" and "journalctl -xeu scylla-server.service" for details.

The nemesis thread cannot get host id and fails permanently (the test keeps running without nemesis):

ThreadFailedEvent
ERROR
no nemesis
2024-12-19 21:06:47.838
one-time
2024-12-19 21:06:47.838: (ThreadFailedEvent Severity.ERROR) period_type=one-time event_id=281ad081-a1a5-4cd9-ab8e-6bc2f7bc949b: message='NoneType' object has no attribute 'get'
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/decorators.py", line 26, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 487, in run
self.disrupt()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 6643, in disrupt
self.call_next_nemesis()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2054, in call_next_nemesis
self.execute_disrupt_method(disrupt_method=next(self.disruptions_cycle))
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1980, in execute_disrupt_method
disrupt_method()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5504, in wrapper
args[0].cluster.check_cluster_health()
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 4388, in check_cluster_health
node.check_node_health()
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2738, in check_node_health
event = next(events, None)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/raft/__init__.py", line 331, in check_group0_tokenring_consistency
self._node.name, self._node.host_id)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 483, in host_id
return self.parent_cluster.get_nodetool_info(self, ignore_status=True, publish_event=False).get("ID")
AttributeError: 'NoneType' object has no attribute 'get'

The get_nodetool_info probably failed to run, in this state, and returned None, so the get() failed.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 3 nodes (i4i.large)

Scylla Nodes used in this run:

elasticity-test-nemesis-master-db-node-90bfa08f-3 (54.217.164.63 | 10.4.14.82) (shards: 2)
elasticity-test-nemesis-master-db-node-90bfa08f-2 (34.246.246.186 | 10.4.14.242) (shards: 2)
elasticity-test-nemesis-master-db-node-90bfa08f-1 (54.228.89.37 | 10.4.14.150) (shards: 2)

OS / Image: ami-0f14cab4bda57c2b2 (aws: undefined_region)

Test: byo-longevity-test-yg2
Test id: 90bfa08f-2a3d-4ba9-b443-13ce00925638
Test name: scylla-staging/yarongilor/byo-longevity-test-yg2
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

perf-regression-latency-i4i_2xlarge-elasticity-90-percent-with-nemesis.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 90bfa08f-2a3d-4ba9-b443-13ce00925638
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 90bfa08f-2a3d-4ba9-b443-13ce00925638

Logs:

core.scylla-elasticity-test-nemesis-master-db-node-90bfa08f-1-2024-12-19_20-03-42.gz - https://storage.cloud.google.com/upload.scylladb.com/core.systemd-network.998.93c36ffde86d40fcae8d05712785715b.489.1734638559000000./core.systemd-network.998.93c36ffde86d40fcae8d05712785715b.489.1734638559000000.zst

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

yarongilor · 2024-12-22T09:29:45Z

It's also a problem to investigate core dump of node-1 since no logs available.

2024-12-19 20:03:42.060 <2024-12-19 20:02:39.000>: (CoreDumpEvent Severity.ERROR) period_type=one-time event_id=5e8a0fc5-f118-4a96-b7df-fb582051f7ad during_nemesis=NetworkBlock node=Node elasticity-test-nemesis-master-db-node-90bfa08f-1 [54.228.89.37 | 10.4.14.150]
corefile_url=https://storage.cloud.google.com/upload.scylladb.com/core.systemd-network.998.93c36ffde86d40fcae8d05712785715b.489.1734638559000000./core.systemd-network.998.93c36ffde86d40fcae8d05712785715b.489.1734638559000000.zst
backtrace=           PID: 489 (systemd-network)
UID: 998 (systemd-network)
GID: 998 (systemd-network)
Signal: 11 (SEGV)
Timestamp: Thu 2024-12-19 20:02:39 UTC (1min 0s ago)
Command Line: /usr/lib/systemd/systemd-networkd
Executable: /usr/lib/systemd/systemd-networkd
Control Group: /system.slice/systemd-networkd.service
Unit: systemd-networkd.service
Slice: system.slice
Boot ID: 93c36ffde86d40fcae8d05712785715b
Machine ID: ec2758c51f893c87a0ff556267587598
Hostname: elasticity-test-nemesis-master-db-node-90bfa08f-1
Storage: /var/lib/systemd/coredump/core.systemd-network.998.93c36ffde86d40fcae8d05712785715b.489.1734638559000000.zst (present)
Size on Disk: 775.4K
Package: systemd/255.4-1ubuntu8.4
build-id: ba7817d4a0f1beb10af1c6dd425b467ec7491dd0
Message: Process 489 (systemd-network) of user 998 dumped core.
Module libzstd.so.1 from deb libzstd-1.5.5+dfsg2-2build1.1.amd64
Module libsystemd-shared-255.so from deb systemd-255.4-1ubuntu8.4.amd64
Module systemd-networkd from deb systemd-255.4-1ubuntu8.4.amd64
Stack trace of thread 489:
#0  0x00007e1dae9c4122 _hashmap_iterate (libsystemd-shared-255.so + 0x1c4122)
#1  0x00005ae542d99175 n/a (systemd-networkd + 0xe7175)
#2  0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#3  0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#4  0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#5  0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#6  0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#7  0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#8  0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#9  0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#10 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#11 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#12 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#13 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#14 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#15 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#16 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#17 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#18 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#19 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#20 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#21 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#22 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#23 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#24 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#25 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#26 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#27 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#28 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#29 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#30 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#31 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#32 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#33 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#34 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#35 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#36 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#37 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#38 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#39 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#40 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#41 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#42 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#43 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#44 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#45 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#46 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#47 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#48 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#49 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#50 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#51 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#52 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#53 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#54 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#55 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#56 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#57 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#58 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#59 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#60 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#61 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
#62 0x00005ae542d992ca n/a (systemd-networkd + 0xe72ca)
#63 0x00005ae542d9918d n/a (systemd-networkd + 0xe718d)
ELF object binary architecture: AMD x86-64
Info about modules can be found in SCT logs by search for 'Coredump Modules info'
download_instructions:
gsutil cp gs://upload.scylladb.com/core.systemd-network.998.93c36ffde86d40fcae8d05712785715b.489.1734638559000000./core.systemd-network.998.93c36ffde86d40fcae8d05712785715b.489.1734638559000000.zst .
unzstd core.systemd-network.998.93c36ffde86d40fcae8d05712785715b.489.1734638559000000.zst

fruch · 2024-12-22T10:52:20Z

@yarongilor

the coredump is a known issue, and is dealt with in SCT code, I would advise to rebase.

as for logs missing, seems like:

java.io.StreamCorruptedException: invalid stream header: 636F7272

it's issue being chased at in #9444, and not don't yet know the root cause of it, it's not related to this issue.

as the the issue at hand, it's has nothing todo with full utilization of the disks.
core is assuming all nodes are up during the check, and one isn't. that should be fixed

yarongilor self-assigned this Dec 22, 2024

yarongilor mentioned this issue Dec 22, 2024

Run existing nemesis with 90% storage utilization test function #9155

Open

fruch changed the title ~~Nemesis thread fails in health-check where a db node reaches 100% disk utilization~~ health check fails in check_group0_tokenring_consistency if one of the nodes is down Dec 22, 2024

yarongilor added the area/elastic cloud Issues related to the elastic cloud project label Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

health check fails in `check_group0_tokenring_consistency` if one of the nodes is down #9599

health check fails in `check_group0_tokenring_consistency` if one of the nodes is down #9599

yarongilor commented Dec 22, 2024 •

edited

Loading

Logs:

yarongilor commented Dec 22, 2024

fruch commented Dec 22, 2024

health check fails in check_group0_tokenring_consistency if one of the nodes is down #9599

health check fails in check_group0_tokenring_consistency if one of the nodes is down #9599

Comments

yarongilor commented Dec 22, 2024 • edited Loading

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

yarongilor commented Dec 22, 2024

fruch commented Dec 22, 2024

health check fails in `check_group0_tokenring_consistency` if one of the nodes is down #9599

health check fails in `check_group0_tokenring_consistency` if one of the nodes is down #9599

yarongilor commented Dec 22, 2024 •

edited

Loading