-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to add node in parallel (no tablets) #19523
Comments
What's the backtrace, decoded? |
In the attached file |
This is a new assertion, added in 18f5d6f cc @gleb-cloudius |
How can the cluster size be 3 when there are 6 nodes. |
So it complains that it doesn't have IP mapping for itself. No other node got this error. |
According to the topology coordinator, bootstrap completed successfully -- including the final global barrier in
So apparently, our node crashed when handling that last command -- which removes transition state and moves it to normal? But the crash happens in |
@gleb-cloudius why is wait_for_ip excluding the node being worked on? Maybe it makes sense, we assume that it has its own IP (although apparently in this case it eventually turns out that it doesn't...) |
This could mean that the bootstrapping node is behind, slower to apply commands, but -- shouldn't we include bootstrapping node in global barriers? |
Ah, could the IP mapping have expired in |
The fix for #16849 was bc42a5a for (const auto& [id, rs]: t.new_nodes) {
_group0->modifiable_address_map().set_nonexpiring(id);
} this however assumes that every node will observe a state in which every joining node appears in But I think command merging or snapshot transfers could break this assumption. A node might theoretically skip some state transitions (if it is not included in global barriers between each transitions) so e.g. it might never see the moment it was in Could this be the cause of why a bootstrapping node lots its own mapping? Another hypothesis: could it be possible that the mapping expired even before it reached the state where it was in |
Well yeah -- judging from the timestamps:
so if the mapping was created before |
@gleb-cloudius I think we need a fix for #16849 that doesn't depend on timing. |
@mykaul this should be a blocker for 2024.2 (concurrent bootstrap doesn't work correctly with this issue) |
The fix is for all other nodes, not the joining node. The joining node will not be a part of group0 when it is in the |
…since the local node cannot disappear' from ScyllaDB A node may wait in the topology coordinator queue for awhile before been joined. Since the local address is added as expiring entry to the raft address map it may expire meanwhile and the bootstrap will fail. The series makes the entry non expiring. Fixes #19523 Needs to be backported to 6.0 since the bug may cause bootstrap to fail. (cherry picked from commit 5d8f08c) (cherry picked from commit 3f136cf) Refs #19557 Closes #19574 * github.com:scylladb/scylladb: test: add test that checks that local address cannot expire between join request placemen and its processing storage_service: make node's entry non expiring in raft address map
I repeated the test and it failed again, but this time different error (but still failure during adding node in parallel, no tablets).
And referenced node (one that first was to be added to cluster) crashed during that time:
(I couldn't decode backtrace as backtrace.scylladb.com didn't respond, no core dump collected) PackagesScylla version: Kernel Version: Installation detailsCluster size: 3 nodes (i3en.2xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:No logs captured during this run. |
This ran on a version without the fix |
yes, this is due missing AMI's closing back until retested |
Packages
Scylla version:
6.1.0~dev-20240625.c80dc5715668
with build-idbf0032dbaafe5e4d3e01ece0dcb7785d2ec7a098
Kernel Version:
5.15.0-1063-aws
Issue description
I tried elasticity test with parallel nodes bootstrap (3 nodes), but without tablets, as per request in comment.
During addition of the third node, right after
error showed up:
and
Aborting on shard 0, in scheduling group gossip.
error:coredump-info.txt
and later when node was restarted it failed joining the cluster due existing node with this IP address:
Impact
Failed to add the node
How frequently does it reproduce?
First time seen
Installation details
Cluster size: 3 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-05027cfce81f4fa63
(aws: undefined_region)Test:
scylla-master-perf-regression-latency-650gb-grow-shrink
Test id:
ace73ad8-be62-4544-8861-45aa083cb8d3
Test name:
scylla-staging/lukasz/scylla-master-perf-regression-latency-650gb-grow-shrink
Test config file(s):
Logs and commands
$ hydra investigate show-monitor ace73ad8-be62-4544-8861-45aa083cb8d3
$ hydra investigate show-logs ace73ad8-be62-4544-8861-45aa083cb8d3
Logs:
Jenkins job URL
Argus
| 20190101_010101 | prometheus | https://cloudius-jenkins-test.s3.amazonaws.com/ace73ad8-be62-4544-8861-45aa083cb8d3/prometheus_snapshot_20240627_104445.tar.gz |
| 20190101_010101 | prometheus | https://cloudius-jenkins-test.s3.amazonaws.com/ace73ad8-be62-4544-8861-45aa083cb8d3/prometheus_snapshot_20240627_134404.tar.gz |
| 20240627_104344 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/ace73ad8-be62-4544-8861-45aa083cb8d3/20240627_104344/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240627_104415-perf-latency-grow-shrink-ubuntu-monitor-node-ace73ad8-1.png |
| 20240627_115534 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/ace73ad8-be62-4544-8861-45aa083cb8d3/20240627_115534/grafana-screenshot-overview-20240627_115535-perf-latency-grow-shrink-ubuntu-monitor-node-ace73ad8-1.png |
| 20240627_115534 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/ace73ad8-be62-4544-8861-45aa083cb8d3/20240627_115534/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240627_115541-perf-latency-grow-shrink-ubuntu-monitor-node-ace73ad8-1.png |
| 20240627_134340 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/ace73ad8-be62-4544-8861-45aa083cb8d3/20240627_134340/grafana-screenshot-overview-20240627_134340-perf-latency-grow-shrink-ubuntu-monitor-node-ace73ad8-1.png |
| 20240627_134340 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/ace73ad8-be62-4544-8861-45aa083cb8d3/20240627_134340/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240627_134347-perf-latency-grow-shrink-ubuntu-monitor-node-ace73ad8-1.png |
| 20240627_134535 | db-cluster | https://cloudius-jenkins-test.s3.amazonaws.com/ace73ad8-be62-4544-8861-45aa083cb8d3/20240627_134535/db-cluster-ace73ad8.tar.gz |
| 20240627_134535 | loader-set | https://cloudius-jenkins-test.s3.amazonaws.com/ace73ad8-be62-4544-8861-45aa083cb8d3/20240627_134535/loader-set-ace73ad8.tar.gz |
| 20240627_134535 | monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/ace73ad8-be62-4544-8861-45aa083cb8d3/20240627_134535/monitor-set-ace73ad8.tar.gz |
| 20240627_134535 | sct | https://cloudius-jenkins-test.s3.amazonaws.com/ace73ad8-be62-4544-8861-45aa083cb8d3/20240627_134535/sct-ace73ad8.log.tar.gz |
| 20240627_134535 | event | https://cloudius-jenkins-test.s3.amazonaws.com/ace73ad8-be62-4544-8861-45aa083cb8d3/20240627_134535/sct-runner-events-ace73ad8.tar.gz
The text was updated successfully, but these errors were encountered: