-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce tablets to SM #3753
Introduce tablets to SM #3753
Conversation
Ref #16509 (in scylladb)
Vnode keyspaces have per keyspace ring, and tablet keyspaces have per table ring.
This allows for checking which keyspaces have vnodes/tablets replication.
Used scylla nightly version has all options used by SM repair enabled. Investigated issues:
Note: all issues mentioned above are really flaky (meaning that the test passes sometimes - especially when ignoring erros from querying table size). After setting up the scenario and running repair multiple times, it is possible that it fails, succeeds and then fails again. Other:
|
@asias @karol-kokoszka FYI this is the PR preparing SM for tablets - tests are running and I'm investigating some encountered issues described above. |
8525fa6
to
1a5fa11
Compare
Tests against Scylla nightly with TABLETS DISABLED passed for all services (link here). |
Tests against Scylla nightly with TABLETS:
|
From the logs in restore tables tests we can see that tablet table is repaired:
yet it doesn't contain all required rows afterwards:
It's worth to note that schema is recreated via CQL on the restore destination cluster and that both cluster support tablets. |
I also run the tests with just raft-topology enabled and they passed - meaning that enabling tablets messes up the tests. |
@bhalevy this PR may be interesting for you. |
|
SM is supposed to ignore system_auth_v2. It is a raft managed table which does not need repair. |
This sounds like a node was down during repair, e.g., a node is restarted. Did you try some basic tests like starting a cluster manually, insert data, repair with SM. Did you see similar repair errors?
|
@asias you are right, it looks like some node is getting segfault and then restarts. This are example logs:
|
It looks like it happens after querying DescribeRing of table created for test purposes. |
Is there a way to know which tables are raft based and should be skipped by SM repair? |
Here are logs from mentioned test scenario: |
Changed to ready for review.
It is unsure if the problem lies in SM test env setup or something on Scylla side or in some misunderstanding of tablet features. |
Decoded Segmentation fault:
|
It allows for easy and efficient ring describing which works with older Scylla versions and tablet and vnode tables.
d657e08
to
367fcd3
Compare
@dani-tweig @karol-kokoszka good news! It looks like after the segfault fix the tests are mostly passing. |
@Michal-Leszczynski @bhalevy , FYI |
This is required for tablet tables which have per table ring description.
This is required for tablet tables which have per table ring description.
6a3b386
to
27fa6fb
Compare
Some tests assumed that SM is going to send thousands of ranges to be repaired and that it is easy to interrupt this process in the middle without any synchronization. Tablet tables have far fewer ranges (e.g. 8), so they require dedicated synchronization for those cases.
27fa6fb
to
e3e0de6
Compare
@karol-kokoszka could you take a look at this PR? |
I like the change with introducing nightly build. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Even though there were some problems regarding dtests that doesn't look connected to tablets, I'm going to merge it so that the master dtests and sct will have a chance of passing. |
Testing against
There was a little mix up with different PRs being opened and closed regarding tablet related functionalities. This one is supposed to prepare backup and repair services for tablet tables and test them.
Fixes #3759
Fixes #3704
Fixes #3703
Fixes #3713