Skip to content

[Placeholder] Support repair on tablets enabled environment. #3704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #3516
karol-kokoszka opened this issue Jan 30, 2024 · 6 comments · Fixed by #3753
Closed
Tracked by #3516

[Placeholder] Support repair on tablets enabled environment. #3704

karol-kokoszka opened this issue Jan 30, 2024 · 6 comments · Fixed by #3753
Assignees
Milestone

Comments

@karol-kokoszka
Copy link
Collaborator

This is just a placeholder for an issue that will add support for the repair in tablets enabled environment.

Design doc for tablet repair and integration with SM is here https://docs.google.com/document/d/1ATh5i4EMTJYnWERZ-BDH7lpgqwHe0HE-zxPWycW70So/edit#heading=h.6ubymuu4oedu

The issue must be updated when the API on Scylla side is confirmed already (cc: @tgrabiec @bhalevy )

@Michal-Leszczynski
Copy link
Collaborator

PR which added describe ring endpoint for tablet tables (solved issue).

@Michal-Leszczynski
Copy link
Collaborator

Repair description from core POV.

@Michal-Leszczynski
Copy link
Collaborator

As mentioned in the tablet repair doc, sensible intensity and parallel values might be different for regular and tablet tables, so SM needs to make it possible to set them separately.

@tgrabiec
Copy link

Repair description from core POV.

FYI, we decided not to implement this spec in full for 6.0, so we're going for "Using the old APIs (option 2)" described in the doc.

@ShlomiBalalis
Copy link

It's not just repair, but also backup that fails on tablets enabled clusters:

< t:2024-03-23 12:12:10,862 f:remote_base.py  l:521  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo sctool backup -c fbb525d8-fe74-4b5a-be5f-04840bba0c72 --location s3:manager-backup-tests-us-east-1 "...
2024-03-23 12:12:13.359: (DisruptionEvent Severity.ERROR) period_type=end event_id=26d5c1c2-ef33-4d19-bb12-cd5f6ce24da3 duration=48s: nemesis_name=MgmtBackup target_node=Node longevity-large-partitions-8h-maste-db-node-839a80cc-4 [34.246.173.247 | 10.4.1.100] (seed: True) errors=Encountered an error on sctool command: backup -c fbb525d8-fe74-4b5a-be5f-04840bba0c72 --location s3:manager-backup-tests-us-east-1 : Encountered a bad command exit code!

Command: 'sudo sctool backup -c fbb525d8-fe74-4b5a-be5f-04840bba0c72 --location s3:manager-backup-tests-us-east-1 '

Exit code: 1

Stdout:

Stderr:

Error: create backup target: keyspace scylla_bench: get ring description: giving up after 5 attempts: agent [HTTP 500] std::runtime_error (The keyspace scylla_bench has tablet table. Query describe_ring with the table parameter!)
Trace ID: VfaaEpMzT_CdKPXlgfTL6w (grep in scylla-manager logs)
2024-03-23 09:11:54,733 f:remote_base.py  l:521  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo sctool repair -c fbb525d8-fe74-4b5a-be5f-04840bba0c72"...
2024-03-23 09:11:57.778: (DisruptionEvent Severity.ERROR) period_type=end event_id=8da842a0-2a91-4324-acc6-9fc460b9a684 duration=32s: nemesis_name=MgmtRepairCli target_node=Node longevity-large-partitions-8h-maste-db-node-839a80cc-1 [54.229.174.228 | 10.4.2.196] (seed: True) errors=Encountered an error on sctool command: repair -c fbb525d8-fe74-4b5a-be5f-04840bba0c72: Encountered a bad command exit code!

Command: 'sudo sctool repair -c fbb525d8-fe74-4b5a-be5f-04840bba0c72'

Exit code: 1

Stdout:

Stderr:

Error: create repair target: create repair plan: keyspace scylla_bench: get ring description: giving up after 5 attempts: agent [HTTP 500] std::runtime_error (The keyspace scylla_bench has tablet table. Query describe_ring with the table parameter!)
Trace ID: _R3HUAwZRTS2at54oQ57pA (grep in scylla-manager logs)

@Michal-Leszczynski
Copy link
Collaborator

@ShlomiBalalis that's because the PR introducing tablets hasn't been merged yet. It passes all tests from SM repo, but it failed on a few dtests. The errors don't look like connected to tablets but rather to general setup. Could you take a look at them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants