Release 3.3.1 #3963

karol-kokoszka · 2024-08-09T21:44:49Z

Prep for release 3.3.1

Contains:
#3905
#3906
#3910
#3898
#3917
#3923
#3924
#3926
#3938
#3934
#3918
#3885
#3942
#3954
#3931
#3941
#3943
#3955
#3957
#3960

Please make sure that:

Code is split to commits that address a single change
Commit messages are informative
Commit titles have module prefix
Commit titles have issue nr. suffix

Since scylladb/scylladb@1577aa8 (5.2), the replace_address and replace_address_first_boot options are deprecated and their usage is discouraged, and warned about (when used). Instead, the user should use the replace_node_first_boot which states the replaced node's host_id rather than its ip address.

…onfig value Since scylladb/scylladb@1577aa8 (5.2), the replace_address and replace_address_first_boot options are deprecated and their usage is discouraged, and warned about (when used). Instead, the user should use the replace_node_first_boot which states the replaced node's host_id rather than its ip address.

Tablets went from experimental feature to regular enable_tablets feature. This commit updates test env setup according to this change. It also gets rid of no longer needed things like: - separate scylla.yaml and .properties for other cluster - no need to control raft topology, as it is always enabled in 6.0, and we don't care for it in previous versions

…visibility

This commit adds a script ('.github/cfg/main.go') which generates workflows and prints github badges according to './integration-test-cfg.yaml', './integration-test-core.yaml'.

…sIntegration This commit: - extends graceful stop timeout so that it catches all finished after ctx cancel repair jobs - clears tablet ranges of not fully repaired table before starting new repair (as tablet tables always resume repair from scratch) - makes it so that no redundant ranges are tolerated

… outage

…in no opt test

Fixes #3878

Ref #3879

Ref #3916

This section is generated by the ./github/cfg/main.go script (except for the limitations, which should be added manually). Fixes #3872

Even though taking state and appending ranges is done under mutex, it is still possible that (from SM POV) the first insert could reach SM DB later than the second insert. This would result in overwriting some successfully repaired ranges and re-repairing them in the next run. Fixes #3919

Test case "repair temporary network outage" could fail (because the repair finished before network outage) when there were only 3 replica sets to repair (possible for 4 initial tablets).

This fixes flakiness like: 2024-07-10T16:09:57.1335875Z === RUN TestRcloneStoppingTransferIntegration {"host": "[2001:0DB9:200::11]:10001", "method": "POST", "uri": "/agent/rclone/core/bwlimit", "duration": "1ms", "status": 200, "bytes": 93} {"host": "[2001:0DB9:200::11]:10001", "method": "POST", "uri": "/agent/rclone/sync/copydir?_async=true", "duration": "0ms", "status": 200, "bytes": 21} {"host": "[2001:0DB9:200::11]:10001", "method": "POST", "uri": "/agent/rclone/job/info", "duration": "1000ms", "status": 200, "bytes": 556} {"host": "[2001:0DB9:200::11]:10001", "method": "POST", "uri": "/agent/rclone/job/stop", "duration": "0ms", "status": 200, "bytes": 3} {"host": "[2001:0DB9:200::11]:10001", "method": "POST", "uri": "/agent/rclone/job/info", "duration": "1000ms", "status": 200, "bytes": 549} {"host": "[2001:0DB9:200::11]:10001", "method": "POST", "uri": "/agent/rclone/core/bwlimit", "duration": "9ms", "status": 200, "bytes": 79} 2024-07-10T16:10:02.5696715Z --- FAIL: TestRcloneStoppingTransferIntegration (5.44s) 2024-07-10T16:10:02.5716548Z panic: runtime error: index out of range [0] with length 0 [recovered]

Fixes #3886

Using errors.Errorf("%s", err) results in making it impossible to later check nested error cause with errors.Is function. On the other hand, it works fine with fmt.Errorf("%w", err). Fixes #3925

Since it's no longer supported, remove it. Signed-off-by: Yaniv Kaul <[email protected]>

… objects

Also, remove deprecated tags field.

Previously test scenario: - backup src - restore to dst - backup dst - restore to *dst* It was changed to (as roundtrip suggests): - backup src - restore to dst - backup dst - restore to *scr* Also, src schema is dropped right after the backup, so it should propagate to all nodes before the last restore takes place. Fixes #3939

Fixes #3929

…ill experimental there

Even though stage schema does not need indexed snapshot dirs, they are required in the next stage deduplicate.

The purpose of this endpoint is to delete files in batches, instead of deleting them one by one. It should improve purge and deduplication backup stages performance.

Previously it was only possible to delete files one by one. Now it's possible to delete many files in a single API call (RcloneDeletePaths), or in batches of given size (RcloneDeletePathsInBatches).

Fixes #3945

Fixes #3928

Tests were sometimes failing because gh actions machines had less disk space than the default amount.

…status This commit adds dedicated errors which can be used to check task status. Previously it was done by checking error returned by task, but it was impossible to tell whether returned context.Canceled originated from pausing the task or from task execution. Fixes #3884

This discovers #3884.

Ref #3935 (comment)

karol-kokoszka and others added 30 commits August 9, 2024 23:16

refactor(scyllaclient_test): add log on tablet keyspace for improved …

56f74d4

…visibility

feat(.github): add script for generating workflows and badges

f55414f

This commit adds a script ('.github/cfg/main.go') which generates workflows and prints github badges according to './integration-test-cfg.yaml', './integration-test-core.yaml'.

feat(.github): generate workflow files and badges from cfg/main.go

848cd72

fix(repair_test): remove specific error from repair temporary network…

03d8a1b

… outage

fix(repair_test): difference batching and merged ranges optimization …

efd4a79

…in no opt test

fix(docs): update broken links

12a6dcb

Fixes #3878

feat(docs): unlink old release docs from index

e663658

fix(docs): keep prod-cluster name convention

1fba2ca

Ref #3879

fix(docs): remove deprecated flags from repair example

674792c

Ref #3879

fix(docs): use sstables with UUID in backup specification

42fa61c

Ref #3916

feat(.github/cfg): print badges as reference links and table rows

68e0434

fix(README): add separate section for Scylla integration status

840ba22

This section is generated by the ./github/cfg/main.go script (except for the limitations, which should be added manually). Fixes #3872

refactor(repair): improve tablet load balancing logging

35f8710

refactor(backup): improve tablet load balancing logging

4d3a291

fix(repair_test): ensure enough token ranges for repair tests

45a7d6d

Test case "repair temporary network outage" could fail (because the repair finished before network outage) when there were only 3 replica sets to repair (possible for 4 initial tablets).

feat(swagger): scylla_v2, add enable_tablets config field

901fa9a

feat(swagger): agent, add enable_tablets to node info

80000e9

feat(scyllaclient): fill node info with enable_tablets

4e66e53

feat(restore_test): add a way to compare table rows not only by a count

9adcefc

feat(restore_test): test vnode ks backup into tablet ks restore

f0cd919

Fixes #3886

fix(pkg): don't wrap errors as plain strings

fec6c47

Using errors.Errorf("%s", err) results in making it impossible to later check nested error cause with errors.Is function. On the other hand, it works fine with fmt.Errorf("%w", err). Fixes #3925

Remove ansible/restore

3bd1ec9

Since it's no longer supported, remove it. Signed-off-by: Yaniv Kaul <[email protected]>

feat(swagger): scylla-manager, add labels to cluster and task related…

1db785a

… objects

feat(schema): add labels to cluster and scheduler_task

8acafbd

Also, remove deprecated tags field.

karol-kokoszka and others added 27 commits August 9, 2024 23:30

backup: reindex when resuming on deduplicate stage

dac8e95

scyllaclient: method to get host's rack

a559dde

config-cache: extend node info with dc and rack

2a6a706

metrics: add rack and datacenter label to healthcheck metrics

be14567

healthcheck: update rack and dc on healthcheck metrics

80d890f

fix(repair_test): reduce the amount of tablets to gain speed

ffe67f3

Fixes #3929

feat(.github/cfg): delete old workflows on generation

8af9785

feat(.github/cfg): don't test tablets on Scylla 2024.1 as they are st…

9ef25cc

…ill experimental there

fix(backup): allow re-indexing on stage schema

5002573

Even though stage schema does not need indexed snapshot dirs, they are required in the next stage deduplicate.

feat(swagger): agent, add /rclone/operations/deletepaths endpoint

77aca63

The purpose of this endpoint is to delete files in batches, instead of deleting them one by one. It should improve purge and deduplication backup stages performance.

feat(rcserver): implement operations/deletepaths endpoint

2630169

feat(rcserver): add operations/deletepaths to supported calls

1ccae9d

feat(scyllaclient): add method to delete multiple files in batches

b19d214

Previously it was only possible to delete files one by one. Now it's possible to delete many files in a single API call (RcloneDeletePaths), or in batches of given size (RcloneDeletePathsInBatches).

fix(backup): don't deduplicate the same file twice

44e3c18

Fixes #3945

feat(purger): delete files in batches (#3928)

5a22ac3

feat(backup): delete files in batches (#3928)

4b241f0

Fixes #3928

fix(restore_test): decrease min disk space

d033ef5

Tests were sometimes failing because gh actions machines had less disk space than the default amount.

cluster: update to include all hosts into connection validity check

01e4a68

feat(scheduler_test): test task status outside of maintenance window

f2e6d5e

feat(scheduler_test): test task returned internal context error

7040dce

This discovers #3884.

fix(scyllaclient): create clients under mutex in cache provider

4160465

Ref #3935 (comment)

fix(cluster): remove duplicates when combining Host and KnownHosts

65031e8

Ref #3935 (comment)

bugfix(backup): delete deduplicated files based on crc32

b0becc5

bugfix(backup): include explicitly deduplicated files into progress

5953471

int-test(backup): update skipped in test hooks from different place

0a2c898

karol-kokoszka requested a review from Michal-Leszczynski as a code owner August 9, 2024 21:44

karol-kokoszka merged commit 932a84b into branch-3.3 Aug 9, 2024
50 of 51 checks passed

karol-kokoszka deleted the release-3.3.1 branch August 9, 2024 22:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 3.3.1 #3963

Release 3.3.1 #3963

karol-kokoszka commented Aug 9, 2024

Release 3.3.1 #3963

Release 3.3.1 #3963

Conversation

karol-kokoszka commented Aug 9, 2024