Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 3.3.1 #3963

Merged
merged 80 commits into from
Aug 9, 2024
Merged
Changes from 1 commit
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
827c0f4
swagger(scylla): remove endpoint getting replace_address config value
karol-kokoszka Jun 25, 2024
33f22a3
swagger(scylla): remove endpoint getting replace_address_first_boot c…
karol-kokoszka Jun 25, 2024
bc5b7ad
fix(testing): fix testing env tablets setup
Michal-Leszczynski Jun 24, 2024
56f74d4
refactor(scyllaclient_test): add log on tablet keyspace for improved …
Michal-Leszczynski Jun 24, 2024
f55414f
feat(.github): add script for generating workflows and badges
Michal-Leszczynski Jun 27, 2024
848cd72
feat(.github): generate workflow files and badges from cfg/main.go
Michal-Leszczynski Jun 27, 2024
473cece
fix(repair_test): remove flakiness of TestServiceRepairResumeAllRange…
Michal-Leszczynski Jul 2, 2024
03d8a1b
fix(repair_test): remove specific error from repair temporary network…
Michal-Leszczynski Jul 2, 2024
efd4a79
fix(repair_test): difference batching and merged ranges optimization …
Michal-Leszczynski Jul 2, 2024
12a6dcb
fix(docs): update broken links
Michal-Leszczynski Jul 3, 2024
e663658
feat(docs): unlink old release docs from index
Michal-Leszczynski Jul 3, 2024
1fba2ca
fix(docs): keep prod-cluster name convention
Michal-Leszczynski Jul 3, 2024
674792c
fix(docs): remove deprecated flags from repair example
Michal-Leszczynski Jul 3, 2024
42fa61c
fix(docs): use sstables with UUID in backup specification
Michal-Leszczynski Jul 3, 2024
68e0434
feat(.github/cfg): print badges as reference links and table rows
Michal-Leszczynski Jul 5, 2024
840ba22
fix(README): add separate section for Scylla integration status
Michal-Leszczynski Jul 5, 2024
110ddb7
fix(repair): insert repair state under mutex
Michal-Leszczynski Jul 9, 2024
35f8710
refactor(repair): improve tablet load balancing logging
Michal-Leszczynski Jul 10, 2024
4d3a291
refactor(backup): improve tablet load balancing logging
Michal-Leszczynski Jul 10, 2024
45a7d6d
fix(repair_test): ensure enough token ranges for repair tests
Michal-Leszczynski Jul 10, 2024
ecbffd1
fix(scyllaclient_test): make sure to wait for transfer stop
Michal-Leszczynski Jul 11, 2024
901fa9a
feat(swagger): scylla_v2, add enable_tablets config field
Michal-Leszczynski Jul 9, 2024
80000e9
feat(swagger): agent, add enable_tablets to node info
Michal-Leszczynski Jul 9, 2024
4e66e53
feat(scyllaclient): fill node info with enable_tablets
Michal-Leszczynski Jul 9, 2024
9adcefc
feat(restore_test): add a way to compare table rows not only by a count
Michal-Leszczynski Jul 9, 2024
f0cd919
feat(restore_test): test vnode ks backup into tablet ks restore
Michal-Leszczynski Jul 9, 2024
fec6c47
fix(pkg): don't wrap errors as plain strings
Michal-Leszczynski Jul 10, 2024
3bd1ec9
Remove ansible/restore
mykaul Jul 23, 2024
1db785a
feat(swagger): scylla-manager, add labels to cluster and task related…
Michal-Leszczynski Jul 9, 2024
8acafbd
feat(schema): add labels to cluster and scheduler_task
Michal-Leszczynski Jul 17, 2024
4477274
feat(flag): add Label type
Michal-Leszczynski Jul 17, 2024
a75ee4a
feat(flag): add --label flag to TaskBase
Michal-Leszczynski Jul 17, 2024
7533739
feat(managerclient): render labels in TaskListItems
Michal-Leszczynski Jul 17, 2024
46f5c8b
feat(managerclient): render labels in TaskInfo
Michal-Leszczynski Jul 17, 2024
0ece5ed
feat(clusteradd): add --label flag
Michal-Leszczynski Jul 18, 2024
02c4143
feat(clusterupdate): add --label flag
Michal-Leszczynski Jul 18, 2024
8af0049
feat(managerclient): render labels in ClusterSlice
Michal-Leszczynski Jul 18, 2024
5d7c3f3
feat(docs): add --label docs
Michal-Leszczynski Jul 17, 2024
f51dd30
refactor(backup): move token ownership check outside getLiveNodes
Michal-Leszczynski Jul 8, 2024
7c422c9
refactor(backup): move validation and unit creation from GetTarget to…
Michal-Leszczynski Jul 8, 2024
95d3160
fix(backup): don't backup views
Michal-Leszczynski Jul 11, 2024
4cb852b
feat(backup_test): test that views are not backed up
Michal-Leszczynski Jul 4, 2024
599f208
feat(docs): backup, explain table backup criteria
Michal-Leszczynski Jul 8, 2024
5cc90bf
refactor(backup): make GetTarget a wrapper for targetFromProperties
Michal-Leszczynski Jul 23, 2024
4a38ac5
rcserver: allow to cat .crc32 checksum files
karol-kokoszka Jun 18, 2024
c84dc6c
backup: add explicit deduplication stage
karol-kokoszka Jun 18, 2024
09cc1ba
restore: update versioning tests to break .crc32 files
karol-kokoszka Jul 4, 2024
aedc236
CI: cluster to use integer generation version
karol-kokoszka Jul 5, 2024
9333972
Makefile: added restore integration test to integration-test target
karol-kokoszka Jul 5, 2024
362f3e5
backup: add deduplicate to resumable states
karol-kokoszka Jul 12, 2024
c070ed3
sstable: return error instead of panic on SSTable ID extract
karol-kokoszka Jul 12, 2024
c93a9ad
backup: aggregate size of deduplicated content
karol-kokoszka Jul 15, 2024
641c337
tests: cover pause/resume on backup's deduplication stage
karol-kokoszka Jul 19, 2024
dac8e95
backup: reindex when resuming on deduplicate stage
karol-kokoszka Jul 19, 2024
a559dde
scyllaclient: method to get host's rack
karol-kokoszka Jul 24, 2024
2a6a706
config-cache: extend node info with dc and rack
karol-kokoszka Jul 24, 2024
be14567
metrics: add rack and datacenter label to healthcheck metrics
karol-kokoszka Jul 25, 2024
80d890f
healthcheck: update rack and dc on healthcheck metrics
karol-kokoszka Jul 25, 2024
1941d89
fix(restore_test): fix TestRestoreSchemaRoundtripIntegration flakiness
Michal-Leszczynski Aug 6, 2024
ffe67f3
fix(repair_test): reduce the amount of tablets to gain speed
Michal-Leszczynski Jul 16, 2024
8af9785
feat(.github/cfg): delete old workflows on generation
Michal-Leszczynski Jul 16, 2024
9ef25cc
feat(.github/cfg): don't test tablets on Scylla 2024.1 as they are st…
Michal-Leszczynski Jul 16, 2024
5002573
fix(backup): allow re-indexing on stage schema
Michal-Leszczynski Jul 24, 2024
77aca63
feat(swagger): agent, add /rclone/operations/deletepaths endpoint
Michal-Leszczynski Jul 24, 2024
2630169
feat(rcserver): implement operations/deletepaths endpoint
Michal-Leszczynski Jul 24, 2024
1ccae9d
feat(rcserver): add operations/deletepaths to supported calls
Michal-Leszczynski Jul 24, 2024
b19d214
feat(scyllaclient): add method to delete multiple files in batches
Michal-Leszczynski Jul 24, 2024
44e3c18
fix(backup): don't deduplicate the same file twice
Michal-Leszczynski Aug 4, 2024
5a22ac3
feat(purger): delete files in batches (#3928)
Michal-Leszczynski Jul 24, 2024
4b241f0
feat(backup): delete files in batches (#3928)
Michal-Leszczynski Aug 5, 2024
d033ef5
fix(restore_test): decrease min disk space
Michal-Leszczynski Aug 5, 2024
01e4a68
cluster: update to include all hosts into connection validity check
karol-kokoszka Jul 25, 2024
f2e6d5e
feat(scheduler_test): test task status outside of maintenance window
Michal-Leszczynski Aug 6, 2024
3388840
fix(scheduler): use dedicated context cause errors for checking task …
Michal-Leszczynski Aug 6, 2024
7040dce
feat(scheduler_test): test task returned internal context error
Michal-Leszczynski Aug 6, 2024
4160465
fix(scyllaclient): create clients under mutex in cache provider
Michal-Leszczynski Aug 7, 2024
65031e8
fix(cluster): remove duplicates when combining Host and KnownHosts
Michal-Leszczynski Aug 7, 2024
b0becc5
bugfix(backup): delete deduplicated files based on crc32
karol-kokoszka Aug 9, 2024
5953471
bugfix(backup): include explicitly deduplicated files into progress
karol-kokoszka Aug 9, 2024
0a2c898
int-test(backup): update skipped in test hooks from different place
karol-kokoszka Aug 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fix(repair_test): remove flakiness of TestServiceRepairResumeAllRange…
…sIntegration

This commit:
- extends graceful stop timeout so that it catches all finished after ctx cancel repair jobs
- clears tablet ranges of not fully repaired table before starting new repair (as tablet tables always resume repair from scratch)
- makes it so that no redundant ranges are tolerated
Michal-Leszczynski authored and karol-kokoszka committed Aug 9, 2024
commit 473cece1e9e465a50052053f8a4e4103f6990b70
134 changes: 68 additions & 66 deletions pkg/service/repair/service_repair_integration_test.go
Original file line number Diff line number Diff line change
@@ -883,7 +883,10 @@ func TestServiceRepairResumeAllRangesIntegration(t *testing.T) {
session := CreateScyllaManagerDBSession(t)
clusterSession := CreateSessionAndDropAllKeyspaces(t, c)
cfg := repair.DefaultConfig()
cfg.GracefulStopTimeout = time.Second
// Make sure that all repairs finish within the graceful stop timeout,
// so that we avoid situations that interceptor marked range as done,
// but SM won't save it in its DB.
cfg.GracefulStopTimeout = time.Minute
h := newRepairWithClusterSessionTestHelper(t, session, hrt, c, cfg)

const (
@@ -961,36 +964,33 @@ func TestServiceRepairResumeAllRangesIntegration(t *testing.T) {

// Tools for performing a repair with 4 pauses
var (
cnt = atomic.Int64{}
reqCnt = atomic.Int64{}
rspCnt = atomic.Int64{}
stopErrInject = atomic.Bool{}
stop1Ctx, stop1 = context.WithCancel(ctx)
stop2Ctx, stop2 = context.WithCancel(ctx)
stop3Ctx, stop3 = context.WithCancel(ctx)
stop4Ctx, stop4 = context.WithCancel(ctx)
stopCnt1 = 150
stopCnt2 = 250
stopCnt3 = 400
stopCnt4 = 500
stopCnt1 = 50
stopCnt2 = 75
stopCnt3 = 100
stopCnt4 = 125
)

running := atomic.Bool{}
// Repair request
h.Hrt.SetInterceptor(httpx.RoundTripperFunc(func(req *http.Request) (*http.Response, error) {
if repairEndpointRegexp.MatchString(req.URL.Path) && req.Method == http.MethodPost {
switch int(cnt.Add(1)) {
switch int(reqCnt.Add(1)) {
case stopCnt1:
running.Store(false)
stop1()
t.Log("First repair pause")
case stopCnt2:
running.Store(false)
stop2()
t.Log("Second repair pause")
case stopCnt3:
running.Store(false)
stop3()
t.Log("Third repair pause")
case stopCnt4:
running.Store(false)
stop4()
t.Log("Fourth repair pause")
}
@@ -1027,63 +1027,33 @@ func TestServiceRepairResumeAllRangesIntegration(t *testing.T) {

// Response to repair status
if repairEndpointRegexp.MatchString(resp.Request.URL.Path) && resp.Request.Method == http.MethodGet {
// Inject 5% errors on all runs except the last one.
// Inject errors on all runs except the last one.
// This helps to test repair error resilience.
// Also, return errors for requests after the pause, so that's
// easier to look for redundant ranges.
if i := cnt.Load(); !running.Load() || i < int64(stopCnt4) && i%20 == 0 {
if !stopErrInject.Load() && rspCnt.Add(1)%20 == 0 {
resp.Body = io.NopCloser(bytes.NewBufferString(fmt.Sprintf("%q", scyllaclient.CommandFailed)))
return
}

status := string(body)
if status == "\"SUCCESSFUL\"" {
muJS.Lock()
tr := jobSpec[resp.Request.Host+resp.Request.URL.Query()["id"][0]]
muJS.Unlock()

if tr.FullTable == "" {
t.Logf("This is strange %s", resp.Request.Host+resp.Request.URL.Query()["id"][0])
return
defer muJS.Unlock()

k := resp.Request.Host + resp.Request.URL.Query()["id"][0]
if tr, ok := jobSpec[k]; ok {
// Make sure that retries don't result in counting redundant ranges
delete(jobSpec, k)
// Register done ranges
muDR.Lock()
dr := doneRanges[tr.FullTable]
dr = append(dr, tr.Ranges...)
doneRanges[tr.FullTable] = dr
muDR.Unlock()
}

// Register done ranges
muDR.Lock()
dr := doneRanges[tr.FullTable]
dr = append(dr, tr.Ranges...)
doneRanges[tr.FullTable] = dr
muDR.Unlock()
}
}
})

Print("When: run first repair with context cancel")
running.Store(true)
if err := h.runRegularRepair(stop1Ctx, props); err == nil {
t.Fatal("Repair failed without error")
}

Print("When: run second repair with context cancel")
h.RunID = uuid.NewTime()
running.Store(true)
if err := h.runRegularRepair(stop2Ctx, props); err == nil {
t.Fatal("Repair failed without error")
}

Print("When: run third repair with context cancel")
h.RunID = uuid.NewTime()
running.Store(true)
if err := h.runRegularRepair(stop3Ctx, props); err == nil {
t.Fatal("Repair failed without error")
}

Print("When: run fourth repair with context cancel")
h.RunID = uuid.NewTime()
running.Store(true)
if err := h.runRegularRepair(stop4Ctx, props); err == nil {
t.Fatal("Repair failed without error")
}

validate := func(tab string, tr []scyllaclient.TokenRange) (redundant int, err error) {
sort.Slice(tr, func(i, j int) bool {
return tr[i].StartToken < tr[j].StartToken
@@ -1113,33 +1083,65 @@ func TestServiceRepairResumeAllRangesIntegration(t *testing.T) {
return redundant, nil
}

clearTabletRanges := func(doneRanges map[string][]scyllaclient.TokenRange, ringDescriber scyllaclient.RingDescriber) {
var clearKeys []string
for tab, dr := range doneRanges {
_, err := validate(tab, dr)
if err != nil && ringDescriber.IsTabletKeyspace(strings.Split(tab, ".")[0]) {
clearKeys = append(clearKeys, tab)
}
}
for _, k := range clearKeys {
delete(doneRanges, k)
}
}

Print("When: run first repair with context cancel")
if err := h.runRegularRepair(stop1Ctx, props); err == nil {
t.Fatal("Repair failed without error")
}

// Tablet tables don't support resuming repair, so in order to check if repair
// started from scratch, we need to remove all repaired ranges up to this point.
ringDescriber := scyllaclient.NewRingDescriber(ctx, h.Client)
for tab, _ := range doneRanges {
_, err := validate(tab, doneRanges[tab])
if err != nil && ringDescriber.IsTabletKeyspace(strings.Split(tab, ".")[0]) {
doneRanges[tab] = nil
}

Print("When: run second repair with context cancel")
h.RunID = uuid.NewTime()
clearTabletRanges(doneRanges, ringDescriber)
if err := h.runRegularRepair(stop2Ctx, props); err == nil {
t.Fatal("Repair failed without error")
}

Print("When: run third repair with context cancel")
clearTabletRanges(doneRanges, ringDescriber)
h.RunID = uuid.NewTime()
if err := h.runRegularRepair(stop3Ctx, props); err == nil {
t.Fatal("Repair failed without error")
}

Print("When: run fourth repair with context cancel")
clearTabletRanges(doneRanges, ringDescriber)
h.RunID = uuid.NewTime()
if err := h.runRegularRepair(stop4Ctx, props); err == nil {
t.Fatal("Repair failed without error")
}

Print("When: run fifth repair till it finishes")
clearTabletRanges(doneRanges, ringDescriber)
h.RunID = uuid.NewTime()
running.Store(true)
stopErrInject.Store(true)
if err := h.runRegularRepair(ctx, props); err != nil {
t.Fatalf("Repair failed: %s", err)
}

Print("When: validate all, continuous ranges")
for tab, tr := range doneRanges {
t.Logf("Checking table %s", tab)

r, err := validate(tab, tr)
if err != nil {
t.Fatal(err)
}
if r > len(tr)/20 {
t.Fatalf("Expected less than 5 percent (%d) of redundant ranges per table (%d)", r, len(tr))
if r > 0 {
t.Fatalf("Expected no redundant ranges in %s, got %d (out of total %d)", tab, r, len(tr))
}
}
}