Skip to content

Cheksum comparison#3885

Merged
karol-kokoszka merged 10 commits intomasterfrom
cheksum_comparison
Jul 24, 2024
Merged

Cheksum comparison#3885
karol-kokoszka merged 10 commits intomasterfrom
cheksum_comparison

Conversation

@karol-kokoszka
Copy link
Copy Markdown
Collaborator

@karol-kokoszka karol-kokoszka commented Jun 18, 2024

Fixes #3827

This PR adds explicit stage to the backup process - deduplication.
It follows https://docs.google.com/document/d/1EtGlF6UGNy34D_7QsnCheaukp3UwVObZU56PBdd0CQ8/edit#heading=h.jl2qbpcarwp9

It is done to mitigate the problem with current deduplication/checksum comparison which causes SM to create a checksum of every SSTable file and to compare it with the checksum from remote.

This PR changes restore of versioned snapshot as well to corrupt CRC32 files.


Please make sure that:

  • Code is split to commits that address a single change
  • Commit messages are informative
  • Commit titles have module prefix
  • Commit titles have issue nr. suffix

@karol-kokoszka karol-kokoszka force-pushed the cheksum_comparison branch 2 times, most recently from 056d2bd to 10b6287 Compare June 18, 2024 15:41
@karol-kokoszka
Copy link
Copy Markdown
Collaborator Author

karol-kokoszka commented Jun 18, 2024

Integration tests are failing. To be checked.

Done

@karol-kokoszka karol-kokoszka marked this pull request as draft June 19, 2024 09:53
@karol-kokoszka karol-kokoszka force-pushed the cheksum_comparison branch 2 times, most recently from 3755bf4 to 0efd86e Compare July 5, 2024 14:47
@karol-kokoszka karol-kokoszka marked this pull request as ready for review July 5, 2024 14:49
@karol-kokoszka karol-kokoszka force-pushed the cheksum_comparison branch 3 times, most recently from bc165a5 to a172c63 Compare July 5, 2024 15:37
@karol-kokoszka
Copy link
Copy Markdown
Collaborator Author

This jenkins build is expected to trigger DTests and SCT.
https://jenkins.scylladb.com/view/scylla-manager/job/manager-master/job/manager-build/718/

Let's check if we are asserting the backup progress.

@karol-kokoszka
Copy link
Copy Markdown
Collaborator Author

Copy link
Copy Markdown
Collaborator

@Michal-Leszczynski Michal-Leszczynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

In the past there was an idea of making it possible to delete files in batches (I don't think that it is currently supported in our rclone fork). This seems like a simple change on rclone side that would allow us to save maaaany API calls (both in backup purger, and now in backup deduplication).

@karol-kokoszka maybe we should include it in the scope of this effort (perhaps in a separate series of rclone/SM PRs)?

Comment thread pkg/service/backup/worker_deduplicate.go
Comment thread pkg/service/backup/worker_deduplicate.go
Comment thread pkg/service/backup/worker_deduplicate.go Outdated
Comment thread pkg/service/backup/worker_deduplicate.go Outdated
Comment thread pkg/service/backup/worker_deduplicate.go
Comment thread pkg/service/backup/worker_deduplicate.go Outdated
Comment thread pkg/service/restore/helper_integration_test.go Outdated
Comment thread pkg/sstable/sstable_naming.go Outdated
Comment thread pkg/service/restore/service_restore_integration_test.go
Comment thread pkg/service/restore/service_restore_integration_test.go
@karol-kokoszka
Copy link
Copy Markdown
Collaborator Author

@Michal-Leszczynski please re-review.

Copy link
Copy Markdown
Collaborator

@Michal-Leszczynski Michal-Leszczynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds the deduplication stage, but it does not change the rclone options of the upload stage - it still has checksum. Is this expected?

In theory it shouldn't be a big deal, but we could disable the checksum comparison in rclone, so that when we checked that the crc32 doesn't match, we don't compare the hashes for nothing.

Comment thread pkg/service/backup/worker_deduplicate.go Outdated
Comment thread pkg/service/backup/worker_deduplicate.go Outdated
Comment thread pkg/service/backup/worker_deduplicate.go
@karol-kokoszka karol-kokoszka force-pushed the cheksum_comparison branch 3 times, most recently from 63b23de to cdcf63b Compare July 17, 2024 15:33
operations/cat on RClone server can be executed against /backup/meta prefix and files with .crc32 suffix
@karol-kokoszka karol-kokoszka force-pushed the cheksum_comparison branch 2 times, most recently from d702da2 to 0dd57ff Compare July 18, 2024 10:26
@karol-kokoszka
Copy link
Copy Markdown
Collaborator Author

Some sample logs (after changing from DEBUG to INFO)

Jul 18 09:04:14 manager-regression-master-monitor-node-d5777710-1 scylla-manager[10584]: {"L":"INFO","T":"2024-07-18T09:04:14.845Z","N":"backup.deduplicate","M":"Removing local snapshot file (deduplication based on generation UUID)","host":"10.15.1.78","file":"data:system_distributed_everywhere/cdc_generation_descriptions_v2-234d2227dd633d37ac5fc013e2ea9e6e/snapshots/sm_20240718090409UTC/me-3ghx_0net_4rw9s2bhesqa79ne6z-big-CompressionInfo.db","_trace_id":"LkFf76dqQ-miLcvXFxGsMQ"}
Jul 18 09:04:14 manager-regression-master-monitor-node-d5777710-1 scylla-manager[10584]: {"L":"INFO","T":"2024-07-18T09:04:14.845Z","N":"backup.deduplicate","M":"Removing local snapshot file (deduplication based on generation UUID)","host":"10.15.1.78","file":"data:system_distributed_everywhere/cdc_generation_descriptions_v2-234d2227dd633d37ac5fc013e2ea9e6e/snapshots/sm_20240718090409UTC/me-3ghx_0net_4rw9s2bhesqa79ne6z-big-Data.db","_trace_id":"LkFf76dqQ-miLcvXFxGsMQ"}
Jul 18 09:04:14 manager-regression-master-monitor-node-d5777710-1 scylla-manager[10584]: {"L":"INFO","T":"2024-07-18T09:04:14.846Z","N":"backup.deduplicate","M":"Removing local snapshot file (deduplication based on generation UUID)","host":"10.15.1.78","file":"data:system_distributed_everywhere/cdc_generation_descriptions_v2-234d2227dd633d37ac5fc013e2ea9e6e/snapshots/sm_20240718090409UTC/me-3ghx_0net_4rw9s2bhesqa79ne6z-big-Digest.crc32","_trace_id":"LkFf76dqQ-miLcvXFxGsMQ"}
Jul 18 09:04:14 manager-regression-master-monitor-node-d5777710-1 scylla-manager[10584]: {"L":"INFO","T":"2024-07-18T09:04:14.846Z","N":"backup.deduplicate","M":"Removing local snapshot file (deduplication based on generation UUID)","host":"10.15.1.78","file":"data:system_distributed_everywhere/cdc_generation_descriptions_v2-234d2227dd633d37ac5fc013e2ea9e6e/snapshots/sm_20240718090409UTC/me-3ghx_0net_4rw9s2bhesqa79ne6z-big-Filter.db","_trace_id":"LkFf76dqQ-miLcvXFxGsMQ"}
Jul 18 09:04:14 manager-regression-master-monitor-node-d5777710-1 scylla-manager[10584]: {"L":"INFO","T":"2024-07-18T09:04:14.847Z","N":"backup.deduplicate","M":"Removing local snapshot file (deduplication based on generation UUID)","host":"10.15.1.78","file":"data:system_distributed_everywhere/cdc_generation_descriptions_v2-234d2227dd633d37ac5fc013e2ea9e6e/snapshots/sm_20240718090409UTC/me-3ghx_0net_4rw9s2bhesqa79ne6z-big-Index.db","_trace_id":"LkFf76dqQ-miLcvXFxGsMQ"}
Jul 18 09:04:14 manager-regression-master-monitor-node-d5777710-1 scylla-manager[10584]: {"L":"INFO","T":"2024-07-18T09:04:14.847Z","N":"backup.deduplicate","M":"Removing local snapshot file (deduplication based on generation UUID)","host":"10.15.1.78","file":"data:system_distributed_everywhere/cdc_generation_descriptions_v2-234d2227dd633d37ac5fc013e2ea9e6e/snapshots/sm_20240718090409UTC/me-3ghx_0net_4rw9s2bhesqa79ne6z-big-Scylla.db","_trace_id":"LkFf76dqQ-miLcvXFxGsMQ"}
Jul 18 09:04:14 manager-regression-master-monitor-node-d5777710-1 scylla-manager[10584]: {"L":"INFO","T":"2024-07-18T09:04:14.847Z","N":"backup.deduplicate","M":"Removing local snapshot file (deduplication based on generation UUID)","host":"10.15.1.78","file":"data:system_distributed_everywhere/cdc_generation_descriptions_v2-234d2227dd633d37ac5fc013e2ea9e6e/snapshots/sm_20240718090409UTC/me-3ghx_0net_4rw9s2bhesqa79ne6z-big-Statistics.db","_trace_id":"LkFf76dqQ-miLcvXFxGsMQ"}
Jul 18 09:04:14 manager-regression-master-monitor-node-d5777710-1 scylla-manager[10584]: {"L":"INFO","T":"2024-07-18T09:04:14.848Z","N":"backup.deduplicate","M":"Removing local snapshot file (deduplication based on generation

Overall, it will be logged on debug level as there may be plenty of these entries during the backup.

Copy link
Copy Markdown
Collaborator

@Michal-Leszczynski Michal-Leszczynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there is a small important thing missing.
AFAIU, the d.Progress.files is initialized only during the INDEX stage. This means that pausing and resuming backup during DEDUPLICATION stage would result in d.Progress.files being empty and the whole process falling to pieces.

This is the reason why some stages allow reindexing on resume:

		// Skip completed stages
		if run.PrevID != uuid.Nil {
			// Allow reindexing if previous state is manifest creation or upload
			if stage == StageIndex && (prevStage == StageManifest || prevStage == StageUpload) { //nolint: revive
				// continue
			} else if prevStage.Index() > stage.Index() {
				return nil
			}
		}

DEDUPLICATE should be one of those stages as well.

What's funny is that it looks like checking for reindexing is bugged (by me), as the moved SCHEMA stage should also be reindexed on resume (as it is before UPLOAD, which supposedly requires reindexation - although I think that it only impacts progress recording/display there).

Comment thread pkg/service/backup/worker_deduplicate.go
@Michal-Leszczynski
Copy link
Copy Markdown
Collaborator

Perhaps it would be good to have a test which checks backup pause/resume during each stage.

Comment thread pkg/service/backup/worker_deduplicate.go Outdated
@karol-kokoszka
Copy link
Copy Markdown
Collaborator Author

Actually, there is a small important thing missing.
AFAIU, the d.Progress.files is initialized only during the INDEX stage. This means that pausing and resuming backup during DEDUPLICATION stage would result in d.Progress.files being empty and the whole process falling to pieces.

Nice catch !
It should be reindexed indeed ! Added integration test to service_deduplicate_integration_test that proves it.

@karol-kokoszka
Copy link
Copy Markdown
Collaborator Author

@Michal-Leszczynski please re-review

Comment thread pkg/service/backup/worker_deduplicate.go Outdated
@Michal-Leszczynski
Copy link
Copy Markdown
Collaborator

@karol-kokoszka kudos for adding dedicated deduplication test. I re-review the changes and left some small comments.

@karol-kokoszka karol-kokoszka merged commit 50ad87c into master Jul 24, 2024
@karol-kokoszka karol-kokoszka deleted the cheksum_comparison branch July 24, 2024 08:00
@karol-kokoszka karol-kokoszka mentioned this pull request Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add explicit deduplicate stage to SM (remove basing on .crc32 or UUID as generation ID)

2 participants