Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archival: Add replica validator #24626

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

Lazin
Copy link
Contributor

@Lazin Lazin commented Dec 20, 2024

The validator is used to check if replica state is not diverged from the previous leader.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

  • none

@Lazin Lazin marked this pull request as draft December 20, 2024 16:47
@Lazin Lazin force-pushed the pr/detect-ot-problems branch from 94ba70b to 8a839b2 Compare December 20, 2024 16:48
The validator is used to check if replica state is not diverged from
the previous leader.

Signed-off-by: Evgeny Lazin <[email protected]>
@Lazin Lazin force-pushed the pr/detect-ot-problems branch from 8a839b2 to 4126158 Compare December 20, 2024 16:52
@Lazin Lazin marked this pull request as ready for review December 23, 2024 09:59
@Lazin Lazin requested review from dotnwat and andrwng December 23, 2024 09:59
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 23, 2024

Retry command for Build#60061

please wait until all jobs are finished before running the slash command



/ci-repeat 1
tests/rptest/tests/topic_recovery_test.py::TopicRecoveryTest.test_missing_segment@{"cloud_storage_type":2}
tests/rptest/tests/topic_recovery_test.py::TopicRecoveryTest.test_missing_segment@{"cloud_storage_type":1}

Copy link
Contributor

@nvartolomei nvartolomei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea looks all right 👍

A bit concerned that we would allow the uploads to proceed if cloud_storage_disable_upload_consistency_checks is set. First of all, in which case would it be fine to allow uploads to proceed rather than potentially going and fixing the manifest first?

Also, hah, what would "fixing" mean?

Other concern is that we overload this setting with too much, it is also global so large blast radius.

We should probably have a per-partition mechanism to allow inconsistencies only at specific offsets. (Not for this PR of course)

// Get the last uploaded segment and try to translate one of its offsets
// and compare the results.
auto last_segment = manifest.last_segment();
if (last_segment.has_value() && local_so < last_segment->base_offset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<=?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Missing test too.

// Last segment exists in the manifest and can be translated using
// local offset translation state.

auto expected_delta = last_segment->delta_offset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to also validate delta_offset_end?

/// over cluster replicas. If the archiver starts on a replica
/// with inconsistent offset translator state it should be able
/// to detect this and report the problem.
class replica_state_validator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be a class? i think it would be better as a free function with a struct result

a bit unusual for a validator class to do work in constructor and then have no other use

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the method could accept a log object and a manifest object only, no need for a full blown partition

will make testing easier, less includes too


bool has_anomalies() const noexcept;

const std::deque<replica_state_anomaly> get_anomalies() const noexcept;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code?

#include "cluster/fwd.h"
#include "cluster/partition.h"

#include <seastar/core/shared_ptr.hh>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused include?


#pragma once

#include "cluster/fwd.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused?

@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#60061
test_id test_kind job_url test_status passed
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/60061#0193f354-c71d-4784-a89e-0cad15ee2297 FLAKY 3/6
rptest.tests.topic_recovery_test.TopicRecoveryTest.test_missing_segment.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/60061#0193f339-2d8b-4499-b1ab-d2a39e71b79b FAIL 0/1
rptest.tests.topic_recovery_test.TopicRecoveryTest.test_missing_segment.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/60061#0193f354-c71c-4c6e-bfb2-897453dcec9c FAIL 0/1
rptest.tests.topic_recovery_test.TopicRecoveryTest.test_missing_segment.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/60061#0193f339-2d8f-4b72-940a-45ce3fa8a38f FAIL 0/1
rptest.tests.topic_recovery_test.TopicRecoveryTest.test_missing_segment.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/60061#0193f354-c71d-4bac-9ed3-daf0ecbf9191 FAIL 0/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants