Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE-627] archival: Convert assertion to exception #24580

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

Lazin
Copy link
Contributor

@Lazin Lazin commented Dec 16, 2024

The situation in which the assertion is triggered can only be caused by a race and is retrieable. There is no need to crash the whole process in that case.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

  • none

@Lazin Lazin requested a review from WillemKauf December 16, 2024 12:17
@Lazin Lazin changed the title archival: Convert assertion to exception [CORE-627] archival: Convert assertion to exception Dec 16, 2024
WillemKauf
WillemKauf previously approved these changes Dec 16, 2024
@piyushredpanda
Copy link
Contributor

Can we have tests that make things run in parallel so this doesn't break in future, @Lazin ?

@Lazin
Copy link
Contributor Author

Lazin commented Dec 16, 2024

Can we have tests that make things run in parallel so this doesn't break in future, @Lazin ?

We do have such tests. In that case rm_stm was in a wired state and the partition move was in-progress. It's quite difficult to reproduce in a test.

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#59810

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 16, 2024

CI test results

test results on build#59810
test_id test_kind job_url test_status passed
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59810#0193d063-ea2b-4e58-bc0d-a6317ce56c06 FAIL 0/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/59810#0193d063-ea2c-4551-b228-abbb9dcc9e85 FLAKY 3/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/59810#0193d077-8738-4195-84b9-cca78cf8d762 FLAKY 4/6
rptest.tests.e2e_shadow_indexing_test.EndToEndShadowIndexingTestCompactedTopic.test_write.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59810#0193d063-ea2b-4e58-bc0d-a6317ce56c06 FLAKY 5/6
test results on build#59929
test_id test_kind job_url test_status passed
gtest_raft_rpunit.gtest_raft_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59929#0193db64-98b5-4d9b-ba92-47cb6c380e18 FLAKY 1/2
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59929#0193dbc1-4178-4a5e-b415-b751d2860a53 FAIL 0/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/59929#0193dbbe-1885-4f1c-b06c-37ef09b1ee02 FLAKY 4/6
rptest.tests.enterprise_features_license_test.EnterpriseFeaturesTest.test_enable_features.feature=Feature.oidc.install_license=False.disable_trial=False ducktape https://buildkite.com/redpanda/redpanda/builds/59929#0193dbc1-417a-44a8-8a70-b76ba5ed944f FAIL 0/1

@Lazin
Copy link
Contributor Author

Lazin commented Dec 16, 2024

/ci-repeat 1
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

@Lazin
Copy link
Contributor Author

Lazin commented Dec 17, 2024

All jobs have failed after few minutes due to some infra issue. Restarted.

WillemKauf
WillemKauf previously approved these changes Dec 18, 2024
upl->file_offset <= upl->final_file_offset,
"Invalid upload candidate {}",
upl);
if (upl->sources.size() == 1 && upl->file_offset > upl->final_file_offset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: why does upl->sources.size() need to equal 1 for this race condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that here it's not necessary to check this because sources.size() is always 1 (this path never merges data from several segments). Potentially, the upload may contain multiple segments. In this case it's totally fine for it to have any start/stop file offsets because the upload begins in one segment and ends in another segment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can removes this because it's unnecessary and is probably confusing. WDYT?

Copy link
Contributor

@WillemKauf WillemKauf Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I would remove it if it's unnecessary/confusing.

The comment below (// Normally this shouldn't happen...) might also be better if it were altered to highlight under what circumstances this path would be taken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, please take a look

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 👍

The situation in which the assertion is triggered can only be caused
by a race and is retrieable. There is no need to crash the whole process
in that case.

Signed-off-by: Evgeny Lazin <[email protected]>
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 19, 2024

Retry command for Build#59929

please wait until all jobs are finished before running the slash command


/ci-repeat 1
tests/rptest/tests/enterprise_features_license_test.py::EnterpriseFeaturesTest.test_enable_features@{"disable_trial":false,"feature":5,"install_license":false}
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

@Lazin
Copy link
Contributor Author

Lazin commented Dec 19, 2024

/ci-repeat 1
tests/rptest/tests/enterprise_features_license_test.py::EnterpriseFeaturesTest.test_enable_features@{"disable_trial":false,"feature":5,"install_license":false}
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

@Lazin Lazin enabled auto-merge December 20, 2024 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants