Skip to content

Conversation

wdberkeley
Copy link
Contributor

@wdberkeley wdberkeley commented Sep 24, 2025

Add retry logic to reconciler's metastore add_objects calls to handle transport errors. Previously, the reconciler would immediately fail and abandon the reconciliation round on any metastore error.

Fixes CORE-13427

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

  • none

…errors

Add retry logic to reconciler's metastore add_objects calls to handle
transport errors. Previously, the reconciler would immediately fail and
abandon the reconciliation round on any metastore error.
@Copilot Copilot AI review requested due to automatic review settings September 24, 2025 23:39
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a retry mechanism to the cloud topics reconciler to handle transport errors during metastore operations. Previously, the reconciler would fail immediately on any metastore error and abandon the reconciliation round. The change introduces resilience by retrying specifically on transport errors while preserving immediate failure behavior for other error types.

Key changes:

  • Added add_objects_with_retry method with timeout and backoff logic for transport error retries
  • Modified existing commit_objects method to use the new retry wrapper
  • Changed log level from error to warn for add_objects failures since they're now retryable

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/v/cloud_topics/reconciler/reconciler.h Added declaration for new add_objects_with_retry method
src/v/cloud_topics/reconciler/reconciler.cc Implemented retry logic and integrated it into commit_objects method
src/v/cloud_topics/reconciler/BUILD Added dependency on retry_chain_node utility

Copy link
Contributor

@rockwotj rockwotj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question, otherwise LGTM

co_return std::unexpected(add_result.error());
}

co_await ss::sleep_abortable(permit.delay, rtc.root_abort_source());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will throw if aborted, everywhere else we use std::unexpected. Should we do that too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's inside the reconciler round's try-catch so if it aborts (presumably because of shutdown) it'll abandon the round, which I think is the right thing to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree it's fine, but it's get a little hard to review when we are just mixing std::expected and exceptions. a reasonable pattern might be to have a top-level try/catch which treats exceptions as surprises.

std::unique_ptr<l1::metastore::object_metadata_builder> meta_builder,
l1::metastore::term_offset_map_t terms) {
static constexpr auto timeout = 5s;
static constexpr auto backoff = 100ms;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have some existing config parameters that can be used here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Poked around a bit and couldn't find any config properties or general parameters in cloud topics for this. Did you have something in mind? Maybe there are some numbers that could be applied multiple places we could factor out into configuration?

retry_chain_node rtc(_as, ss::lowres_clock::now() + timeout, backoff);
retry_chain_logger ctxlog(lg, rtc, "add_objects");
for (auto permit = rtc.retry(); permit.is_allowed; permit = rtc.retry()) {
auto add_result = co_await _metastore->add_objects(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the code inside the metastore->add_objects retry? It should have better "understanding" on how the requests should be retried.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requests are idempotent, and Andrew considered this to be the simplest thing: https://redpandadata.atlassian.net/browse/CORE-13427?focusedCommentId=89779

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not ideal because this way we're baking in some assumptions about the metastore and it's commands into the implementation of the reconciler (specifically, the fact that the command is idempotent and that the backoff is exponential) but maybe it's not a major concern at this moment

"Non-retryable error adding objects to the L1 metastore: {}",
add_result.error());
co_return std::unexpected(add_result.error());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we got metastore transport_error doesn't mean that the add_objects request wasn't applied to the metastore. Is add_objects idempotent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is

@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#72901
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkingReplicationTests test_replication_with_failures null integration https://buildkite.com/redpanda/redpanda/builds/72901#0199818c-33ec-478e-95e4-778742970c58 FLAKY 20/21 upstream reliability is '99.78902953586498'. current run reliability is '95.23809523809523'. drift is 4.55093 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_with_failures
ClusterRateQuotaTest test_client_group_consume_rate_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/72901#0199818c-33e9-48bf-9e52-63a9454fb0b9 FLAKY 13/21 upstream reliability is '81.34328358208955'. current run reliability is '61.904761904761905'. drift is 19.43852 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_group_consume_rate_throttle_mechanism
ClusterRateQuotaTest test_client_group_produce_rate_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/72901#0199818c-33eb-4867-8084-a7228f54d523 FLAKY 13/21 upstream reliability is '83.04033092037228'. current run reliability is '61.904761904761905'. drift is 21.13557 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_group_produce_rate_throttle_mechanism
ClusterRateQuotaTest test_client_response_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/72901#0199818c-33ee-4c38-ad45-d08ff032aac7 FLAKY 9/21 upstream reliability is '81.81818181818183'. current run reliability is '42.857142857142854'. drift is 38.96104 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_throttle_mechanism
ClusterRateQuotaTest test_client_response_throttle_mechanism_applies_to_next_request null integration https://buildkite.com/redpanda/redpanda/builds/72901#0199818c-33ef-4054-aa6d-9945b1f7c954 FLAKY 14/21 upstream reliability is '82.39999999999999'. current run reliability is '66.66666666666666'. drift is 15.73333 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_throttle_mechanism_applies_to_next_request
DatalakeCustomPartitioningTest test_many_partitions {"catalog_type": "rest_jdbc", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/72901#0199818c-33e9-48bf-9e52-63a9454fb0b9 FLAKY 14/21 upstream reliability is '100.0'. current run reliability is '66.66666666666666'. drift is 33.33333 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeCustomPartitioningTest&test_method=test_many_partitions
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "compaction_mode": "sliding_window", "enable_failures": false, "mixed_versions": true, "with_iceberg": false} integration https://buildkite.com/redpanda/redpanda/builds/72901#0199818c-f448-4701-a09f-77116ef6312e FLAKY 20/21 upstream reliability is '98.54545454545455'. current run reliability is '95.23809523809523'. drift is 3.30736 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RandomNodeOperationsTest&test_method=test_node_operations
DisablingPartitionsTest test_disable null integration https://buildkite.com/redpanda/redpanda/builds/72901#0199818c-33f1-46e8-9875-1cc43ad4e0be FLAKY 14/21 upstream reliability is '82.59493670886076'. current run reliability is '66.66666666666666'. drift is 15.92827 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DisablingPartitionsTest&test_method=test_disable
WriteCachingFailureInjectionTest test_unavoidable_data_loss null integration https://buildkite.com/redpanda/redpanda/builds/72901#0199818c-f446-404c-89d5-63b5d48b2298 FLAKY 20/21 upstream reliability is '95.23809523809523'. current run reliability is '95.23809523809523'. drift is 0.0 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionTest&test_method=test_unavoidable_data_loss

@dotnwat
Copy link
Member

dotnwat commented Sep 25, 2025

every test passed with the exception of a flaky upgrade test that has nothing to do with cloud topics or the reconciler.

@dotnwat dotnwat merged commit a26ec3c into redpanda-data:dev Sep 25, 2025
16 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants