-
Notifications
You must be signed in to change notification settings - Fork 678
cloud_topics/reconciler: add retry mechanism for transport errors #27724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…errors Add retry logic to reconciler's metastore add_objects calls to handle transport errors. Previously, the reconciler would immediately fail and abandon the reconciliation round on any metastore error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a retry mechanism to the cloud topics reconciler to handle transport errors during metastore operations. Previously, the reconciler would fail immediately on any metastore error and abandon the reconciliation round. The change introduces resilience by retrying specifically on transport errors while preserving immediate failure behavior for other error types.
Key changes:
- Added
add_objects_with_retry
method with timeout and backoff logic for transport error retries - Modified existing
commit_objects
method to use the new retry wrapper - Changed log level from error to warn for add_objects failures since they're now retryable
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
src/v/cloud_topics/reconciler/reconciler.h |
Added declaration for new add_objects_with_retry method |
src/v/cloud_topics/reconciler/reconciler.cc |
Implemented retry logic and integrated it into commit_objects method |
src/v/cloud_topics/reconciler/BUILD |
Added dependency on retry_chain_node utility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question, otherwise LGTM
co_return std::unexpected(add_result.error()); | ||
} | ||
|
||
co_await ss::sleep_abortable(permit.delay, rtc.root_abort_source()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will throw if aborted, everywhere else we use std::unexpected
. Should we do that too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's inside the reconciler round's try-catch so if it aborts (presumably because of shutdown) it'll abandon the round, which I think is the right thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree it's fine, but it's get a little hard to review when we are just mixing std::expected and exceptions. a reasonable pattern might be to have a top-level try/catch which treats exceptions as surprises.
std::unique_ptr<l1::metastore::object_metadata_builder> meta_builder, | ||
l1::metastore::term_offset_map_t terms) { | ||
static constexpr auto timeout = 5s; | ||
static constexpr auto backoff = 100ms; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may have some existing config parameters that can be used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Poked around a bit and couldn't find any config properties or general parameters in cloud topics for this. Did you have something in mind? Maybe there are some numbers that could be applied multiple places we could factor out into configuration?
retry_chain_node rtc(_as, ss::lowres_clock::now() + timeout, backoff); | ||
retry_chain_logger ctxlog(lg, rtc, "add_objects"); | ||
for (auto permit = rtc.retry(); permit.is_allowed; permit = rtc.retry()) { | ||
auto add_result = co_await _metastore->add_objects( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the code inside the metastore->add_objects retry? It should have better "understanding" on how the requests should be retried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requests are idempotent, and Andrew considered this to be the simplest thing: https://redpandadata.atlassian.net/browse/CORE-13427?focusedCommentId=89779
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not ideal because this way we're baking in some assumptions about the metastore and it's commands into the implementation of the reconciler (specifically, the fact that the command is idempotent and that the backoff is exponential) but maybe it's not a major concern at this moment
"Non-retryable error adding objects to the L1 metastore: {}", | ||
add_result.error()); | ||
co_return std::unexpected(add_result.error()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that we got metastore transport_error doesn't mean that the add_objects request wasn't applied to the metastore. Is add_objects idempotent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is
CI test resultstest results on build#72901
|
every test passed with the exception of a flaky upgrade test that has nothing to do with cloud topics or the reconciler. |
Add retry logic to reconciler's metastore add_objects calls to handle transport errors. Previously, the reconciler would immediately fail and abandon the reconciliation round on any metastore error.
Fixes CORE-13427
Backports Required
Release Notes