-
Notifications
You must be signed in to change notification settings - Fork 678
[CORE-13370] archival: Fence spillover command #27714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks reasonable. is there a link to an issue we can add or a high level description of the situation that motivated the change? or why we didn't do it before, or whatever.
e7399d9
to
b05f01c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the archival system to use a fenced spillover command approach to prevent race conditions. The change centralizes fence creation logic and adds fencing support specifically to the spillover operation.
- Introduces a new
emit_rw_fence()
method to centralize fence creation logic - Refactors multiple methods to use the centralized fence creation instead of duplicated code
- Adds proper fencing to the spillover command with fence reset between iterations
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
src/v/cluster/archival/ntp_archiver_service.h | Declares new emit_rw_fence() method for centralized fence creation |
src/v/cluster/archival/ntp_archiver_service.cc | Implements emit_rw_fence() and refactors multiple methods to use it, adds fencing to spillover operations |
I think we ignored it before because it's replicated in a loop. The issue is that on CI it could also attempt to upload/replicates a spillover manifest which is slightly off in some cases. |
CI test resultstest results on build#72948
test results on build#73057
|
_rtclog.warn, | ||
"Failed to replicate spillover command: {}", | ||
error.message()); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given the change in control flow here, i think a bit more exposition in the commit message would be helpful. i.e. the fact that we break out of the loop on replication failure, whether that's usually attributable to the fence, why it's ok to bail at this point, what happens next, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Also, extract fence initialization into a method in the ntp_archiver to avoid code duplication. There is a change in the control flow in the 'apply_spillover' method. Previously, the spillover wouldn't stop in case of replication error causing the error to be repeated. The loop would use manifest to create a spillover manifest and replicate the command with archival STM. The replicate method waits until the command is applied and propagates the error back to the loop. In case of error the error was printed and the loop continued. Since the state of the manifest didn't change the loop would produce the same manifesta and the same command causing new failure. This commit breaks if the spillover command can't be applied. This guarantees forward progress. Signed-off-by: Evgeny Lazin <[email protected]>
b05f01c
to
35dc6d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
/backport v25.2.x |
/backport v25.1.x |
/backport v24.3.x |
Failed to create a backport PR to v25.1.x branch. I tried:
|
Failed to create a backport PR to v25.2.x branch. I tried:
|
Failed to create a backport PR to v24.3.x branch. I tried:
|
Replicate spillover command with the fence to avoid races.
Backports Required
Release Notes