-
Notifications
You must be signed in to change notification settings - Fork 678
rm_stm: async loop over intrusive list fix #27694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,6 +14,7 @@ | |
#include "cluster/producer_state_manager.h" | ||
#include "cluster/rm_stm_types.h" | ||
#include "cluster/snapshot.h" | ||
#include "cluster/tx_errc.h" | ||
#include "cluster/tx_gateway_frontend.h" | ||
#include "cluster/types.h" | ||
#include "container/chunked_hash_map.h" | ||
|
@@ -1632,15 +1633,25 @@ void rm_stm::maybe_rearm_autoabort_timer(time_point_type deadline) { | |
} | ||
|
||
ss::future<tx::errc> rm_stm::abort_all_txes() { | ||
static constexpr uint max_concurrency = 5u; | ||
if (!co_await sync(_sync_timeout())) { | ||
co_return tx::errc::stale; | ||
} | ||
|
||
tx::errc last_err = tx::errc::none; | ||
|
||
// snap the intrusive list produced_ids before yielding the cpu | ||
chunked_vector<model::producer_identity> producer_ids_to_expire{ | ||
std::from_range, | ||
std::ranges::views::transform( | ||
_active_tx_producers, | ||
[](const auto& producer) { return producer.id(); })}; | ||
|
||
co_await ss::max_concurrent_for_each( | ||
_active_tx_producers, 5, [this, &last_err](const auto& producer) { | ||
return mark_expired(producer.id()).then([&last_err](tx::errc res) { | ||
std::move(producer_ids_to_expire), | ||
max_concurrency, | ||
[this, &last_err](const auto producer_id) { | ||
return mark_expired(producer_id).then([&last_err](tx::errc res) { | ||
Comment on lines
1641
to
+1654
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
does this PR fix a real bug? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not one that we've seen strike There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But is there a bug? It's still a bug if you've never seen a crash, but can describe how it might occur. Merely saying "it is dangerous" isn't quite enough? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
if (res != tx::errc::none) { | ||
last_err = res; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to get @bashtanov's thoughts on this. This was originally added for migration purposes, but I vaguely recall a discussion in the PR about the potential unsafe iteration (no synchronization when this method is called). I reviewed that PR but its not the same PR that added this method. Alexey, do you remember anything about it? I'm not able to find that discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found the discussion
#26380 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mark_expired holds a lock on every individual invocation, but is it not possible to have an inter-leafed modification to the list in-between invocations of mark_expired
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya for sure (that was the original concern).. but not sure what the conclusion of the thread was (I went silent :)). The fix makes sense but I'd like to check with @bashtanov incase we are overlooking something.