-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tx/producer eviction: fix a bug with incorrect eviction using stale pids #24852
Conversation
6d8dfa9
to
7ac2a8a
Compare
/ci-repeat 3 |
7ac2a8a
to
fd562bb
Compare
/ci-repeat 3 |
be2a94b
to
2b9437f
Compare
@@ -133,6 +133,7 @@ constexpr error_code map_tx_errc(cluster::tx::errc ec) { | |||
case cluster::tx::errc::partition_not_exists: | |||
case cluster::tx::errc::not_coordinator: | |||
case cluster::tx::errc::stale: | |||
case cluster::tx::errc::producer_creation_error: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error mapping needs more 👀 .. the intent was to map to a retriable error code that works with both idempotent and transactional producers and not_coordinator seems the safest. It provides some natural backoff + retry and meanwhile the producers may get evicted to make space for the incoming request.
CI test resultstest results on build#60912
test results on build#60939
test results on build#60982
test results on build#60998
|
.. in the presence of evictions
Currently exceptions are thrown which are propagated as generic (and confusing) RPC server errors which are prone to misinterpretation by the callers.
Today this is caught by the rpc_server and propagated as a server error, instead this should be retriable from the caller side.
2b9437f
to
4cdbc1a
Compare
src/v/cluster/rm_stm.cc
Outdated
@@ -1663,6 +1663,7 @@ void rm_stm::apply_fence(model::producer_identity pid, model::record_batch b) { | |||
} | |||
|
|||
ss::future<> rm_stm::do_apply(const model::record_batch& b) { | |||
auto units = co_await _state_lock.hold_read_lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this lock shouldn't be here. Apply should always be possible and the state should not be modified outside of apply method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the state should not be modified outside of apply method
I believe this is still true here (ie, the state is not modified outside of apply). This additional coordination is solely to prevent undesirable interactions between prefix truncation (which resets all internal state) and concurrently executing apply fibers.
4cdbc1a
to
9fea974
Compare
…cers After 4ee6b02 cleanup happens asynchronously after eviction. If there is a request for a new producer_id and the associated producer got evicted, clean it up to make room for a new producer.
pid is captured in the lambda could be a stale if the pid got fenced (with an epoch bump). The change forces to provide a pid as a part of the eviction hook, which would be the current pid at the time of eviction.
9fea974
to
cba29a0
Compare
/backport v24.3.x |
/backport v24.2.x |
Failed to create a backport PR to v24.2.x branch. I tried:
|
Failed to create a backport PR to v24.3.x branch. I tried:
|
[24.2.x][backport] tx/producer eviction: fix a bug with incorrect eviction using stale pids #24852
pid is currently captured in the lambda could become stale if it got fenced
(with an epoch bump). The change forces to provide a pid as a part of
the eviction hook, which would be the current pid at the time of
eviction.
Backports Required
Release Notes
Bug Fixes