tx/producer eviction: fix a bug with incorrect eviction using stale pids #24852

bharathv · 2025-01-17T00:32:58Z

pid is currently captured in the lambda could become stale if it got fenced
(with an epoch bump). The change forces to provide a pid as a part of
the eviction hook, which would be the current pid at the time of
eviction.

Backports Required

Release Notes

Bug Fixes

Fixes an issue where transactions incorrectly timeout due incorrect cleanup of evicted producers.

bharathv · 2025-01-17T00:45:39Z

/ci-repeat 3

bharathv · 2025-01-17T02:53:14Z

/ci-repeat 3

bharathv · 2025-01-17T16:24:35Z

src/v/kafka/server/errors.h

@@ -133,6 +133,7 @@ constexpr error_code map_tx_errc(cluster::tx::errc ec) {
    case cluster::tx::errc::partition_not_exists:
    case cluster::tx::errc::not_coordinator:
    case cluster::tx::errc::stale:
+    case cluster::tx::errc::producer_creation_error:


This error mapping needs more 👀 .. the intent was to map to a retriable error code that works with both idempotent and transactional producers and not_coordinator seems the safest. It provides some natural backoff + retry and meanwhile the producers may get evicted to make space for the incoming request.

vbotbuildovich · 2025-01-17T19:17:55Z

CI test results

test results on build#60912

test_id	test_kind	job_url	test_status	passed
rptest.tests.cloud_storage_scrubber_test.CloudStorageScrubberTest.test_scrubber.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/60912#01947540-383b-439c-83bd-150502b71043	FLAKY	1/2
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli	ducktape	https://buildkite.com/redpanda/redpanda/builds/60912#01947545-3702-4106-a6af-a793b53fa0cb	FLAKY	1/2

test results on build#60939

test_id	test_kind	job_url	test_status	passed
rptest.tests.datalake.datalake_e2e_test.DatalakeE2ETests.test_topic_lifecycle.cloud_storage_type=CloudStorageType.S3.filesystem_catalog_mode=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/60939#019481d2-6cbb-4dc1-beee-44fc89e4027e	FLAKY	1/2

test results on build#60982

test_id	test_kind	job_url	test_status	passed
rptest.tests.topic_creation_test.TopicRecreateTest.test_topic_recreation_while_producing.workload=IDEMPOTENT.cleanup_policy=compact	ducktape	https://buildkite.com/redpanda/redpanda/builds/60982#0194879f-5d76-40aa-b25d-334a0907eb7c	FLAKY	1/2

test results on build#60998

test_id	test_kind	job_url	test_status	passed
rptest.tests.partition_balancer_test.PartitionBalancerTest.test_unavailable_nodes	ducktape	https://buildkite.com/redpanda/redpanda/builds/60998#01948a1b-b97c-4462-98f3-ef1a3b77b015	FLAKY	1/2

.. in the presence of evictions

Currently exceptions are thrown which are propagated as generic (and confusing) RPC server errors which are prone to misinterpretation by the callers.

Today this is caught by the rpc_server and propagated as a server error, instead this should be retriable from the caller side.

src/v/cluster/rm_stm.cc

mmaslankaprv · 2025-01-20T08:16:49Z

src/v/cluster/rm_stm.cc

@@ -1663,6 +1663,7 @@ void rm_stm::apply_fence(model::producer_identity pid, model::record_batch b) {
 }

 ss::future<> rm_stm::do_apply(const model::record_batch& b) {
+    auto units = co_await _state_lock.hold_read_lock();


i think this lock shouldn't be here. Apply should always be possible and the state should not be modified outside of apply method

the state should not be modified outside of apply method

I believe this is still true here (ie, the state is not modified outside of apply). This additional coordination is solely to prevent undesirable interactions between prefix truncation (which resets all internal state) and concurrently executing apply fibers.

…cers After 4ee6b02 cleanup happens asynchronously after eviction. If there is a request for a new producer_id and the associated producer got evicted, clean it up to make room for a new producer.

pid is captured in the lambda could be a stale if the pid got fenced (with an epoch bump). The change forces to provide a pid as a part of the eviction hook, which would be the current pid at the time of eviction.

.. to remove it

vbotbuildovich · 2025-01-21T20:29:28Z

/backport v24.3.x

vbotbuildovich · 2025-01-21T20:29:28Z

/backport v24.2.x

vbotbuildovich · 2025-01-21T20:30:34Z

Failed to create a backport PR to v24.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-24852-v24.2.x-102 remotes/upstream/v24.2.x
git cherry-pick -x 64e47ca809 f264502e97 176241e6ac 21781c7fc8 79362faf97 cba29a05aa

Workflow run logs.

vbotbuildovich · 2025-01-21T20:30:37Z

Failed to create a backport PR to v24.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-24852-v24.3.x-838 remotes/upstream/v24.3.x
git cherry-pick -x 64e47ca809 f264502e97 176241e6ac 21781c7fc8 79362faf97 cba29a05aa

Workflow run logs.

[24.2.x][backport] tx/producer eviction: fix a bug with incorrect eviction using stale pids #24852

github-actions bot added the area/redpanda label Jan 17, 2025

bharathv force-pushed the fix_eviction_2 branch from 6d8dfa9 to 7ac2a8a Compare January 17, 2025 00:43

bharathv force-pushed the fix_eviction_2 branch from 7ac2a8a to fd562bb Compare January 17, 2025 02:52

bharathv requested review from mmaslankaprv, ztlpn and bashtanov January 17, 2025 05:58

bharathv marked this pull request as ready for review January 17, 2025 06:36

bharathv force-pushed the fix_eviction_2 branch 2 times, most recently from be2a94b to 2b9437f Compare January 17, 2025 16:00

bharathv commented Jan 17, 2025

View reviewed changes

bharathv added 3 commits January 19, 2025 18:05

tx/ducktape: add test to ensure fencing producers can make progress

64e47ca

.. in the presence of evictions

tx/errc: add a new error code for propagating producer errors

f264502

Currently exceptions are thrown which are propagated as generic (and confusing) RPC server errors which are prone to misinterpretation by the callers.

tx/errc: propagate correct ec on cache full errors

176241e

Today this is caught by the rpc_server and propagated as a server error, instead this should be retriable from the caller side.

bharathv force-pushed the fix_eviction_2 branch from 2b9437f to 4cdbc1a Compare January 20, 2025 02:05

bharathv enabled auto-merge January 20, 2025 06:13

mmaslankaprv reviewed Jan 20, 2025

View reviewed changes

src/v/cluster/rm_stm.cc Outdated Show resolved Hide resolved

mmaslankaprv reviewed Jan 20, 2025

View reviewed changes

bharathv force-pushed the fix_eviction_2 branch from 4cdbc1a to 9fea974 Compare January 21, 2025 05:35

bharathv requested a review from mmaslankaprv January 21, 2025 05:35

bharathv added 3 commits January 21, 2025 09:01

tx/rm_stm: early cleanup evicted producers to make room for new produ…

21781c7

…cers After 4ee6b02 cleanup happens asynchronously after eviction. If there is a request for a new producer_id and the associated producer got evicted, clean it up to make room for a new producer.

rm_stm: fix incorrect eviction attempts using a stale pid

79362fa

pid is captured in the lambda could be a stale if the pid got fenced (with an epoch bump). The change forces to provide a pid as a part of the eviction hook, which would be the current pid at the time of eviction.

producer_state/hook: add a note about hook existence and a todo

cba29a0

.. to remove it

bharathv force-pushed the fix_eviction_2 branch from 9fea974 to cba29a0 Compare January 21, 2025 17:01

mmaslankaprv approved these changes Jan 21, 2025

View reviewed changes

bharathv merged commit a43f07b into redpanda-data:dev Jan 21, 2025
18 checks passed

This was referenced Jan 21, 2025

[v24.2.x] tx/producer eviction: fix a bug with incorrect eviction using stale pids #24875

Open

[v24.3.x] tx/producer eviction: fix a bug with incorrect eviction using stale pids #24876

Open

bharathv mentioned this pull request Jan 21, 2025

Transaction do_abort_tx request: not found, assuming already aborted. #24468

Closed

mmaslankaprv added a commit that referenced this pull request Jan 24, 2025

Merge pull request #24878 from bharathv/24-2x-bp-tx-ec

542abcd

[24.2.x][backport] tx/producer eviction: fix a bug with incorrect eviction using stale pids #24852

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tx/producer eviction: fix a bug with incorrect eviction using stale pids #24852

tx/producer eviction: fix a bug with incorrect eviction using stale pids #24852

bharathv commented Jan 17, 2025 •

edited

Loading

bharathv commented Jan 17, 2025

bharathv commented Jan 17, 2025

bharathv Jan 17, 2025

vbotbuildovich commented Jan 17, 2025 •

edited

Loading

mmaslankaprv Jan 20, 2025

bharathv Jan 21, 2025

vbotbuildovich commented Jan 21, 2025

vbotbuildovich commented Jan 21, 2025

vbotbuildovich commented Jan 21, 2025

vbotbuildovich commented Jan 21, 2025

tx/producer eviction: fix a bug with incorrect eviction using stale pids #24852

tx/producer eviction: fix a bug with incorrect eviction using stale pids #24852

Conversation

bharathv commented Jan 17, 2025 • edited Loading

Backports Required

Release Notes

Bug Fixes

bharathv commented Jan 17, 2025

bharathv commented Jan 17, 2025

bharathv Jan 17, 2025

Choose a reason for hiding this comment

vbotbuildovich commented Jan 17, 2025 • edited Loading

CI test results

mmaslankaprv Jan 20, 2025

Choose a reason for hiding this comment

bharathv Jan 21, 2025

Choose a reason for hiding this comment

vbotbuildovich commented Jan 21, 2025

vbotbuildovich commented Jan 21, 2025

vbotbuildovich commented Jan 21, 2025

vbotbuildovich commented Jan 21, 2025

bharathv commented Jan 17, 2025 •

edited

Loading

vbotbuildovich commented Jan 17, 2025 •

edited

Loading