Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tx/producer eviction: fix a bug with incorrect eviction using stale pids #24852

Merged
merged 6 commits into from
Jan 21, 2025

Conversation

bharathv
Copy link
Contributor

@bharathv bharathv commented Jan 17, 2025

pid is currently captured in the lambda could become stale if it got fenced
(with an epoch bump). The change forces to provide a pid as a part of
the eviction hook, which would be the current pid at the time of
eviction.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

Bug Fixes

  • Fixes an issue where transactions incorrectly timeout due incorrect cleanup of evicted producers.

@bharathv
Copy link
Contributor Author

/ci-repeat 3

@bharathv
Copy link
Contributor Author

/ci-repeat 3

@bharathv bharathv marked this pull request as ready for review January 17, 2025 06:36
@bharathv bharathv force-pushed the fix_eviction_2 branch 2 times, most recently from be2a94b to 2b9437f Compare January 17, 2025 16:00
@@ -133,6 +133,7 @@ constexpr error_code map_tx_errc(cluster::tx::errc ec) {
case cluster::tx::errc::partition_not_exists:
case cluster::tx::errc::not_coordinator:
case cluster::tx::errc::stale:
case cluster::tx::errc::producer_creation_error:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error mapping needs more 👀 .. the intent was to map to a retriable error code that works with both idempotent and transactional producers and not_coordinator seems the safest. It provides some natural backoff + retry and meanwhile the producers may get evicted to make space for the incoming request.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 17, 2025

CI test results

test results on build#60912
test_id test_kind job_url test_status passed
rptest.tests.cloud_storage_scrubber_test.CloudStorageScrubberTest.test_scrubber.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/60912#01947540-383b-439c-83bd-150502b71043 FLAKY 1/2
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli ducktape https://buildkite.com/redpanda/redpanda/builds/60912#01947545-3702-4106-a6af-a793b53fa0cb FLAKY 1/2
test results on build#60939
test_id test_kind job_url test_status passed
rptest.tests.datalake.datalake_e2e_test.DatalakeE2ETests.test_topic_lifecycle.cloud_storage_type=CloudStorageType.S3.filesystem_catalog_mode=True ducktape https://buildkite.com/redpanda/redpanda/builds/60939#019481d2-6cbb-4dc1-beee-44fc89e4027e FLAKY 1/2
test results on build#60982
test_id test_kind job_url test_status passed
rptest.tests.topic_creation_test.TopicRecreateTest.test_topic_recreation_while_producing.workload=IDEMPOTENT.cleanup_policy=compact ducktape https://buildkite.com/redpanda/redpanda/builds/60982#0194879f-5d76-40aa-b25d-334a0907eb7c FLAKY 1/2
test results on build#60998
test_id test_kind job_url test_status passed
rptest.tests.partition_balancer_test.PartitionBalancerTest.test_unavailable_nodes ducktape https://buildkite.com/redpanda/redpanda/builds/60998#01948a1b-b97c-4462-98f3-ef1a3b77b015 FLAKY 1/2

Currently exceptions are thrown which are propagated as generic (and
confusing) RPC server errors which are prone to misinterpretation by the
callers.
Today this is caught by the rpc_server and propagated as a server error,
instead this should be retriable from the caller side.
@@ -1663,6 +1663,7 @@ void rm_stm::apply_fence(model::producer_identity pid, model::record_batch b) {
}

ss::future<> rm_stm::do_apply(const model::record_batch& b) {
auto units = co_await _state_lock.hold_read_lock();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this lock shouldn't be here. Apply should always be possible and the state should not be modified outside of apply method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the state should not be modified outside of apply method

I believe this is still true here (ie, the state is not modified outside of apply). This additional coordination is solely to prevent undesirable interactions between prefix truncation (which resets all internal state) and concurrently executing apply fibers.

…cers

After 4ee6b02 cleanup happens asynchronously after eviction. If
there is a request for a new producer_id and the associated producer got
evicted, clean it up to make room for a new producer.
pid is captured in the lambda could be a stale if the pid got fenced
(with an epoch bump). The change forces to provide a pid as a part of
the eviction hook, which would be the current pid at the time of
eviction.
@bharathv bharathv merged commit a43f07b into redpanda-data:dev Jan 21, 2025
18 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v24.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.2.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-24852-v24.2.x-102 remotes/upstream/v24.2.x
git cherry-pick -x 64e47ca809 f264502e97 176241e6ac 21781c7fc8 79362faf97 cba29a05aa

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-24852-v24.3.x-838 remotes/upstream/v24.3.x
git cherry-pick -x 64e47ca809 f264502e97 176241e6ac 21781c7fc8 79362faf97 cba29a05aa

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants