rm_stm: async loop over intrusive list fix #27694

joe-redpanda · 2025-09-23T22:17:36Z

It is dangerous to suspend while looping over intrusive lists because destruction of the intrusive list element can invalidate an iterator on the intrusive list.

Here, the resolution is to snapshot the producer_identities which need to be cancelled into an owning collection before crossing a suspension point. The snapshotted producer identities will subsequently be used to expire transactions, which is an asynchronous operation.

Backports Required

Release Notes

Bug Fixes

fixes possible race condition in aborting all transactions

dotnwat · 2025-09-23T22:53:10Z

src/v/cluster/rm_stm.cc

    co_await ss::max_concurrent_for_each(
-      _active_tx_producers, 5, [this, &last_err](const auto& producer) {
-          return mark_expired(producer.id()).then([&last_err](tx::errc res) {
+      std::move(producer_ids_to_expire),
+      max_concurrency,
+      [this, &last_err](const auto producer_id) {
+          return mark_expired(producer_id).then([&last_err](tx::errc res) {


It is dangerous to suspend while looping over intrusive lists

does this PR fix a real bug?

Not one that we've seen strike

But is there a bug? It's still a bug if you've never seen a crash, but can describe how it might occur. Merely saying "it is dangerous" isn't quite enough?

rm_stm::abort_all_txes does not hold any lock itself, and its caller holds partition produce lock. So rm_stm::reset_procucers() can run concurrently and invalidate iterators.

vbotbuildovich · 2025-09-24T02:08:39Z

CI test results

test results on build#72792

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
EndToEndCloudTopicsTest	test_write	null	integration	https://buildkite.com/redpanda/redpanda/builds/72792#019978d7-4482-4204-8e29-d66b6165c89f	FLAKY	13/21	upstream reliability is '87.87878787878788'. current run reliability is '61.904761904761905'. drift is 25.97403 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=EndToEndCloudTopicsTest&test_method=test_write
RandomNodeOperationsTest	test_node_operations	{"cloud_storage_type": 1, "compaction_mode": "sliding_window", "enable_failures": false, "mixed_versions": true, "with_iceberg": false}	integration	https://buildkite.com/redpanda/redpanda/builds/72792#019978d7-4478-4e53-b74b-54003908ab6b	FLAKY	20/21	upstream reliability is '98.8'. current run reliability is '95.23809523809523'. drift is 3.5619 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RandomNodeOperationsTest&test_method=test_node_operations
RandomNodeOperationsTest	test_node_operations	{"cloud_storage_type": 1, "compaction_mode": "sliding_window", "enable_failures": true, "mixed_versions": true, "with_iceberg": false}	integration	https://buildkite.com/redpanda/redpanda/builds/72792#019978d7-4480-4ecb-9b5f-5af431f3b23c	FLAKY	18/21	upstream reliability is '98.51301115241635'. current run reliability is '85.71428571428571'. drift is 12.79873 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RandomNodeOperationsTest&test_method=test_node_operations

test results on build#72980

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
ShadowLinkingReplicationTests	test_replication_basic	{"shuffle_leadership": false, "source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}}	integration	https://buildkite.com/redpanda/redpanda/builds/72980#0199821c-3f3f-4b7a-9418-cc55099f8bfd	FLAKY	19/21	upstream reliability is '100.0'. current run reliability is '90.47619047619048'. drift is 9.52381 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic
ClusterRateQuotaTest	test_client_group_consume_rate_throttle_mechanism	null	integration	https://buildkite.com/redpanda/redpanda/builds/72980#0199821c-3f39-4d68-b5e8-cbbaf37979ad	FLAKY	17/21	upstream reliability is '93.9423076923077'. current run reliability is '80.95238095238095'. drift is 12.98993 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_group_consume_rate_throttle_mechanism
ClusterRateQuotaTest	test_client_response_throttle_mechanism	null	integration	https://buildkite.com/redpanda/redpanda/builds/72980#01998218-cec3-43a3-90a5-ab11164072ba	FLAKY	13/21	upstream reliability is '81.76197836166924'. current run reliability is '61.904761904761905'. drift is 19.85722 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_throttle_mechanism
DatalakeCustomPartitioningTest	test_many_partitions	{"catalog_type": "rest_jdbc", "cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/72980#01998218-cebf-46fb-9ead-79ebaad3b459	FLAKY	9/21	upstream reliability is '100.0'. current run reliability is '42.857142857142854'. drift is 57.14286 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeCustomPartitioningTest&test_method=test_many_partitions
DisablingPartitionsTest	test_disable	null	integration	https://buildkite.com/redpanda/redpanda/builds/72980#0199821c-3f3f-4b7a-9418-cc55099f8bfd	FLAKY	9/21	upstream reliability is '85.8085808580858'. current run reliability is '42.857142857142854'. drift is 42.95144 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DisablingPartitionsTest&test_method=test_disable
RecoveryModeTest	test_rolling_restart	null	integration	https://buildkite.com/redpanda/redpanda/builds/72980#0199821c-3f39-4d68-b5e8-cbbaf37979ad	FLAKY	13/21	upstream reliability is '94.36860068259386'. current run reliability is '61.904761904761905'. drift is 32.46384 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RecoveryModeTest&test_method=test_rolling_restart
WriteCachingFailureInjectionE2ETest	test_crash_all	{"use_transactions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/72980#0199821c-3f3c-48bf-9c9f-cf4da10d87a9	FLAKY	17/21	upstream reliability is '91.87279151943463'. current run reliability is '80.95238095238095'. drift is 10.92041 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
WriteCachingFailureInjectionTest	test_unavoidable_data_loss	null	integration	https://buildkite.com/redpanda/redpanda/builds/72980#0199821c-3f39-4d68-b5e8-cbbaf37979ad	FLAKY	17/21	upstream reliability is '95.2136752136752'. current run reliability is '80.95238095238095'. drift is 14.26129 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionTest&test_method=test_unavoidable_data_loss

WillemKauf · 2025-09-24T02:37:42Z

Could this be equivalently safe with

    co_await ss::max_concurrent_for_each(
      _active_tx_producers,
      5,
      [this, &last_err](const producer_state& producer) {
          if (!producer._active_transaction_hook.is_linked()) {
              return ss::now();
          }
          return mark_expired(producer.id()).then([&last_err](tx::errc res) {
              if (res != tx::errc::none) {
                  last_err = res;
              }
          });
      });

bharathv · 2025-09-24T14:24:46Z

src/v/cluster/rm_stm.cc

 }

 ss::future<tx::errc> rm_stm::abort_all_txes() {
+    static constexpr uint max_concurrency = 5u;


I'd like to get @bashtanov's thoughts on this. This was originally added for migration purposes, but I vaguely recall a discussion in the PR about the potential unsafe iteration (no synchronization when this method is called). I reviewed that PR but its not the same PR that added this method. Alexey, do you remember anything about it? I'm not able to find that discussion.

Found the discussion

#26380 (comment)

mark_expired holds a lock on every individual invocation, but is it not possible to have an inter-leafed modification to the list in-between invocations of mark_expired

ya for sure (that was the original concern).. but not sure what the conclusion of the thread was (I went silent :)). The fix makes sense but I'd like to check with @bashtanov incase we are overlooking something.

joe-redpanda · 2025-09-24T16:24:09Z

Could this be equivalently safe with

    co_await ss::max_concurrent_for_each(
      _active_tx_producers,
      5,
      [this, &last_err](const producer_state& producer) {
          if (!producer._active_transaction_hook.is_linked()) {
              return ss::now();
          }
          return mark_expired(producer.id()).then([&last_err](tx::errc res) {
              if (res != tx::errc::none) {
                  last_err = res;
              }
          });
      });

I don't agree.
if the producer can be unlinked, then that implies that the list is changing under our nose.
There must be an iterator that concurrent for each is using to generate batches of size max_concurrency
^ can we guarantee that the iterator used for work generation is not struck by iterator invalidation?

dotnwat · 2025-09-24T21:48:15Z

src/v/cluster/rm_stm.cc

    tx::errc last_err = tx::errc::none;

+    // snap the intrusive list produced_ids before yielding the cpu
+    chunked_vector<model::producer_identity> producer_ids_to_expire;


maybe we should just change the type of _active_tx_producers container @bharathv ?

umm would that help? Any container is prone to iterator invalidation?

oh no i just meant if we are going ot make a copy of the container to iterate over anyway then we could align the data structure types. you're right im not suggesting magic

IIRC my reasoning was that the function is only called when the partition is made read-only, so no further control batches can be applied and non individual erases can happen. reset_producers(), which erases the whole container, however, can still be called, so your concern is valid.

intrusive list is still offering value here, we're not copying the list members (which are quite large afaict) only the ids (int64+int16)

bashtanov · 2025-09-25T16:38:11Z

src/v/cluster/rm_stm.cc

    tx::errc last_err = tx::errc::none;

+    // snap the intrusive list produced_ids before yielding the cpu
+    chunked_vector<model::producer_identity> producer_ids_to_expire;


IIRC my reasoning was that the function is only called when the partition is made read-only, so no further control batches can be applied and non individual erases can happen. reset_producers(), which erases the whole container, however, can still be called, so your concern is valid.

bashtanov · 2025-09-25T16:42:57Z

src/v/cluster/rm_stm.cc

+    producer_ids_to_expire.reserve(_active_tx_producers.size());
+    std::ranges::transform(
+      _active_tx_producers,
+      std::back_inserter(producer_ids_to_expire),
+      [](const auto& producer) { return producer.id(); });


nit: we can use something like

chunked_vector<model::producer_identity> producer_ids_to_expire{ std::from_range, _active_tx_producers | std::views::transform([](const auto& producer) { return producer.id(); })};

It'll reserve automatically.

Thats awesome, thanks

Copilot

Pull Request Overview

This PR fixes a race condition in the abort_all_txes() function by preventing iterator invalidation when looping over an intrusive list. The fix ensures that asynchronous operations don't corrupt the intrusive list iteration by snapshotting producer identities before any suspension points.

Key changes:

Snapshot producer identities from the intrusive list into an owning collection before yielding CPU
Use the snapshotted collection for asynchronous transaction expiration operations
Add necessary header include for transaction error codes

src/v/cluster/rm_stm.cc

It is dangerous to suspend while looping over intrusive lists because destruction of the intrusive list element can invalidate an iterator on the intrusive list. Here, the resolution is to snapshot the producer_identities which need to be cancelled into an owning collection before crossing a suspension point. The snapshotted producer identities will subsequently be used to expire transactions, which is an asynchronous operation.

vbotbuildovich · 2025-09-25T19:38:18Z

Retry command for Build#72980

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/datalake/custom_partitioning_test.py::DatalakeCustomPartitioningTest.test_many_partitions@{"catalog_type":"rest_jdbc","cloud_storage_type":1}

vbotbuildovich · 2025-09-26T16:51:52Z

/backport v25.2.x

vbotbuildovich · 2025-09-26T16:51:53Z

/backport v25.1.x

vbotbuildovich · 2025-09-26T16:51:53Z

/backport v24.3.x

vbotbuildovich · 2025-09-26T16:52:51Z

Failed to create a backport PR to v24.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27694-v24.3.x-322 remotes/upstream/v24.3.x
git cherry-pick -x d256479640

Workflow run logs.

vbotbuildovich · 2025-09-26T16:53:03Z

Failed to create a backport PR to v25.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27694-v25.1.x-41 remotes/upstream/v25.1.x
git cherry-pick -x d256479640

Workflow run logs.

joe-redpanda requested review from Copilot, bashtanov and bharathv and removed request for Copilot September 23, 2025 22:17

github-actions bot added the area/redpanda label Sep 23, 2025

joe-redpanda requested a review from mmaslankaprv September 23, 2025 22:17

dotnwat reviewed Sep 23, 2025

View reviewed changes

bharathv reviewed Sep 24, 2025

View reviewed changes

dotnwat reviewed Sep 24, 2025

View reviewed changes

bashtanov previously approved these changes Sep 25, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings September 25, 2025 17:19

joe-redpanda dismissed bashtanov’s stale review via 7acd494 September 25, 2025 17:19

joe-redpanda force-pushed the rm_stm_intr_list_fix branch from f2164ab to 7acd494 Compare September 25, 2025 17:19

joe-redpanda requested a review from bashtanov September 25, 2025 17:19

Copilot AI reviewed Sep 25, 2025

View reviewed changes

src/v/cluster/rm_stm.cc Outdated Show resolved Hide resolved

joe-redpanda force-pushed the rm_stm_intr_list_fix branch from 7acd494 to d256479 Compare September 25, 2025 17:46

bharathv approved these changes Sep 25, 2025

View reviewed changes

bashtanov approved these changes Sep 26, 2025

View reviewed changes

joe-redpanda merged commit d606432 into redpanda-data:dev Sep 26, 2025
17 checks passed

vbotbuildovich mentioned this pull request Sep 26, 2025

[v24.3.x] rm_stm: async loop over intrusive list fix #27762

Open

This was referenced Sep 26, 2025

[v25.2.x] rm_stm: async loop over intrusive list fix #27763

Open

[v25.1.x] rm_stm: async loop over intrusive list fix #27764

Open

rm_stm: async loop over intrusive list fix #27694

rm_stm: async loop over intrusive list fix #27694

Conversation

joe-redpanda commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backports Required

Release Notes

Bug Fixes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vbotbuildovich commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI test results

Uh oh!

WillemKauf commented Sep 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joe-redpanda commented Sep 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

vbotbuildovich commented Sep 25, 2025

Retry command for Build#72980

Uh oh!

Uh oh!

vbotbuildovich commented Sep 26, 2025

Uh oh!

vbotbuildovich commented Sep 26, 2025

Uh oh!

vbotbuildovich commented Sep 26, 2025

Uh oh!

vbotbuildovich commented Sep 26, 2025

Uh oh!

vbotbuildovich commented Sep 26, 2025

Uh oh!

Uh oh!

joe-redpanda commented Sep 23, 2025 •

edited

Loading

vbotbuildovich commented Sep 24, 2025 •

edited

Loading