Skip to content

Conversation

joe-redpanda
Copy link
Contributor

@joe-redpanda joe-redpanda commented Sep 23, 2025

It is dangerous to suspend while looping over intrusive lists because destruction of the intrusive list element can invalidate an iterator on the intrusive list.

Here, the resolution is to snapshot the producer_identities which need to be cancelled into an owning collection before crossing a suspension point. The snapshotted producer identities will subsequently be used to expire transactions, which is an asynchronous operation.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

Bug Fixes

  • fixes possible race condition in aborting all transactions

@joe-redpanda joe-redpanda requested review from Copilot, bashtanov and bharathv and removed request for Copilot September 23, 2025 22:17
Comment on lines 1641 to +1654
co_await ss::max_concurrent_for_each(
_active_tx_producers, 5, [this, &last_err](const auto& producer) {
return mark_expired(producer.id()).then([&last_err](tx::errc res) {
std::move(producer_ids_to_expire),
max_concurrency,
[this, &last_err](const auto producer_id) {
return mark_expired(producer_id).then([&last_err](tx::errc res) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is dangerous to suspend while looping over intrusive lists

does this PR fix a real bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not one that we've seen strike

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is there a bug? It's still a bug if you've never seen a crash, but can describe how it might occur. Merely saying "it is dangerous" isn't quite enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm_stm::abort_all_txes does not hold any lock itself, and its caller holds partition produce lock. So rm_stm::reset_procucers() can run concurrently and invalidate iterators.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Sep 24, 2025

CI test results

test results on build#72792
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
EndToEndCloudTopicsTest test_write null integration https://buildkite.com/redpanda/redpanda/builds/72792#019978d7-4482-4204-8e29-d66b6165c89f FLAKY 13/21 upstream reliability is '87.87878787878788'. current run reliability is '61.904761904761905'. drift is 25.97403 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=EndToEndCloudTopicsTest&test_method=test_write
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "compaction_mode": "sliding_window", "enable_failures": false, "mixed_versions": true, "with_iceberg": false} integration https://buildkite.com/redpanda/redpanda/builds/72792#019978d7-4478-4e53-b74b-54003908ab6b FLAKY 20/21 upstream reliability is '98.8'. current run reliability is '95.23809523809523'. drift is 3.5619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RandomNodeOperationsTest&test_method=test_node_operations
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "compaction_mode": "sliding_window", "enable_failures": true, "mixed_versions": true, "with_iceberg": false} integration https://buildkite.com/redpanda/redpanda/builds/72792#019978d7-4480-4ecb-9b5f-5af431f3b23c FLAKY 18/21 upstream reliability is '98.51301115241635'. current run reliability is '85.71428571428571'. drift is 12.79873 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RandomNodeOperationsTest&test_method=test_node_operations
test results on build#72980
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkingReplicationTests test_replication_basic {"shuffle_leadership": false, "source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}} integration https://buildkite.com/redpanda/redpanda/builds/72980#0199821c-3f3f-4b7a-9418-cc55099f8bfd FLAKY 19/21 upstream reliability is '100.0'. current run reliability is '90.47619047619048'. drift is 9.52381 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic
ClusterRateQuotaTest test_client_group_consume_rate_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/72980#0199821c-3f39-4d68-b5e8-cbbaf37979ad FLAKY 17/21 upstream reliability is '93.9423076923077'. current run reliability is '80.95238095238095'. drift is 12.98993 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_group_consume_rate_throttle_mechanism
ClusterRateQuotaTest test_client_response_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/72980#01998218-cec3-43a3-90a5-ab11164072ba FLAKY 13/21 upstream reliability is '81.76197836166924'. current run reliability is '61.904761904761905'. drift is 19.85722 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_throttle_mechanism
DatalakeCustomPartitioningTest test_many_partitions {"catalog_type": "rest_jdbc", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/72980#01998218-cebf-46fb-9ead-79ebaad3b459 FLAKY 9/21 upstream reliability is '100.0'. current run reliability is '42.857142857142854'. drift is 57.14286 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeCustomPartitioningTest&test_method=test_many_partitions
DisablingPartitionsTest test_disable null integration https://buildkite.com/redpanda/redpanda/builds/72980#0199821c-3f3f-4b7a-9418-cc55099f8bfd FLAKY 9/21 upstream reliability is '85.8085808580858'. current run reliability is '42.857142857142854'. drift is 42.95144 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DisablingPartitionsTest&test_method=test_disable
RecoveryModeTest test_rolling_restart null integration https://buildkite.com/redpanda/redpanda/builds/72980#0199821c-3f39-4d68-b5e8-cbbaf37979ad FLAKY 13/21 upstream reliability is '94.36860068259386'. current run reliability is '61.904761904761905'. drift is 32.46384 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RecoveryModeTest&test_method=test_rolling_restart
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/72980#0199821c-3f3c-48bf-9c9f-cf4da10d87a9 FLAKY 17/21 upstream reliability is '91.87279151943463'. current run reliability is '80.95238095238095'. drift is 10.92041 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
WriteCachingFailureInjectionTest test_unavoidable_data_loss null integration https://buildkite.com/redpanda/redpanda/builds/72980#0199821c-3f39-4d68-b5e8-cbbaf37979ad FLAKY 17/21 upstream reliability is '95.2136752136752'. current run reliability is '80.95238095238095'. drift is 14.26129 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionTest&test_method=test_unavoidable_data_loss

@WillemKauf
Copy link
Contributor

Could this be equivalently safe with

    co_await ss::max_concurrent_for_each(
      _active_tx_producers,
      5,
      [this, &last_err](const producer_state& producer) {
          if (!producer._active_transaction_hook.is_linked()) {
              return ss::now();
          }
          return mark_expired(producer.id()).then([&last_err](tx::errc res) {
              if (res != tx::errc::none) {
                  last_err = res;
              }
          });
      });

}

ss::future<tx::errc> rm_stm::abort_all_txes() {
static constexpr uint max_concurrency = 5u;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to get @bashtanov's thoughts on this. This was originally added for migration purposes, but I vaguely recall a discussion in the PR about the potential unsafe iteration (no synchronization when this method is called). I reviewed that PR but its not the same PR that added this method. Alexey, do you remember anything about it? I'm not able to find that discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found the discussion

#26380 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark_expired holds a lock on every individual invocation, but is it not possible to have an inter-leafed modification to the list in-between invocations of mark_expired

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya for sure (that was the original concern).. but not sure what the conclusion of the thread was (I went silent :)). The fix makes sense but I'd like to check with @bashtanov incase we are overlooking something.

@joe-redpanda
Copy link
Contributor Author

Could this be equivalently safe with

    co_await ss::max_concurrent_for_each(
      _active_tx_producers,
      5,
      [this, &last_err](const producer_state& producer) {
          if (!producer._active_transaction_hook.is_linked()) {
              return ss::now();
          }
          return mark_expired(producer.id()).then([&last_err](tx::errc res) {
              if (res != tx::errc::none) {
                  last_err = res;
              }
          });
      });

I don't agree.
if the producer can be unlinked, then that implies that the list is changing under our nose.
There must be an iterator that concurrent for each is using to generate batches of size max_concurrency
^ can we guarantee that the iterator used for work generation is not struck by iterator invalidation?

tx::errc last_err = tx::errc::none;

// snap the intrusive list produced_ids before yielding the cpu
chunked_vector<model::producer_identity> producer_ids_to_expire;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should just change the type of _active_tx_producers container @bharathv ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm would that help? Any container is prone to iterator invalidation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh no i just meant if we are going ot make a copy of the container to iterate over anyway then we could align the data structure types. you're right im not suggesting magic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC my reasoning was that the function is only called when the partition is made read-only, so no further control batches can be applied and non individual erases can happen. reset_producers(), which erases the whole container, however, can still be called, so your concern is valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intrusive list is still offering value here, we're not copying the list members (which are quite large afaict) only the ids (int64+int16)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm!

bashtanov
bashtanov previously approved these changes Sep 25, 2025
tx::errc last_err = tx::errc::none;

// snap the intrusive list produced_ids before yielding the cpu
chunked_vector<model::producer_identity> producer_ids_to_expire;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC my reasoning was that the function is only called when the partition is made read-only, so no further control batches can be applied and non individual erases can happen. reset_producers(), which erases the whole container, however, can still be called, so your concern is valid.

Comment on lines 1645 to 1649
producer_ids_to_expire.reserve(_active_tx_producers.size());
std::ranges::transform(
_active_tx_producers,
std::back_inserter(producer_ids_to_expire),
[](const auto& producer) { return producer.id(); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can use something like

    chunked_vector<model::producer_identity> producer_ids_to_expire{
      std::from_range,
      _active_tx_producers | std::views::transform([](const auto& producer) {
          return producer.id();
      })};

It'll reserve automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats awesome, thanks

@Copilot Copilot AI review requested due to automatic review settings September 25, 2025 17:19
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a race condition in the abort_all_txes() function by preventing iterator invalidation when looping over an intrusive list. The fix ensures that asynchronous operations don't corrupt the intrusive list iteration by snapshotting producer identities before any suspension points.

Key changes:

  • Snapshot producer identities from the intrusive list into an owning collection before yielding CPU
  • Use the snapshotted collection for asynchronous transaction expiration operations
  • Add necessary header include for transaction error codes

It is dangerous to suspend while looping over intrusive lists because
destruction of the intrusive list element can invalidate an iterator on
the intrusive list.

Here, the resolution is to snapshot the producer_identities which need
to be cancelled into an owning collection before crossing a suspension
point. The snapshotted producer identities will subsequently be used to
expire transactions, which is an asynchronous operation.
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#72980

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/datalake/custom_partitioning_test.py::DatalakeCustomPartitioningTest.test_many_partitions@{"catalog_type":"rest_jdbc","cloud_storage_type":1}

@joe-redpanda joe-redpanda merged commit d606432 into redpanda-data:dev Sep 26, 2025
17 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v25.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v25.1.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27694-v24.3.x-322 remotes/upstream/v24.3.x
git cherry-pick -x d256479640

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v25.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27694-v25.1.x-41 remotes/upstream/v25.1.x
git cherry-pick -x d256479640

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants