[backport] [v23.2.x] rm_stm: fix fence_pid_epoch cleanup #17880 #18120
Closed
bharathv wants to merge 2 commits intoredpanda-data:v23.2.xfrom
Closed
[backport] [v23.2.x] rm_stm: fix fence_pid_epoch cleanup #17880 #18120bharathv wants to merge 2 commits intoredpanda-data:v23.2.xfrom
bharathv wants to merge 2 commits intoredpanda-data:v23.2.xfrom
Conversation
fence_pid_epoch maps a producer id to its latest epoch. Current cleanup code does not do a epoch check before cleaningup the pid state. This can result in removing the state related to the latest epoch. Consider the following series of events.. [x, y] = pid[id=x, epoch=y] [1, 0] begin_tx - fence_pid_epoch[1] = 0 [1, 1] begin_tx - fence_pid_epoch[1] = 1 evict [1, 0] erase(fence_pid[1]) ==> removes (1) This results in a messed up state stalling the state of the transaction because the partition cannot make progress until it verifies the epoch. This is a long pending bug that was exposed by racy evictions. (cherry picked from commit 996e138)
(cherry picked from commit 4a94208)
ztlpn
reviewed
May 6, 2024
| rpk = RpkTool(self.redpanda) | ||
| rpk.cluster_config_set("max_concurrent_producer_ids", | ||
| str(max_concurrent_pids)) | ||
| sleep(5) |
Contributor
There was a problem hiding this comment.
should we wait for at least 10s to guarantee that eviction ran at least once? (is the relevant config property abort_timed_out_transactions_interval_ms)?
Member
|
i think we can close this as |
Contributor
v23.2.x is supported for about a month or so more. |
Contributor
|
Closing as v23.2.x goes end-of-support |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fence_pid_epoch maps a producer id to its latest epoch. Current cleanup code does not do a epoch check before cleaning up the pid state. This can result in removing the state related to the latest epoch. Consider the following series of events..
[x, y] = pid[id=x, epoch=y]
[1, 0] begin_tx - fence_pid_epoch[1] = 0
[1, 1] begin_tx - fence_pid_epoch[1] = 1
evict [1, 0]
erase(fence_pid[1]) ==> removes (1)
This results in a messed up state stalling the state of the transaction because the partition cannot make progress until it verifies the epoch.
This is a long pending bug that was exposed by racy evictions.
note: this whole code is going to be revamped soon and the plan is to add a self contained unit test fixture that supports transactions end-to-end, that should have better test coverage.
Fixes #17891
Backports Required
Release Notes