KAFKA-19925: Fix transaction timeout handling during broker upgrades #21161

FrancisGodinho · 2025-12-16T03:27:21Z

Problem

During broker upgrades, the sendOffsetsToTransaction call would sometimes hang. Logs showed that it continuously returned errorCode=51 which is CONCURRENT_TRANSACTION. The test would eventually hit its timeout and fail. This happened for every single version upgrade and occurred in around 30% of the runs.

Resolution

The problem above left the producer in a broken state and even after 5-10 minutes of waiting, it didn't resolve itself (even if we waited a few minutes past the transaction.max.ms time). I tried multiple solutions including waiting extended periods of time and re-trying the sendOffsetsToTransaction multiple times whenever timeout occurred.

Unfortunately, the producer was just permanently stuck and always receiving the errorCode=51. In this case, the recommended resolution in the Kafka docs is to close the previous producer and create a new producer. https://kafka.apache.org/documentation/#usingtransactions

Using the old transaction.id would continue to lead to a stuck state, so this fix creates a brand new producer with a new ID and then rewinds the consumer offset to ensure EOD.

Testing and Validation

Previously, I was able to run the test for a single version upgrade and have it fail within the first 5-10 runs. After the fix, I was able to run it 40 times continuously with 0 failures. I also ran the full test (all versions) ~5 times with 9/9 cases passing.

FrancisGodinho · 2025-12-16T03:28:19Z

tests/kafkatest/tests/core/transactions_upgrade_test.py

        self.perform_upgrade(from_kafka_version)

-        copier_timeout_sec = 180
+        copier_timeout_sec = 360 


Note: due to timeouts and re-creation of producer, this copier_timeout needed to be increased. I experimented a bit and found that 360s was a consistently reliable value.

As I described in https://issues.apache.org/jira/browse/KAFKA-20000, the performance regression is caused by the backoff logic. Therefore, I suggest fixing the underlying issue instead of increasing the timeout.

kindly asking, if this is something to consider? If so, would add some test for this adjustment.

Thanks.

KAFKA-19925: Fix transaction timeout handling during broker upgrades #21161 #KAFKA-20000 #21172

@Pankraz76 thanks for the effort. As Justine suggested, hardcoding the timeout is a bit coarse-grained. Please refer to KAFKA-20000 for more discussion.

FrancisGodinho · 2025-12-16T03:28:53Z

@chia7712 can you take a look when you get a chance please?

chia7712 · 2025-12-16T19:47:41Z

@FrancisGodinho thanks for you patch. I have identified some underlying issues in e2e and TV2. Addressing them should allow us to achieve more stable transaction behavior. Please check https://issues.apache.org/jira/browse/KAFKA-19999 and https://issues.apache.org/jira/browse/KAFKA-20000 for more details.

Pankraz76

+1

tools/src/main/java/org/apache/kafka/tools/TransactionalMessageCopier.java

…eCopier.java Co-authored-by: Vincent Potuček <[email protected]>

FrancisGodinho · 2025-12-18T07:15:13Z

@Pankraz76 thanks for the comments, can you re-review please?

Pankraz76

issue is very well documented, thanks for effort given.

Pankraz76 · 2025-12-18T09:13:18Z

tests/kafkatest/tests/core/transactions_upgrade_test.py

        self.perform_upgrade(from_kafka_version)

-        copier_timeout_sec = 180
+        copier_timeout_sec = 360 


Suggested change

copier_timeout_sec = 360

copier_timeout_sec = 360

sorry again this something for SCA. Taking away the off-topics upfront.

spotless and rewrite both ready to fix on their own.

Pankraz76 · 2025-12-18T09:16:22Z

tools/src/main/java/org/apache/kafka/tools/TransactionalMessageCopier.java

+                            // in case the producer gets stuck here, create a new one and continue the loop
+                            try { producer.close(Duration.ofSeconds(0)); } catch (Exception ignore) {}
+                            parsedArgs.getAttrs().put("transactionalId", parsedArgs.getString("transactionalId") + producerNumber++);
+                            producer = createProducer(parsedArgs);
+                            producer.initTransactions();
+                            resetToLastCommittedPositions(consumer);


Suggested change

// in case the producer gets stuck here, create a new one and continue the loop

try { producer.close(Duration.ofSeconds(0)); } catch (Exception ignore) {}

parsedArgs.getAttrs().put("transactionalId", parsedArgs.getString("transactionalId") + producerNumber++);

producer = createProducer(parsedArgs);

producer.initTransactions();

resetToLastCommittedPositions(consumer);

circutBreaker(); // in case the producer gets stuck here, create a new one and continue the loop

could give dedicated to this concern apply single responsibility principle, giving more focus to each own. Here its just about breaking the circut, how this is actually done seems to be some kind of (randomly) changing impl. detail.

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

…#21161 Signed-off-by: Vincent Potucek <[email protected]>

…ache#21161 apache#21168 Signed-off-by: Vincent Potucek <[email protected]>

…pache#21161 #KAFKA-20000 Signed-off-by: Vincent Potucek <[email protected]>

chickenchickenlove

Thanks for your work 🙇‍♂️
I left some minor comments in this PR and Jira as well.

However, if you decide to implement this on the server side instead of the client side, please feel free to ignore my review comments. Also, it seems that the cause of TimeoutException via TV2 is resolved by this PR (https://github.com/apache/kafka/pulls?q=is%3Apr+is%3Aclosed+KAFKA-19999). If so, please feel free to ignore my review.

chickenchickenlove · 2026-01-03T03:26:29Z

tools/src/main/java/org/apache/kafka/tools/TransactionalMessageCopier.java

-                        producer.sendOffsetsToTransaction(consumerPositions(consumer), groupMetadata);
+                        try {
+                            producer.sendOffsetsToTransaction(consumerPositions(consumer), groupMetadata);
+                        } catch (KafkaException e) {


IMHO, shouldn't we focus on handling the TimeoutException?

AFAIK, CONCURRENT_TRANSACTIONS eventually manifests as a TimeoutException on the client side. I'm concerned that broad scope retries might mask other underlying errors. In those cases, we need to clearly identify the root cause to take appropriate action.

What do you think?

chickenchickenlove · 2026-01-03T03:30:31Z

tools/src/main/java/org/apache/kafka/tools/TransactionalMessageCopier.java

+                        } catch (KafkaException e) {
+                            // in case the producer gets stuck here, create a new one and continue the loop
+                            try { producer.close(Duration.ofSeconds(0)); } catch (Exception ignore) {}
+                            parsedArgs.getAttrs().put("transactionalId", parsedArgs.getString("transactionalId") + producerNumber++);


Is there a safer way to generate a globally unique transactionalId to avoid collisions?

Simply appending an incremental number (producerNumber++) to the user-provided ID seems risky in a shared cluster environment. If the generated ID happens to match an existing transactionalId of a running production application, it could trigger the fencing mechanism, unintentionally aborting the active transactions of that application.

Perhaps appending a UUID or a random suffix would be a safer approach to ensure uniqueness?
What do you think?

chia7712 · 2026-01-03T04:28:35Z

Also, it seems that the cause of TimeoutException via TV2 is resolved by this PR

Agreed, the root cause is the livelock, but fixing the retry is still valuable. +1 to trying the server-side approach. Let’s keeping this PR and handle the server-side fix in a separate PR

chickenchickenlove · 2026-01-03T05:40:49Z

@FrancisGodinho , thanks for the update!

Following @chia7712 ’s suggestion to try the server-side approach (retrying internally on the broker side), I wanted to check on ownership for the follow-up work.

Are you planning to implement the server-side fix as a separate PR? If so, I’m happy to help with reviews/testing. If you’d prefer to focus on the current client-side PR, I can pick up the follow-up server-side PR (or co-author it) to help move KAFKA-20000 forward.

Totally up to you!
I just wanted to express my willingness to help if needed.
Let me know what works best for you!

CC. @chia7712

FrancisGodinho · 2026-01-03T06:27:27Z

@chickenchickenlove @chia7712 yeah I think retrying on the server side would be better as well since less logic for the client. I'd like to take a stab at it since I will have some time this week, but I'll let you know if I need help/guidance (thanks for offering!)

I'll implement the server side changes as a separate PR (since it's also a separate ticket) and ping you once I have some progress made!

KAFKA-19925: Fix transaction timeout handling during broker upgrades

da0c877

github-actions bot added triage PRs from the community tools small Small PRs labels Dec 16, 2025

FrancisGodinho commented Dec 16, 2025

View reviewed changes

github-actions bot removed the triage PRs from the community label Dec 16, 2025

Pankraz76 approved these changes Dec 17, 2025

View reviewed changes

tools/src/main/java/org/apache/kafka/tools/TransactionalMessageCopier.java Outdated Show resolved Hide resolved

tools/src/main/java/org/apache/kafka/tools/TransactionalMessageCopier.java Show resolved Hide resolved

Update tools/src/main/java/org/apache/kafka/tools/TransactionalMessag…

d433005

…eCopier.java Co-authored-by: Vincent Potuček <[email protected]>

Pankraz76 approved these changes Dec 18, 2025

View reviewed changes

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

d426fac

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

afa4c87

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

85169cc

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

61bf1fb

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

701d519

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

30041c5

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

f6e8e04

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

6028c65

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

8867fd6

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

37b95c1

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

3712925

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

1d30e4b

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

bb64467

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

faf8db9

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

672c48f

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

d3e3129

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

d994435

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

6c60fec

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

c276f78

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

71453db

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

d19270b

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

41e1291

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

447c7a3

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

c6d1cc4

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

5d394f3

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

a76cae7

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

4a15940

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

ff17423

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

2b14980

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MINOR: Apply missing format trimTrailingWhitespace() apache#21165 a…

be16e21

…pache#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MAJOR: Hotfix broken spotless config, adhere SSOT apache#21165 apache…

5e369c1

…#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MAJOR: Hotfix broken spotless config, adhere SSOT apache#21165 apache…

f0759fc

…#21161 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 mentioned this pull request Dec 18, 2025

MAJOR: Hotfix broken spotless config, adhere SSOT #21165 #21161 #21168 #21171

Open

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MAJOR: Hotfix broken spotless config, adhere SSOT apache#21165 ap…

e877320

…ache#21161 apache#21168 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

MAJOR: Hotfix broken spotless config, adhere SSOT apache#21165 ap…

365838b

…ache#21161 apache#21168 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025

KAFKA-19925: Fix transaction timeout handling during broker upgrades a…

b398ef5

…pache#21161 #KAFKA-20000 Signed-off-by: Vincent Potucek <[email protected]>

Pankraz76 mentioned this pull request Dec 18, 2025

KAFKA-19925: Fix transaction timeout handling during broker upgrades #21161 #KAFKA-20000 #21172

Closed

chickenchickenlove reviewed Jan 3, 2026

View reviewed changes

KAFKA-19925: Fix transaction timeout handling during broker upgrades #21161

Are you sure you want to change the base?

KAFKA-19925: Fix transaction timeout handling during broker upgrades #21161

Uh oh!

Conversation

FrancisGodinho commented Dec 16, 2025

Problem

Resolution

Testing and Validation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FrancisGodinho commented Dec 16, 2025

Uh oh!

chia7712 commented Dec 16, 2025

Uh oh!

Pankraz76 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

FrancisGodinho commented Dec 18, 2025

Uh oh!

Pankraz76 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chickenchickenlove left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chia7712 commented Jan 3, 2026

Uh oh!

chickenchickenlove commented Jan 3, 2026

Uh oh!

FrancisGodinho commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chickenchickenlove left a comment •

edited

Loading