-
Notifications
You must be signed in to change notification settings - Fork 14.9k
KAFKA-19925: Fix transaction timeout handling during broker upgrades #21161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -179,7 +179,7 @@ def copy_messages_transactionally_during_upgrade(self, input_topic, output_topic | |||||
|
|
||||||
| self.perform_upgrade(from_kafka_version) | ||||||
|
|
||||||
| copier_timeout_sec = 180 | ||||||
| copier_timeout_sec = 360 | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
sorry again this something for SCA. Taking away the off-topics upfront. spotless and rewrite both ready to fix on their own. |
||||||
| for copier in copiers: | ||||||
| wait_until(lambda: copier.is_done, | ||||||
| timeout_sec=copier_timeout_sec, | ||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -308,9 +308,11 @@ public static void runEventLoop(Namespace parsedArgs) { | |||||||||||||||
|
|
||||||||||||||||
| String consumerGroup = parsedArgs.getString("consumerGroup"); | ||||||||||||||||
|
|
||||||||||||||||
| final KafkaProducer<String, String> producer = createProducer(parsedArgs); | ||||||||||||||||
| KafkaProducer<String, String> producer = createProducer(parsedArgs); | ||||||||||||||||
| final KafkaConsumer<String, String> consumer = createConsumer(parsedArgs); | ||||||||||||||||
|
|
||||||||||||||||
| int producerNumber = 0; | ||||||||||||||||
FrancisGodinho marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||
|
|
||||||||||||||||
| final AtomicLong remainingMessages = new AtomicLong( | ||||||||||||||||
| parsedArgs.getInt("maxMessages") == -1 ? Long.MAX_VALUE : parsedArgs.getInt("maxMessages")); | ||||||||||||||||
|
|
||||||||||||||||
|
|
@@ -387,7 +389,17 @@ public void onPartitionsAssigned(Collection<TopicPartition> partitions) { | |||||||||||||||
| long messagesSentWithinCurrentTxn = records.count(); | ||||||||||||||||
|
|
||||||||||||||||
| ConsumerGroupMetadata groupMetadata = useGroupMetadata ? consumer.groupMetadata() : new ConsumerGroupMetadata(consumerGroup); | ||||||||||||||||
| producer.sendOffsetsToTransaction(consumerPositions(consumer), groupMetadata); | ||||||||||||||||
| try { | ||||||||||||||||
| producer.sendOffsetsToTransaction(consumerPositions(consumer), groupMetadata); | ||||||||||||||||
| } catch (KafkaException e) { | ||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMHO, shouldn't we focus on handling the AFAIK, What do you think? |
||||||||||||||||
| // in case the producer gets stuck here, create a new one and continue the loop | ||||||||||||||||
| try { producer.close(Duration.ofSeconds(0)); } catch (Exception ignore) {} | ||||||||||||||||
| parsedArgs.getAttrs().put("transactionalId", parsedArgs.getString("transactionalId") + producerNumber++); | ||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a safer way to generate a globally unique transactionalId to avoid collisions? Simply appending an incremental number Perhaps appending a UUID or a random suffix would be a safer approach to ensure uniqueness? |
||||||||||||||||
| producer = createProducer(parsedArgs); | ||||||||||||||||
| producer.initTransactions(); | ||||||||||||||||
| resetToLastCommittedPositions(consumer); | ||||||||||||||||
|
Comment on lines
+395
to
+400
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
could give dedicated to this concern apply single responsibility principle, giving more focus to each own. Here its just about breaking the circut, how this is actually done seems to be some kind of (randomly) changing impl. detail. |
||||||||||||||||
| continue; | ||||||||||||||||
| } | ||||||||||||||||
|
|
||||||||||||||||
| if (enableRandomAborts && random.nextInt() % 3 == 0) { | ||||||||||||||||
| abortTransactionAndResetPosition(producer, consumer); | ||||||||||||||||
|
|
||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: due to timeouts and re-creation of producer, this copier_timeout needed to be increased. I experimented a bit and found that 360s was a consistently reliable value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I described in https://issues.apache.org/jira/browse/KAFKA-20000, the performance regression is caused by the backoff logic. Therefore, I suggest fixing the underlying issue instead of increasing the timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kindly asking, if this is something to consider? If so, would add some test for this adjustment.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Pankraz76 thanks for the effort. As Justine suggested, hardcoding the timeout is a bit coarse-grained. Please refer to KAFKA-20000 for more discussion.