Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transaction expiration and abort failing with result: tx::errc::not_coordinator #24445

Open
vsarunas opened this issue Dec 5, 2024 · 2 comments
Labels
kind/bug Something isn't working

Comments

@vsarunas
Copy link

vsarunas commented Dec 5, 2024

Version & Environment

rpm version
Version:     v24.2.10
Git ref:     74404e718c
Build date:  2024-11-08T14:41:25Z
OS/Arch:     linux/amd64
Go version:  go1.23.1

Redpanda Cluster
  node-0  v24.2.10 - 74404e718c3c7ab1bd21a86e718b243a3166327b
  node-1  v24.2.10 - 74404e718c3c7ab1bd21a86e718b243a3166327b
  node-2  v24.2.10 - 74404e718c3c7ab1bd21a86e718b243a3166327b

OS: Ubuntu 22.04.4 LTS (GNU/Linux 5.15.0-1054-kvm x86_64)

What went wrong?

  1. The problem reproduced on 29th and process-identifier-assignment+process-identifier-status group was failing until transactional_id_expiration_ms was reduced to 5 seconds and then increased again.
  2. node-3 had for over a day errors of failing log messages with tx::errc::not_coordinator, even when transaction list showed that it should be the coordinator; but system remained to work, I assume because the application kept on running and coordinator remained on another broker.

node-1 / broker 0

Nov 29 09:03:36 demo-redpanda-1 rpk[471715]: INFO  2024-11-29 09:03:36,185 [shard 2:main] storage - disk_log_impl.cc:2475 - Removing "/var/lib/redpanda/data/kafka/cache-assignment-pending-demo/10_5913/19-10-v1.log" (remove_prefix_full_segments, {offset_tracker:{term:10, base_offset:19, committed_offset:26, dirty_offset:26}, compacted_segment=0, finished_self_compaction=0, finished_windowed_compaction=0, generation=17, reader={/var/lib/redpanda/data/kafka/cache-assignment-pending-demo/10_5913/19-10-v1.log, (2373 bytes)}, writer=nullptr, cache={cache_size=8, dirty tracker: {min: -9223372036854775808, max: -9223372036854775808}}, compaction_index:nullopt, closed=0, tombstone=0, index={file:/var/lib/redpanda/data/kafka/cache-assignment-pending-demo/10_5913/19-10-v1.base_index, offsets:19, index:{header_bitflags:0, base_offset:19, max_offset:26, base_timestamp:{timestamp: 1732179784596}, max_timestamp:{timestamp: 1732266206467}, batch_timestamps_are_monotonic:1, with_offset:true, non_data_timestamps:0, broker_timestamp:{{timestamp: 1732266206476}}, num_compactible_records_appended:{7}, index(1,1,1)}, step:32768, needs_persistence:0}})
Nov 29 09:08:53 demo-redpanda-1 rpk[471715]: WARN  2024-11-29 09:08:53,785 [shard 0:main] tx - rm_partition_frontend.cc:164 - got error rpc::errc:4 on remote begin tx
Nov 29 09:08:54 demo-redpanda-1 rpk[471715]: WARN  2024-11-29 09:08:54,106 [shard 0:main] tx - rm_partition_frontend.cc:164 - got error rpc::errc:4 on remote begin tx
Nov 29 09:08:54 demo-redpanda-1 rpk[471715]: WARN  2024-11-29 09:08:54,427 [shard 0:main] tx - rm_partition_frontend.cc:164 - got error rpc::errc:4 on remote begin tx
Nov 29 09:08:54 demo-redpanda-1 rpk[471715]: WARN  2024-11-29 09:08:54,747 [shard 0:main] tx - rm_partition_frontend.cc:164 - got error rpc::errc:4 on remote begin tx
Nov 29 09:08:55 demo-redpanda-1 rpk[471715]: WARN  2024-11-29 09:08:55,068 [shard 0:main] tx - rm_partition_frontend.cc:164 - got error rpc::errc:4 on remote begin tx
Nov 29 09:08:55 demo-redpanda-1 rpk[471715]: WARN  2024-11-29 09:08:55,388 [shard 0:main] tx - rm_partition_frontend.cc:164 - got error rpc::errc:4 on remote begin tx
Nov 29 09:08:55 demo-redpanda-1 rpk[471715]: WARN  2024-11-29 09:08:55,709 [shard 0:main] tx - rm_partition_frontend.cc:164 - got error rpc::errc:4 on remote begin tx

node-2 / broker 1

Nov 29 04:40:06 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:40:06,975 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=142} sequence: 0
Nov 29 04:40:06 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:40:06,976 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=142} sequence: 0
Nov 29 04:40:06 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:40:06,976 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=142} sequence: 0
Nov 29 04:40:06 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:40:06,977 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=142} sequence: 0
Nov 29 04:40:06 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:40:06,977 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=142} sequence: 0
Nov 29 04:40:07 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:40:07,923 [shard 1:main] tx - tx_gateway_frontend.cc:2497 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] producer {producer_identity: id=13020, epoch=142} is fenced of by {producer_identity: id=13020, epoch=143}
Nov 29 04:41:08 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:41:08,976 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=144} sequence: 0
Nov 29 04:41:08 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:41:08,976 [shard 1:main] tx - tx_gateway_frontend.cc:578 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] found transaction {id: demo@demo-core-2@core@default+incidents+Augmentation, status: ongoing, pid: {producer_identity: id=13020, epoch=144}, last_pid: {producer_identity: id=-1,
 epoch=-1}, etag: 37, seq: 0, partitions: } to abort
Nov 29 04:41:08 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:41:08,995 [shard 1:main] tx - tx_gateway_frontend.cc:2497 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] producer {producer_identity: id=13020, epoch=144} is fenced of by {producer_identity: id=13020, epoch=145}
Nov 29 04:41:09 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:41:09,175 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=144} sequence: 0
Nov 29 04:41:09 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:41:09,176 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=144} sequence: 0
Nov 29 04:41:09 demo-repanda-2 rpk[516933]: message repeated 2 times: [ INFO  2024-11-29 04:41:09,176 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=144} sequence: 0]
Nov 29 04:41:09 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:41:09,178 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=144} sequence: 0
Nov 29 04:41:10 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:41:10,117 [shard 1:main] tx - tx_gateway_frontend.cc:2497 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] producer {producer_identity: id=13020, epoch=144} is fenced of by {producer_identity: id=13020, epoch=145}
Nov 29 04:42:11 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:42:11,177 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=146} sequence: 0
Nov 29 04:42:11 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:42:11,177 [shard 1:main] tx - tx_gateway_frontend.cc:578 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] found transaction {id: demo@demo-core-2@core@default+incidents+Augmentation, status: ongoing, pid: {producer_identity: id=13020, epoch=146}, last_pid: {producer_identity: id=-1,
 epoch=-1}, etag: 37, seq: 0, partitions: } to abort
Nov 29 04:42:11 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:42:11,191 [shard 1:main] tx - tx_gateway_frontend.cc:2497 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] producer {producer_identity: id=13020, epoch=146} is fenced of by {producer_identity: id=13020, epoch=147}
Nov 29 04:42:11 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:42:11,373 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=146} sequence: 0
Nov 29 04:42:11 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:42:11,373 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=146} sequence: 0
Nov 29 04:42:11 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:42:11,374 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=146} sequence: 0
Nov 29 04:42:11 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:42:11,374 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=146} sequence: 0
Nov 29 04:42:11 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:42:11,375 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=146} sequence: 0
Nov 29 04:42:11 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:42:11,375 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=146} sequence: 0
Nov 29 04:42:12 demo-repanda-2 rpk[516933]: INFO  2024-11-29 04:42:12,310 [shard 1:main] tx - tx_gateway_frontend.cc:2497 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] producer {producer_identity: id=13020, epoch=146} is fenced of by {producer_identity: id=13020, epoch=147}
[...]
Nov 29 11:00:11 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:00:11,535 [shard 1:main] kafka - server.cc:167 - Disconnected 10.200.118.1:59833 (applying protocol, Connection reset by peer)
Nov 29 11:00:11 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:00:11,535 [shard 2:main] kafka - server.cc:167 - Disconnected 10.200.118.1:59840 (applying protocol, Connection reset by peer)
Nov 29 11:02:34 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:02:34,644 [shard 2:main] storage - disk_log_impl.cc:2475 - Removing "/var/lib/redpanda/data/kafka/cache-assignment-pending-demo/9_5913/35-10-v1.log" (remove_prefix_full_segments, {offset_tracker:{term:10, base_offset:35, committed_offset:36, dirty_offset:36}, compacted_segment=0, finished_self_compaction=0, finished_windowed_compaction=0, generation=5, reader={/var/lib/redpanda/data/kafka/cache-assignment-pending-demo/9_5913/35-10-v1.log, (466 bytes)}, writer=nullptr, cache={cache_size=0, dirty tracker: {min: -9223372036854775808, max: -9223372036854775808}}, compaction_index:nullopt, closed=0, tombstone=0, index={file:/var/lib/redpanda/data/kafka/cache-assignment-pending-demo/9_5913/35-10-v1.base_index, offsets:35, index:{header_bitflags:0, base_offset:35, max_offset:36, base_timestamp:{timestamp: 1732273351093}, max_timestamp:{timestamp: 1732273351093}, batch_timestamps_are_monotonic:1, with_offset:true, non_data_timestamps:0, broker_timestamp:{{timestamp: 1732273351108}}, num_compactible_records_appended:{1}, index(1,1,1)}, step:32768, needs_persistence:0}})
Nov 29 11:05:45 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:05:45,444 [shard 3:main] storage-gc - segment.cc:631 - Marking compaction index /var/lib/redpanda/data/kafka/__consumer_offsets/2_19/90331319-33-v1.compaction_index as incomplete
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,218 [shard 0:main] kafka - server.cc:167 - Disconnected 10.200.118.1:45350 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,219 [shard 2:main] kafka - server.cc:167 - Disconnected 10.200.118.1:45248 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,219 [shard 3:main] kafka - server.cc:167 - Disconnected 10.200.118.1:45260 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,219 [shard 0:main] kafka - server.cc:167 - Disconnected 10.200.118.1:45214 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,219 [shard 3:main] kafka - server.cc:167 - Disconnected 10.200.118.1:45208 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,219 [shard 0:main] kafka - server.cc:167 - Disconnected 10.200.118.1:45200 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,219 [shard 0:main] kafka - server.cc:167 - Disconnected 10.200.118.1:45186 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,219 [shard 3:main] kafka - server.cc:167 - Disconnected 10.200.118.1:45166 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,219 [shard 3:main] kafka - server.cc:167 - Disconnected 10.200.118.1:45154 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,219 [shard 3:main] kafka - server.cc:167 - Disconnected 10.200.118.1:45144 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,220 [shard 2:main] kafka - server.cc:167 - Disconnected 10.200.118.1:45104 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,220 [shard 0:main] kafka - server.cc:167 - Disconnected 10.200.118.1:45032 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,220 [shard 2:main] kafka - server.cc:167 - Disconnected 10.200.118.1:45076 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,220 [shard 1:main] kafka - server.cc:167 - Disconnected 10.200.118.1:45018 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,220 [shard 3:main] kafka - server.cc:167 - Disconnected 10.200.118.1:45134 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,220 [shard 2:main] kafka - server.cc:167 - Disconnected 10.200.118.1:45008 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,220 [shard 1:main] kafka - server.cc:167 - Disconnected 10.200.118.1:45002 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,220 [shard 1:main] kafka - server.cc:167 - Disconnected 10.200.118.1:44960 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,220 [shard 0:main] kafka - server.cc:167 - Disconnected 10.200.118.1:44990 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,220 [shard 3:main] kafka - server.cc:167 - Disconnected 10.200.118.1:44936 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,220 [shard 1:main] kafka - server.cc:167 - Disconnected 10.200.118.1:44934 (applying protocol, Connection reset by peer)
Nov 29 11:15:57 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:15:57,220 [shard 2:main] kafka - server.cc:167 - Disconnected 10.200.118.1:60214 (applying protocol, Connection reset by peer)
Nov 29 11:16:12 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:16:12,364 [shard 2:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-1@core@default+metadata-entry+Augmentation] trying to abort transaction. pid: {producer_identity: id=13010, epoch=91} sequence: 1
Nov 29 11:16:12 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:16:12,364 [shard 2:main] tx - tx_gateway_frontend.cc:578 - [tx_id=demo@demo-core-1@core@default+metadata-entry+Augmentation] found transaction {id: demo@demo-core-1@core@default+metadata-entry+Augmentation, status: ongoing, pid: {producer_identity: id=13010, epoch=91}, last_pid: {producer_identity: id=-1, epoch=-1}, etag: 39, seq: 1, partitions: } to abort
Nov 29 11:16:12 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:16:12,464 [shard 2:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-1@core@default+metadata-entry+Augmentation] trying to abort transaction. pid: {producer_identity: id=13010, epoch=91} sequence: 1
Nov 29 11:16:16 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:16:16,902 [shard 1:main] kafka - server.cc:167 - Disconnected 10.200.118.1:60980 (applying protocol, Connection reset by peer)
Nov 29 11:16:16 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:16:16,902 [shard 1:main] kafka - server.cc:167 - Disconnected 10.200.118.1:60943 (applying protocol, Connection reset by peer)
Nov 29 11:16:16 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:16:16,902 [shard 2:main] kafka - server.cc:167 - Disconnected 10.200.118.1:60946 (applying protocol, Connection reset by peer)
Nov 29 11:17:03 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:17:03,680 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=148} sequence: 0
Nov 29 11:17:03 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:17:03,680 [shard 1:main] tx - tx_gateway_frontend.cc:578 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] found transaction {id: demo@demo-core-2@core@default+incidents+Augmentation, status: ongoing, pid: {producer_identity: id=13020, epoch=148}, last_pid: {producer_identity: id=-1, epoch=-1}, etag: 37, seq: 0, partitions: } to abort
Nov 29 11:17:03 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:17:03,689 [shard 1:main] tx - tx_gateway_frontend.cc:2497 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] producer {producer_identity: id=13020, epoch=148} is fenced of by {producer_identity: id=13020, epoch=149}
Nov 29 11:17:03 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:17:03,866 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=148} sequence: 0
Nov 29 11:17:03 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:17:03,867 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=148} sequence: 0
Nov 29 11:17:03 demo-repanda-2 rpk[516933]: message repeated 4 times: [ INFO  2024-11-29 11:17:03,867 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=148} sequence: 0]
Nov 29 11:17:03 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:17:03,868 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=148} sequence: 0
Nov 29 11:17:03 demo-repanda-2 rpk[516933]: message repeated 3 times: [ INFO  2024-11-29 11:17:03,868 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=148} sequence: 0]
Nov 29 11:17:03 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:17:03,877 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=148} sequence: 0
Nov 29 11:17:03 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:17:03,878 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=148} sequence: 0
Nov 29 11:17:03 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:17:03,878 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=148} sequence: 0
Nov 29 11:17:03 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:17:03,879 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=148} sequence: 0
Nov 29 11:17:04 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:17:04,782 [shard 1:main] tx - tx_gateway_frontend.cc:2497 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] producer {producer_identity: id=13020, epoch=148} is fenced of by {producer_identity: id=13020, epoch=149}
Nov 29 11:18:05 demo-repanda-2 rpk[516933]: INFO  2024-11-29 11:18:05,871 [shard 1:main] tx - tx_gateway_frontend.cc:513 - [tx_id=demo@demo-core-2@core@default+incidents+Augmentation] trying to abort transaction. pid: {producer_identity: id=13020, epoch=150} sequence: 0

node-3 / broker 2

Nov 27 02:44:45 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:45,756 [shard 3:main] rpc - transport.cc:214 - RPC timeout (3000 ms) to {host: 10.200.230.21, port: 33145}, method: raftgen::heartbeat_v2, correlation id: 5524273, 10 in flight, time since: {init: 3000 ms, enqueue: 3000 ms, memory_reserved: 3000 ms dispatch: 3000 ms, written: unset}, flushed:
 false
Nov 27 02:44:45 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:45,833 [shard 0:main] rpc - transport.cc:214 - RPC timeout (100 ms) to {host: 10.200.230.21, port: 33145}, method: node_status_rpc::node_status, correlation id: 5009658, 1 in flight, time since: {init: 100 ms, enqueue: 100 ms, memory_reserved: 100 ms dispatch: 100 ms, written: unset}, flushed
: false
Nov 27 02:44:45 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:45,846 [shard 2:main] rpc - transport.cc:214 - RPC timeout (3000 ms) to {host: 10.200.230.21, port: 33145}, method: raftgen::heartbeat_v2, correlation id: 3818598, 10 in flight, time since: {init: 3000 ms, enqueue: 3000 ms, memory_reserved: 3000 ms dispatch: 291 ms, written: unset}, flushed: 
false
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:46,034 [shard 0:main] rpc - transport.cc:214 - RPC timeout (100 ms) to {host: 10.200.230.21, port: 33145}, method: node_status_rpc::node_status, correlation id: 5009659, 1 in flight, time since: {init: 100 ms, enqueue: 100 ms, memory_reserved: 100 ms dispatch: 100 ms, written: unset}, flushed
: false
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:46,034 [shard 1:main] rpc - transport.cc:214 - RPC timeout (3000 ms) to {host: 10.200.230.21, port: 33145}, method: raftgen::heartbeat_v2, correlation id: 3980079, 10 in flight, time since: {init: 3000 ms, enqueue: 3000 ms, memory_reserved: 3000 ms dispatch: 299 ms, written: unset}, flushed: 
false
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:46,054 [shard 0:main] rpc - transport.cc:214 - RPC timeout (3000 ms) to {host: 10.200.230.21, port: 33145}, method: raftgen::heartbeat_v2, correlation id: 3983955, 11 in flight, time since: {init: 3000 ms, enqueue: 3000 ms, memory_reserved: 3000 ms dispatch: 599 ms, written: unset}, flushed: 
false
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:46,056 [shard 3:main] rpc - transport.cc:214 - RPC timeout (3000 ms) to {host: 10.200.230.21, port: 33145}, method: raftgen::heartbeat_v2, correlation id: 5524274, 10 in flight, time since: {init: 3000 ms, enqueue: 3000 ms, memory_reserved: 3000 ms dispatch: 3000 ms, written: unset}, flushed:
 false
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:46,147 [shard 2:main] rpc - transport.cc:214 - RPC timeout (3000 ms) to {host: 10.200.230.21, port: 33145}, method: raftgen::heartbeat_v2, correlation id: 3818599, 10 in flight, time since: {init: 3000 ms, enqueue: 3000 ms, memory_reserved: 3000 ms dispatch: 291 ms, written: unset}, flushed: 
false
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:46,234 [shard 0:main] rpc - transport.cc:214 - RPC timeout (100 ms) to {host: 10.200.230.21, port: 33145}, method: node_status_rpc::node_status, correlation id: 5009660, 1 in flight, time since: {init: 100 ms, enqueue: 100 ms, memory_reserved: 100 ms dispatch: 100 ms, written: unset}, flushed
: false
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:46,334 [shard 1:main] rpc - transport.cc:214 - RPC timeout (3000 ms) to {host: 10.200.230.21, port: 33145}, method: raftgen::heartbeat_v2, correlation id: 3980080, 10 in flight, time since: {init: 3000 ms, enqueue: 3000 ms, memory_reserved: 3000 ms dispatch: 299 ms, written: unset}, flushed: 
false
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:46,354 [shard 0:main] rpc - transport.cc:214 - RPC timeout (3000 ms) to {host: 10.200.230.21, port: 33145}, method: raftgen::heartbeat_v2, correlation id: 3983956, 11 in flight, time since: {init: 3000 ms, enqueue: 3000 ms, memory_reserved: 3000 ms dispatch: 599 ms, written: unset}, flushed: 
false
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:46,356 [shard 3:main] rpc - transport.cc:214 - RPC timeout (3000 ms) to {host: 10.200.230.21, port: 33145}, method: raftgen::heartbeat_v2, correlation id: 5524275, 10 in flight, time since: {init: 3000 ms, enqueue: 3000 ms, memory_reserved: 3000 ms dispatch: 3000 ms, written: unset}, flushed:
 false
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:46,435 [shard 0:main] rpc - transport.cc:214 - RPC timeout (100 ms) to {host: 10.200.230.21, port: 33145}, method: node_status_rpc::node_status, correlation id: 5009661, 1 in flight, time since: {init: 100 ms, enqueue: 100 ms, memory_reserved: 100 ms dispatch: 100 ms, written: unset}, flushed
: false
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:46,448 [shard 2:main] rpc - transport.cc:214 - RPC timeout (3000 ms) to {host: 10.200.230.21, port: 33145}, method: raftgen::heartbeat_v2, correlation id: 3818600, 10 in flight, time since: {init: 3000 ms, enqueue: 3000 ms, memory_reserved: 3000 ms dispatch: 291 ms, written: unset}, flushed: 
false
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:46,534 [shard 3:main] tx - [N:process-identifier-assignment S:Empty G:173] group.cc:3181 - attempting expiration of producer: {producer_identity: id=2005, epoch=0} transaction
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: WARN  2024-11-27 02:44:46,534 [shard 3:main] tx - [N:process-identifier-assignment S:Empty G:173] group.cc:3190 - producer {producer_identity: id=2005, epoch=0} transaction expiration result: tx::errc::not_coordinator
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:46,534 [shard 3:main] tx - [N:process-identifier-assignment S:Empty G:173] group.cc:3181 - attempting expiration of producer: {producer_identity: id=2035, epoch=0} transaction
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: WARN  2024-11-27 02:44:46,534 [shard 3:main] tx - [N:process-identifier-assignment S:Empty G:173] group.cc:3190 - producer {producer_identity: id=2035, epoch=0} transaction expiration result: tx::errc::not_coordinator
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:46,534 [shard 3:main] tx - [N:process-identifier-assignment S:Empty G:173] group.cc:3181 - attempting expiration of producer: {producer_identity: id=2048, epoch=2} transaction
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: WARN  2024-11-27 02:44:46,534 [shard 3:main] tx - [N:process-identifier-assignment S:Empty G:173] group.cc:3190 - producer {producer_identity: id=2048, epoch=2} transaction expiration result: tx::errc::not_coordinator
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:46,534 [shard 3:main] tx - [N:process-identifier-assignment S:Empty G:173] group.cc:3181 - attempting expiration of producer: {producer_identity: id=2084, epoch=0} transaction
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: WARN  2024-11-27 02:44:46,534 [shard 3:main] tx - [N:process-identifier-assignment S:Empty G:173] group.cc:3190 - producer {producer_identity: id=2084, epoch=0} transaction expiration result: tx::errc::not_coordinator
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:46,534 [shard 3:main] tx - [N:process-identifier-assignment S:Empty G:173] group.cc:3181 - attempting expiration of producer: {producer_identity: id=2105, epoch=0} transaction
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: WARN  2024-11-27 02:44:46,534 [shard 3:main] tx - [N:process-identifier-assignment S:Empty G:173] group.cc:3190 - producer {producer_identity: id=2105, epoch=0} transaction expiration result: tx::errc::not_coordinator
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:46,534 [shard 3:main] tx - [N:process-identifier-assignment S:Empty G:173] group.cc:3181 - attempting expiration of producer: {producer_identity: id=2131, epoch=0} transaction
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: WARN  2024-11-27 02:44:46,534 [shard 3:main] tx - [N:process-identifier-assignment S:Empty G:173] group.cc:3190 - producer {producer_identity: id=2131, epoch=0} transaction expiration result: tx::errc::not_coordinator
Nov 27 02:44:46 demo-redpanda-3 rpk[500671]: INFO  2024-11-27 02:44:46,534 [shard 3:main] tx - [N:process-identifier-assignment S:Empty G:173] group.cc:3181 - attempting expiration of producer: {producer_identity: id=2144, epoch=0} transaction

There where a lot of errors; number of errors per date / hour:

ubuntu@demo-redpanda-3:~$ cat /var/log/syslog.1  | grep "not_coordinator" | cut -b 1-9 | uniq -c
   7900 Nov 25 02
  12008 Nov 25 03
   3951 Nov 27 02
  12960 Nov 27 03
  12265 Nov 27 04
  11160 Nov 27 05
  11160 Nov 27 06
  11160 Nov 27 07
  11160 Nov 27 08
  11160 Nov 27 09
  11160 Nov 27 10
  11160 Nov 27 11
  11160 Nov 27 12
  11160 Nov 27 13
  11160 Nov 27 14
  11160 Nov 27 15
  11160 Nov 27 16
  11160 Nov 27 17
  11160 Nov 27 18
  11160 Nov 27 19
  11160 Nov 27 20
  11160 Nov 27 21
  11160 Nov 27 22
  11160 Nov 27 23
  11160 Nov 28 00
  11160 Nov 28 01
  11160 Nov 28 02
  11160 Nov 28 03
  11129 Nov 28 04
  11160 Nov 28 05
  11160 Nov 28 06
  11160 Nov 28 07
  11160 Nov 28 08
  11160 Nov 28 09
  11160 Nov 28 10
  11160 Nov 28 11
  11160 Nov 28 12
  11160 Nov 28 13
  11160 Nov 28 14
  11160 Nov 28 15
  11160 Nov 28 16
  11160 Nov 28 17
  11160 Nov 28 18
  11160 Nov 28 19
  11160 Nov 28 20
  11160 Nov 28 21
  11160 Nov 28 22
  11160 Nov 28 23
  11160 Nov 29 00
  11160 Nov 29 01
  11160 Nov 29 02
  11160 Nov 29 03
  11160 Nov 29 04
  11160 Nov 29 05
  11160 Nov 29 06
  11160 Nov 29 07
  11160 Nov 29 08
  11160 Nov 29 09
  11160 Nov 29 10
   1085 Nov 29 11
 rpk cluster txn describe
COORDINATOR  TRANSACTIONAL-ID                                                         PRODUCER-ID  PRODUCER-EPOCH  STATE           START-TIMESTAMP           TIMEOUT  TOPICS
1            demo@demo-core-2@core@default+incidents+Augmentation                     13020        157             CompleteAbort   2024-11-29T11:21:12.485Z  1m0s
2            process-identifier-assignment+process-identifier-status                  15001        51              CompleteAbort   2024-11-29T11:41:09.503Z  1m0s     
@vsarunas vsarunas added the kind/bug Something isn't working label Dec 5, 2024
@dotnwat
Copy link
Member

dotnwat commented Dec 5, 2024

@mmaslankaprv replication team?

@mmaslankaprv
Copy link
Member

Hello, sorry for the delay here, can you please provide debug bundle please, it can be collected with rpk debug bundle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants