-
Notifications
You must be signed in to change notification settings - Fork 232
The ManagedBalancedConsumer is not resilient to kafka node failures #517
Comments
Thanks for noting this @thebigw4lrus. The |
+1 this is causing problems for us. |
Locally, I get this error occasionally when I take down a node in my three-node cluster. I wonder if anyone else has seen this particular traceback. It seems like a race caused by the heartbeat sending before the consumer has joined the group.
|
More local logs with comments inline.
A
Cluster update complete, next heartbeat fails with
Theory: the generation ID is incremented not only when the Join Group phase is complete, but also when partition leadership in the Kafka cluster changes. |
I put together some thoughts in #568, though that branch doesn't yet solve the issue. |
Same issue with error code 25. But I didn't take down any Kafka node. It just runs into it. |
I wonder how should we handle |
@messense the simple answer is to |
Parse.ly does not currently use the |
PyKafka version: master ( commit - 9ae97cb )
Kafka version: 0.9.0
We have Kafka configured with a 3 node cluster. Also we have a producer producing messages, and a managed consumer consuming them. It turned out that whenever a kafka node fails for whatever reason(including because we stop one of them due to maintenance), the client is not able to overcome this.
This is the way we reproduce this problem:
And the error that we are actually seeing is:
So it seems that the broker's heartbeat method is not able to handle when a broker goes down. We were expecting the heartbeat method (and the client in general) be resilient to a kafka node failure (having in mind that we have another 2 nodes in a good state).
Also we tried pykafka 2.2.1, and we saw a similar problem, but we got SocketDisconnectedError raised by this method. This we managed to fix by choosing a random broker for each retry. The latest pykafka version, however, deals with broker failures slightly differently.
The text was updated successfully, but these errors were encountered: