Kafka fails to start after node migration in GKE #599

swimand · 2022-04-21T06:22:19Z

GCP performs automatic migrations of nodes when more resources are needed on a specific cluster or on certain updates. This causes the pods to receive new ports which I believe is causing a missmatch and leads to a failed connection between the zookeeper and kafka pods. As you can see in the following log excerpt, the kafka pod finds the zookeeper service but fails to connect:

[main-SendThread(cp-zookeeper-headless:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /10.102.2.98:53984, server: cp-zookeeper-headless/10.102.2.98:2181"
[main] ERROR io.confluent.admin.utils.ClusterStatus - Timed out waiting for connection to Zookeeper server [cp-zookeeper-headless:2181]."
[main-SendThread(cp-zookeeper-headless:2181)] WARN org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 40001ms for sessionid 0x0"
[main] INFO org.apache.zookeeper.ZooKeeper - Session: 0x0 closed

Is there any way to avoid this issue, when running the cp-helm-charts in GKE? Am I maybe missing some configuration?

The text was updated successfully, but these errors were encountered:

toraxe · 2022-11-29T07:26:45Z

I have the same problem but with a on-prem soulution with automatic migration.
Did you found any solution or workaround @swimand ?

swimand · 2022-11-29T07:57:15Z

Sadly no, the only method that works, as far as I can see is to manually delete the pods, so they are forced to get new addresses to the individual services. Because of this, and the unstructured startup sequence, we are looking into moving to bitnamis charts instead, as they seem to be configured for a more stable run, but also creating specific node-pools for the kafka so it does not need to migrate so often (theoretically only on node updates).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kafka fails to start after node migration in GKE #599

Kafka fails to start after node migration in GKE #599

swimand commented Apr 21, 2022

toraxe commented Nov 29, 2022

swimand commented Nov 29, 2022

Kafka fails to start after node migration in GKE #599

Kafka fails to start after node migration in GKE #599

Comments

swimand commented Apr 21, 2022

toraxe commented Nov 29, 2022

swimand commented Nov 29, 2022