-
Notifications
You must be signed in to change notification settings - Fork 632
Description
- We are facing an issue similar to issue - Force disable the creation of Kafka Admin Client #2914
We are using Spring cloud stream and using Kafka Streams binder configuration. This is what our configuration looks like:
spring:
cloud:
stream:
bindings:
<input>:
<destination>
kafka:
bindings:
<input>:
consumer:
dlq-name: <name>
configuration:
offset-reset-strategy: latest
group.id: <id>
streams:
binder:
auto-create-topics: false
configuration:
session.timeout.ms: 12000
max.poll.interval.ms: 5000
heartbeat.interval.ms: 40000
reconnect.backoff.ms: 500
reconnect.backoff.max.ms: 2000
retry.backoff.ms: 1000
retry.backoff.max.ms: 2000
-
We are still seeing the AdminClient trying to get instantiated and hence seeing authentication related issues. We are using SASL/SSL authentication methodology. What is the root cause of this happening internally? Why is an adminClient getting instantiated?
-
Also, we are seeing DNS lookup issues from our K8s clusters when trying to resolve kafka broker hostnames to its IP address when our consumer application is running. This connection gets broken when theres a DNS maintenance or change, and the consumer goes into idle/limbo state and stops consuming any messages from the kafka topic, even though the application seems like its running without any error logs. We are finding ourselves having to restart the AKS pods manually and this is causing significant delays and slowing productivity.
We have tried an auto-healing mechanism but couldn't test this out in lower environments since we were unable to replicate the unique scenario of DNS being unable to resolve broker host IPs. So we're kind of waiting for the issue to show up in prod to test out if auto-heal mechanism actually works. The way we implemented it is by making a liveness health check to the /actuator/health/binders/kstream
endpoint, and checking the binder status. We restart the pod if it remains down status for 6 mins. We are currently not sure if the K8s liveness check actually checks the connection to the broker or it only hits the endpoint and gets its status, in which case I believe the pod wouldn't get restarted.
Do you have any thoughts on these concerns? I would like to hear your thoughts on these and share any troubleshooting or any other implementations we could consider for our application to build resilience when theres a network outage.
Please let me know if there are any questions or clarifications you need.
Thanks,
Santhosh