AdminClient and DNS lookup issue

@sobychacko 
 
* We are facing an issue similar to issue - https://github.com/spring-cloud/spring-cloud-stream/issues/2914

We are using Spring cloud stream and using Kafka Streams binder configuration. This is what our configuration looks like:

```
 spring:
  cloud:
   stream:
    bindings:
     <input>:
      <destination>
    kafka:
     bindings:
      <input>:
       consumer:
        dlq-name: <name>
        configuration: 
         offset-reset-strategy: latest
         group.id: <id>
     streams:
      binder:
       auto-create-topics: false
      configuration:
       session.timeout.ms: 12000
       max.poll.interval.ms: 5000
       heartbeat.interval.ms: 40000
       reconnect.backoff.ms: 500
       reconnect.backoff.max.ms: 2000
       retry.backoff.ms: 1000
       retry.backoff.max.ms: 2000
```

1. We are still seeing the AdminClient trying to get instantiated and hence seeing authentication related issues. We are using SASL/SSL authentication methodology. What is the root cause of this happening internally? Why is an adminClient getting instantiated?

2. Also, we are seeing DNS lookup issues from our K8s clusters when trying to resolve kafka broker hostnames to its IP address when our consumer application is running. This connection gets broken when theres a DNS maintenance or change, and the consumer goes into idle/limbo state and stops consuming any messages from the kafka topic, even though the application seems like its running without any error logs. We are finding ourselves having to restart the AKS pods manually and this is causing significant delays and slowing productivity. 

We have tried an auto-healing mechanism but couldn't test this out in lower environments since we were unable to replicate the unique scenario of DNS being unable to resolve broker host IPs. So we're kind of waiting for the issue to show up in prod to test out if auto-heal mechanism actually works. The way we implemented it is by making a liveness health check to the `/actuator/health/binders/kstream` endpoint, and checking the binder status. We restart the pod if it remains down status for 6 mins. We are currently not sure if the K8s liveness check actually checks the connection to the broker or it only hits the endpoint and gets its status, in which case I believe the pod wouldn't get restarted.

Do you have any thoughts on these concerns? I would like to hear your thoughts on these and share any troubleshooting or any other implementations we could consider for our application to build resilience when theres a network outage.

Please let me know if there are any questions or clarifications you need.

Thanks,
Santhosh

    



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AdminClient and DNS lookup issue #3138

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AdminClient and DNS lookup issue #3138

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions