Skip to content

AdminClient and DNS lookup issue #3138

@imskrishnan

Description

@imskrishnan

@sobychacko

We are using Spring cloud stream and using Kafka Streams binder configuration. This is what our configuration looks like:

 spring:
  cloud:
   stream:
    bindings:
     <input>:
      <destination>
    kafka:
     bindings:
      <input>:
       consumer:
        dlq-name: <name>
        configuration: 
         offset-reset-strategy: latest
         group.id: <id>
     streams:
      binder:
       auto-create-topics: false
      configuration:
       session.timeout.ms: 12000
       max.poll.interval.ms: 5000
       heartbeat.interval.ms: 40000
       reconnect.backoff.ms: 500
       reconnect.backoff.max.ms: 2000
       retry.backoff.ms: 1000
       retry.backoff.max.ms: 2000
  1. We are still seeing the AdminClient trying to get instantiated and hence seeing authentication related issues. We are using SASL/SSL authentication methodology. What is the root cause of this happening internally? Why is an adminClient getting instantiated?

  2. Also, we are seeing DNS lookup issues from our K8s clusters when trying to resolve kafka broker hostnames to its IP address when our consumer application is running. This connection gets broken when theres a DNS maintenance or change, and the consumer goes into idle/limbo state and stops consuming any messages from the kafka topic, even though the application seems like its running without any error logs. We are finding ourselves having to restart the AKS pods manually and this is causing significant delays and slowing productivity.

We have tried an auto-healing mechanism but couldn't test this out in lower environments since we were unable to replicate the unique scenario of DNS being unable to resolve broker host IPs. So we're kind of waiting for the issue to show up in prod to test out if auto-heal mechanism actually works. The way we implemented it is by making a liveness health check to the /actuator/health/binders/kstream endpoint, and checking the binder status. We restart the pod if it remains down status for 6 mins. We are currently not sure if the K8s liveness check actually checks the connection to the broker or it only hits the endpoint and gets its status, in which case I believe the pod wouldn't get restarted.

Do you have any thoughts on these concerns? I would like to hear your thoughts on these and share any troubleshooting or any other implementations we could consider for our application to build resilience when theres a network outage.

Please let me know if there are any questions or clarifications you need.

Thanks,
Santhosh

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions