Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Resilience of CoreDNS Manager Operator for Disaster Scenarios #9

Open
nikolay-udovik opened this issue Jun 11, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@nikolay-udovik
Copy link
Contributor

Is your feature request related to a problem? Please describe.
During redundancy tests (described here), I found that while overall failover works, there are problems in disaster scenarios. The CoreDNS Manager Operator is not resilient enough and could be improved. There are rare situations where CoreDNS stops serving zones, which requires a restart of CoreDNS or the operator. It should monitor CoreDNS events or perform resolve monitoring. If a problem occurs, it should trigger a reload of CoreDNS. In my failover tests, sometimes name resolution was disrupted due to load balancer behavior. More failover and scale tests are needed to investigate this behavior. All tests were done using k3s.

Describe the solution you'd like
I would like the operator to watch CoreDNS events and perform resolve monitoring. If an issue is detected, it should automatically trigger a reload of CoreDNS to ensure it continues serving zones properly.

Describe alternatives you've considered

  • Manually reloading CoreDNS when an issue is detected.
  • Using external monitoring tools to watch CoreDNS and trigger reloads.

Additional context
These issues were found during extensive failover tests on a k3s cluster. Improving the resilience of the CoreDNS Manager Operator will ensure more reliable DNS service in air-gapped environments.

@monkale-io monkale-io added the enhancement New feature or request label Jun 12, 2024
@monkale-io monkale-io changed the title Title: Improve Resilience of CoreDNS Manager Operator for Disaster Scenarios Improve Resilience of CoreDNS Manager Operator for Disaster Scenarios Jun 12, 2024
@nikolay-udovik
Copy link
Contributor Author

NOTE: If you have at least three masters, increasing CoreDNS replicas to six will significantly improve disaster recovery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants