You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
During redundancy tests (described here), I found that while overall failover works, there are problems in disaster scenarios. The CoreDNS Manager Operator is not resilient enough and could be improved. There are rare situations where CoreDNS stops serving zones, which requires a restart of CoreDNS or the operator. It should monitor CoreDNS events or perform resolve monitoring. If a problem occurs, it should trigger a reload of CoreDNS. In my failover tests, sometimes name resolution was disrupted due to load balancer behavior. More failover and scale tests are needed to investigate this behavior. All tests were done using k3s.
Describe the solution you'd like
I would like the operator to watch CoreDNS events and perform resolve monitoring. If an issue is detected, it should automatically trigger a reload of CoreDNS to ensure it continues serving zones properly.
Describe alternatives you've considered
Manually reloading CoreDNS when an issue is detected.
Using external monitoring tools to watch CoreDNS and trigger reloads.
Additional context
These issues were found during extensive failover tests on a k3s cluster. Improving the resilience of the CoreDNS Manager Operator will ensure more reliable DNS service in air-gapped environments.
The text was updated successfully, but these errors were encountered:
monkale-io
changed the title
Title: Improve Resilience of CoreDNS Manager Operator for Disaster Scenarios
Improve Resilience of CoreDNS Manager Operator for Disaster Scenarios
Jun 12, 2024
Is your feature request related to a problem? Please describe.
During redundancy tests (described here), I found that while overall failover works, there are problems in disaster scenarios. The CoreDNS Manager Operator is not resilient enough and could be improved. There are rare situations where CoreDNS stops serving zones, which requires a restart of CoreDNS or the operator. It should monitor CoreDNS events or perform resolve monitoring. If a problem occurs, it should trigger a reload of CoreDNS. In my failover tests, sometimes name resolution was disrupted due to load balancer behavior. More failover and scale tests are needed to investigate this behavior. All tests were done using k3s.
Describe the solution you'd like
I would like the operator to watch CoreDNS events and perform resolve monitoring. If an issue is detected, it should automatically trigger a reload of CoreDNS to ensure it continues serving zones properly.
Describe alternatives you've considered
Additional context
These issues were found during extensive failover tests on a k3s cluster. Improving the resilience of the CoreDNS Manager Operator will ensure more reliable DNS service in air-gapped environments.
The text was updated successfully, but these errors were encountered: