-
Notifications
You must be signed in to change notification settings - Fork 585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ApiListener fails to reconnect on heavily loaded system #10376
Comments
I experimented with sending a 'HUP' signal to perform a daemon configuration reload on one of the instances that was sat without a connection to the parent endpoint. This caused the daemon to shutdown the stalled 'ApiListener' and start a new listener that then reconnected successfully to the parent.
I did not try '/usr/lib/icinga2/safe-reload' or 'systemctl reload icinga2.service' but they both ultimately cause the 'SIGHUP' to be delivered so will likely work too. |
Describe the bug
On a very heavily loaded system the ApiListener may disconnect presumably due to timeout. When this happens a reconnect is started but times out before the connection establishes. After this point no more attempts to reconnect are made until the icinga2 daemon is restarted.
To Reproduce
I cannot give a step by step guide to reproducing other than having a system that occasionally gets itself into a extremely high load situation.
Note that the remote endpoint is in a container on the same heavily loaded computer so that will explain the slowness in responding to the connection request. Many other connections to that same remote endpoint continue so it does suggest to me that the problem lies in the local endpoint and not remote.
Expected behavior
The endpoints should reconnect after the disruption. The following log is from another that did reconnect successfully after the same high load situation.
Your Environment
Include as many relevant details about the environment you experienced the problem in
icinga2 --version
):Observed on multiple versions of icinga2
version: r2.14.5-1
version: r2.13.6-1
version: r2.12.3-1
linux amd64. Ubuntu 24.04 kernel with various Debian/Ubuntu versions running in LXC containers.
icinga2 feature list
):Enabled features: api checker mainlog
Config validation (
icinga2 daemon -C
):If you run multiple Icinga 2 instances, the
zones.conf
file (oricinga2 object list --type Endpoint
andicinga2 object list --type Zone
) from all affected nodes.Additional context
This failure mode is observed on the same systems where I have observed #10355 . I think that they are two separate issues and the ApiListener problem occurs both with and without the fix for #10355 applied.
I am currently running a custom build of 2.14.5-1 and have added additional logging into 'lib/remote/apilistener.cpp' to trace the path that the failing instances take through 'ApiListener::NewClientHandlerInternal'. The execution continues as far as either 'SendMessage' or 'async_flush' for 'RoleClient'
That causes a 'system_error' to be thrown which gets caught at
I did not log the 'systemError.code()' value.
I am happy to add more debug and/or test possible fixes.
The text was updated successfully, but these errors were encountered: