ApiListener fails to reconnect on heavily loaded system #10376

oldelvet · 2025-03-13T11:42:13Z

Describe the bug

On a very heavily loaded system the ApiListener may disconnect presumably due to timeout. When this happens a reconnect is started but times out before the connection establishes. After this point no more attempts to reconnect are made until the icinga2 daemon is restarted.

[2025-03-13 00:26:31 +0000] information/Application: We jumped forward in time: 74.132 seconds
[2025-03-13 00:38:07 +0000] information/Application: We jumped forward in time: 713.587 seconds
[2025-03-13 00:39:50 +0000] warning/ApiListener: Removing API client for endpoint 'shirehall'. 0 API clients left.
[2025-03-13 00:39:51 +0000] warning/JsonRpcConnection: API client disconnected for identity 'shirehall'
[2025-03-13 00:40:02 +0000] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 0, rate: 0.6/s (36/min 36/5min 36/15min);
[2025-03-13 00:40:02 +0000] information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2025-03-13 00:40:03 +0000] information/ApiListener: Reconnecting to endpoint 'shirehall' via host 'shirehall' and port '5665'
[2025-03-13 00:40:44 +0000] critical/ApiListener: Timeout while reconnecting to endpoint 'shirehall' via host 'shirehall' and port '5665', cancelling attempt
[2025-03-13 00:41:04 +0000] information/ApiListener: New client connection for identity 'shirehall' to [xxxx:xxxx:xxxx:xxxx::xx]:5665
[2025-03-13 00:43:05 +0000] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2025-03-13 00:45:02 +0000] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 0, rate: 0.466667/s (28/min 65/5min 101/15min);
[2025-03-13 00:45:02 +0000] information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);

To Reproduce

I cannot give a step by step guide to reproducing other than having a system that occasionally gets itself into a extremely high load situation.

Note that the remote endpoint is in a container on the same heavily loaded computer so that will explain the slowness in responding to the connection request. Many other connections to that same remote endpoint continue so it does suggest to me that the problem lies in the local endpoint and not remote.

Expected behavior

The endpoints should reconnect after the disruption. The following log is from another that did reconnect successfully after the same high load situation.

[2025-03-13 00:23:36 +0000] information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2025-03-13 00:24:52 +0000] information/Application: We jumped forward in time: 20.5464 seconds
[2025-03-13 00:39:55 +0000] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 0, rate: 0.966667/s (58/min 126/5min 290/15min);
[2025-03-13 00:39:55 +0000] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2025-03-13 00:39:57 +0000] information/Application: We jumped forward in time: 904.955 seconds
[2025-03-13 00:40:00 +0000] information/JsonRpcConnection: No messages for identity 'shirehall' have been received in the last 60 seconds.
[2025-03-13 00:40:05 +0000] information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2025-03-13 00:40:06 +0000] warning/ApiListener: Removing API client for endpoint 'shirehall'. 0 API clients left.
[2025-03-13 00:40:06 +0000] warning/JsonRpcConnection: API client disconnected for identity 'shirehall'
[2025-03-13 00:40:15 +0000] information/ApiListener: Reconnecting to endpoint 'shirehall' via host 'shirehall' and port '5665'
[2025-03-13 00:40:44 +0000] critical/ApiListener: Timeout while reconnecting to endpoint 'shirehall' via host 'shirehall' and port '5665', cancelling attempt
[2025-03-13 00:41:06 +0000] information/ApiListener: New client connection for identity 'shirehall' to [xxxx:xxxx:xxxx:xxxx::xx]:5665
[2025-03-13 00:41:06 +0000] information/JsonRpcConnection: Requesting new certificate for this Icinga instance from endpoint 'shirehall'.
[2025-03-13 00:41:06 +0000] information/ApiListener: Sending config updates for endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 00:41:06 +0000] information/ApiListener: Finished sending config file updates for endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 00:41:06 +0000] information/ApiListener: Syncing runtime objects to endpoint 'shirehall'.
[2025-03-13 00:41:06 +0000] information/ApiListener: Finished reconnecting to endpoint 'shirehall' via host 'shirehall' and port '5665'
[2025-03-13 00:41:25 +0000] information/ApiListener: Finished syncing runtime objects to endpoint 'shirehall'.
[2025-03-13 00:41:25 +0000] information/ApiListener: Finished sending runtime config updates for endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 00:41:25 +0000] information/ApiListener: Sending replay log for endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 00:41:26 +0000] information/ApiListener: Replayed 14 messages.
[2025-03-13 00:41:26 +0000] information/ApiListener: Finished sending replay log for endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 00:41:26 +0000] information/ApiListener: Finished syncing endpoint 'shirehall' in zone 'shirehall'.

Your Environment

Include as many relevant details about the environment you experienced the problem in

Version used (icinga2 --version):

Observed on multiple versions of icinga2

version: r2.14.5-1
version: r2.13.6-1
version: r2.12.3-1

Operating System and version:

linux amd64. Ubuntu 24.04 kernel with various Debian/Ubuntu versions running in LXC containers.

Enabled features (icinga2 feature list):

Enabled features: api checker mainlog

Config validation (icinga2 daemon -C):
If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.

root@alpha:~# cat /etc/icinga2/zones.conf
/*

Generated by Icinga 2 node setup commands

on 2020-10-20 20:25:28 +0000
*/

object Endpoint "shirehall" {
host = "shirehall"
port = "5665"
}

object Zone "shirehall" {
endpoints = [ "shirehall" ]
}

object Endpoint "alpha" {
}

object Zone "alpha" {
endpoints = [ "alpha" ]
parent = "shirehall"
}

object Zone "global-templates" {
global = true
}

object Zone "director-global" {
global = true
}

Additional context

This failure mode is observed on the same systems where I have observed #10355 . I think that they are two separate issues and the ApiListener problem occurs both with and without the fix for #10355 applied.

I am currently running a custom build of 2.14.5-1 and have added additional logging into 'lib/remote/apilistener.cpp' to trace the path that the failing instances take through 'ApiListener::NewClientHandlerInternal'. The execution continues as far as either 'SendMessage' or 'async_flush' for 'RoleClient'

if (role == RoleClient) {
JsonRpc::SendMessage(client, new Dictionary({
{ "jsonrpc", "2.0" },
{ "method", "icinga::Hello" },
{ "params", new Dictionary({
{ "version", (double)l_AppVersionInt },
{ "capabilities", (double)l_MyCapabilities }
}) }
}), yc);
client->async_flush(yc);
ctype = ClientJsonRpc;

That causes a 'system_error' to be thrown which gets caught at

} catch (const boost::system::system_error& systemError) {
if (systemError.code() == boost::asio::error::operation_aborted) {
shutDownIfNeeded.Cancel();
}
throw;

I did not log the 'systemError.code()' value.

I am happy to add more debug and/or test possible fixes.

The text was updated successfully, but these errors were encountered:

oldelvet · 2025-03-13T12:35:24Z

I experimented with sending a 'HUP' signal to perform a daemon configuration reload on one of the instances that was sat without a connection to the parent endpoint. This caused the daemon to shutdown the stalled 'ApiListener' and start a new listener that then reconnected successfully to the parent.

root@workshop:# kill -HUP 727
root@workshop:# [2025-03-13 12:22:37 +0000] information/Application: Received request to shut down.
[2025-03-13 12:22:38 +0000] information/Application: Shutting down...
[2025-03-13 12:22:38 +0000] information/CheckerComponent: 'checker' stopped.
[2025-03-13 12:22:38 +0000] information/ApiListener: 'api' stopped.
[2025-03-13 12:22:38 +0000] information/FileLogger: 'main-log' started.
[2025-03-13 12:22:38 +0000] information/ApiListener: 'api' started.
[2025-03-13 12:22:38 +0000] information/ApiListener: Started new listener on '[::]:5665'
[2025-03-13 12:22:38 +0000] information/CheckerComponent: 'checker' started.
[2025-03-13 12:22:38 +0000] information/ConfigItem: Activated all objects.
[2025-03-13 12:22:39 +0000] information/ApiListener: Reconnecting to endpoint 'shirehall' via host 'shirehall' and port '5665'
[2025-03-13 12:22:39 +0000] information/ApiListener: New client connection for identity 'shirehall' to [xxxx:xxxx:xxxx:xxxx::xx]:5665
[2025-03-13 12:22:39 +0000] information/JsonRpcConnection: Requesting new certificate for this Icinga instance from endpoint 'shirehall'.
[2025-03-13 12:22:39 +0000] information/ApiListener: Sending config updates for endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 12:22:39 +0000] information/ApiListener: Finished sending config file updates for endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 12:22:39 +0000] information/ApiListener: Syncing runtime objects to endpoint 'shirehall'.
[2025-03-13 12:22:39 +0000] information/ApiListener: Finished reconnecting to endpoint 'shirehall' via host 'shirehall' and port '5665'
[2025-03-13 12:22:39 +0000] information/ApiListener: Finished syncing runtime objects to endpoint 'shirehall'.
[2025-03-13 12:22:39 +0000] information/ApiListener: Finished sending runtime config updates for endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 12:22:39 +0000] information/ApiListener: Sending replay log for endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 12:22:39 +0000] information/ApiListener: Replayed 8084 messages.
[2025-03-13 12:22:39 +0000] information/ApiListener: Finished sending replay log for endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 12:22:39 +0000] information/ApiListener: Finished syncing endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 12:22:41 +0000] information/ApiListener: Applying config update from endpoint 'shirehall' of zone 'shirehall'.
[2025-03-13 12:22:41 +0000] information/ApiListener: Our production configuration is more recent than the received configuration update. Ignoring configuration file update for path '/var/lib/icinga2/api/zones-stage/global-templates'. Current timestamp '2025-02-17 14:03:51 +0000' (1739801031.214035) >= received timestamp '2025-02-17 14:03:51 +0000' (1739801031.214035).

I did not try '/usr/lib/icinga2/safe-reload' or 'systemctl reload icinga2.service' but they both ultimately cause the 'SIGHUP' to be delivered so will likely work too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ApiListener fails to reconnect on heavily loaded system #10376

ApiListener fails to reconnect on heavily loaded system #10376

oldelvet commented Mar 13, 2025

oldelvet commented Mar 13, 2025

ApiListener fails to reconnect on heavily loaded system #10376

ApiListener fails to reconnect on heavily loaded system #10376

Comments

oldelvet commented Mar 13, 2025

Describe the bug

To Reproduce

Expected behavior

Your Environment

Additional context

oldelvet commented Mar 13, 2025