Cloudstack 4.20 UI cannot be operated, all timeouts. Restart the server to recover, it happens every 10 days. There are a lot of duplicate logs in the log, and one record exceeds 3MB #10578

ztskycn · 2025-03-17T09:18:01Z

Description:
I'm experiencing a critical issue with CloudStack 4.20 where all API endpoints become unresponsive approximately every 10 days. The only temporary resolution is to restart the CloudStack management server.

Observed Behavior:

API requests timeout/fail completely after ~10 days of uptime

No explicit ERROR messages in logs prior to outage

Found an unusually large INFO-level log entry (3MB per line) that might be relevant

Attached log file: [filename.log] (Please ensure you actually attach the file via GitHub interface)

Environment:

CloudStack Version: 4.20.0.0

OS:Ubuntu 24.04

Steps to Reproduce:

Start CloudStack management server

Operate normally for ~10 days

API services become unavailable without obvious triggers

Expected Behavior:
API endpoints should remain available continuously without requiring manual restarts.

Additional Context:

The large INFO-level log entry repeats periodically (full content attached)

No observed resource exhaustion (CPU/MEM) before outages

Problem persists across multiple maintenance windows

Troubleshooting Attempted:

Reviewed standard error logs - no smoking gun

Monitored system resources - no apparent bottlenecks

Server restart temporarily resolves the issue

Request:
Please help investigate:

Potential memory leaks or thread blocking in the 4.20 codebase

Significance of the oversized INFO log entries

Update to Original Issue:
Further analysis of the oversized INFO log reveals repetitive entries related to createVPCOffering API calls. The JSON payload in these logs appears to be abnormally large (3MB per line) and contains repetitive configuration data.

Key Log Excerpt Pattern:
INFO [c.c.a.ApiServlet] (qtp123456789-42:) {cmd="createVPCOffering", ... JSON payload (3MB) ...}

boring-cyborg · 2025-03-17T09:18:04Z

Thanks for opening your first issue here! Be sure to follow the issue template!

DaanHoogland · 2025-03-19T15:48:41Z

@ztskycn , you will probably find a reason above the log you are refering to that would explain the issue. createVPCOffering is a one off call that should not be repeated. It might be that it fails and a retry is done?
There is no log attached, so please add it.

weizhouapache · 2025-03-20T07:48:44Z

with upgrade to the log4j 2.x (#7131) , the content of all API responses are logged.

this issue is caused by massive response of API listApis.
2025-03-18 14:17:53,683 INFO [c.c.a.ApiServlet] (qtp2038105753-4077:[ctx-c7d0b173, ctx-8784ed60]) (logid:d5c7cecd) (userId=2 accountId=2 sessionId=node0q691iwxlhj07l65hzs2gg3ru2) 15.178 -- GET command=listApis&response=json&sessionkey=Sc6JtXZo8btFy6yP7WA9888BXIs 200 {"listapisresponse":{"count":812,"api":[{"name":"createVPCOffering","description":"Creates VPC offering","isasync":true,"related":"updateVPCOffering,listVPCOfferings","params":[{"name":"serviceproviderlist","description":"provider to service mapping. If not specified, the provider for the service will be mapped to the default provider on the physical network","type":"map","length":255,"required":false},

I am not sure if it is configurable.
maybe @JoaoJandre can give some advise

bernardodemarco · 2025-03-20T11:59:41Z

with upgrade to the log4j 2.x (#7131) , the content of all API responses are logged.

@weizhouapache, yes, I believe #10567 is handling that (cc @DaanHoogland, @gpordeus)

weizhouapache · 2025-03-20T12:07:59Z

with upgrade to the log4j 2.x (#7131) , the content of all API responses are logged.

@weizhouapache, yes, I believe #10567 is handling that (cc @DaanHoogland, @gpordeus)

thanks @bernardodemarco

I did not look into the PR.
Does it introduce some changes on management-server.log , other than fixing apilog.log ?

JoaoJandre · 2025-03-20T12:38:37Z

with upgrade to the log4j 2.x (#7131) , the content of all API responses are logged.

this issue is caused by massive response of API listApis. 2025-03-18 14:17:53,683 INFO [c.c.a.ApiServlet] (qtp2038105753-4077:[ctx-c7d0b173, ctx-8784ed60]) (logid:d5c7cecd) (userId=2 accountId=2 sessionId=node0q691iwxlhj07l65hzs2gg3ru2) 15.178 -- GET command=listApis&response=json&sessionkey=Sc6JtXZo8btFy6yP7WA9888BXIs 200 {"listapisresponse":{"count":812,"api":[{"name":"createVPCOffering","description":"Creates VPC offering","isasync":true,"related":"updateVPCOffering,listVPCOfferings","params":[{"name":"serviceproviderlist","description":"provider to service mapping. If not specified, the provider for the service will be mapped to the default provider on the physical network","type":"map","length":255,"required":false},

I am not sure if it is configurable. maybe @JoaoJandre can give some advise

Yes, @weizhouapache @ztskycn this was changed (by mistake) with the #7131 PR, originally, the API response logs were being written to a specific logger/appender/file. #10567 should fix this so that it goes back to what it used to be by default.

One workaround to limit the size of your log lines is to set the maxLen property on the PaternLayout of the mgmt log appender. It would look something like this:

      <RollingFile name="FILE" append="true" fileName="/var/log/cloudstack/management/management-server.log" filePattern="/var/log/cloudstack/management/management-server.log.%d{yyyy-MM-dd}.gz">
         <ThresholdFilter level="TRACE" onMatch="ACCEPT" onMismatch="DENY"/>
         <Policies>
            <TimeBasedTriggeringPolicy/>
         </Policies>
         <PatternLayout pattern="%maxLen{%d{DEFAULT} %-5p [%c{1.}] (%t:%x) (logid:%X{logcontextid}) %m%ex{filters(${filters})}}{1000}%n"/>
      </RollingFile>

Using this configuration, each log line would have at most 1000 characters. This number should be tweaked so that you can still read important log information but those enormous JSON logs are truncated. See https://logging.apache.org/log4j/2.x/manual/pattern-layout.html#converter-max-len for more information on how to configure your logs.

In any case, is your server going down because of no disk space left? If not, I fail to see how the log size would affect this.

ztskycn · 2025-03-21T11:00:53Z

{1000}

I still can't find the specific problem. Is there any way to troubleshoot this problem? If the log is too large, why is there such a large log?

JoaoJandre · 2025-03-21T11:59:41Z

{1000}

I still can't find the specific problem. Is there any way to troubleshoot this problem? If the log is too large, why is there such a large log?

As @weizhouapache and I explained, these large logs should not be printed into this file, this was a bug introduced in version 4.20.0.0. Version 4.21.0.0 should fix this. My last message has a workaround to limit the size of any log line to 1000 characters. I'm sure that if you're willing to go into the log4j2 documentation there may be other workarounds.

Regarding the issue at hand. You reported that after a restart the UI recovers, this indicates to me that the logs are not the real culprit, since if the logs were filling your Management Server's storage, a simple restart would not fix the issue, you would need to manually recover some space by deleting stuff.

Therefore, the first step I would take is pinpoint when the services became unavailable and try to find errors during that period of time in the logs.

DaanHoogland · 2025-03-26T14:27:49Z

@JoaoJandre , could an internal log buffer be the culprit? (as the log size is reported to increase over time.)

JoaoJandre · 2025-03-27T11:48:28Z

@JoaoJandre , could an internal log buffer be the culprit? (as the log size is reported to increase over time.)

@DaanHoogland I am not sure, but before going with any guesses. I would first analyze the logs and the host situation when the problem happens.

@ztskycn When this happens, what is the CPU usage of the host? RAM? storage?

Also, again, what are the logs saying?

ztskycn · 2025-03-28T03:10:00Z

@JoaoJandre，内部日志缓冲区可能是罪魁祸首吗？（因为据报道日志大小会随着时间的推移而增加。）

@DaanHoogland我不确定，但在猜测之前，我会先分析日志和问题发生时的主机情况。

@ztskycn发生这种情况时，主机的 CPU 使用率是多少？ RAM 和存储？

另外，日志又说了什么？

There are generally no major problems with the CPU and memory, and no obvious errors have been found in the logs. In one scenario, I now use a custom network computing solution (only three components: dhcp, dns, and user-static data).

weizhouapache · 2025-03-28T07:25:27Z

There are generally no major problems with the CPU and memory, and no obvious errors have been found in the logs. In one scenario, I now use a custom network computing solution (only three components: dhcp, dns, and user-static data).

@ztskycn
just to confirm, what is the issue, 1,2,3 or 4 ?
(1) management server is down or unreachable
(2) cloudstack-management service is not running
(3) service is running but cloudstack GUI is not working
(4) service and UI are running, but user actions fail.

ztskycn · 2025-03-28T08:28:20Z

There are generally no major problems with the CPU and memory, and no obvious errors have been found in the logs. In one scenario, I now use a custom network computing solution (only three components: dhcp, dns, and user-static data).

@ztskycn just to confirm, what is the issue, 1,2,3 or 4 ? (1) management server is down or unreachable (2) cloudstack-management service is not running (3) service is running but cloudstack GUI is not working (4) service and UI are running, but user actions fail.

Your definition above should be 4。

Most of the UI interfaces can be opened, but a few cannot be opened. If operations are involved, timeout will be reported directly.

weizhouapache · 2025-03-28T08:44:42Z

@ztskycn just to confirm, what is the issue, 1,2,3 or 4 ? (1) management server is down or unreachable (2) cloudstack-management service is not running (3) service is running but cloudstack GUI is not working (4) service and UI are running, but user actions fail.

Your definition above should be 4。

Most of the UI interfaces can be opened, but a few cannot be opened. If operations are involved, timeout will be reported directly.

if so, I think it should NOT be caused by the logs.

can you share an example of the logs ?

gpordeus · 2025-03-28T17:16:31Z

@ztskycn Are the errors in the log similar to this? The problem description is similar to one I've discovered recently, I'm in the middle of searching for where the connection leak is.

2025-03-28 16:54:35,644 WARN  [c.c.v.SystemVmLoadScanner$1] (secstorage-1:[ctx-992586eb]) (logid:9f55b800) Unexpected exception Unable to find on DB, due to: cloud - Connection is not available, request timed out after 30000ms (total=50, active=50, idle=0, waiting=19) com.cloud.utils.exception.CloudRuntimeException: Unable to find on DB, due to: cloud - Connection is not available, request timed out after 30000ms (total=50, active=50, idle=0, waiting=19)
	at com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBase.java:441)
	at com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBase.java:368)

DaanHoogland added this to the 4.20.1 milestone Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloudstack 4.20 UI cannot be operated, all timeouts. Restart the server to recover, it happens every 10 days. There are a lot of duplicate logs in the log, and one record exceeds 3MB #10578

Cloudstack 4.20 UI cannot be operated, all timeouts. Restart the server to recover, it happens every 10 days. There are a lot of duplicate logs in the log, and one record exceeds 3MB #10578

ztskycn commented Mar 17, 2025 •

edited

Loading

boring-cyborg bot commented Mar 17, 2025

DaanHoogland commented Mar 19, 2025

weizhouapache commented Mar 20, 2025

bernardodemarco commented Mar 20, 2025

weizhouapache commented Mar 20, 2025

JoaoJandre commented Mar 20, 2025

ztskycn commented Mar 21, 2025

JoaoJandre commented Mar 21, 2025

DaanHoogland commented Mar 26, 2025

JoaoJandre commented Mar 27, 2025

ztskycn commented Mar 28, 2025

weizhouapache commented Mar 28, 2025

ztskycn commented Mar 28, 2025

weizhouapache commented Mar 28, 2025

gpordeus commented Mar 28, 2025

Cloudstack 4.20 UI cannot be operated, all timeouts. Restart the server to recover, it happens every 10 days. There are a lot of duplicate logs in the log, and one record exceeds 3MB #10578

Cloudstack 4.20 UI cannot be operated, all timeouts. Restart the server to recover, it happens every 10 days. There are a lot of duplicate logs in the log, and one record exceeds 3MB #10578

Comments

ztskycn commented Mar 17, 2025 • edited Loading

boring-cyborg bot commented Mar 17, 2025

DaanHoogland commented Mar 19, 2025

weizhouapache commented Mar 20, 2025

bernardodemarco commented Mar 20, 2025

weizhouapache commented Mar 20, 2025

JoaoJandre commented Mar 20, 2025

ztskycn commented Mar 21, 2025

JoaoJandre commented Mar 21, 2025

DaanHoogland commented Mar 26, 2025

JoaoJandre commented Mar 27, 2025

ztskycn commented Mar 28, 2025

weizhouapache commented Mar 28, 2025

ztskycn commented Mar 28, 2025

weizhouapache commented Mar 28, 2025

gpordeus commented Mar 28, 2025

ztskycn commented Mar 17, 2025 •

edited

Loading