Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloudstack 4.20 UI cannot be operated, all timeouts. Restart the server to recover, it happens every 10 days. There are a lot of duplicate logs in the log, and one record exceeds 3MB #10578

Open
ztskycn opened this issue Mar 17, 2025 · 15 comments
Milestone

Comments

@ztskycn
Copy link

ztskycn commented Mar 17, 2025

Description:
I'm experiencing a critical issue with CloudStack 4.20 where all API endpoints become unresponsive approximately every 10 days. The only temporary resolution is to restart the CloudStack management server.

Observed Behavior:

API requests timeout/fail completely after ~10 days of uptime

No explicit ERROR messages in logs prior to outage

Found an unusually large INFO-level log entry (3MB per line) that might be relevant

Attached log file: [filename.log] (Please ensure you actually attach the file via GitHub interface)

Environment:

CloudStack Version: 4.20.0.0

OS:Ubuntu 24.04

Steps to Reproduce:

Start CloudStack management server

Operate normally for ~10 days

API services become unavailable without obvious triggers

Expected Behavior:
API endpoints should remain available continuously without requiring manual restarts.

Additional Context:

The large INFO-level log entry repeats periodically (full content attached)

No observed resource exhaustion (CPU/MEM) before outages

Problem persists across multiple maintenance windows

Troubleshooting Attempted:

Reviewed standard error logs - no smoking gun

Monitored system resources - no apparent bottlenecks

Server restart temporarily resolves the issue

Request:
Please help investigate:

Potential memory leaks or thread blocking in the 4.20 codebase

Significance of the oversized INFO log entries

Update to Original Issue:
Further analysis of the oversized INFO log reveals repetitive entries related to createVPCOffering API calls. The JSON payload in these logs appears to be abnormally large (3MB per line) and contains repetitive configuration data.

Key Log Excerpt Pattern:
INFO [c.c.a.ApiServlet] (qtp123456789-42:) {cmd="createVPCOffering", ... JSON payload (3MB) ...}

Image

Copy link

boring-cyborg bot commented Mar 17, 2025

Thanks for opening your first issue here! Be sure to follow the issue template!

@DaanHoogland
Copy link
Contributor

@ztskycn , you will probably find a reason above the log you are refering to that would explain the issue. createVPCOffering is a one off call that should not be repeated. It might be that it fails and a retry is done?
There is no log attached, so please add it.

@ztskycn ztskycn changed the title API Service Becomes Unresponsive Every ~10 Days in CloudStack 4.20 - Requires Server Restart Cloudstack 4.20 UI cannot be operated, all timeouts. Restart the server to recover, it happens every 10 days. There are a lot of duplicate logs in the log, and one record exceeds 3MB Mar 20, 2025
@weizhouapache
Copy link
Member

with upgrade to the log4j 2.x (#7131) , the content of all API responses are logged.

this issue is caused by massive response of API listApis.
2025-03-18 14:17:53,683 INFO [c.c.a.ApiServlet] (qtp2038105753-4077:[ctx-c7d0b173, ctx-8784ed60]) (logid:d5c7cecd) (userId=2 accountId=2 sessionId=node0q691iwxlhj07l65hzs2gg3ru2) 15.178 -- GET command=listApis&response=json&sessionkey=Sc6JtXZo8btFy6yP7WA9888BXIs 200 {"listapisresponse":{"count":812,"api":[{"name":"createVPCOffering","description":"Creates VPC offering","isasync":true,"related":"updateVPCOffering,listVPCOfferings","params":[{"name":"serviceproviderlist","description":"provider to service mapping. If not specified, the provider for the service will be mapped to the default provider on the physical network","type":"map","length":255,"required":false},

I am not sure if it is configurable.
maybe @JoaoJandre can give some advise

@bernardodemarco
Copy link
Collaborator

with upgrade to the log4j 2.x (#7131) , the content of all API responses are logged.

@weizhouapache, yes, I believe #10567 is handling that (cc @DaanHoogland, @gpordeus)

@weizhouapache
Copy link
Member

with upgrade to the log4j 2.x (#7131) , the content of all API responses are logged.

@weizhouapache, yes, I believe #10567 is handling that (cc @DaanHoogland, @gpordeus)

thanks @bernardodemarco

I did not look into the PR.
Does it introduce some changes on management-server.log , other than fixing apilog.log ?

@JoaoJandre
Copy link
Contributor

with upgrade to the log4j 2.x (#7131) , the content of all API responses are logged.

this issue is caused by massive response of API listApis. 2025-03-18 14:17:53,683 INFO [c.c.a.ApiServlet] (qtp2038105753-4077:[ctx-c7d0b173, ctx-8784ed60]) (logid:d5c7cecd) (userId=2 accountId=2 sessionId=node0q691iwxlhj07l65hzs2gg3ru2) 15.178 -- GET command=listApis&response=json&sessionkey=Sc6JtXZo8btFy6yP7WA9888BXIs 200 {"listapisresponse":{"count":812,"api":[{"name":"createVPCOffering","description":"Creates VPC offering","isasync":true,"related":"updateVPCOffering,listVPCOfferings","params":[{"name":"serviceproviderlist","description":"provider to service mapping. If not specified, the provider for the service will be mapped to the default provider on the physical network","type":"map","length":255,"required":false},

I am not sure if it is configurable. maybe @JoaoJandre can give some advise

Yes, @weizhouapache @ztskycn this was changed (by mistake) with the #7131 PR, originally, the API response logs were being written to a specific logger/appender/file. #10567 should fix this so that it goes back to what it used to be by default.

One workaround to limit the size of your log lines is to set the maxLen property on the PaternLayout of the mgmt log appender. It would look something like this:

      <RollingFile name="FILE" append="true" fileName="/var/log/cloudstack/management/management-server.log" filePattern="/var/log/cloudstack/management/management-server.log.%d{yyyy-MM-dd}.gz">
         <ThresholdFilter level="TRACE" onMatch="ACCEPT" onMismatch="DENY"/>
         <Policies>
            <TimeBasedTriggeringPolicy/>
         </Policies>
         <PatternLayout pattern="%maxLen{%d{DEFAULT} %-5p [%c{1.}] (%t:%x) (logid:%X{logcontextid}) %m%ex{filters(${filters})}}{1000}%n"/>
      </RollingFile>

Using this configuration, each log line would have at most 1000 characters. This number should be tweaked so that you can still read important log information but those enormous JSON logs are truncated. See https://logging.apache.org/log4j/2.x/manual/pattern-layout.html#converter-max-len for more information on how to configure your logs.

In any case, is your server going down because of no disk space left? If not, I fail to see how the log size would affect this.

@ztskycn
Copy link
Author

ztskycn commented Mar 21, 2025

{1000}

I still can't find the specific problem. Is there any way to troubleshoot this problem? If the log is too large, why is there such a large log?

@JoaoJandre
Copy link
Contributor

{1000}

I still can't find the specific problem. Is there any way to troubleshoot this problem? If the log is too large, why is there such a large log?

As @weizhouapache and I explained, these large logs should not be printed into this file, this was a bug introduced in version 4.20.0.0. Version 4.21.0.0 should fix this. My last message has a workaround to limit the size of any log line to 1000 characters. I'm sure that if you're willing to go into the log4j2 documentation there may be other workarounds.

Regarding the issue at hand. You reported that after a restart the UI recovers, this indicates to me that the logs are not the real culprit, since if the logs were filling your Management Server's storage, a simple restart would not fix the issue, you would need to manually recover some space by deleting stuff.

Therefore, the first step I would take is pinpoint when the services became unavailable and try to find errors during that period of time in the logs.

@DaanHoogland DaanHoogland added this to the 4.20.1 milestone Mar 24, 2025
@DaanHoogland
Copy link
Contributor

@JoaoJandre , could an internal log buffer be the culprit? (as the log size is reported to increase over time.)

@JoaoJandre
Copy link
Contributor

@JoaoJandre , could an internal log buffer be the culprit? (as the log size is reported to increase over time.)

@DaanHoogland I am not sure, but before going with any guesses. I would first analyze the logs and the host situation when the problem happens.

@ztskycn When this happens, what is the CPU usage of the host? RAM? storage?

Also, again, what are the logs saying?

@ztskycn
Copy link
Author

ztskycn commented Mar 28, 2025

@JoaoJandre,内部日志缓冲区可能是罪魁祸首吗?(因为据报道日志大小会随着时间的推移而增加。)

@DaanHoogland我不确定,但在猜测之前,我会先分析日志和问题发生时的主机情况。

@ztskycn发生这种情况时,主机的 CPU 使用率是多少? RAM 和存储?

另外,日志又说了什么?

There are generally no major problems with the CPU and memory, and no obvious errors have been found in the logs. In one scenario, I now use a custom network computing solution (only three components: dhcp, dns, and user-static data).

@weizhouapache
Copy link
Member

There are generally no major problems with the CPU and memory, and no obvious errors have been found in the logs. In one scenario, I now use a custom network computing solution (only three components: dhcp, dns, and user-static data).

@ztskycn
just to confirm, what is the issue, 1,2,3 or 4 ?
(1) management server is down or unreachable
(2) cloudstack-management service is not running
(3) service is running but cloudstack GUI is not working
(4) service and UI are running, but user actions fail.

@ztskycn
Copy link
Author

ztskycn commented Mar 28, 2025

There are generally no major problems with the CPU and memory, and no obvious errors have been found in the logs. In one scenario, I now use a custom network computing solution (only three components: dhcp, dns, and user-static data).

@ztskycn just to confirm, what is the issue, 1,2,3 or 4 ? (1) management server is down or unreachable (2) cloudstack-management service is not running (3) service is running but cloudstack GUI is not working (4) service and UI are running, but user actions fail.

Your definition above should be 4。

Most of the UI interfaces can be opened, but a few cannot be opened. If operations are involved, timeout will be reported directly.

@weizhouapache
Copy link
Member

@ztskycn just to confirm, what is the issue, 1,2,3 or 4 ? (1) management server is down or unreachable (2) cloudstack-management service is not running (3) service is running but cloudstack GUI is not working (4) service and UI are running, but user actions fail.

Your definition above should be 4。

Most of the UI interfaces can be opened, but a few cannot be opened. If operations are involved, timeout will be reported directly.

if so, I think it should NOT be caused by the logs.

can you share an example of the logs ?

@gpordeus
Copy link
Collaborator

@ztskycn Are the errors in the log similar to this? The problem description is similar to one I've discovered recently, I'm in the middle of searching for where the connection leak is.

2025-03-28 16:54:35,644 WARN  [c.c.v.SystemVmLoadScanner$1] (secstorage-1:[ctx-992586eb]) (logid:9f55b800) Unexpected exception Unable to find on DB, due to: cloud - Connection is not available, request timed out after 30000ms (total=50, active=50, idle=0, waiting=19) com.cloud.utils.exception.CloudRuntimeException: Unable to find on DB, due to: cloud - Connection is not available, request timed out after 30000ms (total=50, active=50, idle=0, waiting=19)
	at com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBase.java:441)
	at com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBase.java:368)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants