-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resource leak with DD_RUNTIME_METRICS_ENABLED=true #5862
Comments
Hello. Thanks for the report, though this one is going to be hard to investigate given the time it takes to reproduce the issue. I doubt anything in the runtime metrics themselves could cause this behavior, so I would suspect either the .NET EventPipes or the Datadog Dogstatsd client. The best tool to investigate this would be a perf trace, but it's very hard to interpret if |
Thanks - dump file is here from container trace-metrics-idle-1. |
Thanks for the dump file, @npubl629. We've started inspecting it, although we haven't found anything obvious in it yet. Will update here when we make some progress. |
Describe the bug
A resource leak with DD_RUNTIME_METRICS_ENABLED=true. This bug is most apparent when viewing the CPU usage of a service that has been running for several weeks without a reboot. It is unclear whether or not there is a memory leak associated with this bug.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The CPU and memory usage of this library should be stable when the service is idle or serving basic status endpoints.
Screenshots
I have had the services from the example repo running for over 3 weeks and the issue is fairly clear. See this public datadog dashboard where CPU is steadily rising for the containers trace-metrics-idle-1 and trace-metrics-healthcheck-1. Unfortunately the trace-nometrics-idle-1 container never started, but the difference in CPU usage for the three healthcheck containers makes the issue fairly clear.
This chart is the difference between container CPU usage of the three healthcheck containers. Note the blue and green lines are increasing, while the orange line is fairly stable. Also note the CPU usage of the idle service (serving no requests) has more than doubled in three weeks.
Runtime environment:
Additional context
My team discovered this issue with a few of our services which serve almost no traffic and are restarted very infrequently. The rise in CPU usage appears to outpace memory usage, but given the length of time it takes to see the issue as well as various GC settings, is difficult to determine if and where a potential memory leak occurs. If necessary I can provide redacted charts of our internal service's metrics.
The text was updated successfully, but these errors were encountered: