emergency memory cleanup of harvester #974

nr-swilloughby · 2024-10-15T23:11:54Z

This is a feature addition I was holding back until I got confirmation that it was in fact the needed solution, but I think it may be best to go ahead and put it in as a default-off feature anyway to provide a means to mitigate a memory issue if one emerges with no other apparent cause or solution available and/or as a quick stop-gap solution for the customer until a better solution can be found.

This came out of the work on Issue #906, which was a reported memory leak apparently due to the agent holding onto log data longer than normal, but this only happens to this one application for this one customer under this one set of circumstances (a kubernetes environment) where memory constraints are a real issue.

Since one possibility for why it could be the case that log event data is retained longer than harvest cycles is a problem delivering them to the New Relic back-end collector (since the agent will wait for them to be delivered before dumping them), that might be a situation where a network issue or other external problem could indirectly cause the instrumented application to grow its memory too large to be viable.

And above all, we never want instrumentation of an application to unduly affect the operation of that application itself, so it stands to reason that if an application reaches that point where there seems like no other alternative, we should discard the accumulated event data in the harvester so the app can continue running.

This PR introduces an API call to allow the application to set a maximum heap size for the application. If it exceeds that value, all the harvester's data will be dropped and an emergency garbage collection and memory release will be requested. See the documentation for the function in the deltas for the PR for more details.

nr-swilloughby · 2024-10-24T19:04:11Z

I think we should look at whether we want to allow more control over what memory is released here, since the only case we've found so far seems to be caused by memory issues outside the agent itself, and we're just providing a tool to help an application let go of resources to avoid a worse problem.

emergency memory cleanup of harvester

fcc60a0

nr-swilloughby added the enhancement label Oct 15, 2024

nr-swilloughby requested review from iamemilio and mirackara October 15, 2024 23:11

nr-swilloughby self-assigned this Oct 15, 2024

nr-swilloughby marked this pull request as draft October 24, 2024 19:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

emergency memory cleanup of harvester #974

emergency memory cleanup of harvester #974

nr-swilloughby commented Oct 15, 2024

nr-swilloughby commented Oct 24, 2024

emergency memory cleanup of harvester #974

Are you sure you want to change the base?

emergency memory cleanup of harvester #974

Conversation

nr-swilloughby commented Oct 15, 2024

nr-swilloughby commented Oct 24, 2024