Work on API memory exhaustion / stability issues #1471

sjahl · 2024-04-03T18:05:03Z

I've been working on this in the background for a while, I just wanted to have an issue to have on the board for it.

What we've been seeing, and trying to mitigate:

pods had a JS heap that was larger than the container memory limit, so pods were getting OOMKilled before they did any garbage collection
Setting a heap size below the container limit resulted in heap allocation failures, since that heap wasn't large enough (3GB), and caused CPU usage to go nuts, since we were constantly garbage collecting
Setting a 7GB heap resulted in fewer heap allocation failures, and improved CPU utilization, but we were still getting the odd container kill on very large short term heap allocations
Setting a 10GB heap seems to fix most of the heap allocation failures, except for some outliers.

Still tracking down some of our other pod restarts -- not all of these seem to be memory related, and we get pod restarts about ~twice per day with the latest resource limit increase.

Instability patterns that we're seeing regularly:

pods fail their health check, which removes them from the load balancing pool. All requests then go to the other API server, which overwhelm it, and cause it to also become unhealthy. We swap back and forth throughout the day
API pods crash with a RangeError stacktrace. The common cause of this appears to be attempting to serialize very large JSON objects into strings (https://the-tgg.slack.com/archives/C03P7FA3W3T/p1710772444516509)

sjahl added the Component: Infrastructure label Apr 3, 2024

sjahl self-assigned this Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work on API memory exhaustion / stability issues #1471

Work on API memory exhaustion / stability issues #1471

sjahl commented Apr 3, 2024

Work on API memory exhaustion / stability issues #1471

Work on API memory exhaustion / stability issues #1471

Comments

sjahl commented Apr 3, 2024