You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
pods had a JS heap that was larger than the container memory limit, so pods were getting OOMKilled before they did any garbage collection
Setting a heap size below the container limit resulted in heap allocation failures, since that heap wasn't large enough (3GB), and caused CPU usage to go nuts, since we were constantly garbage collecting
Setting a 7GB heap resulted in fewer heap allocation failures, and improved CPU utilization, but we were still getting the odd container kill on very large short term heap allocations
Setting a 10GB heap seems to fix most of the heap allocation failures, except for some outliers.
Still tracking down some of our other pod restarts -- not all of these seem to be memory related, and we get pod restarts about ~twice per day with the latest resource limit increase.
Instability patterns that we're seeing regularly:
pods fail their health check, which removes them from the load balancing pool. All requests then go to the other API server, which overwhelm it, and cause it to also become unhealthy. We swap back and forth throughout the day
I've been working on this in the background for a while, I just wanted to have an issue to have on the board for it.
What we've been seeing, and trying to mitigate:
Still tracking down some of our other pod restarts -- not all of these seem to be memory related, and we get pod restarts about ~twice per day with the latest resource limit increase.
Instability patterns that we're seeing regularly:
The text was updated successfully, but these errors were encountered: