-
Notifications
You must be signed in to change notification settings - Fork 843
feat: add memory usage health check #4802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat: add memory usage health check #4802
Conversation
|
|
||
| maxDiskSpaceThreshold = 50 | ||
| maxDiskSpaceThreshold = 50 | ||
| maxMemorySpaceThreshold = 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can change this if we'd like. I just copied all the values for memory, but I understand memory is not disk!
config/flags.go
Outdated
| fs.Uint64(SystemTrackerRequiredAvailableMemoryPercentageKey, 3, "Minimum percentage (between 0 and 50) of available system memory, under which the node will shutdown.") | ||
| fs.Uint64(SystemTrackerWarningAvailableMemoryPercentageKey, 10, fmt.Sprintf("Warning threshold for the percentage (between 0 and 50) of available system memory, under which the node will be considered unhealthy. Must be >= [%s]", SystemTrackerRequiredAvailableMemoryPercentageKey)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
Signed-off-by: Jonathan Oppenheimer <[email protected]>
maru-ava
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there prior art suggesting this is a good idea? I usually think of OOM conditions as something that something at the OS level would be handling, even more so if running in e.g. kubernetes.
I am not familiar with Kubernetes (at all), but it doesn't seem to me to be a bad idea that nodes should be at least aware of available memory, so we can react, and gracefully shutdown (it seems like it'd be the same rationale for disk space, but maybe this is managed differently in Kubernetes and I am not aware). It's not like we're managing memory (reduce our usage) or anything here, it's all passive. Perhaps @StephenButtolph has additional thoughts? Am I correct to read your comment as thinking this is unnecessary? |
Yeah, I'm not clear on the value of this. |
In a world where system memory gets extremely / dangerously low on a node to the point that a node can no longer operate, what would currently happen? I assumed it would just crash as we have no way to monitor memory, but I'm inferring I'm wrong? |
I think it really depends on the runtime environment of the node. Unlike disk, memory utilization can change quite dynamically. On a bare metal node, a momentary condition might cause the proposed health check to shutdown the node before the kernel VMM (virtual memory manager) has a chance to OOM kill a misbehaving process and free up memory. I'm not sure why we would want to do that. For kube, I don't think we'd want to this at all. Kube already has ways of declaring memory requests/limits and orchestrating scheduled workloads accordingly. Adding another mechanism just adds potential for conflict. That suggests to me that if we do want to add this feature that it should be disabled by default since it should really only be enabled for bare metal nodes. I think there might be more value in warning about a low memory condition rather than responding to it with node shutdown. What am I missing? |
So I am incorrect! That makes a lot of sense. Prematurely killing a node would certainly be a bad idea. You're not missing anything, I was missing how systems properly deal with low memory conditions. Rather than close than PR right now, @StephenButtolph as the one who created the issue do you disagree with any of the above? If not, I will close both this PR, and the original issue as not planned. |
|
The original PR to introduce this logic was here: #1315. It was closed because of concern around K8s support. I agree that we probably never want to FATAL the node due to these memory readings... I do think it could be possible for us to someday report unhealthy... Regardless, I don't think this PR will be able to be merged as is. Whether we should close the issue as a whole or not... 🤷 |
Signed-off-by: Jonathan Oppenheimer <[email protected]>
Thanks for the context in the additional issue! I have modified this PR to
I'm laughing at the "someday." If this PR is still miles away from being mergeable, I would suggest closing this PR and the issue, or at least edit the issue to be far more accurate (as it is not currently). cc @maru-ava |
Why this should be merged
This closes a long standing issue #1314. For context, @alarso16 recently made a PR to convert the disk health check to percentage based: #4770, which reminded me of this old issue.
His PR provided an excellent template to implement the memory percentage check -- not much creativity went into this PR -- most of the relevant code blocks are copy and pasted, with configs changed. I considered trying to generalize the code to make it less repetitive, but didn't think it was worth the effort. If other's disagree, I can take a pass at it.
How this works
This PR adds the memory usage health check that monitors system memory availability and takes action when memory becomes critically low, following the exact same pattern as the existing disk space health check.
/healthAPI endpoint, which returns the current memory statisticsTwo new flags control the thresholds:
--system-tracker-memory-required-available-percentage(default: 3) - Minimum available memory percentage before shutdown--system-tracker-memory-warning-available-percentage(default: 10) - Warning threshold percentageBoth thresholds must be between 0-50% (again, same as the disk)
How this was tested
CI, and a new test,
TestGetMemoryConfigNeed to be documented in RELEASES.md?
Yes