feat: add memory usage health check #4802

JonathanOppenheimer · 2026-01-03T00:08:09Z

Why this should be merged

This closes a long standing issue #1314. For context, @alarso16 recently made a PR to convert the disk health check to percentage based: #4770, which reminded me of this old issue.

His PR provided an excellent template to implement the memory percentage check -- not much creativity went into this PR -- most of the relevant code blocks are copy and pasted, with configs changed. I considered trying to generalize the code to make it less repetitive, but didn't think it was worth the effort. If other's disagree, I can take a pass at it.

How this works

This PR adds the memory usage health check that monitors system memory availability and takes action when memory becomes critically low, following the exact same pattern as the existing disk space health check.

Shutdown threshold (default: 3%): When available memory drops below this percentage, the node logs a fatal error and gracefully shuts down
Warning threshold (default: 10%): When available memory drops below this percentage (but above the shutdown threshold), the node reports as unhealthy but continues operating
The health status is queryable via the /health API endpoint, which returns the current memory statistics

Two new flags control the thresholds:

--system-tracker-memory-required-available-percentage (default: 3) - Minimum available memory percentage before shutdown
--system-tracker-memory-warning-available-percentage (default: 10) - Warning threshold percentage

Both thresholds must be between 0-50% (again, same as the disk)

How this was tested

CI, and a new test, TestGetMemoryConfig

Need to be documented in RELEASES.md?

Yes

JonathanOppenheimer · 2026-01-03T00:08:32Z

config/config.go


-	maxDiskSpaceThreshold = 50
+	maxDiskSpaceThreshold   = 50
+	maxMemorySpaceThreshold = 50


We can change this if we'd like. I just copied all the values for memory, but I understand memory is not disk!

JonathanOppenheimer · 2026-01-03T00:08:48Z

config/flags.go

+	fs.Uint64(SystemTrackerRequiredAvailableMemoryPercentageKey, 3, "Minimum percentage (between 0 and 50) of available system memory, under which the node will shutdown.")
+	fs.Uint64(SystemTrackerWarningAvailableMemoryPercentageKey, 10, fmt.Sprintf("Warning threshold for the percentage (between 0 and 50) of available system memory, under which the node will be considered unhealthy. Must be >= [%s]", SystemTrackerRequiredAvailableMemoryPercentageKey))


utils/resource/usage.go

config/flags.go

Signed-off-by: Jonathan Oppenheimer <[email protected]>

maru-ava

Is there prior art suggesting this is a good idea? I usually think of OOM conditions as something that something at the OS level would be handling, even more so if running in e.g. kubernetes.

snow/networking/tracker/resource_tracker.go

JonathanOppenheimer · 2026-01-06T18:52:06Z

Is there prior art suggesting this is a good idea? I usually think of OOM conditions as something that something at the OS level would be handling, even more so if running in e.g. kubernetes.

I am not familiar with Kubernetes (at all), but it doesn't seem to me to be a bad idea that nodes should be at least aware of available memory, so we can react, and gracefully shutdown (it seems like it'd be the same rationale for disk space, but maybe this is managed differently in Kubernetes and I am not aware). It's not like we're managing memory (reduce our usage) or anything here, it's all passive. Perhaps @StephenButtolph has additional thoughts? Am I correct to read your comment as thinking this is unnecessary?

maru-ava · 2026-01-07T16:49:54Z

Am I correct to read your comment as thinking this is unnecessary?

Yeah, I'm not clear on the value of this.

JonathanOppenheimer · 2026-01-07T16:56:53Z

Am I correct to read your comment as thinking this is unnecessary?

Yeah, I'm not clear on the value of this.

In a world where system memory gets extremely / dangerously low on a node to the point that a node can no longer operate, what would currently happen? I assumed it would just crash as we have no way to monitor memory, but I'm inferring I'm wrong?

maru-ava · 2026-01-07T17:14:51Z

Am I correct to read your comment as thinking this is unnecessary?

Yeah, I'm not clear on the value of this.

In a world where system memory gets extremely / dangerously low on a node to the point that a node can no longer operate, what would currently happen? I assumed it would just crash as we have no way to monitor memory, but I'm inferring I'm wrong?

I think it really depends on the runtime environment of the node. Unlike disk, memory utilization can change quite dynamically.

On a bare metal node, a momentary condition might cause the proposed health check to shutdown the node before the kernel VMM (virtual memory manager) has a chance to OOM kill a misbehaving process and free up memory. I'm not sure why we would want to do that.

For kube, I don't think we'd want to this at all. Kube already has ways of declaring memory requests/limits and orchestrating scheduled workloads accordingly. Adding another mechanism just adds potential for conflict. That suggests to me that if we do want to add this feature that it should be disabled by default since it should really only be enabled for bare metal nodes.

I think there might be more value in warning about a low memory condition rather than responding to it with node shutdown.

What am I missing?

JonathanOppenheimer · 2026-01-07T17:18:59Z

Am I correct to read your comment as thinking this is unnecessary?

Yeah, I'm not clear on the value of this.

In a world where system memory gets extremely / dangerously low on a node to the point that a node can no longer operate, what would currently happen? I assumed it would just crash as we have no way to monitor memory, but I'm inferring I'm wrong?

I think it really depends on the runtime environment of the node. Unlike disk, memory utilization can change quite dynamically.

On a bare metal node, a momentary condition might cause the proposed health check to shutdown the node before the kernel VMM (virtual memory manager) has a chance to OOM kill a misbehaving process and free up memory. I'm not sure why we would want to do that.

For kube, I don't think we'd want to this at all. Kube already has ways of declaring memory requests/limits and orchestrating scheduled workloads accordingly. Adding another mechanism just adds potential for conflict. That suggests to me that if we do want to add this feature that it should be disabled by default since it should really only be enabled for bare metal nodes.

I think there might be more value in warning about a low memory condition rather than responding to it with node shutdown.

What am I missing?

So I am incorrect! That makes a lot of sense. Prematurely killing a node would certainly be a bad idea. You're not missing anything, I was missing how systems properly deal with low memory conditions. Rather than close than PR right now, @StephenButtolph as the one who created the issue do you disagree with any of the above? If not, I will close both this PR, and the original issue as not planned.

StephenButtolph · 2026-01-12T21:27:53Z

The original PR to introduce this logic was here: #1315. It was closed because of concern around K8s support.

I agree that we probably never want to FATAL the node due to these memory readings... I do think it could be possible for us to someday report unhealthy...

Regardless, I don't think this PR will be able to be merged as is. Whether we should close the issue as a whole or not... 🤷

Signed-off-by: Jonathan Oppenheimer <[email protected]>

JonathanOppenheimer · 2026-01-13T16:18:37Z

The original PR to introduce this logic was here: #1315. It was closed because of concern around K8s support.

I agree that we probably never want to FATAL the node due to these memory readings... I do think it could be possible for us to someday report unhealthy...

Regardless, I don't think this PR will be able to be merged as is. Whether we should close the issue as a whole or not... 🤷

Thanks for the context in the additional issue! I have modified this PR to

Only report unhealthy, never shutdown
- Single warning threshold
- Disabled by default (default is 0, user would have to manually configure a warning threshold, rather than defaulting to say 3%)
Correctly get memory percentage on v1 (core/v1) and v2 (core/v2) k8s. There's actually a good amount of discussion on the internet about this issue, including an issue on gopsutil repo in which a guy shared how to correctly get these figures: [BUG] github.com/shirou/gopsutil/v4/mem not correctly working WITHIN a docker container. shirou/gopsutil#1758.

I'm laughing at the "someday." If this PR is still miles away from being mergeable, I would suggest closing this PR and the issue, or at least edit the issue to be far more accurate (as it is not currently).

cc @maru-ava

feat: add memory usage health check

258eafe

JonathanOppenheimer requested review from StephenButtolph and alarso16 January 3, 2026 00:08

JonathanOppenheimer self-assigned this Jan 3, 2026

JonathanOppenheimer requested a review from a team as a code owner January 3, 2026 00:08

JonathanOppenheimer added enhancement New feature or request monitoring This primarily focuses on logs, metrics, and/or tracing labels Jan 3, 2026

github-project-automation bot added this to avalanchego Jan 3, 2026

JonathanOppenheimer commented Jan 3, 2026

View reviewed changes

JonathanOppenheimer linked an issue Jan 3, 2026 that may be closed by this pull request

Add memory usage health check #1314

Open

docs: add to releases.md

72e7728

DracoLi reviewed Jan 5, 2026

View reviewed changes

utils/resource/usage.go Outdated Show resolved Hide resolved

config/flags.go Outdated Show resolved Hide resolved

JonathanOppenheimer added 3 commits January 5, 2026 10:31

chore: change memory percentage to 5%

992d1b5

Merge branch 'master' into JonathanOppenheimer/memory-health-check

8363af9

chore: directly calculate memory used percent

ee98118

JonathanOppenheimer requested a review from DracoLi January 5, 2026 15:44

DracoLi approved these changes Jan 5, 2026

View reviewed changes

Merge branch 'master' into JonathanOppenheimer/memory-health-check

d92980a

Signed-off-by: Jonathan Oppenheimer <[email protected]>

maru-ava reviewed Jan 6, 2026

View reviewed changes

snow/networking/tracker/resource_tracker.go Outdated Show resolved Hide resolved

snow/networking/tracker/resource_tracker.go Show resolved Hide resolved

JonathanOppenheimer added 2 commits January 6, 2026 13:54

chore: maru naming feedback

5b94c63

Merge branch 'master' into JonathanOppenheimer/memory-health-check

a3e5ef5

JonathanOppenheimer added 2 commits January 13, 2026 10:53

Merge branch 'master' into JonathanOppenheimer/memory-health-check

822bdae

Signed-off-by: Jonathan Oppenheimer <[email protected]>

refactor: Stephen's comment

285c6f2

JonathanOppenheimer added 2 commits January 13, 2026 11:21

chore: log

5728f00

Merge branch 'master' into JonathanOppenheimer/memory-health-check

68899b1

JonathanOppenheimer requested a review from maru-ava January 13, 2026 17:59

		fs.Uint64(SystemTrackerRequiredAvailableMemoryPercentageKey, 3, "Minimum percentage (between 0 and 50) of available system memory, under which the node will shutdown.")
		fs.Uint64(SystemTrackerWarningAvailableMemoryPercentageKey, 10, fmt.Sprintf("Warning threshold for the percentage (between 0 and 50) of available system memory, under which the node will be considered unhealthy. Must be >= [%s]", SystemTrackerRequiredAvailableMemoryPercentageKey))

feat: add memory usage health check #4802

Are you sure you want to change the base?

feat: add memory usage health check #4802

Uh oh!

Conversation

JonathanOppenheimer commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this should be merged

How this works

How this was tested

Need to be documented in RELEASES.md?

Uh oh!

JonathanOppenheimer Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

JonathanOppenheimer Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

maru-ava left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JonathanOppenheimer commented Jan 6, 2026

Uh oh!

maru-ava commented Jan 7, 2026

Uh oh!

JonathanOppenheimer commented Jan 7, 2026

Uh oh!

maru-ava commented Jan 7, 2026

Uh oh!

JonathanOppenheimer commented Jan 7, 2026

Uh oh!

StephenButtolph commented Jan 12, 2026

Uh oh!

JonathanOppenheimer commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

JonathanOppenheimer commented Jan 3, 2026 •

edited

Loading

JonathanOppenheimer commented Jan 13, 2026 •

edited

Loading