document memory buffers #280

davidkoski · 2025-10-01T20:07:54Z

see also LLMEval: Memory usage mlx-swift-examples#17

davidkoski · 2025-10-01T20:08:23Z

Note: CI will fail until #278 is merged

- see also ml-explore/mlx-swift-examples#17

awni · 2025-10-10T15:56:46Z

Source/MLX/GPU.swift

+    /// - **After first token**: +~1MB intermediates → cache grows as buffers are recycled
+    /// - **After 100 tokens**: Cache may be ~500MB (accumulated smaller buffers)
+    /// - **After 500 tokens**: Cache may be ~9.9GB (buffers of various sizes waiting for reuse)


That step from 500MB to 9.9GB seems rather large..

If the cache is unbounded (well, left to be related to physical memory size) then you can easily get a lot of recycled buffers in there. And it is related to N^2 as each new token uses 1 more in one of the dimensions.

Anyway, these were numbers I had in my notes but they may not be entirely accurate. Do you think this is misleading?

I'm not worried about misleading.. I'm more wondering why it's growing so fast.

If you are using a stepped KV cache (like in Python) which I can't recall.. then the unused cache entries shouldn't grow on every token but every say 256 tokens.

If you are seeing the cache numbers grow on every token for a regular decode then that would be a performance issue worth investigating 🤔

I just checked mlx-lm and the memory cache size is roughly constant in groups of 256 tokens.

Actually I bet these numbers are from before the KVCache was added -- the notes are pretty old and KVCache was added on the swift side in August 2024, so a good 6+ months after launch.

awni · 2025-10-10T16:44:42Z

Source/MLX/GPU.swift

+    /// During model inference, this can grow significantly as buffers of various
+    /// sizes accumulate from intermediate computations. Each token generation
+    /// may need slightly larger buffers, causing smaller cached buffers to
+    /// remain unused while new, larger buffers are allocated.


I think this is a little bit too speicifc to autoregressive LLMs which have variable shapes. It would be good to rephrase it in a way that is less specific.

The core issue is that if the shapes of the data are changing frequently (for example during inference with a language model) then the cache can accumulate unused buffers.

awni · 2025-10-10T16:46:43Z

Source/MLX/GPU.swift

+    /// **Important**: The policy is applied on allocation, not when buffers
+    /// are returned to the cache. This means you may observe cache sizes
+    /// temporarily exceeding the limit until the next allocation triggers cleanup.


I wonder if we should change that behavior 🤔

Actually looking at the code, the comment there seems incorrect: https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/allocator.cpp#L187-L189

Maybe better to remove it.

Possibly. I wonder if we want different behavior for swift and python? For example most swift apps probably want a limit set on the cache by default -- they have jetsam limits to worry about.

We could have some #defines that control this during the build.

I don't know if when the policy is applied matters, but I documented it as it might be surprising if you were trying to debug something.

I don't know if when the policy is applied matters, but I documented it as it might be surprising if you were trying to debug something.

That's a good point. This is more accurate:

The cache limit will go into effect on the next deallocation. Because of that you may observe the cache size temporarily exceeding the requested limit. To immediately clear the cache, use clear_cache.

awni

Very nice, thanks for adding the documentation on that!

One general comment is many of the comments are quite specific to LLM inference. I would rephrase some of them to mention the core issue which is computations where the shapes are changing frequently and then point to LLM inference as an example of that.

davidkoski · 2025-10-10T16:51:42Z

One general comment is many of the comments are quite specific to LLM inference. I would rephrase some of them to mention the core issue which is computations where the shapes are changing frequently and then point to LLM inference as an example of that.

OK, good idea!

davidkoski requested a review from awni October 1, 2025 20:07

document memory buffers

3fa66db

- see also ml-explore/mlx-swift-examples#17

davidkoski force-pushed the memory-docs branch from 01c2719 to 3fa66db Compare October 7, 2025 17:11

awni reviewed Oct 10, 2025

View reviewed changes

awni approved these changes Oct 10, 2025

View reviewed changes

update docs per PR feedback

7118612

document memory buffers #280

Are you sure you want to change the base?

document memory buffers #280

Conversation

davidkoski commented Oct 1, 2025

Uh oh!

davidkoski commented Oct 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awni Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awni left a comment

Choose a reason for hiding this comment

Uh oh!

davidkoski commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

awni Oct 10, 2025 •

edited

Loading