-
Notifications
You must be signed in to change notification settings - Fork 118
document memory buffers #280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
davidkoski
commented
Oct 1, 2025
- see also LLMEval: Memory usage mlx-swift-examples#17
Note: CI will fail until #278 is merged |
01c2719
to
3fa66db
Compare
Source/MLX/GPU.swift
Outdated
/// - **After first token**: +~1MB intermediates → cache grows as buffers are recycled | ||
/// - **After 100 tokens**: Cache may be ~500MB (accumulated smaller buffers) | ||
/// - **After 500 tokens**: Cache may be ~9.9GB (buffers of various sizes waiting for reuse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That step from 500MB to 9.9GB seems rather large..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the cache is unbounded (well, left to be related to physical memory size) then you can easily get a lot of recycled buffers in there. And it is related to N^2 as each new token uses 1 more in one of the dimensions.
Anyway, these were numbers I had in my notes but they may not be entirely accurate. Do you think this is misleading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not worried about misleading.. I'm more wondering why it's growing so fast.
If you are using a stepped KV cache (like in Python) which I can't recall.. then the unused cache entries shouldn't grow on every token but every say 256 tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are seeing the cache numbers grow on every token for a regular decode then that would be a performance issue worth investigating 🤔
I just checked mlx-lm and the memory cache size is roughly constant in groups of 256 tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I bet these numbers are from before the KVCache was added -- the notes are pretty old and KVCache was added on the swift side in August 2024, so a good 6+ months after launch.
Source/MLX/GPU.swift
Outdated
/// During model inference, this can grow significantly as buffers of various | ||
/// sizes accumulate from intermediate computations. Each token generation | ||
/// may need slightly larger buffers, causing smaller cached buffers to | ||
/// remain unused while new, larger buffers are allocated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a little bit too speicifc to autoregressive LLMs which have variable shapes. It would be good to rephrase it in a way that is less specific.
The core issue is that if the shapes of the data are changing frequently (for example during inference with a language model) then the cache can accumulate unused buffers.
Source/MLX/GPU.swift
Outdated
/// **Important**: The policy is applied on allocation, not when buffers | ||
/// are returned to the cache. This means you may observe cache sizes | ||
/// temporarily exceeding the limit until the next allocation triggers cleanup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should change that behavior 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually looking at the code, the comment there seems incorrect: https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/allocator.cpp#L187-L189
Maybe better to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly. I wonder if we want different behavior for swift and python? For example most swift apps probably want a limit set on the cache by default -- they have jetsam limits to worry about.
We could have some #defines that control this during the build.
I don't know if when the policy is applied matters, but I documented it as it might be surprising if you were trying to debug something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if when the policy is applied matters, but I documented it as it might be surprising if you were trying to debug something.
That's a good point. This is more accurate:
The cache limit will go into effect on the next deallocation. Because of that you may observe the cache size temporarily exceeding the requested limit. To immediately clear the cache, use clear_cache
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thanks for adding the documentation on that!
One general comment is many of the comments are quite specific to LLM inference. I would rephrase some of them to mention the core issue which is computations where the shapes are changing frequently and then point to LLM inference as an example of that.
OK, good idea! |