Skip to content

Conversation

davidkoski
Copy link
Collaborator

@davidkoski davidkoski requested a review from awni October 1, 2025 20:07
@davidkoski
Copy link
Collaborator Author

Note: CI will fail until #278 is merged

Comment on lines 72 to 74
/// - **After first token**: +~1MB intermediates → cache grows as buffers are recycled
/// - **After 100 tokens**: Cache may be ~500MB (accumulated smaller buffers)
/// - **After 500 tokens**: Cache may be ~9.9GB (buffers of various sizes waiting for reuse)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That step from 500MB to 9.9GB seems rather large..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cache is unbounded (well, left to be related to physical memory size) then you can easily get a lot of recycled buffers in there. And it is related to N^2 as each new token uses 1 more in one of the dimensions.

Anyway, these were numbers I had in my notes but they may not be entirely accurate. Do you think this is misleading?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not worried about misleading.. I'm more wondering why it's growing so fast.

If you are using a stepped KV cache (like in Python) which I can't recall.. then the unused cache entries shouldn't grow on every token but every say 256 tokens.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are seeing the cache numbers grow on every token for a regular decode then that would be a performance issue worth investigating 🤔

I just checked mlx-lm and the memory cache size is roughly constant in groups of 256 tokens.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I bet these numbers are from before the KVCache was added -- the notes are pretty old and KVCache was added on the swift side in August 2024, so a good 6+ months after launch.

Comment on lines 188 to 191
/// During model inference, this can grow significantly as buffers of various
/// sizes accumulate from intermediate computations. Each token generation
/// may need slightly larger buffers, causing smaller cached buffers to
/// remain unused while new, larger buffers are allocated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a little bit too speicifc to autoregressive LLMs which have variable shapes. It would be good to rephrase it in a way that is less specific.

The core issue is that if the shapes of the data are changing frequently (for example during inference with a language model) then the cache can accumulate unused buffers.

Comment on lines 268 to 270
/// **Important**: The policy is applied on allocation, not when buffers
/// are returned to the cache. This means you may observe cache sizes
/// temporarily exceeding the limit until the next allocation triggers cleanup.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should change that behavior 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually looking at the code, the comment there seems incorrect: https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/allocator.cpp#L187-L189

Maybe better to remove it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. I wonder if we want different behavior for swift and python? For example most swift apps probably want a limit set on the cache by default -- they have jetsam limits to worry about.

We could have some #defines that control this during the build.

I don't know if when the policy is applied matters, but I documented it as it might be surprising if you were trying to debug something.

Copy link
Member

@awni awni Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if when the policy is applied matters, but I documented it as it might be surprising if you were trying to debug something.

That's a good point. This is more accurate:

The cache limit will go into effect on the next deallocation. Because of that you may observe the cache size temporarily exceeding the requested limit. To immediately clear the cache, use clear_cache.

Copy link
Member

@awni awni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks for adding the documentation on that!

One general comment is many of the comments are quite specific to LLM inference. I would rephrase some of them to mention the core issue which is computations where the shapes are changing frequently and then point to LLM inference as an example of that.

@davidkoski
Copy link
Collaborator Author

One general comment is many of the comments are quite specific to LLM inference. I would rephrase some of them to mention the core issue which is computations where the shapes are changing frequently and then point to LLM inference as an example of that.

OK, good idea!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants