Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse KV cache of prefixes #484

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from
Draft

Reuse KV cache of prefixes #484

wants to merge 11 commits into from

Conversation

tohtana
Copy link
Contributor

@tohtana tohtana commented May 27, 2024

This PR implements reusing of KV cache across multiple requests. You can set enable_prefix_cache to True in RaggedInferenceEngineConfig to enable this feature.

config = RaggedInferenceEngineConfig(enable_prefix_cache=True)

This feature keeps KV cache blocks as long as we have free space. When a new request has a prefix that matches the KV cache blocks, FastGen reuses them. The blocks can also be reused by multiple requests. This drastically reuses the computation for prompt and memory usages for KV cache when many requests have common prefixes.
Note that looking up the cache has some overhead. You can disable this feature when prompts don't have much overlap.

Here is a benchmark result using this feature. We used prompts that have the same prefix when using this feature.
image

When the prompts are short and generation are long, the benefit will be smaller.
image

@tohtana tohtana changed the title Tohtana/cache prefix Reuse KV cache of prefix May 27, 2024
@tohtana tohtana changed the title Reuse KV cache of prefix Reuse KV cache of prefixes May 27, 2024
@tohtana tohtana marked this pull request as draft May 27, 2024 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant