Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements reusing of KV cache across multiple requests. You can set
enable_prefix_cache
toTrue
inRaggedInferenceEngineConfig
to enable this feature.This feature keeps KV cache blocks as long as we have free space. When a new request has a prefix that matches the KV cache blocks, FastGen reuses them. The blocks can also be reused by multiple requests. This drastically reuses the computation for prompt and memory usages for KV cache when many requests have common prefixes.
Note that looking up the cache has some overhead. You can disable this feature when prompts don't have much overlap.
Here is a benchmark result using this feature. We used prompts that have the same prefix when using this feature.
When the prompts are short and generation are long, the benefit will be smaller.