Details about cache #155
-
What is the purpose of the cache ExLlamaCache? Are multiple generations independent from each other if I use the same cache? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
The cache is roughly equivalent to the Like The model itself is stateless (though not thread-safe) because any state is held in the cache and the generator. This way you can use multiple |
Beta Was this translation helpful? Give feedback.
The cache is roughly equivalent to the
past_key_values
returned from a HF model, only instead of being a list of variable-length tensors, it's an object wrapping a list of fixed-length tensors along with an index to track how much of the cache is being used.Like
past_key_values
list, the cache stores keys and values from one forward pass so they can be reused in the next. This way you never have to run any token through the model more than once, as long as you're building the sequence sequentially (as you typically would in causal LM).The model itself is stateless (though not thread-safe) because any state is held in the cache and the generator. This way you can use multiple
ExLlamaGene…