Details about cache #155

giorgiopiatti · 2023-07-14T16:38:26Z

giorgiopiatti
Jul 14, 2023

What is the purpose of the cache ExLlamaCache? Are multiple generations independent from each other if I use the same cache?
I'm trying to understand better how I can use a model with this library for independent queries (leaving apart the effect of randomness which can be controlled by seeds). Thanks!

Answered by turboderp

Jul 15, 2023

The cache is roughly equivalent to the past_key_values returned from a HF model, only instead of being a list of variable-length tensors, it's an object wrapping a list of fixed-length tensors along with an index to track how much of the cache is being used.

Like past_key_values list, the cache stores keys and values from one forward pass so they can be reused in the next. This way you never have to run any token through the model more than once, as long as you're building the sequence sequentially (as you typically would in causal LM).

The model itself is stateless (though not thread-safe) because any state is held in the cache and the generator. This way you can use multiple ExLlamaGene…

View full answer

turboderp · 2023-07-15T11:40:23Z

turboderp
Jul 15, 2023
Maintainer

The cache is roughly equivalent to the past_key_values returned from a HF model, only instead of being a list of variable-length tensors, it's an object wrapping a list of fixed-length tensors along with an index to track how much of the cache is being used.

Like past_key_values list, the cache stores keys and values from one forward pass so they can be reused in the next. This way you never have to run any token through the model more than once, as long as you're building the sequence sequentially (as you typically would in causal LM).

The model itself is stateless (though not thread-safe) because any state is held in the cache and the generator. This way you can use multiple ExLlamaGenerators, each with its own ExLlamaCache. There's also nothing wrong with allocating these on the fly as needed, just keeping in mind that each cache is a rather large set of tensors. But unless you want to generate sequences in parallel (in which case you'll also want to consider batching, I guess), you can also just reset the cache by setting current_seq_len = 0, or resetting any attached generator with gen_begin().

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Details about cache #155

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Details about cache #155

giorgiopiatti Jul 14, 2023

Replies: 1 comment

turboderp Jul 15, 2023 Maintainer

giorgiopiatti
Jul 14, 2023

turboderp
Jul 15, 2023
Maintainer