Skip to content

Details about cache #155

Answered by turboderp
giorgiopiatti asked this question in Q&A
Discussion options

You must be logged in to vote

The cache is roughly equivalent to the past_key_values returned from a HF model, only instead of being a list of variable-length tensors, it's an object wrapping a list of fixed-length tensors along with an index to track how much of the cache is being used.

Like past_key_values list, the cache stores keys and values from one forward pass so they can be reused in the next. This way you never have to run any token through the model more than once, as long as you're building the sequence sequentially (as you typically would in causal LM).

The model itself is stateless (though not thread-safe) because any state is held in the cache and the generator. This way you can use multiple ExLlamaGene…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by giorgiopiatti
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants