Cache Hidden State #194
Replies: 4 comments 3 replies
-
Yes, this is valid. You would need to recompute K and V projections and position embeddings at every attention step, but you'd only need to do that per encoder block so you'd still save close to 50% of the VRAM used by the K/V cache, for maybe a 20% (?) drop in performance. Worth considering, I think. It probably wouldn't make sense on 70b, though (or 34b when it releases). With GQA the K and V vectors together are only 9/8 the size of the hidden state. |
Beta Was this translation helpful? Give feedback.
-
I think you'd probably lose too much information. But either way FP8 presents some challenges, like only being supported in tensor cores.
Yeah, you could multiply either Q or K by a factor, or their product, either way. And because of the softmax normalization this would work much like temperature scaling, emphasizing strong query-key matches and de-emphasizing the weaker ones.
Double it? How so? |
Beta Was this translation helpful? Give feedback.
-
I don't feel the same way, I think the model is really resilient. As long as the output scores looks a certain way the model seem to run fine
Well even in the case of GQA you are caching the key and value state separately? If you just cache the hidden state you will still see memory savings, not that you need it, but combined with the speedup of GQA you basically getting 2x memory on top of that for a very low price |
Beta Was this translation helpful? Give feedback.
-
Yes, and the values are the same size as the hidden state. But the keys are only 1/8 the size of the hidden state with GQA (at least in Llama-2). So if you store the hidden state vector instead of K and V vectors, you only save 1/9 of the total VRAM needed for the cache. With the regular MHA you would save half, though. |
Beta Was this translation helpful? Give feedback.
-
You could cache just the hidden state here
exllama/model.py
Line 557 in e8a544f
Beta Was this translation helpful? Give feedback.
All reactions