Cache Hidden State #194

kaiokendev · 2023-07-25T07:50:44Z

kaiokendev
Jul 25, 2023

You could cache just the hidden state here

Line 557 in e8a544f

class ExLlamaCache:

instead of both key state and value state. It will have a perf hit of course, but can save memory in cases where the tradeoff is desired

turboderp · 2023-07-25T12:09:38Z

turboderp
Jul 25, 2023
Maintainer

Yes, this is valid. You would need to recompute K and V projections and position embeddings at every attention step, but you'd only need to do that per encoder block so you'd still save close to 50% of the VRAM used by the K/V cache, for maybe a 20% (?) drop in performance. Worth considering, I think.

It probably wouldn't make sense on 70b, though (or 34b when it releases). With GQA the K and V vectors together are only 9/8 the size of the hidden state.

1 reply

kaiokendev Jul 25, 2023
Author

https://pastebin.com/u0Vwmx8D

Its a very simple implementation in 🤗
The only thing that really needs to be changed is the caching step need to be reordered so that RoPE can be applied properly and need to track the q_seq length separate from the kv_seq len.

I wonder what it would look like if you can cast hidden state to fp8, I tried a lot of weird things last night:

for one I noticed you can multiply QK by >1 and the output is still coherent (it somewhat amplifies the features). I also think its possible to straight up remove the position without retraining
its really weird, it seem like the model uses the prescence of some grammatical tokens to construct the output, and whether it spits out gibberish or not is entirely dependent on the relationship of the weights distributed to just those tokens. You can completely nuke the other tokens with no effect, but if those grammar tokens get disturbed, gibberish. Its almost like the distribution amongst just those tokens act like the zeros of a function, and everything else is supplemental data, and those zeros have to be exact.

Just weird things, I was also tired at the time so take it with a grain of salt 😛

Edit: yes its irrelevant for GQA models since the reduction would already be significant for those models. Still interesting because even in that case it would still double the memory capability

turboderp · 2023-07-25T17:32:22Z

turboderp
Jul 25, 2023
Maintainer

I wonder what it would look like if you can cast hidden state to fp8,

I think you'd probably lose too much information. But either way FP8 presents some challenges, like only being supported in tensor cores.

for one I noticed you can multiply QK by >1 and the output is still coherent (it somewhat amplifies the features). I also think its possible to straight up remove the position without retraining

Yeah, you could multiply either Q or K by a factor, or their product, either way. And because of the softmax normalization this would work much like temperature scaling, emphasizing strong query-key matches and de-emphasizing the weaker ones.

yes its irrelevant for GQA models since the reduction would already be significant for those models. Still interesting because even in that case it would still double the memory capability

Double it? How so?

0 replies

kaiokendev · 2023-07-25T18:02:45Z

kaiokendev
Jul 25, 2023
Author

I think you'd probably lose too much information.

I don't feel the same way, I think the model is really resilient. As long as the output scores looks a certain way the model seem to run fine

Double it? How so?

Well even in the case of GQA you are caching the key and value state separately? If you just cache the hidden state you will still see memory savings, not that you need it, but combined with the speedup of GQA you basically getting 2x memory on top of that for a very low price

0 replies

turboderp · 2023-07-25T20:44:27Z

turboderp
Jul 25, 2023
Maintainer

Well even in the case of GQA you are caching the key and value state separately?

Yes, and the values are the same size as the hidden state. But the keys are only 1/8 the size of the hidden state with GQA (at least in Llama-2). So if you store the hidden state vector instead of K and V vectors, you only save 1/9 of the total VRAM needed for the cache. With the regular MHA you would save half, though.

2 replies

kaiokendev Jul 25, 2023
Author

Yea you are right, it is projecting from i.e 5120 => 640, so it is not worth using for GQA, but what do you mean it saves only 1/9? It doesnt seem like it will save anything, its just worse for GQA

turboderp Jul 25, 2023
Maintainer

Wait no, I misremembered that. Not sure why I was thinking values were stored at the full number of heads. You're right that it would just be counterproductive with GQA.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache Hidden State #194

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Cache Hidden State #194

kaiokendev Jul 25, 2023

Replies: 4 comments · 3 replies

turboderp Jul 25, 2023 Maintainer

kaiokendev Jul 25, 2023 Author

turboderp Jul 25, 2023 Maintainer

kaiokendev Jul 25, 2023 Author

turboderp Jul 25, 2023 Maintainer

kaiokendev Jul 25, 2023 Author

turboderp Jul 25, 2023 Maintainer

kaiokendev
Jul 25, 2023

Replies: 4 comments 3 replies

turboderp
Jul 25, 2023
Maintainer

kaiokendev Jul 25, 2023
Author

turboderp
Jul 25, 2023
Maintainer

kaiokendev
Jul 25, 2023
Author

turboderp
Jul 25, 2023
Maintainer

kaiokendev Jul 25, 2023
Author

turboderp Jul 25, 2023
Maintainer