24GB 3090/4090 + 16GB Tesla P100 = 70B (almost)? #203

EyeDeck · 2023-07-28T04:55:30Z

EyeDeck
Jul 28, 2023

So, P40s have already been discussed, and despite the nice 24GB chunk of VRAM, unfortunately aren't viable with ExLlama on account of the abysmal FP16 performance.
I was looking at card specs earlier and realized something interesting: P100s, despite being slightly older and from the same generation as P40s, actually have very good FP16 performance!
Early Pascal (P100) runs at a 2:1 FP16:FP32 ratio, which is great on paper for ExLlama. Later Pascal runs at a really awful 1:64 ratio, meaning FP16 math is completely unviable. Turing/Volta also run at a 2:1 ratio, and Ampere and Lovelace/Hopper are both just a 1:1 ratio. In absolute terms, Nvidia claims 18.7 TFLOP/s for FP16 on a P100, where by comparison a 3090 is listed at 29-35 TFLOP/s, so a 3090 is a little less than twice as fast.
Memory bandwidth on the P100 is also excellent on account of using HBM2, listed at 732 GB/s, which is not that far off ~900-1000 of a 3090/4090.
16GB P100s are dirt cheap on ebay, for example here's a current listing (no affiliation) of at least 100 cards for sale at $150/each.

A caveat is that software support for FP16 on a P100 is reportedly spotty, for example PyTorch apparently disabled FP16 math on these cards, citing "numerical instability". Though it's unclear to me if that's really meaningful, or if it just means there's slightly more rounding error or whatever, which probably wouldn't make any practical difference just for LLM inference. (Also, would PyTorch weirdness here actually matter for ExLlama, since most of the math is being done by a custom kernel anyway?)

Of course, all that means nothing if the available VRAM doesn't pass certain thresholds. Right now Meta withholding LLaMA 2 34B puts single 24GB card users in an awkward position, where LLaMA 2 13B is arguably not that far off of LLaMA 1 33B, leaving a lot of unused VRAM, and it takes quite a bit extra to fit 70B. Adding a second 16GB card, to total 40GB, by my napkin math gives enough total VRAM to almost run 70B (i.e. it might load, but if it did context would be extremely limited). I know @turboderp's been working on quantization improvements for ExLlama v2, and by my math it would only take something in the ballpark of a 10-15% reduction in overall quant size to make room for 70B at full context with such a proposed hardware configuration. I understand if you can't promise anything yet, but is at least that much a realistic hope for ExLlama v2?

Anyway I was just curious if anyone had any thoughts on this, and especially if anyone has tried running ExLlama on a P100 already. Seeing how popular 3090s and 4090s are, the possibility of running 70B well just by adding a cheap $150 secondary card, instead of having to spend 4-5× as much on a second 3090, or 10× as much on a second 4090, could make such a configuration a very practical low-budget solution for getting 70B running on a system like mine.

EyeDeck · 2023-08-01T23:24:11Z

EyeDeck
Aug 1, 2023
Author

To partially answer my own question, the modified GPTQ that turboderp's working on for ExLlama v2 is looking really promising even down to 3 bits. 3B, 7B, and 13B models have been unthoroughly tested, but going by early results, each step up in parameter size is notably more resistant to quantization loss than the last, and 3-bit 13B already looks like it could be a winner. Assuming the trend continues, I wouldn't be surprised if 3-bit 70B using the new quantization method equals or even outperforms current GPTQ 4-bit ungrouped, and if so that's a very respectable memory save.

Also I have seen one report that P100 performance is acceptable with ExLlama (unlike P40), though mixing cards from different generations can be sketchy. Regardless, it still looks like it may be viable, eventually.

0 replies

Ph0rk0z · 2023-08-06T10:44:17Z

Ph0rk0z
Aug 6, 2023

Right now Meta withholding LLaMA 2 34B puts single 24GB card users in an awkward position,

Have you tried the 22b merges? The couple I used seemed alright as a midpoint.

A caveat is that software support for FP16 on a P100 is reportedly spotty, for example PyTorch apparently disabled FP16 math on these cards, citing "numerical instability"

Would have to turn it back on again.

In SD, I am finding that just using attention upcast at FP32 setting returns most of the speed for P40 and at that point doing full precision and keeping the model FP32 is making no difference while using the xformers optimizer. People were also having luck adding P40 to a faster card and splitting the model, as in they still got respectable speeds in exllama.

0 replies

amidstdebug · 2024-01-13T08:28:51Z

amidstdebug
Jan 13, 2024

Hi @Ph0rk0z @EyeDeck ,

I have 2 P100s that I'm looking to run for inference.

I'm thinking about using the dual gpu setup to run the Mixtral MoE model, but I'm not super familiar with the new terminology I.e. (Q4, Q5), and all the different finetunes coming out.

Any luck with PyTorch disabling FP16 on the P100?

Given the current developments with exl2 and the P100's fp16 performance, what configuration do you guys suggest?

3 replies

Ph0rk0z Jan 13, 2024

I ended up buying one and using it in exlamav2 and SD. Pytorch never replied what they do with the P100 in terms of FP16. I ended up compiling my own xformers that speeds up both that and P40. Exllama already has it's own kernel and format. You lose flash attention (ampere only) but it works alright. With 32G of vram you can probably run some 3-bit or 4 bit quants of somethin. Go by file-size first and leave a gig or 2 for context.

amidstdebug Jan 24, 2024

Regarding your custom compiled xformers, os the code for that publicly available?

Ph0rk0z Jan 24, 2024

you don't need custom code, just set which architecture to compile for. .ie export TORCH_CUDA_ARCH_LIST = "6.0;6.1;6.2;7.0;7.2;7.5;8.0;8.6" before building. That's all of them.

TMust77 · 2024-07-15T12:37:08Z

TMust77
Jul 15, 2024

I just tried Oogabooga -> Exllama yesterday and it works fine on a P100. Just had to disable flash attention, I suppose it needs tensor cores or something which is weird. But bottom line: Exllama works on a P100.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

24GB 3090/4090 + 16GB Tesla P100 = 70B (almost)? #203

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

24GB 3090/4090 + 16GB Tesla P100 = 70B (almost)? #203

EyeDeck Jul 28, 2023

Replies: 3 comments · 1 reply

EyeDeck Aug 1, 2023 Author

Ph0rk0z Aug 6, 2023

amidstdebug Jan 13, 2024

Ph0rk0z Jan 13, 2024

amidstdebug Jan 24, 2024

Ph0rk0z Jan 24, 2024

TMust77 Jul 15, 2024

EyeDeck
Jul 28, 2023

Replies: 3 comments 1 reply

EyeDeck
Aug 1, 2023
Author

Ph0rk0z
Aug 6, 2023

amidstdebug
Jan 13, 2024

TMust77
Jul 15, 2024