Would 65b work on a five-3060 gpu crypto mining rack? #151

cian0 · 2023-07-10T23:57:58Z

cian0
Jul 10, 2023

Question says all, just want to know your opinion on this as I have one at home and would like to know if there are any limitations as I've seen you've tried it on a 2 gpu setup but I'm assuming mine would have much worse performance

Answered by jmoney7823956789378

Jul 11, 2023

using the ooba webui, I run it with the following args in a startup script:
python server.py --listen --api --verbose --chat --xformers --loader exllama --model TheBloke_airoboros-65B-gpt4-1.4-GPTQ --gpu-split 8,8,8,8 --max_seq_len 4096 --alpha_value 2

View full answer

jmoney7823956789378 · 2023-07-11T00:01:00Z

jmoney7823956789378
Jul 11, 2023

I've been running 65B daily for a while, on 4x A4000 (16GB).
Performance is alright, around 5-6t/s.
You'll probably have faster speeds since you have full heatsinks instead of tiny little single slot cards like mine.

7 replies

cian0 Jul 11, 2023
Author

Cool cool may I know what is your command line args to run it?

jmoney7823956789378 Jul 11, 2023

using the ooba webui, I run it with the following args in a startup script:
python server.py --listen --api --verbose --chat --xformers --loader exllama --model TheBloke_airoboros-65B-gpt4-1.4-GPTQ --gpu-split 8,8,8,8 --max_seq_len 4096 --alpha_value 2

Answer selected by cian0

cian0 Jul 11, 2023
Author

Thank you so much I'll try this out

cian0 Jul 11, 2023
Author

Any reason why you only use 8gb of each? Is there any performance benefit whatsoever?

EyeDeck Jul 11, 2023

The split value doesn't cover all allocation, there's a ton of overhead that isn't included, such as context.

jmoney7823956789378 Jul 11, 2023

The cards tend to fill up unevenly and sometimes cause OOM, not sure the actual science behind it.
for a model that is 31GB large, 8,8,8,8 works good for keeping it split, and the rest is slack for context.

EyeDeck · 2023-07-11T00:59:04Z

EyeDeck
Jul 11, 2023

I would expect 65B to work on a minimum of 4 12GB cards using exllama; there's some overhead per card though so you probably won't be able to push context quite as far as, say, 2 24GB cards (apparently that'll go to around 4k). Going up to 5 will most likely more than make up for that, though.

There's PCI-E bandwidth to consider, a mining rack is probably on risers running at like PCI-E 2.0 x1. Exllama's implementation doesn't seem to use a whole lot of PCI-E bandwidth though, so it might not really matter—I think turboderp said it's something on the order of, like, 16kB per token per second, which is still a very tiny fraction of even PCI-E 2.0 x1's bandwidth.

What I suggest doing is running the benchmark and fiddling with the split value until it stops crashing. I'd start with something like:
python test_benchmark_inference.py -d <model-path> -gs 4,8,8,8,8 -p -a 4 -l 4096
The first card seems to need a bit more space for overhead than the rest. If it OOMs on any specific card, push the allocation on that card down a little and try again (so e.g. if CUDA:0 crashes, reduce the first 4 a little bit, and so on). Then see how far you can push length, I don't think that setup will make it quite to 8k with 65B, but maybe something like 5-7k? If it makes it over ~5.5k, also double alpha to -a 8, otherwise the model will go nuts. The more you push context length, the more precise your split tuning has to be.

I've currently got silly setup with a 3090 + 3080, because I wanted to see if it could handle 33B at 8k context. The answer is yes, but it's right on the edge; I found the bounds of stability were 11.41 and 12.17 on the first card (i.e. -gs 11.41,10 and -gs 12.17,10) Going 0.01 under would OOM card 1 (because everything that didn't fit on card 0 has to go on card 1, and that's the threshold that overflows on card 1 too), and 0.01 over would OOM card 0. And while those bounds would pass the benchmark, they'd still OOM if used for a chatbot or whatever, because of memory fragmentation I guess, so I tried splitting the difference and found that -gs 11.79,10 is stable.

5 replies

jmoney7823956789378 Jul 11, 2023

Intetesting findings. I was considering trying to pick up a fifth A4000 for a 65B @ 8K context, since I was hitting OOM with my setup. (Note I also run Stable Diffusion on the fourth GPU, so I really only have ~60GB VRAM to play with. Context at 4K or 8K seems to absolutely devour my extra VRAM.

cian0 Jul 11, 2023
Author

Thank you for the detailed response, this helps me have a good headstart, one thing unclear to me is the use of alpha.. I saw compress_pos_emb param but not quite sure how to use them both

cian0 Jul 11, 2023
Author

I'll reopen the discussion so it's easier to view by others

jmoney7823956789378 Jul 11, 2023

without getting too deep in the weeds with technical details...
alpha and compress_pos_emb are both methods of altering the way tokens are used (in some way). You would never use both at the same time.
Alpha can be used with pretty much any model (though some trained specifically with alpha exist), and its method seems to have better effects on perplexity.
compress_pos_embedding is a different method that requires a special LoRA to be merged with the model first.
Terms for research: NTK RoPE, RoPE.

EyeDeck Jul 11, 2023

Yeah, context beyond 2048 with LLaMA models is still very experimental, and of course you don't have to increase it at all; but it works, and quite well, so (depending on use case, anyway) I think it's worth seeing how far you can push it on your hardware.

For exllama specifically, your fifth 3060 would probably go entirely unused when running 65B at 2048 context, since exllama doesn't do parallelism (yet, anyway, who knows what'll happen in the future). It runs the cards in sequence, running however much of the model is on the first card, before passing the hidden state onto the second card to pick up where it left off, and so on. It could be split to use all 5 at 2k context, but it would just strictly be a little slower than 4.

You could also try llama.cpp's GPU offloading thing, that does support GPU parallelism, but the PCI-E bandwidth bottleneck of mining risers might cause performance problems there, and it's not as memory optimized as exllama either.

Uh oh!

Would 65b work on a five-3060 gpu crypto mining rack? #151

Uh oh!

Replies: 2 comments · 12 replies

Uh oh!

Uh oh!

cian0 Jul 11, 2023 Author

Uh oh!

Uh oh!

cian0 Jul 11, 2023 Author

Uh oh!

cian0 Jul 11, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cian0 Jul 11, 2023 Author

Uh oh!

cian0 Jul 11, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Replies: 2 comments 12 replies

cian0 Jul 11, 2023
Author

cian0 Jul 11, 2023
Author

cian0 Jul 11, 2023
Author

cian0 Jul 11, 2023
Author

cian0 Jul 11, 2023
Author