Would 65b work on a five-3060 gpu crypto mining rack? #151
-
Question says all, just want to know your opinion on this as I have one at home and would like to know if there are any limitations as I've seen you've tried it on a 2 gpu setup but I'm assuming mine would have much worse performance |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 12 replies
-
I've been running 65B daily for a while, on 4x A4000 (16GB). |
Beta Was this translation helpful? Give feedback.
-
I would expect 65B to work on a minimum of 4 12GB cards using exllama; there's some overhead per card though so you probably won't be able to push context quite as far as, say, 2 24GB cards (apparently that'll go to around 4k). Going up to 5 will most likely more than make up for that, though. There's PCI-E bandwidth to consider, a mining rack is probably on risers running at like PCI-E 2.0 x1. Exllama's implementation doesn't seem to use a whole lot of PCI-E bandwidth though, so it might not really matter—I think turboderp said it's something on the order of, like, 16kB per token per second, which is still a very tiny fraction of even PCI-E 2.0 x1's bandwidth. What I suggest doing is running the benchmark and fiddling with the split value until it stops crashing. I'd start with something like: I've currently got silly setup with a 3090 + 3080, because I wanted to see if it could handle 33B at 8k context. The answer is yes, but it's right on the edge; I found the bounds of stability were 11.41 and 12.17 on the first card (i.e. |
Beta Was this translation helpful? Give feedback.
using the ooba webui, I run it with the following args in a startup script:
python server.py --listen --api --verbose --chat --xformers --loader exllama --model TheBloke_airoboros-65B-gpt4-1.4-GPTQ --gpu-split 8,8,8,8 --max_seq_len 4096 --alpha_value 2