65B working on multi-gpu #39

ortegaalfredo · 2023-06-07T15:57:21Z

ortegaalfredo
Jun 7, 2023

This is not a issue, just reporting that it works great with Guanaco-65B-GPTQ-4bit.act-order.safetensors from TheBloke using 2x3090. Speed is great, about 15t/s.

ortegaalfredo · 2023-06-07T16:01:52Z

ortegaalfredo
Jun 7, 2023
Author

.

0 replies

turboderp · 2023-06-07T21:30:10Z

turboderp
Jun 7, 2023
Maintainer

Moving this here instead of just closing the issue.

I'm very happy to hear that it's working out for people. Especially that you can usable performance from multiple GPUs. Could I ask what CPU you're using?

3 replies

ortegaalfredo Jun 22, 2023
Author

Yes, E5-2680 v4 @ 2.40GHz with a X99 montherboard. It's a very old CPU and motherboard, still performance is quite good.

Now I have another 65B model running in a ryzen 5900x, It's interesting that this is much faster but is a consumer-grade motherboard, and I had to put one of the RTX3090 in a PCIE 3.0 1X port, and it really didn't affect the performance a lot. Basically its the same speed as the E5-2680 v4 system.

zakkor Jun 22, 2023

Would there be a significant speedup using the faster CPU if you managed to put both cards in PCIE lanes with more bandwidth (x8 / x8)?

Pretty confusing seeing mixed signals regarding how important CPU is for getting better inference speeds - the other perf thread on here seems to indicate that singlethread CPU performance does affect speed

turboderp Jun 22, 2023
Maintainer

PCIe bandwidth shouldn't matter. As explained elsewhere there's very little communication between GPUs when they're working in sequence like this.

Single-threaded CPU performance can sometimes matter if you have one or more fast GPUs and a very slow CPU. In that case you can end up bottlenecked by Python, although it's much less of an issue with some of the recent optimizations.

tesla3 · 2023-06-24T16:56:33Z

tesla3
Jun 24, 2023

I tried to get llama-65b to work on g5.12x (4 GPUs with 24G vram each), but it gave me oom error. Any clue how to get it to work?

python test_benchmark_inference.py -d ../guanaco-65B-GPTQ/ -p

/home/ubuntu/exllama/cuda_ext.py:82: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
  none_tensor = torch.empty((1, 1), device = "meta")
 -- Tokenizer: ../guanaco-65B-GPTQ/tokenizer.model
 -- Model config: ../guanaco-65B-GPTQ/config.json
 -- Model: ../guanaco-65B-GPTQ/Guanaco-65B-GPTQ-4bit.act-order.safetensors
 -- Sequence length: 2048
 -- Tuning:
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- --sdp_thd: 8
 -- Options: ['perf']
Traceback (most recent call last):
  File "/home/ubuntu/exllama/test_benchmark_inference.py", line 124, in <module>
    model = timer("Load model", lambda: ExLlama(config))
  File "/home/ubuntu/exllama/test_benchmark_inference.py", line 55, in timer
    ret = func()
  File "/home/ubuntu/exllama/test_benchmark_inference.py", line 124, in <lambda>
    model = timer("Load model", lambda: ExLlama(config))
  File "/home/ubuntu/exllama/model.py", line 719, in __init__
    tensor = tensor.to(device, non_blocking = True)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 22.04 GiB of which 25.12 MiB is free. Including non-PyTorch memory, this process has 22.01 GiB memory in use. Of the allocated memory 21.23 GiB is allocated by PyTorch, and 1.09 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

3 replies

turboderp Jun 24, 2023
Maintainer

It doesn't automatically split the model. You have to specify how you want the weights distributed with the --gpu_split or -gs argument. I don't know what the GPU architecture is like for G5, so I can't say if there might be other issues, but usually for 65B on two 24 GB GPUs I find that -gs 17.2,24 works well.

If you've got four GPUs you could also use something like -gs 8,8,8,24 to split the model evenly, but there's no benefit to doing so unless you're running with extra long contexts or with batches. (The 24 at the end means place up to 24 GB of weights on the last GPU. It will only actually use whatever is left after placing 8+8+8 GB on the first three.)

tesla3 Jun 24, 2023

Thanks turboderp! It works. I did not notice -gs option. curious about the magic number 17.2?

turboderp Jun 24, 2023
Maintainer

It's because 65B is kind of a tight fit for 2x24GB. If you run nvidia-smi while the model is running, you can see how much memory ends up being used by the weights, activations, temporary buffers, Torch and the CUDA runtime altogether. It's really difficult to work out automatically when allocating memory for the model in advance, so you end up with trial and error until I come up with a better solution.

17.2 GB happens to be a good amount of weights for the first of two 24 GB GPUs when splitting 65B. At least in my case it lets me run models down to groupsize 32 while also having room for a desktop environment. Of course you'd have a little more headroom with a model without groupsize, especially on a headless server.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

65B working on multi-gpu #39

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

65B working on multi-gpu #39

ortegaalfredo Jun 7, 2023

Replies: 3 comments · 6 replies

ortegaalfredo Jun 7, 2023 Author

turboderp Jun 7, 2023 Maintainer

ortegaalfredo Jun 22, 2023 Author

zakkor Jun 22, 2023

turboderp Jun 22, 2023 Maintainer

tesla3 Jun 24, 2023

turboderp Jun 24, 2023 Maintainer

tesla3 Jun 24, 2023

turboderp Jun 24, 2023 Maintainer

ortegaalfredo
Jun 7, 2023

Replies: 3 comments 6 replies

ortegaalfredo
Jun 7, 2023
Author

turboderp
Jun 7, 2023
Maintainer

ortegaalfredo Jun 22, 2023
Author

turboderp Jun 22, 2023
Maintainer

tesla3
Jun 24, 2023

turboderp Jun 24, 2023
Maintainer

turboderp Jun 24, 2023
Maintainer