-
-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qwen 2 build.py multi gpu with 2 different GPU's issue #98
Comments
Nevermind i understand now i'm using an already quantized model, but now here:
The code gets Killed randomly |
Now after removing .cpu() in weights.py I get this error after loading the weights:
Altough there is enough vram available on my other gpu still (altough the model does load on both gpu's). Then after this happens: Also after increasing swap memory, there is still the same error. The issue seems to be in the beginning it does recognize both GPU's. But after the weights are loaded in only recognizes 1 of me GPU's. |
try to add |
Build it successfully now, I got 2 files in /app/tensorrt_llm/examples/qwen2/trt_engines/fp16/1-gpu Does it automatically work for both gpu used together? |
yes, you can run like this: mpirun -n 2 --allow-run-as-root \
python3 run.py --max_new_tokens=50 \
--tokenizer_dir /root/.cache/huggingface/hub/models--Qwen--Qwen1.5-72B-Chat-GPTQ-Int4/snapshots/b8665876947e59ffb3fbf5b5caa9bd354e885a13 \
--engine_dir=xxxxxx |
When i run the engine in llama-index i get this error altough i'm running it in the same docker container:
Any idea? |
What's your gpu? |
I know, but llama index has a TensorRT-LLM library. I runned Llama13b engine before. Or you saying that Qwen2 is not supported anyway in llama-index?
These are my gpu's:
|
This command:
Gives: |
use |
i found you use two different GPU A100 and 4090. |
But the model only fits on both together. It's little over 40G, is that a problem? |
I mean make the 3090 able to run half the model if you enable |
That is what I already did |
I means: first build (build with A100)python3 build.py --use_weight_only \ --weight_only_precision int4_gptq \ --per_group \ --hf_model_dir /root/.cache/huggingface/hub/models--Qwen--Qwen1.5-72B-Chat-GPTQ-Int4/snapshots/b8665876947e59ffb3fbf5b5caa9bd354e885a13 \ --quant_ckpt_path /root/.cache/huggingface/hub/models--Qwen--Qwen1.5-72B-Chat-GPTQ-Int4/snapshots/b8665876947e59ffb3fbf5b5caa9bd354e885a13 \
-world_size 2 --tp_size 2 --pp_size 1 \
--output_dir trt_engines/fp16/1-gpu/ second build (build with 3090)CUDA_VISIBLE_DEVICES=1 \
python3 build.py --use_weight_only \ --weight_only_precision int4_gptq \ --per_group \ --hf_model_dir /root/.cache/huggingface/hub/models--Qwen--Qwen1.5-72B-Chat-GPTQ-Int4/snapshots/b8665876947e59ffb3fbf5b5caa9bd354e885a13 \ --quant_ckpt_path /root/.cache/huggingface/hub/models--Qwen--Qwen1.5-72B-Chat-GPTQ-Int4/snapshots/b8665876947e59ffb3fbf5b5caa9bd354e885a13 \
-world_size 2 --tp_size 2 --pp_size 1 \
--output_dir trt_engines/fp16/2-gpu/ move second path to firstmv trt_engines/fp16/2-gpu/rank1.engine trt_engines/fp16/1-gpu/ run xxxmpirun -n 2 --allow-run-as-root \
python3 run.py --max_output_len=50 \
--tokenizer_dir /root/.cache/huggingface/hub/models--Qwen--Qwen1.5-72B-Chat-GPTQ-Int4/snapshots/b8665876947e59ffb3fbf5b5caa9bd354e885a13 \
--engine_dir=xxxxxx |
I tried it Running the first command succesfully makes an engine. The seccond command (for the 4090) gives an error:
Also this command starts loading all vram on the A100 not the 4090, don't know if that suppose to happen. |
Anyone had succes building with 2 different GPU's? |
Model: Qwen1.5-72B-Chat-GPTQ-Int4
python3 gptq_convert.py --hf_model_dir /root/.cache/huggingface/hub/models--Qwen--Qwen1.5-72B-Chat-GPTQ-Int4/snapshots/b8665876947e59ffb3fbf5b5caa9bd354e885a13 --tokenizer_dir /root/.cache/huggingface/hub/models--Qwen--Qwen1.5-72B-Chat-GPTQ-Int4/snapshots/b8665876947e59ffb3fbf5b5caa9bd354e885a13
Only 1 of 2 GPU's is loadig the model and then get CUDA out of memory error:
torch.cuda.OutOfMemoryError: CUDA out of memory.
I already tried:
When i run the model with transformers normally it distributes it correctly.
Anyone knows a fix?
The text was updated successfully, but these errors were encountered: