Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triton的显存占用是TensorRT—llm的两倍 #51

Open
lyc728 opened this issue Dec 22, 2023 · 20 comments
Open

Triton的显存占用是TensorRT—llm的两倍 #51

lyc728 opened this issue Dec 22, 2023 · 20 comments

Comments

@lyc728
Copy link

lyc728 commented Dec 22, 2023

我这边测试qwen-72b的,采用了--weight_only_precision int4 这边采用4张卡进行加载,每张卡占用12G左右,然而Triton进行推理时,每张卡能占用到28G左右,请问下为什么差距是这么大呢?

@Tlntin
Copy link
Owner

Tlntin commented Dec 22, 2023

trtion那边开infight-batch后,会尽量多占用显存以支持更大并发。

@lyc728
Copy link
Author

lyc728 commented Dec 22, 2023

infight-batch
但是我这边设置了V1 有走infight-batch吗?
parameters: {
key: "gpt_model_type"
value: {
#string_value: "inflight_fused_batching"
string_value: "V1"
}
}

@Tlntin
Copy link
Owner

Tlntin commented Dec 22, 2023

infight-batch
但是我这边设置了V1 有走infight-batch吗?
parameters: {
key: "gpt_model_type"
value: {
#string_value: "inflight_fused_batching"
string_value: "V1"
}
}

编译的时候有加infight-batch吗?如果有,可以试试关掉,编译的时候设置batch-size等于1试试。还不行可以去trtion官方问问了。

@lyc728
Copy link
Author

lyc728 commented Dec 23, 2023

infight-batch
但是我这边设置了V1 有走infight-batch吗?
parameters: {
key: "gpt_model_type"
value: {
#string_value: "inflight_fused_batching"
string_value: "V1"
}
}

编译的时候有加infight-batch吗?如果有,可以试试关掉,编译的时候设置batch-size等于1试试。还不行可以去trtion官方问问了。

由于我用的两卡,但是它这个端口会被占用
root@4f2bb9ebf657:/models/tensorrtllm_backend/Qwen-TensorRT/qwen# mpirun -n 2 --allow-run-as-root python api.py
Loading engine from /models/tensorrtllm_backend/tensorrt_llm/examples/Qwen-72B_16k/trt_engines/qwen_float16_tp2_rank0.engine
Loading engine from /models/tensorrtllm_backend/tensorrt_llm/examples/Qwen-72B_16k/trt_engines/qwen_float16_tp2_rank1.engine
INFO: Started server process [22325]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: Started server process [22326]
INFO: Waiting for application startup.
INFO: Application startup complete.
ERROR: [Errno 98] error while attempting to bind on address ('0.0.0.0', 8000): address already in use
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.

@Tlntin
Copy link
Owner

Tlntin commented Dec 23, 2023

api.py不支持多卡😅

@lyc728
Copy link
Author

lyc728 commented Dec 25, 2023

博主遇到过这个问题吗?
我在终端下是可以运行,但是在编译器就报错了
也添加了环境变量 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/mpi/lib;OPAL_PREFIX=/opt/hpcx/ompi


Sorry! You were supposed to get help about:
mpi_init:startup:internal-failure
But I couldn't open the help file:
/build-result/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi/share/openmpi/help-mpi-runtime.txt: No such file or directory. Sorry!

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[4f2bb9ebf657:67032] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

@Tlntin
Copy link
Owner

Tlntin commented Dec 25, 2023

你是手动编译trt-llm?

@lyc728
Copy link
Author

lyc728 commented Dec 25, 2023

没有 我是直接按照你得博客进行操作的 pip install git+https://github.com/NVIDIA/TensorRT-LLM.git@release/0.5.0

@Tlntin
Copy link
Owner

Tlntin commented Dec 25, 2023

没有 我是直接按照你得博客进行操作的 pip install git+https://github.com/NVIDIA/TensorRT-LLM.git@release/0.5.0

哦,了解了,这个可能是mpi库升级了导致的,你可以试试下面的操作:

  • 手动按照mpi库
apt update
apt install libopenmpi-dev
pip install https://github.com/Shixiaowei02/mpi4py/tarball/fix-setuptools-version
  • 再去安装上面的
pip install git+https://github.com/NVIDIA/TensorRT-LLM.git@release/0.5.0

@lyc728
Copy link
Author

lyc728 commented Dec 25, 2023

没有 我是直接按照你得博客进行操作的 pip install git+https://github.com/NVIDIA/TensorRT-LLM.git@release/0.5.0

哦,了解了,这个可能是mpi库升级了导致的,你可以试试下面的操作:

  • 手动按照mpi库
apt update
apt install libopenmpi-dev
pip install https://github.com/Shixiaowei02/mpi4py/tarball/fix-setuptools-version
  • 再去安装上面的
pip install git+https://github.com/NVIDIA/TensorRT-LLM.git@release/0.5.0

可是我终端是可以运行的,只是在vscode上面运行不行,现在我怀疑的是环境变量导致的

@Tlntin
Copy link
Owner

Tlntin commented Dec 25, 2023

没有 我是直接按照你得博客进行操作的 pip install git+https://github.com/NVIDIA/TensorRT-LLM.git@release/0.5.0

哦,了解了,这个可能是mpi库升级了导致的,你可以试试下面的操作:

  • 手动按照mpi库
apt update
apt install libopenmpi-dev
pip install https://github.com/Shixiaowei02/mpi4py/tarball/fix-setuptools-version
  • 再去安装上面的
pip install git+https://github.com/NVIDIA/TensorRT-LLM.git@release/0.5.0

可是我终端是可以运行的,只是在vscode上面运行不行,现在我怀疑的是环境变量导致的

哦哦,好吧,那就不知道了。

@lyc728
Copy link
Author

lyc728 commented Dec 26, 2023

0.6.1版本的trt-llm,main分支的qwen-trt-llm, build用了python build.py --hf_model_dir /data/llms/Qwen-72B-Chat/
--dtype float16
--remove_input_padding
--use_gpt_attention_plugin float16
--use_gemm_plugin float16
--enable_context_fmha
--use_weight_only
--rotary_base 1000000
--weight_only_precision int4
--output_dir /data/qwen_test/examples/qwen/out_engine_72B_2gpu_12/
--world_size 2
--tp_size 2
run没修改直接用 mpirun -n 2 --allow-run-as-root python3 run.py
现在运行后出现了下面的报错,博主遇到过吗?
企业微信截图_17035551832733
企业微信截图_17035552172418

@Tlntin
Copy link
Owner

Tlntin commented Dec 26, 2023

0.6.1版本的trt-llm,main分支的qwen-trt-llm, build用了python build.py --hf_model_dir /data/llms/Qwen-72B-Chat/ --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --enable_context_fmha --use_weight_only --rotary_base 1000000 --weight_only_precision int4 --output_dir /data/qwen_test/examples/qwen/out_engine_72B_2gpu_12/ --world_size 2 --tp_size 2 run没修改直接用 mpirun -n 2 --allow-run-as-root python3 run.py 现在运行后出现了下面的报错,博主遇到过吗? 企业微信截图_17035551832733 企业微信截图_17035552172418

可以去试试0.5.0,0.6.1目前测试还不充分,有一些隐藏问题。

@lyc728
Copy link
Author

lyc728 commented Dec 26, 2023

你好,麻烦再确认下,这两个分支是已经对齐测试完成了对吧
企业微信截图_17035562317727
企业微信截图_17035562502026

@Tlntin
Copy link
Owner

Tlntin commented Dec 26, 2023

是的,对齐的。0.6.1目前来看还不算稳定版,没有triton对应。目前最新的trtion又将trt-llm直接升级到0.7.0,说实话有点小坑,所以建议还是用着0.5.0先。

@lyc728
Copy link
Author

lyc728 commented Dec 26, 2023

你好,再确认一个问题,我用NVIDIA/TensorRT-LLM v0.6.0及以上,同你的0.5.0的版本,TRT推出来的结果是不一致的,请问你知道这边有什么不同吗?

@Tlntin
Copy link
Owner

Tlntin commented Dec 26, 2023

你好,再确认一个问题,我用NVIDIA/TensorRT-LLM v0.6.0及以上,同你的0.5.0的版本,TRT推出来的结果是不一致的,请问你知道这边有什么不同吗?

应该是参数配置问题,我这边有改过一次默认参数,同原版做过对齐。对应的commit.

@lyc728
Copy link
Author

lyc728 commented Dec 26, 2023

好的 我再试下

@Tlntin
Copy link
Owner

Tlntin commented Dec 29, 2023

0.6.1版本的trt-llm,main分支的qwen-trt-llm, build用了python build.py --hf_model_dir /data/llms/Qwen-72B-Chat/ --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --enable_context_fmha --use_weight_only --rotary_base 1000000 --weight_only_precision int4 --output_dir /data/qwen_test/examples/qwen/out_engine_72B_2gpu_12/ --world_size 2 --tp_size 2 run没修改直接用 mpirun -n 2 --allow-run-as-root python3 run.py 现在运行后出现了下面的报错,博主遇到过吗? 企业微信截图_17035551832733 企业微信截图_17035552172418

重新看了一下你的运行命令,上面的命令你的Engine输出路径是/data/qwen_test/examples/qwen/out_engine_72B_2gpu_12/,模型路径是在/data/llms/Qwen-72B-Chat/,你这俩路径和我设置的默认路径是不一样的,所以不能直接跑run.py,而是需要指定run.py需要的hf词表路径和trt engine路径才行,否则会走默认的路径导致报错。
我刚刚本地测试了最新main分支的代码,目前是正常的,所以这个还是你操作的问题。
建议run.py运行命令修改为:

mpirun -n 2 --allow-run-as-root python3 run.py --engine_dir /data/qwen_test/examples/qwen/out_engine_72B_2gpu_12/ \
  --tokenizer_dir /data/llms/Qwen-72B-Chat/

@Tlntin Tlntin closed this as completed Dec 29, 2023
@Tlntin
Copy link
Owner

Tlntin commented Jun 12, 2024

@lyc728 测试了最新的trt-llm 0.10.0和配套的tritonserver后,显存占用过多问题已经解决了,你可以试试这个,使用方法基本和现在的0.8.0差不多。

@Tlntin Tlntin reopened this Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants