-
-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
使用官方的Qwen-xxB-Chat-Int4转TRT,都用greedy sereach,TRT和torch的结果不一致正常吗 #57
Comments
或许正常,可以给一个案例说明一下,可能是推理参数不一样导致的。 |
请问一下,我用了8张40G的A100 将72B的模型转fp16,rotary_base=1000000,max_input_len=12000,max_output_le=2048,debug看了一下gpt_attention后的结果与torch的fp16误差较大有可能是什么情况 |
Debug的时候检查一下Attention的 seq_length是否传对了,应该是32k。 |
补充一下,7B的我测试了,转fp16是一致的,里面的value和最终输出的结果和torchfp16保持一致 |
是不是7B和72B的gpt_attention部分有些许的不同导致的 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
使用官方的Qwen-xxB-Chat-Int4转TRT,都用greedy sereach,TRT和torch的结果不一致正常吗
python build.py --hf_model_dir Qwen-7B-Chat-Int4/
--quant_ckpt_path Qwen-7B-Chat-Int4/
--dtype float16
--remove_input_padding
--use_gpt_attention_plugin float16
--enable_context_fmha
--use_gemm_plugin float16
--use_weight_only
--weight_only_precision int4_gptq
--per_group
--world_size 1
--tp_size 1
--output_dir models/7B-int4/1_fp16-gpu
The text was updated successfully, but these errors were encountered: