-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Garbage response when input tokens is longer than 4096 on Llama-3.1-8B-Instruct #624
Comments
Since you don't share the full reproduced steps, including how do you convert the checkpoint, the request you really use and the commit/version/docker. I try the long context evaluation task of TensorRT-LLM on latest main branch (535c9cc) and I cannot reproduce the accuracy issue. The following are my steps (use 8k input): python ./examples/quantization/quantize.py --model_dir Meta-Llama-3.1-8B/ \
--dtype bfloat16 \
--qformat int4_awq \
--awq_block_size 128 \
--output_dir /tmp/llama-3.1/trt_ckpts/int4_awq/ \
--calib_size 32
python -m tensorrt_llm.commands.build --checkpoint_dir /tmp/llama-3.1/trt_ckpts/int4_awq/ \
--output_dir /tmp/llama-3.1/trt_engines/int4_awq/ \
--gpt_attention_plugin bfloat16 \
--gemm_plugin bfloat16 \
--max_num_tokens 131072 \
--max_input_len 131072 \
--max_seq_len 131072 \
--use_paged_context_fmha enable \
--workers 1
python3 examples/infinitebench/construct_synthetic_dataset.py --test_case build_passkey --test_level 0
python examples/eval_long_context.py --task passkey \
--engine_dir /tmp/llama-3.1/trt_engines/int4_awq/ \
--tokenizer_dir Meta-Llama-3.1-8B/ \
--stop_idx 10 \
--max_input_length 8192 \
--enable_chunked_context \
--max_tokens_in_paged_kv_cache 131136 and the results are like [11/21/2024-09:35:49] [TRT-LLM] [I] Load engine takes: 4.858942270278931 sec
[11/21/2024-09:35:49] [TRT-LLM] [I] ==== Evaluation ====
[11/21/2024-09:35:49] [TRT-LLM] [I] # examples: 275
[11/21/2024-09:35:49] [TRT-LLM] [I] Start index: 0
[11/21/2024-09:35:49] [TRT-LLM] [I] Stop index: 10
[11/21/2024-09:35:49] [TRT-LLM] [I] Max tokens: 6
[11/21/2024-09:35:58] [TRT-LLM] [I] Compute the score
10it [00:00, 26329.59it/s]
[11/21/2024-09:35:58] [TRT-LLM] [I] Evaluation takes: 8.512326717376709 sec.
[11/21/2024-09:35:58] [TRT-LLM] [I] accuracy of 10 examples: 1.0
[TensorRT-LLM][INFO] Refreshed the MPI local session Can you take a try on the evaluation task first? |
Hey @byshiue, This is my quantisation arguments. python quantize.py --model_dir /Meta-Llama-3.1-8B-Instruct \
--output_dir /Meta-Llama-3.1-8B-Instruct-AWQ \
--dtype bfloat16 \
--qformat int4_awq \
--awq_block_size 64 The container tag I am using is |
System Info
NVIDIA A100 40 GB
Who can help?
@byshiue @ka
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
trtllm-build --checkpoint_dir Meta-Llama-3.1-8B-Instruct-AWQ \ --output_dir Meta-Llama-3.1-8B-Instruct-AWQ-TRTLLM \ --gpt_attention_plugin bfloat16 \ --gemm_plugin bfloat16 \ --max_num_tokens 131072 \ --max_input_len 131072 \ --max_seq_len 131072 \ --use_paged_context_fmha enable \ --workers 8
Expected behavior
Llama 3.1 should be able to handle up to 131072 tokens and according to the example here, this was demonstrated by NVIDIA to be possible, at least on the 405B parameter variant.
actual behavior
additional notes
I am using the inflight_batcher_llm repository and I have tried toggling
enable_chunked_context
on and off.The text was updated successfully, but these errors were encountered: