Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage response when input tokens is longer than 4096 on Llama-3.1-8B-Instruct #624

Open
2 of 4 tasks
winstxnhdw opened this issue Oct 18, 2024 · 2 comments
Open
2 of 4 tasks
Labels
bug Something isn't working

Comments

@winstxnhdw
Copy link

System Info

NVIDIA A100 40 GB

Who can help?

@byshiue @ka

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

trtllm-build --checkpoint_dir Meta-Llama-3.1-8B-Instruct-AWQ \
             --output_dir Meta-Llama-3.1-8B-Instruct-AWQ-TRTLLM \
             --gpt_attention_plugin bfloat16 \
             --gemm_plugin bfloat16 \
             --max_num_tokens 131072 \
             --max_input_len 131072 \
             --max_seq_len 131072 \
             --use_paged_context_fmha enable \
             --workers 8

Expected behavior

Llama 3.1 should be able to handle up to 131072 tokens and according to the example here, this was demonstrated by NVIDIA to be possible, at least on the 405B parameter variant.

actual behavior

{
	"batch_index": 0,
	"context_logits": 0.0,
	"cum_log_probs": 0.0,
	"generation_logits": 0.0,
	"model_name": "ensemble",
	"model_version": "1",
	"output_log_probs": [
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
		0.0,
               // ... I have truncated most of the logits here for readability
	],
	"sequence_end": false,
	"sequence_id": 0,
	"sequence_start": false,
	"text_output": "atorettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenettenetten"
}

additional notes

I am using the inflight_batcher_llm repository and I have tried toggling enable_chunked_context on and off.

@winstxnhdw winstxnhdw added the bug Something isn't working label Oct 18, 2024
@byshiue
Copy link
Collaborator

byshiue commented Nov 21, 2024

Since you don't share the full reproduced steps, including how do you convert the checkpoint, the request you really use and the commit/version/docker. I try the long context evaluation task of TensorRT-LLM on latest main branch (535c9cc) and I cannot reproduce the accuracy issue. The following are my steps (use 8k input):

python ./examples/quantization/quantize.py --model_dir Meta-Llama-3.1-8B/ \
                                   --dtype bfloat16 \
                                   --qformat int4_awq \
                                   --awq_block_size 128 \
                                   --output_dir /tmp/llama-3.1/trt_ckpts/int4_awq/ \
                                   --calib_size 32

python -m tensorrt_llm.commands.build --checkpoint_dir /tmp/llama-3.1/trt_ckpts/int4_awq/ \
             --output_dir /tmp/llama-3.1/trt_engines/int4_awq/ \
             --gpt_attention_plugin bfloat16 \
             --gemm_plugin bfloat16 \
             --max_num_tokens 131072 \
             --max_input_len 131072 \
             --max_seq_len 131072 \
             --use_paged_context_fmha enable \
             --workers 1

python3 examples/infinitebench/construct_synthetic_dataset.py --test_case build_passkey --test_level 0
python examples/eval_long_context.py  --task passkey \
                                      --engine_dir /tmp/llama-3.1/trt_engines/int4_awq/ \
                                      --tokenizer_dir Meta-Llama-3.1-8B/ \
                                      --stop_idx 10 \
                                      --max_input_length 8192 \
                                      --enable_chunked_context \
                                      --max_tokens_in_paged_kv_cache 131136

and the results are like

[11/21/2024-09:35:49] [TRT-LLM] [I] Load engine takes: 4.858942270278931 sec
[11/21/2024-09:35:49] [TRT-LLM] [I] ==== Evaluation ====
[11/21/2024-09:35:49] [TRT-LLM] [I] # examples: 275
[11/21/2024-09:35:49] [TRT-LLM] [I] Start index: 0
[11/21/2024-09:35:49] [TRT-LLM] [I] Stop index: 10
[11/21/2024-09:35:49] [TRT-LLM] [I] Max tokens: 6
[11/21/2024-09:35:58] [TRT-LLM] [I] Compute the score
10it [00:00, 26329.59it/s]
[11/21/2024-09:35:58] [TRT-LLM] [I] Evaluation takes: 8.512326717376709 sec.
[11/21/2024-09:35:58] [TRT-LLM] [I] accuracy of 10 examples: 1.0
[TensorRT-LLM][INFO] Refreshed the MPI local session

Can you take a try on the evaluation task first?

@winstxnhdw
Copy link
Author

Hey @byshiue,

This is my quantisation arguments.

python quantize.py --model_dir /Meta-Llama-3.1-8B-Instruct \
                   --output_dir /Meta-Llama-3.1-8B-Instruct-AWQ \
                   --dtype bfloat16 \
                   --qformat int4_awq \
                   --awq_block_size 64

The container tag I am using is 24.10-trtllm-python-py3. I am not able to run the evaluation task because my company's proxy blocks MPI from being installed. You should be able to replicate the issue with any long context input. I am also using the instruct model and not the base model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants