You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When deploying a Mistral Instruct 7B v.02 on a SageMaker endpoint (ml.g5.12xlarge) using the TensortRT-LLM backend (just-in-time compilation), I noticed that some of the serving parameters get overwritten.
Specifically, I used the following set of serving properties:
"SERVING_ENGINE": "MPI",
"OPTION_TENSOR_PARALLEL_DEGREE": "1",
"OPTION_MAX_ROLLING_BATCH_SIZE": "16",
"OPTION_ROLLING_BATCH":"trtllm",
"OPTION_MAX_INPUT_LEN":"2048",
"OPTION_MAX_OUTPUT_LEN":"16",
"OPTION_BATCH_SCHEDULER_POLICY": "max_utilization"
CouldWatch logs state the following:
max_input_len is 2048 is larger than max_seq_len 16, clipping it to max_seq_len
max_num_tokens (256) shouldn't be greater than max_seq_len * max_batch_size (256), specifying to max_seq_len * max_batch_size (256).
max_num_tokens is marked in the documentation as taking the default value 16384
Expected Behavior
Parameters to preserve their supplied value
Error Message
When submitting inference requests: this model is compiled to take up to 16 tokens. But actual tokens is 987 > 16. Please set with option.max_input_len=987
How to Reproduce?
Recipe to reproduce the error presented in description
Description
When deploying a Mistral Instruct 7B v.02 on a SageMaker endpoint (ml.g5.12xlarge) using the TensortRT-LLM backend (just-in-time compilation), I noticed that some of the serving parameters get overwritten.
Specifically, I used the following set of serving properties:
"SERVING_ENGINE": "MPI",
"OPTION_TENSOR_PARALLEL_DEGREE": "1",
"OPTION_MAX_ROLLING_BATCH_SIZE": "16",
"OPTION_ROLLING_BATCH":"trtllm",
"OPTION_MAX_INPUT_LEN":"2048",
"OPTION_MAX_OUTPUT_LEN":"16",
"OPTION_BATCH_SCHEDULER_POLICY": "max_utilization"
CouldWatch logs state the following:
max_num_tokens is marked in the documentation as taking the default value 16384
Expected Behavior
Parameters to preserve their supplied value
Error Message
When submitting inference requests:
this model is compiled to take up to 16 tokens. But actual tokens is 987 > 16. Please set with option.max_input_len=987
How to Reproduce?
Recipe to reproduce the error presented in description
Environment Info
Docker image: 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.29.0-tensorrtllm0.11.0-cu124
The text was updated successfully, but these errors were encountered: