Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
| except ClientResponseError as e: | ||
| result_content_str = e.response_content.decode() | ||
| is_out_of_context_length = e.status == 400 and ( | ||
| "context length" in result_content_str or "max_tokens" in result_content_str |
There was a problem hiding this comment.
can we double check that these are the same error patterns vllm will throw for responses as chat completions?
There was a problem hiding this comment.
responses pattern:
seems to have 2 paths, harmony or not
harmony:
https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/engine/serving.py#L921
skips _preprocess_chat, calls _validate_generator_input = "max_model_len" error pattern (https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/responses/serving.py#L301)
non harmony:
calls preprocess_chat = "context length"
later calls validate generator input too
https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/responses/serving.py#L613
chat completions pattern:
_preprocess_chat: https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/chat_completion/serving.py#L297
calls _validate_input: https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/engine/serving.py#L921
resulting in "context length" in error msg
max_tokens seems to come from here https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/engine/serving.py#L959
So - i think we should keep "context length", "max_tokens" and also check for "max_model_len" for responses.
Signed-off-by: Christian Munley <cmunley@nvidia.com>
Signed-off-by: Christian Munley <cmunley@nvidia.com>
Signed-off-by: Christian Munley <cmunley@nvidia.com>
Signed-off-by: Christian Munley <cmunley@nvidia.com>
6807a01 to
7533e77
Compare
|
Some testing results vllm logs showing responses api output format |
Signed-off-by: Christian Munley <cmunley@nvidia.com>
Signed-off-by: Christian Munley <cmunley@nvidia.com>
Signed-off-by: Christian Munley <cmunley@nvidia.com>
Signed-off-by: Christian Munley <cmunley@nvidia.com>
Signed-off-by: Christian Munley <cmunley@nvidia.com>
support responses native models. currently we convert responses to chat completions requests. this enables using responses endpoint directly.
basic tests working, needs further testing
#521