Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose vLLM logprobs in model output #3491

Open
CoolFish88 opened this issue Oct 1, 2024 · 3 comments
Open

Expose vLLM logprobs in model output #3491

CoolFish88 opened this issue Oct 1, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@CoolFish88
Copy link

Description

vLLM sampling parameters include a richer set of values, among which logprobs has a wider utility.

When testing by adding the logpobs option to the request payload, the model output schema was unchanged
{"generated text": "model_output"} suggesting it was not propagated to the output

Will this change the current api? How?

Probably by enriching the output schema.

Who will benefit from this enhancement?

Anyone who wants logprobs extracted from model predictions.

References

  • list known implementations
    This thread provides a starting point for tackling this issue.
@CoolFish88 CoolFish88 added the enhancement New feature or request label Oct 1, 2024
@frankfliu
Copy link
Contributor

@sindhuvahinis

@CoolFish88
Copy link
Author

Found this while looking into CouldWatch logs:

The following parameters are not supported by vllm with rolling batch: {'max_tokens', 'seed', 'logprobs', 'temperature'}

@siddvenk
Copy link
Contributor

siddvenk commented Oct 2, 2024

What is the payload you are using to invoke the endpoint?

We do expose generation parameters that can be included in the inference request. Details are in https://docs.djl.ai/master/docs/serving/serving/docs/lmi/user_guides/lmi_input_output_schema.html.

We have slightly different names for some of the generation/sampling parameters - our API unifies different inference backends like vllm, tensorrt-llm, huggingface accelerate, and transformers-neuronx.

If you want to use a different API schema, we provide documentation on writing your own input/output parsers https://docs.djl.ai/master/docs/serving/serving/docs/lmi/user_guides/lmi_input_output_schema.html#custom-pre-and-post-processing.

We also support the OpenAI chat completions schema for chat type models https://docs.djl.ai/master/docs/serving/serving/docs/lmi/user_guides/chat_input_output_schema.html.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants