-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : fix logprobs, make it OAI-compatible #10783
Conversation
result.tok = ids[i]; | ||
result.tok = ids[i]; | ||
result.text_to_send = common_token_to_piece(ctx, result.tok, params_base.special); | ||
result.prob = 1.0f; // set later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the branch for speculative decoding. I'm not sure now I can get token probs here. Could you give me some clues? @ggerganov
(Or we can skip this for now if it's complicated)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think we need to update common_sampler_sample_and_accept_n()
to return the probs. But let's fix this later.
Based on the example response, it appears that this implementation returns probabilities (between Also, just to clarify, |
Ok I didn't noticed that, thanks for spotting it. It's a simple fix, just
It's the prob after |
That is wrong and needs to be changed. Post-sampler probs are essentially worthless. Imagine you are doing greedy sampling ( But with the current logic, the Samplers destroy information in order to make a choice. The whole point of requesting logprobs is to get that information. We already know which choice was made (the token in the result), so the distorted probability distribution left over by the sampler chain is useless. Pretty much every standard use case for logprobs (confidence estimation, beam search guidance) breaks down if the probs are post-sampling. |
Tbh I'm not even the one who wrote this part of code from the beginning, so I didn't notice that. Judging by comments from vllm-project/vllm#9453 , this suppose to be pre-sampling as you said then. Just some lines of code to modify, not a big deal. |
Done, it's now returning the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff! This will make the server much more useful for research-type work. I have tested many inference engines, and all of them have at least one bug related to obtaining logprobs, which led me to hack together my own server to work with them. Glad I'll finally be able to ditch that!
@@ -664,3 +694,33 @@ static json format_logit_bias(const std::vector<llama_logit_bias> & logit_bias) | |||
static std::string safe_json_to_str(json data) { | |||
return data.dump(-1, ' ', false, json::error_handler_t::replace); | |||
} | |||
|
|||
static std::vector<llama_token_data> get_token_probabilities(llama_context * ctx, int idx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: because llama_sampler_init_softmax
is deprecated, I need to copy the its code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is extremely slow because it operates on the entire vocabulary. It can cause many headaches by having this enabled by default for n_probs > 0
, so at the very least outputting logprobs has to be gated by a parameter only if the user really knows they want these. In majority of the cases, the llama-server
users want the final token probability for visualization in the UI and this should not affect the performance in a significant way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think We can add a gated parameter --max-n-probs
and default it to 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm sorry I misunderstood it a bit, the option should be --probs-multiple-tokens
or -pmt
instead of max n_probs as a said. Having max n_probs means nothing because the softmax is ran on n_vocab tokens anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added --multi-token-probs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I would expect the probabilities to be effectively the same if only, say, the top 1000 tokens are considered in the calculation. In fact I believe the min-p sampler optimization added by @JohannesGaessler does something like this to determine if it is necessary to calculate a full softmax (correct me if I am wrong, I only looked at it briefly a while ago).
The min-p sampler optimization that I did exploits that a ratio of probabilities is equivalent to a difference in logits so min-p sampling can just be done relative to the maximum logit without having to explicitly calculate the softmax of all tokens. It should be fine to apply loose min-p sampling prior to calculating the token probabilities since the vast majority of tokens will effectively have 0 probability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if "gating" this functionality is a good idea. This is what the OpenAI API does. If llama.cpp claims OAI compatibility, it shouldn't come with a fine print of "...but only if you enable this ad-hoc CLI flag".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov On second thought, I think most users know what they're doing (and the consequence) when they want to get n_probs
for multiple tokens. I think we don't really need --multi-token-probs
for now. Is it ok if I remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, --multi-token-probs
is not necessary. What you described in your #10783 (comment) seems good. Ideally, we should add the min-p optimizaion, but we can do it in another PR if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in 8734df7
Regarding log probs of before vs. after samplers: my opinion is that they should reflect the probabilities after samplers since by applying samplers you are intentionally distorting the probability distribution that the model puts out. I don't see why the log likelihood of the original, undistorted distribution would be useful in that scenario. The only real use case I see is to do greedy sampling while still obtaining the probabilities of the original distribution which should be possible by setting the temperature to a negative value. But I ultimately don't feel strongly about this. |
What we are specifically talking about is the perplexity of a token sequence as it's being generated. The corresponding probability distribution is post-samplers. I don't see why the perplexity of such a sequence using the pre-samplers distribution would be a meaningful number. |
Based on the feedback so far, I think we have to support both raw logprobs and post-sampling probs:
Does that sound OK? |
I think that would be confusing, my intuitive expectation would be that |
To make sure we are on the same page, do these definitions sound correct to you:
So 2. and 3. are essentially the same thing, but with The raw logits (1) would require too much data to be transported over the network and top-logits don't have much applications, so I think these should not be exposed through the API in any way. The post-sampling probs (4) are what we have exposed up to now and the suggestion is to keep this as an option. If I understand correctly, you are suggesting to call (2) |
Sorry, that's what I meant, I agree that this is the correct relationship between
If we want to have an OAI-compatible API we should return the same quantities under the same name. So for the OAI-compatible API (3) should be returned as For our own API we are not bound to whatever choices OAI made. We could choose to define I do not mean to use both |
Yes, I agree having "(4) as |
Regardless of how one feels perplexity should be computed, doing so isn't the only application of logprobs. For confidence estimation, we absolutely want the pre-sampler probabilities, because otherwise samplers become a self-fulfilling prophecy where the confidence goes up because the choices are narrowed down by the samplers. In the common case where only a single token remains, this would give 100% confidence, which is patently absurd because that "confidence" is manufactured by truncation. |
Providing more information in extra fields sounds good, but I hope we can all agree that the fields |
Hmm ok so to settle the discussion between post and pre sampling probs, I think we can support both and the rule is:
|
I assume you mean |
@ggerganov Can you please re-review this PR? Thank you! |
examples/server/server.cpp
Outdated
cur_p->data[i].p | ||
}); | ||
// break if we have all the necessary data | ||
if (result.probs.size() == n_probs && found_sampled_tok) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems incorrect - if the sampled token is from the tail of the distribution, we will never break here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah it's quite messy to do every thing in one loop. I think it's better to split this into 2 different loops, and then optimize the one that take the prob of sampled token later on (in another PR)
* server : fix logprobs, make it openai-compatible * update docs * add std::log * return pre-sampling p * sort before apply softmax * add comment * fix test * set p for sampled token * update docs * add --multi-token-probs * update docs * add `post_sampling_probs` option * update docs [no ci] * remove --multi-token-probs * "top_probs" with "post_sampling_probs" * resolve review comments * rename struct token_prob to prob_info * correct comment placement * fix setting prob for sampled token
Fix #10733
The
/v1/chat/completions
now has full support forlogprobs
. See oai docs here. Please note that this is pre-sampling probs (i.e. no sampling method is applied)Example using official openai library:
For non-OAI compatible endpoint
/completion
, thecompletion_probabilities
is also changed:Request:
Response:
For post-sampling probs, add
"post_sampling_probs": true
to the request. Example for the response: