server : fix logprobs, make it OAI-compatible #10783

ngxson · 2024-12-11T13:52:23Z

The /v1/chat/completions now has full support for logprobs. See oai docs here. Please note that this is pre-sampling probs (i.e. no sampling method is applied)

Example using official openai library:

def test_logprobs():
    client = OpenAI(api_key="dummy", base_url=f"http://{server.server_host}:{server.server_port}")
    res = client.chat.completions.create(
        model="gpt-3.5-turbo-instruct",
        temperature=0.0,
        messages=[
            {"role": "system", "content": "Book"},
            {"role": "user", "content": "What is the best book"},
        ],
        max_tokens=5,
        logprobs=True,
        top_logprobs=10,
    )
    for token in res.choices[0].logprobs.content:
        print(token.token, token.logprob)

For non-OAI compatible endpoint /completion, the completion_probabilities is also changed:

Request:

{ "prompt": "hi", "n_predict": 2, "n_probs": 3, "top_k": 1 }

Response:

{
    "index": 0,
    "content": ", i",
    ...
    "completion_probabilities": [
        {
            "id": 11,
            "token": ",",
            "logprob": 0.0,
            "bytes": [
                44
            ],
            "top_logprobs": [
                {
                    "id": 11,
                    "token": ",",
                    "bytes": [
                        44
                    ],
                    "logprob": -1.9807817935943604
                },
                {
                    "id": 682,
                    "token": " all",
                    "bytes": [
                        32,
                        97,
                        108,
                        108
                    ],
                    "logprob": -2.73698353767395
                },
                {
                    "id": 5127,
                    "token": " everyone",
                    "bytes": [
                        32,
                        101,
                        118,
                        101,
                        114,
                        121,
                        111,
                        110,
                        101
                    ],
                    "logprob": -2.8409159183502197
                }
            ]
        },
        {
            "id": 602,
            "token": " i",
            "logprob": 0.0,
            "bytes": [
                32,
                105
            ],
            "top_logprobs": [
                {
                    "id": 602,
                    "token": " i",
                    "bytes": [
                        32,
                        105
                    ],
                    "logprob": -1.4105287790298462
                },
                {
                    "id": 358,
                    "token": " I",
                    "bytes": [
                        32,
                        73
                    ],
                    "logprob": -1.5139223337173462
                },
                {
                    "id": 856,
                    "token": " my",
                    "bytes": [
                        32,
                        109,
                        121
                    ],
                    "logprob": -3.0677266120910645
                }
            ]
        }
    ]
}

For post-sampling probs, add "post_sampling_probs": true to the request. Example for the response:

{
    "completion_probabilities": [
        {
            "id": 682,
            "token": " all",
            "bytes": [
                32,
                97,
                108,
                108
            ],
            "prob": 0.2798407971858978,
            "top_probs": [
                {
                    "id": 11,
                    "token": ",",
                    "bytes": [
                        44
                    ],
                    "prob": 0.7201591730117798
                },
                {
                    "id": 682,
                    "token": " all",
                    "bytes": [
                        32,
                        97,
                        108,
                        108
                    ],
                    "prob": 0.2798407971858978
                }
            ]
        },
        ...
    ]
}

examples/server/server.cpp

ngxson · 2024-12-11T21:30:31Z

examples/server/server.cpp

-                    result.tok = ids[i];
+                    result.tok          = ids[i];
+                    result.text_to_send = common_token_to_piece(ctx, result.tok, params_base.special);
+                    result.prob         = 1.0f; // set later


Here is the branch for speculative decoding. I'm not sure now I can get token probs here. Could you give me some clues? @ggerganov

(Or we can skip this for now if it's complicated)

Think we need to update common_sampler_sample_and_accept_n() to return the probs. But let's fix this later.

p-e-w · 2024-12-12T06:53:53Z

Based on the example response, it appears that this implementation returns probabilities (between 0 and 1) but labels them as log probabilities (which by definition must be between -infinity and 0). I don't know which of the two is expected over the /completion endpoint, but naming the field logprob when it contains a plain probability is certainly wrong.

Also, just to clarify, logprob and top_logprobs are the values before samplers, right? That is, before any penalties, truncation, etc. are applied and distort the probability distribution. In other words, they should contain the raw distribution that the model produces, not the transformed distribution that the generator samples from. Because the data is useless for many applications otherwise (such as confidence estimation).

ngxson · 2024-12-12T09:50:40Z

I don't know which of the two is expected over the /completion endpoint, but naming the field logprob when it contains a plain probability is certainly wrong.

Ok I didn't noticed that, thanks for spotting it. It's a simple fix, just std::log the prob to turn it into logprob.

Also, just to clarify, logprob and top_logprobs are the values before samplers, right?

It's the prob after common_sampler_sample, so the sampler chain is already applied

p-e-w · 2024-12-12T11:36:52Z

It's the prob after common_sampler_sample, so the sampler chain is already applied

That is wrong and needs to be changed. Post-sampler probs are essentially worthless.

Imagine you are doing greedy sampling (top_k = 1). You want to know which other choices were available at each token position, how probable your entire token sequence is (one of the classic applications of logprobs) etc. So you set logprobs = true and top_logprobs = 10 in the request.

But with the current logic, the logprob field in the result is simply 0 for every token, corresponding to a probability of 1, because the top_k sampler leaves only the most probable token, and after normalization, that token has probability 100%. Also, the top_logprobs field will contain only that token, because all other tokens were cut off by the sampler.

Samplers destroy information in order to make a choice. The whole point of requesting logprobs is to get that information. We already know which choice was made (the token in the result), so the distorted probability distribution left over by the sampler chain is useless. Pretty much every standard use case for logprobs (confidence estimation, beam search guidance) breaks down if the probs are post-sampling.

ngxson · 2024-12-12T11:47:09Z

That is wrong and needs to be changed. Post-sampler probs are essentially worthless.

Tbh I'm not even the one who wrote this part of code from the beginning, so I didn't notice that.

Judging by comments from vllm-project/vllm#9453 , this suppose to be pre-sampling as you said then.

Just some lines of code to modify, not a big deal.

ngxson · 2024-12-12T12:46:15Z

Done, it's now returning the log(softmax(logits))

p-e-w

Great stuff! This will make the server much more useful for research-type work. I have tested many inference engines, and all of them have at least one bug related to obtaining logprobs, which led me to hack together my own server to work with them. Glad I'll finally be able to ditch that!

examples/server/README.md

ngxson · 2024-12-13T07:52:46Z

examples/server/utils.hpp

@@ -664,3 +694,33 @@ static json format_logit_bias(const std::vector<llama_logit_bias> & logit_bias)
 static std::string safe_json_to_str(json data) {
    return data.dump(-1, ' ', false, json::error_handler_t::replace);
 }
+
+static std::vector<llama_token_data> get_token_probabilities(llama_context * ctx, int idx) {


NOTE: because llama_sampler_init_softmax is deprecated, I need to copy the its code here.

This function is extremely slow because it operates on the entire vocabulary. It can cause many headaches by having this enabled by default for n_probs > 0, so at the very least outputting logprobs has to be gated by a parameter only if the user really knows they want these. In majority of the cases, the llama-server users want the final token probability for visualization in the UI and this should not affect the performance in a significant way.

Yeah I think We can add a gated parameter --max-n-probs and default it to 1

hmm sorry I misunderstood it a bit, the option should be --probs-multiple-tokens or -pmt instead of max n_probs as a said. Having max n_probs means nothing because the softmax is ran on n_vocab tokens anyways

I added --multi-token-probs

However, I would expect the probabilities to be effectively the same if only, say, the top 1000 tokens are considered in the calculation. In fact I believe the min-p sampler optimization added by @JohannesGaessler does something like this to determine if it is necessary to calculate a full softmax (correct me if I am wrong, I only looked at it briefly a while ago).

The min-p sampler optimization that I did exploits that a ratio of probabilities is equivalent to a difference in logits so min-p sampling can just be done relative to the maximum logit without having to explicitly calculate the softmax of all tokens. It should be fine to apply loose min-p sampling prior to calculating the token probabilities since the vast majority of tokens will effectively have 0 probability.

Not sure if "gating" this functionality is a good idea. This is what the OpenAI API does. If llama.cpp claims OAI compatibility, it shouldn't come with a fine print of "...but only if you enable this ad-hoc CLI flag".

@ggerganov On second thought, I think most users know what they're doing (and the consequence) when they want to get n_probs for multiple tokens. I think we don't really need --multi-token-probs for now. Is it ok if I remove it?

Yes, --multi-token-probs is not necessary. What you described in your #10783 (comment) seems good. Ideally, we should add the min-p optimizaion, but we can do it in another PR if necessary.

Removed in 8734df7

JohannesGaessler · 2024-12-13T21:25:07Z

Regarding log probs of before vs. after samplers: my opinion is that they should reflect the probabilities after samplers since by applying samplers you are intentionally distorting the probability distribution that the model puts out. I don't see why the log likelihood of the original, undistorted distribution would be useful in that scenario. The only real use case I see is to do greedy sampling while still obtaining the probabilities of the original distribution which should be possible by setting the temperature to a negative value. But I ultimately don't feel strongly about this.

JohannesGaessler · 2024-12-14T07:33:06Z

What we are specifically talking about is the perplexity of a token sequence as it's being generated. The corresponding probability distribution is post-samplers. I don't see why the perplexity of such a sequence using the pre-samplers distribution would be a meaningful number.

ggerganov · 2024-12-14T08:23:27Z

Based on the feedback so far, I think we have to support both raw logprobs and post-sampling probs:

For the logprobs, we adopt the same API as the OpenAI spec.
For the post-sampler probs, we should mirror the logprobs OpenAI API, but drop the "log" prefix: logprobs -> probs, top_logprobs -> top_probs
I don't see a problem both the /completion and /chat/completion endpoints to support the 2 types of probability outputs.
For the logprobs, we should apply the "min-p" optimization for some performance improvement.
In the llama-server UI, it would be nice to be able to choose to display either the pre-sampling or post-sampling probs, since I can see that both values can make sense in different cases.

Does that sound OK?

JohannesGaessler · 2024-12-14T08:38:17Z

For the post-sampler probs, we should mirror the logprobs OpenAI API, but drop the "log" prefix: logprobs -> probs, top_logprobs -> top_probs

I think that would be confusing, my intuitive expectation would be that logprobs is just the input of the softmax for probs. The intuitive name that I would associate with the raw model outputs pre-softmax is "logits". I would prefer if we define probs and logprobs in the documentation (whichever way we choose to do it) and then add something with a more explicit name like probs_softmax or probs_samplers.

ggerganov · 2024-12-14T09:04:36Z

that logprobs is just the input of the softmax for probs

To make sure we are on the same page, do these definitions sound correct to you:

logits: raw model output
raw probabilities/probs, pre-sampling probabilities/probs: this is softmax(logits)
raw logprobs, pre-sampling logprobs: this is log(softmax(logits))
post-sampling probs: this are the probabilities that we computed after applying the full sampling chain

So 2. and 3. are essentially the same thing, but with log applied to one of them. The OAI chose to output (3) in their API, so to be compatible, we should do the same.

The raw logits (1) would require too much data to be transported over the network and top-logits don't have much applications, so I think these should not be exposed through the API in any way.

The post-sampling probs (4) are what we have exposed up to now and the suggestion is to keep this as an option.

If I understand correctly, you are suggesting to call (2) probs_softmax and (4) probs_samplers. This sounds intuitive to me, but the problem is that (2) won't match the OAI spec.

JohannesGaessler · 2024-12-14T09:23:56Z

raw logprobs, pre-sampling logprobs: this is log(softmax(logits))

Sorry, that's what I meant, I agree that this is the correct relationship between probs and logprobs.

If I understand correctly, you are suggesting to call (2) probs_softmax and (4) probs_samplers. This sounds intuitive to me, but the problem is that (2) won't match the OAI spec.

If we want to have an OAI-compatible API we should return the same quantities under the same name. So for the OAI-compatible API (3) should be returned as logprobs. But we should not just return (4) as probs since that would be confusing. So I think (4) should be returned as something like probs_samplers or probs_post_samplers (if at all).

For our own API we are not bound to whatever choices OAI made. We could choose to define probs and logprobs as (the logarithm of) the probabilities post samplers (though I think consistency with the OAI-compatible API would be preferable). In that case I would call the probabilities when using just the softmax something like probs_softmax.

I do not mean to use both probs_softmax and probs_samplers simultaneously, just one of them to clearly distinguish the property from the default probabilities.

ggerganov · 2024-12-14T10:26:19Z

But we should not just return (4) as probs since that would be confusing. So I think (4) should be returned as something like probs_samplers or probs_post_samplers (if at all).

Yes, I agree having "(4) as probs" would be confusing and the suggestion for probs_samplers or probs_post_samplers is better.

p-e-w · 2024-12-14T11:47:24Z

What we are specifically talking about is the perplexity of a token sequence as it's being generated. The corresponding probability distribution is post-samplers. I don't see why the perplexity of such a sequence using the pre-samplers distribution would be a meaningful number.

Regardless of how one feels perplexity should be computed, doing so isn't the only application of logprobs. For confidence estimation, we absolutely want the pre-sampler probabilities, because otherwise samplers become a self-fulfilling prophecy where the confidence goes up because the choices are narrowed down by the samplers. In the common case where only a single token remains, this would give 100% confidence, which is patently absurd because that "confidence" is manufactured by truncation.

p-e-w · 2024-12-14T11:53:59Z

Providing more information in extra fields sounds good, but I hope we can all agree that the fields logprob and top_logprobs must contain the pre-sampler log probabilities, and the pre-sampler top tokens, respectively. Those field names are from the OAI spec, and having the same names with different semantics would be a nightmare.

ngxson · 2024-12-15T20:40:56Z

Hmm ok so to settle the discussion between post and pre sampling probs, I think we can support both and the rule is:

For OAI-compat API (i.e. everything under /v1/*), we will return a pre-sampling, aka log(softmax(logits))
For non-OAI-compat, we only return the probs and NOT the logprobs:
- Default to pre-sampling probs (like OAI, softmax(logits))
- If user specify "post_sampling_probs": true in the request, we will return the post sampling probs

JohannesGaessler · 2024-12-15T20:45:02Z

If user specify "pre_sampling_probs": true in the request, we will return the post sampling probs

I assume you mean post_sampling_probs; that would be fine with me.

ngxson · 2024-12-19T08:36:06Z

@ggerganov Can you please re-review this PR? Thank you!

examples/server/server.cpp

examples/server/README.md

examples/server/server.cpp

ggerganov · 2024-12-19T11:38:53Z

examples/server/server.cpp

+                    cur_p->data[i].p
+                });
+                // break if we have all the necessary data
+                if (result.probs.size() == n_probs && found_sampled_tok) {


This seems incorrect - if the sampled token is from the tail of the distribution, we will never break here.

Hmm yeah it's quite messy to do every thing in one loop. I think it's better to split this into 2 different loops, and then optimize the one that take the prob of sampled token later on (in another PR)

* server : fix logprobs, make it openai-compatible * update docs * add std::log * return pre-sampling p * sort before apply softmax * add comment * fix test * set p for sampled token * update docs * add --multi-token-probs * update docs * add `post_sampling_probs` option * update docs [no ci] * remove --multi-token-probs * "top_probs" with "post_sampling_probs" * resolve review comments * rename struct token_prob to prob_info * correct comment placement * fix setting prob for sampled token

ngxson added 2 commits December 11, 2024 14:38

server : fix logprobs, make it openai-compatible

74dc729

update docs

7828013

ngxson requested a review from ggerganov December 11, 2024 13:52

github-actions bot added examples python python script changes server labels Dec 11, 2024

ngxson added the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Dec 11, 2024

ggerganov reviewed Dec 11, 2024

View reviewed changes

examples/server/server.cpp Outdated Show resolved Hide resolved

ngxson commented Dec 11, 2024

View reviewed changes

add std::log

01afafe

return pre-sampling p

cc90cdb

ngxson added 4 commits December 12, 2024 13:47

sort before apply softmax

29c1495

add comment

396ade0

fix test

22b72c8

set p for sampled token

ed7f2d5

ngxson mentioned this pull request Dec 12, 2024

server: add OpenAI compatible response format for legacy /completions with b… #10645

Open

p-e-w reviewed Dec 13, 2024

View reviewed changes

examples/server/README.md Outdated Show resolved Hide resolved

update docs

06bb38e

ngxson requested a review from ggerganov December 13, 2024 07:51

ngxson commented Dec 13, 2024

View reviewed changes

ngxson mentioned this pull request Dec 13, 2024

server: (UI) add syntax highlighting and latex math rendering #10808

Merged

ngxson added 3 commits December 13, 2024 14:33

add --multi-token-probs

196e237

Merge branch 'master' into xsn/fix_logprobs

262950d

update docs

630ddcc

ngxson added 5 commits December 18, 2024 12:46

Merge branch 'master' into xsn/fix_logprobs

c0cca53

add post_sampling_probs option

ecadd37

update docs [no ci]

75fe775

remove --multi-token-probs

8734df7

"top_probs" with "post_sampling_probs"

fd4cf34

ggerganov reviewed Dec 19, 2024

View reviewed changes

examples/server/server.cpp Show resolved Hide resolved

ggerganov approved these changes Dec 19, 2024

View reviewed changes

examples/server/README.md Outdated Show resolved Hide resolved

examples/server/server.cpp Outdated Show resolved Hide resolved

examples/server/server.cpp Outdated Show resolved Hide resolved

examples/server/server.cpp Show resolved Hide resolved

ngxson added 3 commits December 19, 2024 11:46

resolve review comments

d2463dc

rename struct token_prob to prob_info

65ef1c8

correct comment placement

a217382

ngxson requested a review from ggerganov December 19, 2024 11:28

ggerganov approved these changes Dec 19, 2024

View reviewed changes

fix setting prob for sampled token

5b966df

ggerganov approved these changes Dec 19, 2024

View reviewed changes

ngxson merged commit 57bb2c4 into ggerganov:master Dec 19, 2024
50 checks passed

ngxson mentioned this pull request Dec 19, 2024

changelog : llama-server REST API #9291

Open

BrickBee mentioned this pull request Dec 20, 2024

llama.cpp / OpenAI API server support TC-Zheng/ActuosusAI#24

Open

ggerganov mentioned this pull request Dec 22, 2024

Extra newline and other tokens being produced in between paragraphs - llama-server #10914

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : fix logprobs, make it OAI-compatible #10783

server : fix logprobs, make it OAI-compatible #10783

ngxson commented Dec 11, 2024 •

edited

Loading

ngxson Dec 11, 2024 •

edited

Loading

ggerganov Dec 19, 2024

p-e-w commented Dec 12, 2024

ngxson commented Dec 12, 2024

p-e-w commented Dec 12, 2024

ngxson commented Dec 12, 2024 •

edited

Loading

ngxson commented Dec 12, 2024

p-e-w left a comment

ngxson Dec 13, 2024

ggerganov Dec 13, 2024

ngxson Dec 13, 2024

ngxson Dec 13, 2024 •

edited

Loading

ngxson Dec 13, 2024

JohannesGaessler Dec 13, 2024

p-e-w Dec 14, 2024

ngxson Dec 18, 2024 •

edited

Loading

ggerganov Dec 18, 2024

ngxson Dec 18, 2024

JohannesGaessler commented Dec 13, 2024

JohannesGaessler commented Dec 14, 2024 •

edited

Loading

ggerganov commented Dec 14, 2024

JohannesGaessler commented Dec 14, 2024

ggerganov commented Dec 14, 2024 •

edited

Loading

JohannesGaessler commented Dec 14, 2024 •

edited

Loading

ggerganov commented Dec 14, 2024

p-e-w commented Dec 14, 2024

p-e-w commented Dec 14, 2024

ngxson commented Dec 15, 2024 •

edited

Loading

JohannesGaessler commented Dec 15, 2024

ngxson commented Dec 19, 2024

ggerganov Dec 19, 2024

ngxson Dec 19, 2024

server : fix logprobs, make it OAI-compatible #10783

server : fix logprobs, make it OAI-compatible #10783

Conversation

ngxson commented Dec 11, 2024 • edited Loading

ngxson Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p-e-w commented Dec 12, 2024

ngxson commented Dec 12, 2024

p-e-w commented Dec 12, 2024

ngxson commented Dec 12, 2024 • edited Loading

ngxson commented Dec 12, 2024

p-e-w left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngxson Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngxson Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JohannesGaessler commented Dec 13, 2024

JohannesGaessler commented Dec 14, 2024 • edited Loading

ggerganov commented Dec 14, 2024

JohannesGaessler commented Dec 14, 2024

ggerganov commented Dec 14, 2024 • edited Loading

JohannesGaessler commented Dec 14, 2024 • edited Loading

ggerganov commented Dec 14, 2024

p-e-w commented Dec 14, 2024

p-e-w commented Dec 14, 2024

ngxson commented Dec 15, 2024 • edited Loading

JohannesGaessler commented Dec 15, 2024

ngxson commented Dec 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngxson commented Dec 11, 2024 •

edited

Loading

ngxson Dec 11, 2024 •

edited

Loading

ngxson commented Dec 12, 2024 •

edited

Loading

ngxson Dec 13, 2024 •

edited

Loading

ngxson Dec 18, 2024 •

edited

Loading

JohannesGaessler commented Dec 14, 2024 •

edited

Loading

ggerganov commented Dec 14, 2024 •

edited

Loading

JohannesGaessler commented Dec 14, 2024 •

edited

Loading

ngxson commented Dec 15, 2024 •

edited

Loading