server : output embeddings for all tokens when pooling = none #10861

ggerganov · 2024-12-17T09:10:04Z

When using an embedding model with pooling type none, the /embeddings endpoint will now output a vector of unnormalized embedding vectors - one for each input token. The /v1/embeddings OAI-compatible endpoint does not support this and will return an error when trying to use a model with pooling type none.

API Changes

The /embedding endpoint now returns a vector of embedding vectors. See the server readme for more information.

TODO: implement "encoding_format": "base64" to reduce the output size of this endpoint.

ngxson · 2024-12-17T09:13:38Z

examples/server/server.cpp


    virtual int get_index() override {
        return index;
    }

    virtual json to_json() override {
+        if (embedding.size() == 1){


Suggested change

if (embedding.size() == 1){

if (embedding.size() == 1) {

Not sure if I understand correctly, but the case where the embd vector has one element is for reranking, right?

embedding is now a vector of vectors. Here we check if there is only one vector in the results, not that the embedding size is 1 element. So far, all paths returned a single embedding vector. With this change, if --pooling none and there are more than 1 tokens in the input batch, the result will be a vector of embedding vectors.

For reranking, we output a single float which is the score. It is enabled only for --pooling rank. It should not be affected by this change.

ngxson · 2024-12-17T09:15:54Z

examples/server/tests/unit/test_embedding.py

@@ -45,6 +45,18 @@ def test_embedding_multiple():
        assert len(d['embedding']) > 1


+def test_embedding_pooling_none():
+    server = ServerPreset.bert_bge_small(pooling = 'none')


You can also do:

global server server.pooling = 'none' server.start()

slaren · 2024-12-17T11:33:22Z

Just pointing the obvious here, but using the server to obtain the embeddings of every token for something like TTS is extremely inefficient and should not be done if there is a better way.

ggerganov · 2024-12-17T12:01:00Z

Yes, the overhead from the JSON protocol is quite substantial for this use case. Adding support for base64 encoding of the embeddings would improve this to some extend.

ngxson · 2024-12-17T12:10:57Z

Btw if you plan to implement this functionality for TTS, I would suggest another method: we can make an API endpoint that accepts input text and output a stream of wav of ogg audio.

Not sure why this option is not available in the OAI API

The reason why it's not there because OAI itself (the official closed-source API) don't have this capability, mostly due to too much overhead for network bandwidth. Most TTS API out there just take text as input and output audio stream anyway, so I think it will be pretty much the norm now.

slaren · 2024-12-17T12:15:58Z

When using an embedding model and the pooling is set to none, the server will now output a vector of embedding vectors - one for each input token. This does not seem to be supported by the OpenAI API, so to remain compatible, the result is squeezed when there is only one embedding vector in the result.

We should also consider that this will not make the API OpenAI-compatible, it will just allow applications that are (unwittingly) using pooling type none to not crash outright, but still produce useless results. So I am not sure that this is a good solution. Other options would be:

Just return a vector of vectors, at least the application will know that it is doing something wrong
Return an error from the OAI endpoint when using pooling type none, add a new endpoint for this purpose

ggerganov · 2024-12-17T12:38:59Z

Will change the /embeddings endpoint to return vector of vectors and the /v1/embeddings to return a single vector and also throw an error with --pooling none.

Btw if you plan to implement this functionality for TTS, I would suggest another method: we can make an API endpoint that accepts input text and output a stream of wav of ogg audio.

We can do that as well. As long as the endpoint can send raw data that is not JSON. Otherwise there is not a significant benefit compared to streaming embeddings since the raw spectrograms (i.e. the embeddings) are just 4 times larger than the raw audio (x2 from F32 -> I16 and x2 from the IRFFT).

A potential problem with this approach might be that the logic for converting the spectrogram to audio would have to be on the server-side and I don't know yet how many implementations of that will be in the future. I'm currently playing with OuteTTS, but other models do other post-processing, so these would have to be implemented, maybe in libcommon? Returning the raw embeddings seems to have an advantage in this case since the client can do any kind of post-processing and chunking depending on the needs.

ngxson · 2024-12-17T13:22:19Z

As long as the endpoint can send raw data that is not JSON.

Hmm yeah I think we can make a simple POC API that returns the whole audio file (either wav or mp3 or ogg). Given the fact that in the other PR, you already be able to export the audio file, I think this should be simple to bring it into server (?)

On sending the response from server, we just need to set mimetype to audio/...

I'm currently playing with OuteTTS, but other models do other post-processing, so these would have to be implemented

Yes I think it make sense to put it into common for now. If we want a bit future proof (in term of development), we can even make an examples/post-processing and link llama-server against it.

ggerganov · 2024-12-17T14:16:01Z

The /v1/embeddings endpoint is now compatible with the OAI spec and returns an error when attempting to use --pooling none. The /embeddings endpoint always returns vector of vectors and supports --pooling none. Let me know what you think about this version.

ggerganov · 2024-12-17T14:17:02Z

examples/server/tests/unit/test_embedding.py

    server.start()
-    client = OpenAI(api_key="dummy", base_url=f"http://{server.server_host}:{server.server_port}")
+    client = OpenAI(api_key="dummy", base_url=f"http://{server.server_host}:{server.server_port}/v1")


I had to add the /v1 suffix for these calls, otherwise the client will hit the /embeddings endpoint which is not OAI compatible. Is this the correct fix?

Yes that's correct, I haven't notice that before. The original OAI code has /v1 in its default value: https://github.com/openai/openai-python/blob/e94d98e9bf97a5d2d02d79d58f2abdbab26ff2bd/src/openai/_client.py#L117

We should change all other recurrence too (in another PR)

btw for embeddings endpoint we also have the switch between non-OAI and OAI using the presence of content and input fields in the response. Is it still relevant now?

I've modified this to be able to use both endpoints with both inputs. And I'm thinking to also make the change that content is alias to input - maybe we can do this directly in #10866

ggml-ci

Co-authored-by: Xuan Son Nguyen <[email protected]>

ggml-ci

ggerganov · 2024-12-18T09:38:18Z

To make the /v1/embeddings endpoint strongly OAI compatible, we should return an error when the "content" is provided instead of "input":

llama.cpp/examples/server/server.cpp

Lines 3701 to 3711 in 2a5510e

    
           // for the shape of input/content, see tokenize_input_prompts() 
        
           json prompt; 
        
           if (body.count("input") != 0) { 
        
               prompt = body.at("input"); 
        
           } else if (body.contains("content")) { 
        
               oaicompat = false; 
        
               prompt = body.at("content"); 
        
           } else { 
        
               res_error(res, format_error_response("\"input\" or \"content\" must be provided", ERROR_TYPE_INVALID_REQUEST)); 
        
               return; 
        
           }

For the /embeddings endpoint, we should allow both fields.

@ngxson agree?

ggerganov · 2024-12-18T10:00:02Z

This should be ready to merge. We can do the strong OAI-compat of /v1/embeddings later.

ngxson

LGTM. We should tag breaking change too, in case some users using /embeddings in OAI-compat clients.

…nov#10861) * server : add "tokens" output ggml-ci * server : output embeddings for all tokens when pooling = none ggml-ci * server : update readme [no ci] * server : fix spacing [no ci] Co-authored-by: Xuan Son Nguyen <[email protected]> * server : be explicit about the pooling type in the tests ggml-ci * server : update /embeddings and /v1/embeddings endpoints ggml-ci * server : do not normalize embeddings when there is no pooling ggml-ci * server : update readme ggml-ci * server : fixes * tests : update server tests ggml-ci * server : update readme [no ci] * server : remove rebase artifact --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

ggerganov requested a review from ngxson as a code owner December 17, 2024 09:10

github-actions bot added examples python python script changes server labels Dec 17, 2024

ngxson reviewed Dec 17, 2024

View reviewed changes

ggerganov commented Dec 17, 2024

View reviewed changes

ggerganov mentioned this pull request Dec 17, 2024

tts : add OuteTTS support #10784

Merged

9 tasks

Base automatically changed from gg/server-content-tokens to master December 18, 2024 09:05

ggerganov and others added 10 commits December 18, 2024 11:09

server : add "tokens" output

44eeb6a

ggml-ci

server : output embeddings for all tokens when pooling = none

07946a3

ggml-ci

server : update readme [no ci]

d424afa

server : fix spacing [no ci]

2dea487

Co-authored-by: Xuan Son Nguyen <[email protected]>

server : be explicit about the pooling type in the tests

2a94c33

ggml-ci

server : update /embeddings and /v1/embeddings endpoints

abf33e2

ggml-ci

server : do not normalize embeddings when there is no pooling

7e693f9

ggml-ci

server : update readme

3a7c001

ggml-ci

server : fixes

87df601

tests : update server tests

2a5510e

ggml-ci

ggerganov force-pushed the gg/server-embeddings-all branch from 2230786 to 2a5510e Compare December 18, 2024 09:34

server : update readme [no ci]

600cebc

ggerganov requested a review from ngxson December 18, 2024 10:00

ngxson approved these changes Dec 18, 2024

View reviewed changes

ngxson added the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Dec 18, 2024

server : remove rebase artifact

2dcdd48

ggerganov mentioned this pull request Dec 18, 2024

changelog : llama-server REST API #9291

Open

ggerganov merged commit 152610e into master Dec 18, 2024
49 checks passed

ggerganov deleted the gg/server-embeddings-all branch December 18, 2024 11:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : output embeddings for all tokens when pooling = none #10861

server : output embeddings for all tokens when pooling = none #10861

ggerganov commented Dec 17, 2024 •

edited

Loading

ngxson Dec 17, 2024

ggerganov Dec 17, 2024

ngxson Dec 17, 2024

slaren commented Dec 17, 2024

ggerganov commented Dec 17, 2024

ngxson commented Dec 17, 2024 •

edited

Loading

slaren commented Dec 17, 2024

ggerganov commented Dec 17, 2024

ngxson commented Dec 17, 2024 •

edited

Loading

ggerganov commented Dec 17, 2024

ggerganov Dec 17, 2024

ngxson Dec 17, 2024 •

edited

Loading

ngxson Dec 17, 2024

ggerganov Dec 17, 2024

ggerganov commented Dec 18, 2024

ggerganov commented Dec 18, 2024

ngxson left a comment

server : output embeddings for all tokens when pooling = none #10861

server : output embeddings for all tokens when pooling = none #10861

Conversation

ggerganov commented Dec 17, 2024 • edited Loading

API Changes

ngxson Dec 17, 2024

Choose a reason for hiding this comment

ggerganov Dec 17, 2024

Choose a reason for hiding this comment

ngxson Dec 17, 2024

Choose a reason for hiding this comment

slaren commented Dec 17, 2024

ggerganov commented Dec 17, 2024

ngxson commented Dec 17, 2024 • edited Loading

slaren commented Dec 17, 2024

ggerganov commented Dec 17, 2024

ngxson commented Dec 17, 2024 • edited Loading

ggerganov commented Dec 17, 2024

ggerganov Dec 17, 2024

Choose a reason for hiding this comment

ngxson Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

ngxson Dec 17, 2024

Choose a reason for hiding this comment

ggerganov Dec 17, 2024

Choose a reason for hiding this comment

ggerganov commented Dec 18, 2024

ggerganov commented Dec 18, 2024

ngxson left a comment

Choose a reason for hiding this comment

ggerganov commented Dec 17, 2024 •

edited

Loading

ngxson commented Dec 17, 2024 •

edited

Loading

ngxson commented Dec 17, 2024 •

edited

Loading

ngxson Dec 17, 2024 •

edited

Loading