Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : output embeddings for all tokens when pooling = none #10861

Merged
merged 12 commits into from
Dec 18, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Dec 17, 2024

When using an embedding model with pooling type none, the /embeddings endpoint will now output a vector of unnormalized embedding vectors - one for each input token. The /v1/embeddings OAI-compatible endpoint does not support this and will return an error when trying to use a model with pooling type none.

API Changes

  • The /embedding endpoint now returns a vector of embedding vectors. See the server readme for more information.

TODO: implement "encoding_format": "base64" to reduce the output size of this endpoint.

@ggerganov ggerganov requested a review from ngxson as a code owner December 17, 2024 09:10
@github-actions github-actions bot added examples python python script changes server labels Dec 17, 2024

virtual int get_index() override {
return index;
}

virtual json to_json() override {
if (embedding.size() == 1){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (embedding.size() == 1){
if (embedding.size() == 1) {

Not sure if I understand correctly, but the case where the embd vector has one element is for reranking, right?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embedding is now a vector of vectors. Here we check if there is only one vector in the results, not that the embedding size is 1 element. So far, all paths returned a single embedding vector. With this change, if --pooling none and there are more than 1 tokens in the input batch, the result will be a vector of embedding vectors.

For reranking, we output a single float which is the score. It is enabled only for --pooling rank. It should not be affected by this change.

@@ -45,6 +45,18 @@ def test_embedding_multiple():
assert len(d['embedding']) > 1


def test_embedding_pooling_none():
server = ServerPreset.bert_bge_small(pooling = 'none')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also do:

global server
server.pooling = 'none'
server.start()

@slaren
Copy link
Collaborator

slaren commented Dec 17, 2024

Just pointing the obvious here, but using the server to obtain the embeddings of every token for something like TTS is extremely inefficient and should not be done if there is a better way.

@ggerganov
Copy link
Owner Author

Yes, the overhead from the JSON protocol is quite substantial for this use case. Adding support for base64 encoding of the embeddings would improve this to some extend.

@ngxson
Copy link
Collaborator

ngxson commented Dec 17, 2024

Btw if you plan to implement this functionality for TTS, I would suggest another method: we can make an API endpoint that accepts input text and output a stream of wav of ogg audio.

Not sure why this option is not available in the OAI API

The reason why it's not there because OAI itself (the official closed-source API) don't have this capability, mostly due to too much overhead for network bandwidth. Most TTS API out there just take text as input and output audio stream anyway, so I think it will be pretty much the norm now.

@slaren
Copy link
Collaborator

slaren commented Dec 17, 2024

When using an embedding model and the pooling is set to none, the server will now output a vector of embedding vectors - one for each input token. This does not seem to be supported by the OpenAI API, so to remain compatible, the result is squeezed when there is only one embedding vector in the result.

We should also consider that this will not make the API OpenAI-compatible, it will just allow applications that are (unwittingly) using pooling type none to not crash outright, but still produce useless results. So I am not sure that this is a good solution. Other options would be:

  • Just return a vector of vectors, at least the application will know that it is doing something wrong
  • Return an error from the OAI endpoint when using pooling type none, add a new endpoint for this purpose

@ggerganov
Copy link
Owner Author

Will change the /embeddings endpoint to return vector of vectors and the /v1/embeddings to return a single vector and also throw an error with --pooling none.

Btw if you plan to implement this functionality for TTS, I would suggest another method: we can make an API endpoint that accepts input text and output a stream of wav of ogg audio.

We can do that as well. As long as the endpoint can send raw data that is not JSON. Otherwise there is not a significant benefit compared to streaming embeddings since the raw spectrograms (i.e. the embeddings) are just 4 times larger than the raw audio (x2 from F32 -> I16 and x2 from the IRFFT).

A potential problem with this approach might be that the logic for converting the spectrogram to audio would have to be on the server-side and I don't know yet how many implementations of that will be in the future. I'm currently playing with OuteTTS, but other models do other post-processing, so these would have to be implemented, maybe in libcommon? Returning the raw embeddings seems to have an advantage in this case since the client can do any kind of post-processing and chunking depending on the needs.

@ngxson
Copy link
Collaborator

ngxson commented Dec 17, 2024

As long as the endpoint can send raw data that is not JSON.

Hmm yeah I think we can make a simple POC API that returns the whole audio file (either wav or mp3 or ogg). Given the fact that in the other PR, you already be able to export the audio file, I think this should be simple to bring it into server (?)

On sending the response from server, we just need to set mimetype to audio/...

I'm currently playing with OuteTTS, but other models do other post-processing, so these would have to be implemented

Yes I think it make sense to put it into common for now. If we want a bit future proof (in term of development), we can even make an examples/post-processing and link llama-server against it.

@ggerganov
Copy link
Owner Author

The /v1/embeddings endpoint is now compatible with the OAI spec and returns an error when attempting to use --pooling none. The /embeddings endpoint always returns vector of vectors and supports --pooling none. Let me know what you think about this version.

server.start()
client = OpenAI(api_key="dummy", base_url=f"http://{server.server_host}:{server.server_port}")
client = OpenAI(api_key="dummy", base_url=f"http://{server.server_host}:{server.server_port}/v1")
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add the /v1 suffix for these calls, otherwise the client will hit the /embeddings endpoint which is not OAI compatible. Is this the correct fix?

Copy link
Collaborator

@ngxson ngxson Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's correct, I haven't notice that before. The original OAI code has /v1 in its default value: https://github.com/openai/openai-python/blob/e94d98e9bf97a5d2d02d79d58f2abdbab26ff2bd/src/openai/_client.py#L117

We should change all other recurrence too (in another PR)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw for embeddings endpoint we also have the switch between non-OAI and OAI using the presence of content and input fields in the response. Is it still relevant now?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've modified this to be able to use both endpoints with both inputs. And I'm thinking to also make the change that content is alias to input - maybe we can do this directly in #10866

@ggerganov ggerganov mentioned this pull request Dec 17, 2024
9 tasks
Base automatically changed from gg/server-content-tokens to master December 18, 2024 09:05
@ggerganov ggerganov force-pushed the gg/server-embeddings-all branch from 2230786 to 2a5510e Compare December 18, 2024 09:34
@ggerganov
Copy link
Owner Author

To make the /v1/embeddings endpoint strongly OAI compatible, we should return an error when the "content" is provided instead of "input":

// for the shape of input/content, see tokenize_input_prompts()
json prompt;
if (body.count("input") != 0) {
prompt = body.at("input");
} else if (body.contains("content")) {
oaicompat = false;
prompt = body.at("content");
} else {
res_error(res, format_error_response("\"input\" or \"content\" must be provided", ERROR_TYPE_INVALID_REQUEST));
return;
}

For the /embeddings endpoint, we should allow both fields.

@ngxson agree?

@ggerganov
Copy link
Owner Author

This should be ready to merge. We can do the strong OAI-compat of /v1/embeddings later.

@ggerganov ggerganov requested a review from ngxson December 18, 2024 10:00
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We should tag breaking change too, in case some users using /embeddings in OAI-compat clients.

@ngxson ngxson added the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Dec 18, 2024
@ggerganov ggerganov merged commit 152610e into master Dec 18, 2024
49 checks passed
@ggerganov ggerganov deleted the gg/server-embeddings-all branch December 18, 2024 11:01
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
…nov#10861)

* server : add "tokens" output

ggml-ci

* server : output embeddings for all tokens when pooling = none

ggml-ci

* server : update readme [no ci]

* server : fix spacing [no ci]

Co-authored-by: Xuan Son Nguyen <[email protected]>

* server : be explicit about the pooling type in the tests

ggml-ci

* server : update /embeddings and /v1/embeddings endpoints

ggml-ci

* server : do not normalize embeddings when there is no pooling

ggml-ci

* server : update readme

ggml-ci

* server : fixes

* tests : update server tests

ggml-ci

* server : update readme [no ci]

* server : remove rebase artifact

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants