Roberta embeddings fixes #10856

Ssukriti · 2024-12-16T22:51:58Z

Since the last PR, we identified that the embeddings values being produced were not of good quality. On further debugging, we found that the tokenizer should be set to gpt-2 and position embeddings needed some modifications.

This PR addresses changes to get the correct embeddings from Roberta model.

This PR has been tested by comparing embeddings values against sentenceTransformers library for Roberta architecture

from sentence_transformers import SentenceTransformer
embeddings_model = SentenceTransformer(path)
embedding_vector = embeddings_model.encode([prompt])[0]

now matches

python3 convert_hf_to_gguf.py model_path --outfile model_path.gguf
llama-embedding -m model_path.gguf -p [prompt] -c 514

Hence, the embeddings are now the correct values for Roberta models.

We will add documentation examples in a follow up PR

Branch: RobertaTokenizer Signed-off-by: Gabe Goodhart <[email protected]>

Signed-off-by: Sukriti-Sharma4 <[email protected]>

Ssukriti · 2024-12-16T23:56:00Z

convert_hf_to_gguf.py

+            #   adds the cls/sep tokens as bos/eos. This is handled as a
+            #   post-processor in tokenizers, so the chkhsh is different, but
+            #   it still maps to gpt-2 internally.
+            res = "gpt-2"


as per guidelines, we shouldnt be modifying this value in convert_to_gguf , so that it can be autogenerated from convert_hf_to_gguf_update.py

we want it to map to gpt-2 tokenization type.

Would the correct way to do this be keep it to roberta-bpe as generated and then add it to mapping https://github.com/ggerganov/llama.cpp/blob/master/src/llama.cpp#L6483 here , so that it maps to gpt-2?

Any input would be appreciated

Would the correct way to do this be keep it to roberta-bpe as generated and then add it to mapping master/src/llama.cpp#L6483 here , so that it maps to gpt-2?

Yes, this is the correct way.

@ggerganov Thank you! I have pushed the change and tested it.

I dont think the failing test in this PR is because of my changes? I see other PRs failing in same place

ggerganov

Thanks for looking into this.

ggerganov · 2024-12-17T06:38:08Z

convert_hf_to_gguf.py

+            #   adds the cls/sep tokens as bos/eos. This is handled as a
+            #   post-processor in tokenizers, so the chkhsh is different, but
+            #   it still maps to gpt-2 internally.
+            res = "gpt-2"


Would the correct way to do this be keep it to roberta-bpe as generated and then add it to mapping master/src/llama.cpp#L6483 here , so that it maps to gpt-2?

Yes, this is the correct way.

Signed-off-by: Sukriti-Sharma4 <[email protected]>

* fix: Use gpt2 tokenizer for roberta and add eos/bos tokens Branch: RobertaTokenizer Signed-off-by: Gabe Goodhart <[email protected]> * fixes to position embeddings Signed-off-by: Sukriti-Sharma4 <[email protected]> * map roberta-bpe to gpt-2 Signed-off-by: Sukriti-Sharma4 <[email protected]> * fix linting Signed-off-by: Sukriti-Sharma4 <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]> Signed-off-by: Sukriti-Sharma4 <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]>

gabe-l-hart and others added 2 commits December 13, 2024 16:41

fix: Use gpt2 tokenizer for roberta and add eos/bos tokens

a2e03b8

Branch: RobertaTokenizer Signed-off-by: Gabe Goodhart <[email protected]>

fixes to position embeddings

d5f69e8

Signed-off-by: Sukriti-Sharma4 <[email protected]>

github-actions bot added the python python script changes label Dec 16, 2024

Ssukriti commented Dec 16, 2024

View reviewed changes

ggerganov approved these changes Dec 17, 2024

View reviewed changes

Ssukriti added 3 commits December 18, 2024 18:37

map roberta-bpe to gpt-2

334ddfd

Signed-off-by: Sukriti-Sharma4 <[email protected]>

Merge branch 'master' into RobertaTokenizer

be50c0c

fix linting

0b2f031

Signed-off-by: Sukriti-Sharma4 <[email protected]>

ggerganov merged commit 2fffc52 into ggerganov:master Dec 19, 2024
50 of 51 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roberta embeddings fixes #10856

Roberta embeddings fixes #10856

Ssukriti commented Dec 16, 2024 •

edited

Loading

Ssukriti Dec 16, 2024

ggerganov Dec 17, 2024

Ssukriti Dec 19, 2024

ggerganov left a comment

ggerganov Dec 17, 2024

Roberta embeddings fixes #10856

Roberta embeddings fixes #10856

Conversation

Ssukriti commented Dec 16, 2024 • edited Loading

Ssukriti Dec 16, 2024

Choose a reason for hiding this comment

ggerganov Dec 17, 2024

Choose a reason for hiding this comment

Ssukriti Dec 19, 2024

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov Dec 17, 2024

Choose a reason for hiding this comment

Ssukriti commented Dec 16, 2024 •

edited

Loading