Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roberta embeddings fixes #10856

Merged
merged 5 commits into from
Dec 19, 2024
Merged

Conversation

Ssukriti
Copy link
Contributor

@Ssukriti Ssukriti commented Dec 16, 2024

Since the last PR, we identified that the embeddings values being produced were not of good quality. On further debugging, we found that the tokenizer should be set to gpt-2 and position embeddings needed some modifications.

This PR addresses changes to get the correct embeddings from Roberta model.

This PR has been tested by comparing embeddings values against sentenceTransformers library for Roberta architecture

from sentence_transformers import SentenceTransformer
embeddings_model = SentenceTransformer(path)
embedding_vector = embeddings_model.encode([prompt])[0]

now matches

python3 convert_hf_to_gguf.py model_path --outfile model_path.gguf
llama-embedding -m model_path.gguf -p [prompt] -c 514

Hence, the embeddings are now the correct values for Roberta models.

We will add documentation examples in a follow up PR

gabe-l-hart and others added 2 commits December 13, 2024 16:41
Branch: RobertaTokenizer

Signed-off-by: Gabe Goodhart <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
@github-actions github-actions bot added the python python script changes label Dec 16, 2024
# adds the cls/sep tokens as bos/eos. This is handled as a
# post-processor in tokenizers, so the chkhsh is different, but
# it still maps to gpt-2 internally.
res = "gpt-2"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per guidelines, we shouldnt be modifying this value in convert_to_gguf , so that it can be autogenerated from convert_hf_to_gguf_update.py

we want it to map to gpt-2 tokenization type.

Would the correct way to do this be keep it to roberta-bpe as generated and then add it to mapping https://github.com/ggerganov/llama.cpp/blob/master/src/llama.cpp#L6483 here , so that it maps to gpt-2?

Any input would be appreciated

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the correct way to do this be keep it to roberta-bpe as generated and then add it to mapping master/src/llama.cpp#L6483 here , so that it maps to gpt-2?

Yes, this is the correct way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov Thank you! I have pushed the change and tested it.

I dont think the failing test in this PR is because of my changes? I see other PRs failing in same place

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this.

# adds the cls/sep tokens as bos/eos. This is handled as a
# post-processor in tokenizers, so the chkhsh is different, but
# it still maps to gpt-2 internally.
res = "gpt-2"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the correct way to do this be keep it to roberta-bpe as generated and then add it to mapping master/src/llama.cpp#L6483 here , so that it maps to gpt-2?

Yes, this is the correct way.

@ggerganov ggerganov merged commit 2fffc52 into ggerganov:master Dec 19, 2024
50 of 51 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
* fix: Use gpt2 tokenizer for roberta and add eos/bos tokens

Branch: RobertaTokenizer

Signed-off-by: Gabe Goodhart <[email protected]>

* fixes to position embeddings

Signed-off-by: Sukriti-Sharma4 <[email protected]>

* map roberta-bpe to gpt-2

Signed-off-by: Sukriti-Sharma4 <[email protected]>

* fix linting

Signed-off-by: Sukriti-Sharma4 <[email protected]>

---------

Signed-off-by: Gabe Goodhart <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Co-authored-by: Gabe Goodhart <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants