-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roberta embeddings fixes #10856
Roberta embeddings fixes #10856
Conversation
Branch: RobertaTokenizer Signed-off-by: Gabe Goodhart <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
convert_hf_to_gguf.py
Outdated
# adds the cls/sep tokens as bos/eos. This is handled as a | ||
# post-processor in tokenizers, so the chkhsh is different, but | ||
# it still maps to gpt-2 internally. | ||
res = "gpt-2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as per guidelines, we shouldnt be modifying this value in convert_to_gguf , so that it can be autogenerated from convert_hf_to_gguf_update.py
we want it to map to gpt-2 tokenization type.
Would the correct way to do this be keep it to roberta-bpe as generated and then add it to mapping https://github.com/ggerganov/llama.cpp/blob/master/src/llama.cpp#L6483 here , so that it maps to gpt-2?
Any input would be appreciated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the correct way to do this be keep it to roberta-bpe as generated and then add it to mapping master/src/llama.cpp#L6483 here , so that it maps to gpt-2?
Yes, this is the correct way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov Thank you! I have pushed the change and tested it.
I dont think the failing test in this PR is because of my changes? I see other PRs failing in same place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this.
convert_hf_to_gguf.py
Outdated
# adds the cls/sep tokens as bos/eos. This is handled as a | ||
# post-processor in tokenizers, so the chkhsh is different, but | ||
# it still maps to gpt-2 internally. | ||
res = "gpt-2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the correct way to do this be keep it to roberta-bpe as generated and then add it to mapping master/src/llama.cpp#L6483 here , so that it maps to gpt-2?
Yes, this is the correct way.
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
* fix: Use gpt2 tokenizer for roberta and add eos/bos tokens Branch: RobertaTokenizer Signed-off-by: Gabe Goodhart <[email protected]> * fixes to position embeddings Signed-off-by: Sukriti-Sharma4 <[email protected]> * map roberta-bpe to gpt-2 Signed-off-by: Sukriti-Sharma4 <[email protected]> * fix linting Signed-off-by: Sukriti-Sharma4 <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]> Signed-off-by: Sukriti-Sharma4 <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]>
Since the last PR, we identified that the embeddings values being produced were not of good quality. On further debugging, we found that the tokenizer should be set to gpt-2 and position embeddings needed some modifications.
This PR addresses changes to get the correct embeddings from Roberta model.
This PR has been tested by comparing embeddings values against sentenceTransformers library for Roberta architecture
now matches
Hence, the embeddings are now the correct values for Roberta models.
We will add documentation examples in a follow up PR