-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Falcon3 support and Fix issue #10875 #10883
base: master
Are you sure you want to change the base?
Add Falcon3 support and Fix issue #10875 #10883
Conversation
@@ -525,6 +525,11 @@ def get_vocab_base(self) -> tuple[list[str], list[int], str]: | |||
else: | |||
token: str = reverse_vocab[i] | |||
if token in added_vocab: | |||
# We need to manually encode and decode the added tokens in case special characters | |||
# used for `\n` / `\t` have been manually added in the added tokens | |||
if len(token) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if len(token) == 1: | |
# To avoid unexpected issues - we make sure to encode single-char tokens | |
if len(token) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking at the Falcon tokenizer and I don't see any added tokens that have \n
or \t
: https://huggingface.co/tiiuae/Falcon3-7B-Instruct/raw/main/tokenizer.json
For which tokens does this change make a difference?
Maybe also add some logs to know when this path is being triggered so we can spot any potential problems with other models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chiming in here ! The added token is
{
"id": 12,
"content": "Ċ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": false
}
(\t
is the id 13)
the only way to convert it properly to \n
is to encode / decode using the tokenizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added a log message inside the if statement.
convert_hf_to_gguf.py
Outdated
# used for `\n` / `\t` have been manually added in the added tokens | ||
# To avoid unexpected issues - we make sure to encode single-char tokens | ||
if len(token) == 1: | ||
logger.info("Ecode-Decode special characters using AutoTokenizer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about comparing the token before and after the encoding and print the log only if there is a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good idea. Done!
INFO:hf-to-gguf:'Ċ' is encoded and decoded back to '\n' using AutoTokenizer
INFO:hf-to-gguf:'ĉ' is encoded and decoded back to '\t' using AutoTokenizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems OK to me, but I am not sure about the full implications of this change for all other models. Want to wait for some feedback from the community.
The alternative is to find a way to apply this logic only inside the class FalconModel
.
Actually there's no FalconModel class and our model type is llama so we can't use that to check. The only solution I see is that we wait for some feedback from the community and if there's any error related to this, I will be happy to address it and fix it quickly. |
This can be tested by converting all tokenizers fetched by I think what would solve variations of this problem for other models in the future (for another PR) would be to either normalize all added tokens which are marked |
Thanks @compilade ! |
That could also work, as long as it's done correctly. The added tokens are in both This is otherwise a nice edge case I think the convert scripts should have handled correctly, so part of me wants to keep the tokenizers the same. |
Perfect thanks, will test that out and update here ! |
@compilade I just did some tests and I think we can't go with the solution I suggested above - mainly due to backward compatibility reasons - Before the manual changes
After the fix I suggested:
--> For the same token we now get different encodings - As all falcon3 series models have been trained with that tokenizer, even if the probability that this token appears in a text, I am afraid it's a way too risky breaking change to introduce.. Perhaps we can test if existing tokenizers are not affected by this PR, what do you think? Happy to help you on this as well |
8f481d9
to
92e41ec
Compare
This PR adds Falcon3 support and fixes issue #10875 caused by previous PR #10864 (see #10864 for details)
Details of fixing issue #10875:
The issue is that when using meta-llama/Llama-3.1-8B-Instruct the <|begin_of_text|> token is added to every special token when doing
token = tokenizer.decode(tokenizer.encode(token))
the screenshot shows before and after
token = tokenizer.decode(tokenizer.encode(token))
I'm fixing this by adding
add_special_tokens=False
totokenizer.encode()
. Here is the the result after the fixto be extra safe, we will use
token = tokenizer.decode(tokenizer.encode(token))
only iflen(token) == 1
so that still fix this issue when\n
is econded asĊ
Generation before the fix:
Generation after the fix:
@ggerganov @compilade @slaren