You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
>>> print(' '.join(old_tok.batch_decode(old_tok("I foo you<br>hello world")['input_ids'])))
<s> I foo you < br > hello world
>>> print(' '.join(new_tok.batch_decode(new_tok("I foo you<br>hello world")['input_ids'])))
<s> I bar you
hello world
The same process above won't work for "mistralai/Mistral-7B-v0.3".
The same .from_pretrained should work for other model's tokenizers after changes to the normalizer.
The text was updated successfully, but these errors were encountered:
alvations
changed the title
Modifying normalizer for pretrained tokenizers don't consistently work
[Bug] Modifying normalizer for pretrained tokenizers don't consistently work
Jun 27, 2024
Hey! This is not because you can't change it, but because the v3 does not have a normalizer at all.
This is the "legacy=False" version of the tokenizer. This should be fixed soon btw, the mistralv01 should end up without a normalizer
System Info
transformers==4.41.2
Who can help?
@ArthurZucker
Reproduction
From huggingface/tokenizers#1552 (comment)
[out]:
The same process above won't work for
"mistralai/Mistral-7B-v0.3"
.But if we reinitialize with
__class__
after the.from_pretrained
it loads the tokenizer config correctly with the extended normalizer. https://stackoverflow.com/questions/78612251/how-do-we-add-modify-the-normalizer-in-a-pretrained-huggingface-tokenizer/78624238#78624238Expected behavior
The same
.from_pretrained
should work for other model's tokenizers after changes to the normalizer.The text was updated successfully, but these errors were encountered: