Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Modifying normalizer for pretrained tokenizers don't consistently work #31653

Open
alvations opened this issue Jun 27, 2024 · 1 comment
Labels
Core: Tokenization Internals of the library; Tokenization.

Comments

@alvations
Copy link

alvations commented Jun 27, 2024

System Info

transformers==4.41.2

Who can help?

@ArthurZucker

Reproduction

From huggingface/tokenizers#1552 (comment)

from transformers import AutoTokenizer
from tokenizers.normalizers import Sequence, Replace, Prepend

tokenizer_name = "mistralai/Mistral-7B-v0.1"
old_tok = AutoTokenizer.from_pretrained(tokenizer_name)

assert old_tok.backend_tokenizer.normalizer != None

new_normalizer = Sequence(
    [Prepend('▁'), Replace('▁', ' '), Replace("foo", "bar"), Replace('<br>', '\n')]
)

old_tok.backend_tokenizer.normalizer = new_normalizer
new_tokenizdr_name = f"new_tokenizer-{tokenizer_name}"
old_tok.save_pretrained(new_tokenizdr_name)


old_tok = AutoTokenizer.from_pretrained(tokenizer_name)
new_tok = AutoTokenizer.from_pretrained(new_tokenizdr_name)

[out]:

>>> print(' '.join(old_tok.batch_decode(old_tok("I foo you<br>hello world")['input_ids'])))
<s> I foo you < br > hello world

>>> print(' '.join(new_tok.batch_decode(new_tok("I foo you<br>hello world")['input_ids'])))
<s>  I  bar  you 
 hello  world

The same process above won't work for "mistralai/Mistral-7B-v0.3".

But if we reinitialize with __class__ after the .from_pretrained it loads the tokenizer config correctly with the extended normalizer. https://stackoverflow.com/questions/78612251/how-do-we-add-modify-the-normalizer-in-a-pretrained-huggingface-tokenizer/78624238#78624238

Expected behavior

The same .from_pretrained should work for other model's tokenizers after changes to the normalizer.

@alvations alvations changed the title Modifying normalizer for pretrained tokenizers don't consistently work [Bug] Modifying normalizer for pretrained tokenizers don't consistently work Jun 27, 2024
@amyeroberts amyeroberts added the Core: Tokenization Internals of the library; Tokenization. label Jun 27, 2024
@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Jun 28, 2024

Hey! This is not because you can't change it, but because the v3 does not have a normalizer at all.
image
This is the "legacy=False" version of the tokenizer. This should be fixed soon btw, the mistralv01 should end up without a normalizer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core: Tokenization Internals of the library; Tokenization.
Projects
None yet
Development

No branches or pull requests

3 participants