[Bug] Modifying normalizer for pretrained tokenizers don't consistently work #31653

alvations · 2024-06-27T10:12:30Z

System Info

transformers==4.41.2

Who can help?

Reproduction

From huggingface/tokenizers#1552 (comment)

from transformers import AutoTokenizer
from tokenizers.normalizers import Sequence, Replace, Prepend

tokenizer_name = "mistralai/Mistral-7B-v0.1"
old_tok = AutoTokenizer.from_pretrained(tokenizer_name)

assert old_tok.backend_tokenizer.normalizer != None

new_normalizer = Sequence(
    [Prepend('▁'), Replace('▁', ' '), Replace("foo", "bar"), Replace('<br>', '\n')]
)

old_tok.backend_tokenizer.normalizer = new_normalizer
new_tokenizdr_name = f"new_tokenizer-{tokenizer_name}"
old_tok.save_pretrained(new_tokenizdr_name)


old_tok = AutoTokenizer.from_pretrained(tokenizer_name)
new_tok = AutoTokenizer.from_pretrained(new_tokenizdr_name)

[out]:

>>> print(' '.join(old_tok.batch_decode(old_tok("I foo you<br>hello world")['input_ids'])))
<s> I foo you < br > hello world

>>> print(' '.join(new_tok.batch_decode(new_tok("I foo you<br>hello world")['input_ids'])))
<s>  I  bar  you 
 hello  world

The same process above won't work for "mistralai/Mistral-7B-v0.3".

But if we reinitialize with __class__ after the .from_pretrained it loads the tokenizer config correctly with the extended normalizer. https://stackoverflow.com/questions/78612251/how-do-we-add-modify-the-normalizer-in-a-pretrained-huggingface-tokenizer/78624238#78624238

Expected behavior

The same .from_pretrained should work for other model's tokenizers after changes to the normalizer.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-06-28T06:47:54Z

Hey! This is not because you can't change it, but because the v3 does not have a normalizer at all.

This is the "legacy=False" version of the tokenizer. This should be fixed soon btw, the mistralv01 should end up without a normalizer

alvations changed the title ~~Modifying normalizer for pretrained tokenizers don't consistently work~~ [Bug] Modifying normalizer for pretrained tokenizers don't consistently work Jun 27, 2024

amyeroberts added the Core: Tokenization Internals of the library; Tokenization. label Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Modifying normalizer for pretrained tokenizers don't consistently work #31653

[Bug] Modifying normalizer for pretrained tokenizers don't consistently work #31653

alvations commented Jun 27, 2024 •

edited

Loading

ArthurZucker commented Jun 28, 2024 •

edited

Loading

[Bug] Modifying normalizer for pretrained tokenizers don't consistently work #31653

[Bug] Modifying normalizer for pretrained tokenizers don't consistently work #31653

Comments

alvations commented Jun 27, 2024 • edited Loading

System Info

Who can help?

Reproduction

Expected behavior

ArthurZucker commented Jun 28, 2024 • edited Loading

alvations commented Jun 27, 2024 •

edited

Loading

ArthurZucker commented Jun 28, 2024 •

edited

Loading