Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mangled tokenization with Llama 3.1 for string sequences containing<space>'m #35938

Open
2 of 4 tasks
tomjorquera opened this issue Jan 28, 2025 · 3 comments
Open
2 of 4 tasks
Labels

Comments

@tomjorquera
Copy link

tomjorquera commented Jan 28, 2025

We observed that trying to tokenize/detokenize strings containing the sequence <space>'m would not give back the initial string, but would "eat" the leading whitespace.

For example, the string "for 'manual'" will be transformed into "for'manual'"

Investigating further, we also observed issue with strings containing <space>'s, making us think the issue may be related to trying to handle sequences such as "I'm".

System Info

transformers==4.46.2

Who can help?

I guess it's for @ArthurZucker and @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Running:

from transformers import AutoTokenizer

prompt = """for 'manual'"""

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
tokenizer.batch_decode(tokenizer([prompt])["input_ids"], skip_special_tokens=True)[0]

prints

"for'manual'"

(missing whitespace before the leading ')

Expected behavior

It should output the following

"for'manual'"
@tomjorquera
Copy link
Author

Note: we encountered this issue first from TGI, as reported huggingface/text-generation-inference#2927

@ArthurZucker
Copy link
Collaborator

Hey! I think this is related to cleanup_tokenization_space that default to Truein the version of transformers that you are using

@tom-jorquera-pfx
Copy link

tom-jorquera-pfx commented Jan 29, 2025

Thanks for the reply @ArthurZucker !

I can confirm that if I add clean_up_tokenization_spaces=False to the call to batch_decode, the issue disappears.

I'm not sure if this means this behavior is "normal" (or let's say expected at least), and what will be the consequences for setting this option to False.

I tested updating to v4.48.1 (actually I forgot to report I had already tested with v4.48.0 before, sorry), and the behavior is still the same. From the doc it seems indeed True is still the default in the latest version.

Is there something to improve here or shall will be left at that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants