-
Notifications
You must be signed in to change notification settings - Fork 966
Tokenizer v5: Fix apply chat template + typo #3238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a common tokenization mistake and corrects a typo in the tokenizers documentation. The main issue addressed is the double application of special tokens: once by apply_chat_template() and again by the tokenizer itself. This is resolved by adding the add_special_tokens=False parameter when tokenizing chat-formatted text.
Key changes:
- Added
add_special_tokens=Falseparameter to prevent double insertion of special tokens - Fixed typo in special token notation (added missing pipe character
|>)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Cc @itazap |
pcuenca
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
| # <|im_start|>assistant | ||
|
|
||
| model_inputs = tokenizer([text], return_tensors="pt") | ||
| model_inputs = tokenizer([text], add_special_tokens=False, return_tensors="pt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree this is dangerous in general (although it does not affect this particular case).
In fact, it'd be useful if we could recommend apply_chat_template with tokenizer=True as the preferred way to prepare conversational inputs cc @ariG23498. (No need to change this example which is illustrative, but we could add a comment in an appropriate point).
| ``` | ||
|
|
||
| Notice how the special tokens like `<|im_start>` and `<|im_end>` are applied to the prompt before tokenizing. This is useful for the model to learn where a new sequence starts and ends. | ||
| Notice how the special tokens like `<|im_start|>` and `<|im_end|>` are applied to the prompt before tokenizing. This is useful for the model to learn where a new sequence starts and ends. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yes
Avoid a common mistake, see
https://huggingface.co/blog/qgallouedec/gotchas-in-tokenizer-behavior#7-chat-template-and-tokenization-dont-compose-due-to-special-tokens