Tokenizer v5: Fix apply chat template + typo #3238

qgallouedec · 2025-12-20T17:12:21Z

Avoid a common mistake, see

https://huggingface.co/blog/qgallouedec/gotchas-in-tokenizer-behavior#7-chat-template-and-tokenization-dont-compose-due-to-special-tokens

Copilot

Pull request overview

This PR fixes a common tokenization mistake and corrects a typo in the tokenizers documentation. The main issue addressed is the double application of special tokens: once by apply_chat_template() and again by the tokenizer itself. This is resolved by adding the add_special_tokens=False parameter when tokenizing chat-formatted text.

Key changes:

Added add_special_tokens=False parameter to prevent double insertion of special tokens
Fixed typo in special token notation (added missing pipe character |>)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

qgallouedec · 2025-12-20T18:13:56Z

Cc @itazap

pcuenca

Thank you!

pcuenca · 2025-12-20T19:43:00Z

tokenizers.md

 # <|im_start|>assistant

-model_inputs = tokenizer([text], return_tensors="pt")
+model_inputs = tokenizer([text], add_special_tokens=False, return_tensors="pt")


Yes, I agree this is dangerous in general (although it does not affect this particular case).

In fact, it'd be useful if we could recommend apply_chat_template with tokenizer=True as the preferred way to prepare conversational inputs cc @ariG23498. (No need to change this example which is illustrative, but we could add a comment in an appropriate point).

pcuenca · 2025-12-20T19:43:09Z

tokenizers.md

 ```

-Notice how the special tokens like `<|im_start>` and `<|im_end>` are applied to the prompt before tokenizing. This is useful for the model to learn where a new sequence starts and ends.
+Notice how the special tokens like `<|im_start|>` and `<|im_end|>` are applied to the prompt before tokenizing. This is useful for the model to learn where a new sequence starts and ends.


qgallouedec added 2 commits December 20, 2025 10:11

Update tokenizers.md

e952214

Update tokenizers.md

22fa100

qgallouedec changed the title ~~Update tokenizers.md~~ Tokenizer v5: Fix apply chat template + typo Dec 20, 2025

qgallouedec requested review from ArthurZucker and Copilot and removed request for ArthurZucker December 20, 2025 17:56

Copilot started reviewing on behalf of qgallouedec December 20, 2025 17:57 View session

Copilot AI reviewed Dec 20, 2025

View reviewed changes

qgallouedec requested a review from ariG23498 December 20, 2025 18:13

pcuenca approved these changes Dec 20, 2025

View reviewed changes

pcuenca merged commit 08c40d7 into main Dec 20, 2025
7 checks passed

pcuenca deleted the qgallouedec-patch-1 branch December 20, 2025 19:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizer v5: Fix apply chat template + typo #3238

Tokenizer v5: Fix apply chat template + typo #3238

Uh oh!

qgallouedec commented Dec 20, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

qgallouedec commented Dec 20, 2025

Uh oh!

pcuenca left a comment

Uh oh!

pcuenca Dec 20, 2025 •

edited

Loading

Uh oh!

pcuenca Dec 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Tokenizer v5: Fix apply chat template + typo #3238

Tokenizer v5: Fix apply chat template + typo #3238

Uh oh!

Conversation

qgallouedec commented Dec 20, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

qgallouedec commented Dec 20, 2025

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

pcuenca Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcuenca Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pcuenca Dec 20, 2025 •

edited

Loading