Skip to content

Conversation

@qgallouedec
Copy link
Member

@qgallouedec qgallouedec changed the title Update tokenizers.md Tokenizer v5: Fix apply chat template + typo Dec 20, 2025
@qgallouedec qgallouedec requested review from ArthurZucker and Copilot and removed request for ArthurZucker December 20, 2025 17:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a common tokenization mistake and corrects a typo in the tokenizers documentation. The main issue addressed is the double application of special tokens: once by apply_chat_template() and again by the tokenizer itself. This is resolved by adding the add_special_tokens=False parameter when tokenizing chat-formatted text.

Key changes:

  • Added add_special_tokens=False parameter to prevent double insertion of special tokens
  • Fixed typo in special token notation (added missing pipe character |>)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@qgallouedec
Copy link
Member Author

Cc @itazap

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

# <|im_start|>assistant

model_inputs = tokenizer([text], return_tensors="pt")
model_inputs = tokenizer([text], add_special_tokens=False, return_tensors="pt")
Copy link
Member

@pcuenca pcuenca Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree this is dangerous in general (although it does not affect this particular case).

In fact, it'd be useful if we could recommend apply_chat_template with tokenizer=True as the preferred way to prepare conversational inputs cc @ariG23498. (No need to change this example which is illustrative, but we could add a comment in an appropriate point).

```

Notice how the special tokens like `<|im_start>` and `<|im_end>` are applied to the prompt before tokenizing. This is useful for the model to learn where a new sequence starts and ends.
Notice how the special tokens like `<|im_start|>` and `<|im_end|>` are applied to the prompt before tokenizing. This is useful for the model to learn where a new sequence starts and ends.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yes

@pcuenca pcuenca merged commit 08c40d7 into main Dec 20, 2025
7 checks passed
@pcuenca pcuenca deleted the qgallouedec-patch-1 branch December 20, 2025 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants