Skip to content

Conversation

@ariG23498
Copy link
Contributor

This PR adds the new tokenization blog post.

CC: https://github.com/itazap @ArthurZucker @sergiopaniego

Co-authored-by: Sergio Paniego Blanco <[email protected]>
Copy link
Contributor

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice! 🙌🏻 I learnt a lot about v5 changes thanks to you @ariG23498

tokenizers.md Outdated
* **Vocabulary size of the tokenizer**: how many unique tokens exist in the vocabulary of the tokenizer
* **Context length of the model**: how many tokens the model was trained to attend to at once, and can process in a single forward pass.

A good tokenizer *compresses* text into the smallest amount of tokens. Fewer tokens means more usable context without increasing model size.Training a tokenizer boils down to finding the best compression rules for your datasets. For example if you work on Chinese corpus you can sometimes find [very nice surprises 😉](https://x.com/suchenzang/status/1697862650053660721).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk if you'll talk about BPE later but I'd also mention something around representation of text in frequency aspect with least tokens possible!


Behind the scenes, `AutoTokenizer` performs these steps:

1. **Download the tokenizer configuration.** The `from_pretrained` method fetches `tokenizer_config.json` from the Hub (or from a local directory).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we go very deep I'd also explain tokenizer_config.json and the params there. honestly I had to debug tokenizers many times and these files often have culprits 😄

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we plan to deprecate tokenizer_config.json, in v5 we don't really rely on the params here as much apart from reading the tokenizer class to load. So we can probably skip over this!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(we try to get the params from the tokenizer file itself)


## v5 Separates Tokenizer Architecture from Trained Vocab

The most significant change in Transformers v5 is a philosophical shift in how tokenizers are defined. **Tokenizers now work like PyTorch's `nn.Module`**: you define the architecture first, then fill it with learned parameters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here it sounds a bit like you have learnt params of a tokenizer from a different framework and you load it into HF tokenizers with an architecture. but when I read a bit down below I noticed it's opposite, you train your own tokenizer. if latter is not the case I'd clarify a bit

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes its as it sounds! we read the vocab and merges as parameters from tokenizers or. sentencepiece files (tokenizer.json vs sentencepiece.model respectively) and pass the params to create a tokenizer.

We also support training your own tokenizer but that is unrelated to loading one with .from_pretrained! Similar how in PyTorch you can define a model with nn.Embedding, and fill it with weights or you can initialize it and train those weights

tokenizers.md Outdated

### One file, one backend, one recommended path

v5 consolidates the two-file system *into a single file per model*. `LlamaTokenizer` now inherits from `TokenizersBackend`, which wraps the fast Rust tokenizer by default. The slow Python backend (`PythonBackend`) and SentencePiece backend (`SentencePieceBackend`) still exist for models that need them, but **Rust-backed tokenization is the preferred default**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do new models only have fast? I'd add that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea with v5 is, we will never have fast from now on. The TokenizersBackend is fast and we want to keep it that way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused about SentencePiece going through the legacy path. I would clarify it's not necessary.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay we shouldn't refer to python backend as slow in v5 but we should explicitly state that TokenizersBackend classes were the fast classes from v4, and PythonBackend classes were the slow ones

If it's not too wordy:

v5 consolidates the two-file system into a single file per model. LlamaTokenizer now inherits from TokenizersBackend, which wraps the Rust-based tokenizer that was previously exposed as the “fast” implementation and is now the default. The former “slow” Python implementation lives explicitly behind PythonBackend, and SentencePieceBackend remains for models that require it, but Rust-backed tokenization is the preferred default.

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, covering a lot of ground! 🔥

tokenizers.md Outdated

### One file, one backend, one recommended path

v5 consolidates the two-file system *into a single file per model*. `LlamaTokenizer` now inherits from `TokenizersBackend`, which wraps the fast Rust tokenizer by default. The slow Python backend (`PythonBackend`) and SentencePiece backend (`SentencePieceBackend`) still exist for models that need them, but **Rust-backed tokenization is the preferred default**.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused about SentencePiece going through the legacy path. I would clarify it's not necessary.

@ariG23498 ariG23498 merged commit 8418743 into main Dec 18, 2025
1 check passed
@ariG23498 ariG23498 deleted the aritra/tokenizers branch December 18, 2025 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants