[Add] Tokenization Blog Post #3233

ariG23498 · 2025-12-17T16:06:26Z

This PR adds the new tokenization blog post.

CC: https://github.com/itazap @ArthurZucker @sergiopaniego

tokenizers.md

Co-authored-by: Sergio Paniego Blanco <[email protected]>

merveenoyan

very nice! 🙌🏻 I learnt a lot about v5 changes thanks to you @ariG23498

tokenizers.md

merveenoyan · 2025-12-18T09:05:47Z

tokenizers.md

+* **Vocabulary size of the tokenizer**: how many unique tokens exist in the vocabulary of the tokenizer  
+* **Context length of the model**: how many tokens the model was trained to attend to at once, and can process in a single forward pass.
+
+A good tokenizer *compresses* text into the smallest amount of tokens. Fewer tokens means more usable context without increasing model size.Training a tokenizer boils down to finding the best compression rules for your datasets. For example if you work on Chinese corpus you can sometimes find [very nice surprises 😉](https://x.com/suchenzang/status/1697862650053660721).


idk if you'll talk about BPE later but I'd also mention something around representation of text in frequency aspect with least tokens possible!

tokenizers.md

merveenoyan · 2025-12-18T09:14:34Z

tokenizers.md

+
+Behind the scenes, `AutoTokenizer` performs these steps:
+
+1. **Download the tokenizer configuration.** The `from_pretrained` method fetches `tokenizer_config.json` from the Hub (or from a local directory).  


if we go very deep I'd also explain tokenizer_config.json and the params there. honestly I had to debug tokenizers many times and these files often have culprits 😄

we plan to deprecate tokenizer_config.json, in v5 we don't really rely on the params here as much apart from reading the tokenizer class to load. So we can probably skip over this!

(we try to get the params from the tokenizer file itself)

merveenoyan · 2025-12-18T09:15:39Z

tokenizers.md

+
+## v5 Separates Tokenizer Architecture from Trained Vocab
+
+The most significant change in Transformers v5 is a philosophical shift in how tokenizers are defined. **Tokenizers now work like PyTorch's `nn.Module`**: you define the architecture first, then fill it with learned parameters.


here it sounds a bit like you have learnt params of a tokenizer from a different framework and you load it into HF tokenizers with an architecture. but when I read a bit down below I noticed it's opposite, you train your own tokenizer. if latter is not the case I'd clarify a bit

yes its as it sounds! we read the vocab and merges as parameters from tokenizers or. sentencepiece files (tokenizer.json vs sentencepiece.model respectively) and pass the params to create a tokenizer.

We also support training your own tokenizer but that is unrelated to loading one with .from_pretrained! Similar how in PyTorch you can define a model with nn.Embedding, and fill it with weights or you can initialize it and train those weights

tokenizers.md

merveenoyan · 2025-12-18T09:19:48Z

tokenizers.md

+
+### One file, one backend, one recommended path
+
+v5 consolidates the two-file system *into a single file per model*. `LlamaTokenizer` now inherits from `TokenizersBackend`, which wraps the fast Rust tokenizer by default. The slow Python backend (`PythonBackend`) and SentencePiece backend (`SentencePieceBackend`) still exist for models that need them, but **Rust-backed tokenization is the preferred default**.


do new models only have fast? I'd add that

The idea with v5 is, we will never have fast from now on. The TokenizersBackend is fast and we want to keep it that way.

I'm still confused about SentencePiece going through the legacy path. I would clarify it's not necessary.

Okay we shouldn't refer to python backend as slow in v5 but we should explicitly state that TokenizersBackend classes were the fast classes from v4, and PythonBackend classes were the slow ones

If it's not too wordy:

v5 consolidates the two-file system into a single file per model. LlamaTokenizer now inherits from TokenizersBackend, which wraps the Rust-based tokenizer that was previously exposed as the “fast” implementation and is now the default. The former “slow” Python implementation lives explicitly behind PythonBackend, and SentencePieceBackend remains for models that require it, but Rust-backed tokenization is the preferred default.

Co-authored-by: Merve Noyan <[email protected]>

pcuenca

Great work, covering a lot of ground! 🔥

_blog.yml

tokenizers.md

pcuenca · 2025-12-18T11:55:59Z

tokenizers.md

+
+### One file, one backend, one recommended path
+
+v5 consolidates the two-file system *into a single file per model*. `LlamaTokenizer` now inherits from `TokenizersBackend`, which wraps the fast Rust tokenizer by default. The slow Python backend (`PythonBackend`) and SentencePiece backend (`SentencePieceBackend`) still exist for models that need them, but **Rust-backed tokenization is the preferred default**.


I'm still confused about SentencePiece going through the legacy path. I would clarify it's not necessary.

tokenizers.md

Co-authored-by: Pedro Cuenca <[email protected]>

chore: adding tokenization blog post

7e88842

ariG23498 requested review from ArthurZucker, merveenoyan, pcuenca and sergiopaniego December 17, 2025 16:06

sergiopaniego approved these changes Dec 17, 2025

View reviewed changes

tokenizers.md Outdated Show resolved Hide resolved

tokenizers.md Outdated Show resolved Hide resolved

tokenizers.md Outdated Show resolved Hide resolved

tokenizers.md Outdated Show resolved Hide resolved

Apply suggestions from code review

e5699ae

Co-authored-by: Sergio Paniego Blanco <[email protected]>

merveenoyan reviewed Dec 18, 2025

View reviewed changes

Apply suggestions from code review

247e684

Co-authored-by: Merve Noyan <[email protected]>

pcuenca approved these changes Dec 18, 2025

View reviewed changes

ariG23498 and others added 4 commits December 18, 2025 19:06

Update _blog.yml

aebc029

Co-authored-by: Pedro Cuenca <[email protected]>

review suggestions

19526e4

review

0c682cf

review

ecfad8e

ariG23498 merged commit 8418743 into main Dec 18, 2025
1 check passed

ariG23498 deleted the aritra/tokenizers branch December 18, 2025 15:53


		Behind the scenes, `AutoTokenizer` performs these steps:

		1. Download the tokenizer configuration. The `from_pretrained` method fetches `tokenizer_config.json` from the Hub (or from a local directory).


		## v5 Separates Tokenizer Architecture from Trained Vocab

		The most significant change in Transformers v5 is a philosophical shift in how tokenizers are defined. Tokenizers now work like PyTorch's `nn.Module`: you define the architecture first, then fill it with learned parameters.


		### One file, one backend, one recommended path

		v5 consolidates the two-file system into a single file per model. `LlamaTokenizer` now inherits from `TokenizersBackend`, which wraps the fast Rust tokenizer by default. The slow Python backend (`PythonBackend`) and SentencePiece backend (`SentencePieceBackend`) still exist for models that need them, but Rust-backed tokenization is the preferred default.

[Add] Tokenization Blog Post #3233

[Add] Tokenization Blog Post #3233

Uh oh!

Conversation

ariG23498 commented Dec 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

merveenoyan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants