-
Notifications
You must be signed in to change notification settings - Fork 966
[Add] Tokenization Blog Post #3233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Sergio Paniego Blanco <[email protected]>
merveenoyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice! 🙌🏻 I learnt a lot about v5 changes thanks to you @ariG23498
tokenizers.md
Outdated
| * **Vocabulary size of the tokenizer**: how many unique tokens exist in the vocabulary of the tokenizer | ||
| * **Context length of the model**: how many tokens the model was trained to attend to at once, and can process in a single forward pass. | ||
|
|
||
| A good tokenizer *compresses* text into the smallest amount of tokens. Fewer tokens means more usable context without increasing model size.Training a tokenizer boils down to finding the best compression rules for your datasets. For example if you work on Chinese corpus you can sometimes find [very nice surprises 😉](https://x.com/suchenzang/status/1697862650053660721). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idk if you'll talk about BPE later but I'd also mention something around representation of text in frequency aspect with least tokens possible!
|
|
||
| Behind the scenes, `AutoTokenizer` performs these steps: | ||
|
|
||
| 1. **Download the tokenizer configuration.** The `from_pretrained` method fetches `tokenizer_config.json` from the Hub (or from a local directory). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we go very deep I'd also explain tokenizer_config.json and the params there. honestly I had to debug tokenizers many times and these files often have culprits 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we plan to deprecate tokenizer_config.json, in v5 we don't really rely on the params here as much apart from reading the tokenizer class to load. So we can probably skip over this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(we try to get the params from the tokenizer file itself)
|
|
||
| ## v5 Separates Tokenizer Architecture from Trained Vocab | ||
|
|
||
| The most significant change in Transformers v5 is a philosophical shift in how tokenizers are defined. **Tokenizers now work like PyTorch's `nn.Module`**: you define the architecture first, then fill it with learned parameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here it sounds a bit like you have learnt params of a tokenizer from a different framework and you load it into HF tokenizers with an architecture. but when I read a bit down below I noticed it's opposite, you train your own tokenizer. if latter is not the case I'd clarify a bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes its as it sounds! we read the vocab and merges as parameters from tokenizers or. sentencepiece files (tokenizer.json vs sentencepiece.model respectively) and pass the params to create a tokenizer.
We also support training your own tokenizer but that is unrelated to loading one with .from_pretrained! Similar how in PyTorch you can define a model with nn.Embedding, and fill it with weights or you can initialize it and train those weights
tokenizers.md
Outdated
|
|
||
| ### One file, one backend, one recommended path | ||
|
|
||
| v5 consolidates the two-file system *into a single file per model*. `LlamaTokenizer` now inherits from `TokenizersBackend`, which wraps the fast Rust tokenizer by default. The slow Python backend (`PythonBackend`) and SentencePiece backend (`SentencePieceBackend`) still exist for models that need them, but **Rust-backed tokenization is the preferred default**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do new models only have fast? I'd add that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea with v5 is, we will never have fast from now on. The TokenizersBackend is fast and we want to keep it that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still confused about SentencePiece going through the legacy path. I would clarify it's not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay we shouldn't refer to python backend as slow in v5 but we should explicitly state that TokenizersBackend classes were the fast classes from v4, and PythonBackend classes were the slow ones
If it's not too wordy:
v5 consolidates the two-file system into a single file per model. LlamaTokenizer now inherits from TokenizersBackend, which wraps the Rust-based tokenizer that was previously exposed as the “fast” implementation and is now the default. The former “slow” Python implementation lives explicitly behind PythonBackend, and SentencePieceBackend remains for models that require it, but Rust-backed tokenization is the preferred default.
Co-authored-by: Merve Noyan <[email protected]>
pcuenca
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, covering a lot of ground! 🔥
tokenizers.md
Outdated
|
|
||
| ### One file, one backend, one recommended path | ||
|
|
||
| v5 consolidates the two-file system *into a single file per model*. `LlamaTokenizer` now inherits from `TokenizersBackend`, which wraps the fast Rust tokenizer by default. The slow Python backend (`PythonBackend`) and SentencePiece backend (`SentencePieceBackend`) still exist for models that need them, but **Rust-backed tokenization is the preferred default**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still confused about SentencePiece going through the legacy path. I would clarify it's not necessary.
Co-authored-by: Pedro Cuenca <[email protected]>
This PR adds the new tokenization blog post.
CC: https://github.com/itazap @ArthurZucker @sergiopaniego