diff --git a/_blog.yml b/_blog.yml index 6840c611a6..76d3554838 100644 --- a/_blog.yml +++ b/_blog.yml @@ -4992,4 +4992,12 @@ - Claude - Codex - Gemini - - agents \ No newline at end of file + - agents + +- local: tokenizers + date: Dec 18, 2025 + tags: + - tokenizers + - transformers + - open-source + - tokenization diff --git a/assets/tokenizers/thumbnail.png b/assets/tokenizers/thumbnail.png new file mode 100644 index 0000000000..871fd37aee Binary files /dev/null and b/assets/tokenizers/thumbnail.png differ diff --git a/tokenizers.md b/tokenizers.md new file mode 100644 index 0000000000..7439c7657a --- /dev/null +++ b/tokenizers.md @@ -0,0 +1,501 @@ +--- +title: "Tokenization in Transformers v5: Simpler, Clearer, and More Modular" +thumbnail: /blog/assets/tokenizers/thumbnail.png +authors: +- user: itazap +- user: ariG23498 +- user: ArthurZ +- user: sergiopaniego +- user: merve +- user: pcuenq +--- + +# Tokenization in Transformers v5: Simpler, Clearer, and More Modular + +![thumbnail](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/tokenizers/thumbnail.png) + +[Transformers v5](https://huggingface.co/blog/transformers-v5) redesigns how tokenizers work. The [big tokenizers reformat](https://github.com/huggingface/transformers/pull/40936/files) separates tokenizer design from trained vocabulary (much like how PyTorch separates neural network architecture from learned weights). The result is tokenizers you can *inspect*, *customize*, and *train* from scratch with far less friction. + +> [!NOTE] +> TL;DR: This blog explains how tokenization works in Transformers and why v5 is a major redesign, with clearer internals, a clean class hierarchy, and a single fast backend. It’s a practical guide for anyone who wants to understand, customize, or train model-specific tokenizers instead of treating them as black boxes. + +## Table of Contents + +- [What is Tokenization?](#what-is-tokenization) +- [The Tokenization Pipeline](#the-tokenization-pipeline) +- [Tokenization Algorithms](#tokenization-algorithms) +- [Accessing `tokenizers` through `transformers`](#accessing-tokenizers-through-transformers) +- [The Tokenizer Class Hierarchy in `transformers`](#the-tokenizer-class-hierarchy-in-transformers) +- [`AutoTokenizer` Automatically Selects the Correct Tokenizer Class](#autotokenizer-automatically-selects-the-correct-tokenizer-class) +- [v5 Separates Tokenizer Architecture from Trained Vocab](#v5-separates-tokenizer-architecture-from-trained-vocab) +- [Summary](#summary) + +> [!TIP] +> For experts: If you're already familiar with the concepts and want to understand the changes in v5, go to [v5 Separates Tokenizer Architecture from Trained Vocab](#v5-separates-tokenizer-architecture-from-trained-vocab) + +Before diving into the changes, let's quickly cover what tokenization does and how the pieces fit together. + +## What is tokenization? + + + +Language models don't read raw text. They consume sequences of integers usually called **token IDs or input IDs**. Tokenization is the process of converting raw text into these token IDs. + +Tokenization is a broad concept used across natural language processing and text processing generally. This post focuses specifically on tokenization for Large Language Models (LLMs) using the [`transformers`](https://github.com/huggingface/transformers) and [`tokenizers`](https://github.com/huggingface/tokenizers) libraries. + +```py +from transformers import AutoTokenizer +tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B") + +text = "Hello world" +tokens = tokenizer(text) + +print(tokens["input_ids"]) +# [9906, 1917] + +print(tokenizer.convert_ids_to_tokens(tokens["input_ids"])) +# ['Hello', 'Ġworld'] +``` + +> [!NOTE] +> `Ġworld` (above) is a single token that represents the character sequence " world" (with the space). + +A **token** is the smallest string unit the model sees. It can be a character, word, or subword chunk like "play" or "##ing" ("##" is a pattern, don't worry if you don't completely understand it now 🤗). The **vocabulary** maps each unique token to the token ID. + +```py +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B") +print(tokenizer.vocab) + +# {'ÎĹÎľ': 106502, 'ĠPeel': 89694, '.languages': 91078, ...} +``` + +A good tokenizer *compresses* text into the smallest amount of tokens. Fewer tokens means more usable context without increasing model size. Training a tokenizer boils down to finding the best compression rules for your datasets. For example, if you work on Chinese corpus you can sometimes find [very nice surprises 😉](https://x.com/suchenzang/status/1697862650053660721). + +## The tokenization pipeline + +Tokenization happens in stages. Each stage transforms text before passing it to the next: + +| Stage | Purpose | Example | +| :---: | :---: | :---: | +| **Normalizer** | Standardizes text (lowercasing, unicode normalization, whitespace cleanup) | `"HELLO World"` → `"hello world"` | +| **Pre-tokenizer** | Splits text into preliminary chunks | `"hello world"` → `["hello", " world"]` | +| **Model** | Applies the tokenization algorithm (BPE, Unigram, etc.) | `["hello", " world"]` → `[9906, 1917]` | +| **Post-processor** | Adds special tokens (BOS, EOS, padding) | `[9906, 1917]` → `[1, 9906, 1917, 2]` | +| **Decoder** | Converts token IDs back to text | `[9906, 1917]` → `"hello world"` | + +Each component is *independent*. You can swap [normalizers](https://huggingface.co/docs/tokenizers/en/api/normalizers) or change the [algorithm](https://huggingface.co/docs/tokenizers/en/api/models) without rewriting everything else. + +> [!NOTE] +> You can access the rust based tokenizer through `_tokenizer`. We go in more depth about it in [this section](#tokenizersbackend-wraps-the-tokenizers-library) + +```py +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m-it") + +print(f"{tokenizer._tokenizer.normalizer=}") +# Replace(...) + +print(f"{tokenizer._tokenizer.pre_tokenizer=}") +# Split(...) + +print(f"{tokenizer._tokenizer.model=}") +# BPE(...) + +print(f"{tokenizer._tokenizer.post_processor=}") +# TemplateProcessing(...) + +print(f"{tokenizer._tokenizer.decoder=}") +# Sequence(decoders=[Replace(...), ByteFallback(), Fuse()]) +``` + +## Tokenization algorithms + +The following algorithms dominate modern language model tokenizers: + +1. **Byte Pair Encoding (BPE)** iteratively merges the most frequent character pairs. This algorithm is deterministic and widely used. (Read more about [BPE](https://huggingface.co/learn/llm-course/en/chapter6/5)) + +```py +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b") +print(tokenizer._tokenizer.model) + +# BPE(...) +``` + +2. **Unigram** takes a probabilistic approach, selecting the most likely segmentation from a large initial vocabulary. This is more flexible than the strict BPE. (Read more about [Unigram](https://huggingface.co/learn/llm-course/en/chapter6/7)) + +```py +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-base") +print(tokenizer._tokenizer.model) + +# Unigram(...) +``` + +3. **WordPiece** resembles BPE but uses different merge criteria based on likelihood. (Read more about [WordPiece](https://huggingface.co/learn/llm-course/en/chapter6/6)) + +```py +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") +print(tokenizer._tokenizer.model) + +# WordPiece(...) +``` + +## Accessing tokenizers through transformers + +The [`tokenizers`](https://github.com/huggingface/tokenizers) library is a Rust-based tokenization engine. It is fast, efficient, and completely language model agnostic. The library handles the mechanics of converting text into token IDs and back. The `tokenizers` library is a general-purpose tool that implements the tokenization algorithms, but does not implement the conventions that connect those algorithms to specific language models. + +Consider what happens when you use `tokenizers` directly with the [`SmolLM3-3B`](http://hf.co/HuggingFaceTB/SmolLM3-3B) model: + +```py +from tokenizers import Tokenizer + +tokenizer = Tokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B") +text = "Hello world" +encodings = tokenizer.encode(text) + +print(encodings.ids) +# [9906, 1917] +print(encodings.tokens) +# ['Hello', 'Ġworld'] +``` + +The output is raw tokenization. You get token IDs and the string pieces they correspond to. Nothing more. + +Now consider what's missing. The `SmolLM3-3B` is a *conversational model*. When you interact with it, you typically structure your input as a conversation with roles like "user" and "assistant". The language model expects special formatting tokens to indicate these roles. The raw `tokenizers` library has no concept of any of this. + +### How do you bridge the gap between raw tokenization and model requirements? + +The `transformers` library bridges this gap. The library is primarily known as a model definition library, but it also provides a tokenizer abstraction layer that wraps the raw `tokenizers` backend and adds model-aware functionality. + +Here's the same tokenization with the `transformers` wrapper: + +```py +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B") + +# Format a conversation using the model's chat template +prompt = "Give me a brief explanation of gravity in simple terms." +messages = [{"role": "user", "content": prompt}] +text = tokenizer.apply_chat_template( + messages, + tokenize=False, + add_generation_prompt=True, +) + +print(text) + +# <|im_start|>system +# ... +# <|im_start|>user +# Give me a brief explanation of gravity in simple terms.<|im_end|> +# <|im_start|>assistant + +model_inputs = tokenizer([text], return_tensors="pt") +``` + +Notice how the special tokens like `<|im_start>` and `<|im_end>` are applied to the prompt before tokenizing. This is useful for the model to learn where a new sequence starts and ends. + +The `transformers` tokenizer adds everything the raw library lacks: + +* **Chat template application.** The `apply_chat_template` method formats conversations according to the model's expected format, inserting the correct special tokens and delimiters. +* **Automatic special token insertion.** Beginning-of-sequence and end-of-sequence tokens are added where the model expects them. +* **Truncation to context length.** You can specify `truncation=True` and the tokenizer will respect the model's maximum sequence length. +* **Batch encoding with padding.** Multiple inputs can be padded to the same length with the correct padding token and direction. +* **Return format options.** You can request PyTorch tensors (`return_tensors="pt"`), NumPy arrays and others. + +> [!NOTE] +> `transformers` implements the tokenization API that is most commonly used in the entire ML community (`encode`, `decode`, `convert_tokens_to_ids`, etc.) + +## The tokenizer class hierarchy in transformers + +The `transformers` library organizes tokenizers into a class hierarchy. At the top sits a base class that defines the common interface. Below it, backend classes handle the actual tokenization using different engines. At the bottom, model-specific classes configure the backends for particular models. + +| ![class hierarchy](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/tokenizers/hierarchy.png) | +| :--: | +| The class hierarchy for tokenizers inside transformers | + +### `PreTrainedTokenizerBase` defines the common interface for all tokenizers + +[`PreTrainedTokenizerBase`](https://github.com/huggingface/transformers/blob/7f52a2a4ea8ab49b7f069df7fac58a5b280d4919/src/transformers/tokenization_utils_base.py#L964C7) is the abstract base class for all tokenizers in `transformers`. It defines the interface that every tokenizer must implement. + +The base class handles functionality that doesn't depend on the tokenization backend: + +* **Special token properties.** Properties like `bos_token`, `eos_token`, `pad_token`, and `unk_token` are defined here. These properties provide access to the special tokens that models use to mark sequence boundaries and handle unknown inputs. +* **Encoding interface.** The `__call__` method, `encode`, and `encode_plus` methods are defined here. These methods accept text input and return token IDs along with attention masks and other metadata. +* **Decoding interface.** The `decode` and `batch_decode` methods convert token IDs back to text. +* **Serialization.** The `save_pretrained` and `from_pretrained` methods handle downloading the correct files, reading information, saving tokenizers to disk etc. +* **Chat template support.** The `apply_chat_template` method lives here, formatting conversations according to Jinja templates stored in the tokenizer configuration. + +Every tokenizer in `transformers` ultimately inherits from `PreTrainedTokenizerBase`. The base class ensures consistent behavior across all tokenizers, regardless of which backend they use for the actual tokenization. + +### `TokenizersBackend` wraps the `tokenizers` library + +[`TokenizersBackend`](https://github.com/huggingface/transformers/blob/7f52a2a4ea8ab49b7f069df7fac58a5b280d4919/src/transformers/tokenization_utils_tokenizers.py#L80C7) is the primary backend class for most modern tokenizers. It inherits from `PreTrainedTokenizerBase` and wraps the Rust-based `tokenizers` library. + +The class stores the Rust tokenizer object internally: + +```py +class TokenizersBackend(PreTrainedTokenizerBase): + def __init__(self, tokenizer_object, ...): + self._tokenizer = tokenizer_object # The Rust tokenizer + ... +``` + +When you call encoding methods on a `TokenizersBackend` tokenizer, the class delegates the actual tokenization to the Rust backend: + +```py +def _batch_encode_plus(self, batch_text_or_text_pairs, ...): + encodings = self._tokenizer.encode_batch(batch_text_or_text_pairs, ...) + ... +``` + +The Rust backend performs computationally intensive work, while the Python wrapper adds the model-aware features on top. + +Many model-specific tokenizers inherit from `TokenizersBackend`, examples include: + +* `LlamaTokenizer` +* `GemmaTokenizer` + +These model-specific classes configure the backend with the correct vocabulary, merge rules, special tokens, and normalization settings for their respective models. + +### `PythonBackend` provides a pure-Python mixin + +[`PythonBackend`](https://github.com/huggingface/transformers/blob/7f52a2a4ea8ab49b7f069df7fac58a5b280d4919/src/transformers/tokenization_python.py#L400) inherits from `PreTrainedTokenizerBase` and implements tokenization in pure Python. The class is aliased as [`PreTrainedTokenizer`](https://github.com/huggingface/transformers/blob/7f52a2a4ea8ab49b7f069df7fac58a5b280d4919/src/transformers/tokenization_python.py#L1400C1). + +The pure-Python backend exists for several reasons: + +* **Custom tokenization logic.** Some models require tokenization behavior that doesn't fit the standard `tokenizers` pipeline. +* **Legacy compatibility.** Older model implementations may rely on Python-specific behavior. + +> [!NOTE] +> The Python backend is slower than the Rust backend. For most use cases, the Rust-backed `TokenizersBackend` is preferred. + +Model-specific tokenizers that inherit from `PythonBackend` (or its alias `PreTrainedTokenizer`) include some older or specialized models, like: + +* `CTRLTokenizer` +* `CanineTokenizer` + +### `SentencePieceBackend` handles SentencePiece models + +[`SentencePieceBackend`](https://github.com/huggingface/transformers/blob/7f52a2a4ea8ab49b7f069df7fac58a5b280d4919/src/transformers/tokenization_utils_sentencepiece.py#L46) inherits from `PythonBackend` and provides integration with Google's [SentencePiece](https://github.com/google/sentencepiece) library. SentencePiece is a standalone tokenization library that many models use, particularly those trained by Google. + +The backend wraps a SentencePiece processor: + +```py +class SentencePieceBackend(PythonBackend): + def __init__(self, vocab_file, ...): + self.sp_model = spm.SentencePieceProcessor() + self.sp_model.Load(vocab_file) + ... +``` + +Models that use SentencePiece tokenization inherit from this backend. Examples include: + +* `SiglipTokenizer` +* `BartphoTokenizer` + +The SentencePiece backend inherits from `PythonBackend` rather than directly from `PreTrainedTokenizerBase` because it shares much of the same interface and padding/truncation logic. + +## AutoTokenizer automatically selects the correct tokenizer class + +[`AutoTokenizer`](https://github.com/huggingface/transformers/blob/7f52a2a4ea8ab49b7f069df7fac58a5b280d4919/src/transformers/models/auto/tokenization_auto.py#L531) is the recommended entry point for loading tokenizers. It automatically determines which tokenizer class to use for a given model and returns an instance of that class. + +```py +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("gpt2") +``` + +Behind the scenes, `AutoTokenizer` performs these steps: + +1. **Download the tokenizer configuration.** The `from_pretrained` method fetches `tokenizer_config.json` from the Hub (or from a local directory). +2. **Identify the model type.** The configuration contains metadata that [identifies the model type](https://huggingface.co/openai-community/gpt2/blob/main/config.json#L12) (e.g., "gpt2", "llama", "bert"). +3. **Look up the tokenizer class.** `AutoTokenizer` maintains a mapping called [`TOKENIZER_MAPPING_NAMES`](https://github.com/huggingface/transformers/blob/7f52a2a4ea8ab49b7f069df7fac58a5b280d4919/src/transformers/models/auto/tokenization_auto.py#L64) that maps model types to tokenizer class names: + +```py +TOKENIZER_MAPPING_NAMES = { + "gpt2": "GPT2Tokenizer", + "llama": "LlamaTokenizer", + "bert": "BertTokenizer", + ... +} +``` + +4. **Instantiate the correct class.** `AutoTokenizer` imports the appropriate tokenizer class and calls its `from_pretrained` method. +5. **Return the configured tokenizer.** You receive a fully configured, model-specific tokenizer ready for use. + +> [!NOTE] +> The benefit of `AutoTokenizer` is that you don't need to know which tokenizer class a model uses. Whether a model uses `LlamaTokenizer`, `GPT2Tokenizer`, or `BertTokenizer`, the same `AutoTokenizer.from_pretrained("model-name")` call works. + +The tokenizer system in `transformers` forms a layered architecture: + +| Layer | Component | Responsibility | +| :---: | :---: | :---: | +| Entry Point | `AutoTokenizer` | Automatically selects and instantiates the correct tokenizer class | +| Model-Specific | `LlamaTokenizer`, `GPT2Tokenizer`, etc. | Configures the backend with model-specific architecture of normalizer, pre tokenizer, etc, special tokens, and settings | +| Backend | `TokenizersBackend`, `PythonBackend`, `SentencePieceBackend` | Implements the actual tokenization using a specific engine | +| Base | `PreTrainedTokenizerBase` | Defines the common interface and shared functionality | +| Engine | `tokenizers` (Rust), SentencePiece, Pure Python | Performs raw tokenization | + +## v5 Separates Tokenizer Architecture from Trained Vocab + +The most significant change in Transformers v5 is a philosophical shift in how tokenizers are defined. **Tokenizers now work like PyTorch's `nn.Module`**: you define the architecture first, then fill it with learned parameters. + +### The problem with v4: tokenizers were opaque and tightly coupled + +In v4, tokenizers were black boxes tied to pretrained checkpoint files. If you loaded `LlamaTokenizerFast`, you couldn't easily answer basic questions about it: + +* Is it BPE or Unigram? +* How does it normalize text? +* What pre-tokenization strategy does it use? +* What are the special tokens and their positions? + +The `__init__` method gave no clues. You had to dig through serialized files or external documentation to understand what the tokenizer actually did. + +| ![v4 llama](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/tokenizers/v4-llama.png) | +| :--: | +| `LlamaTokenizerFast` as seen in v4 `transformers` | + +v4 also maintained two parallel implementations for every model: + +1. a "slow" Python tokenizer (`LlamaTokenizer` inheriting from `PreTrainedTokenizer`) and +2. a "fast" Rust-backed tokenizer (`LlamaTokenizerFast` inheriting from `PreTrainedTokenizerFast`). + +This meant: + +* **Two files per model** (e.g., `tokenization_llama.py` and `tokenization_llama_fast.py`) +* **Code duplication** across hundreds of models +* **Behavioral discrepancies** between slow and fast versions, leading to subtle bugs +* **A growing test suite** dedicated to verifying that slow and fast tokenizers produced identical outputs +* **User confusion** about which tokenizer to use and when + +Worst of all, you couldn't create an empty tokenizer architecture. If you wanted to train a LLaMA-style tokenizer on your own data, there was no clean way to instantiate a "blank" LLaMA tokenizer and fill it with your vocabulary and merges. Tokenizers existed only as loaded checkpoints, not as configurable templates. + +### The v5 solution: architecture and parameters are now separate + +v5 treats tokenizer architecture (normalizer, pre-tokenizer, model type, post-processor, decoder) as distinct from trained parameters (vocabulary, merges). This mirrors how PyTorch separates model architecture from learned weights. + +**With `nn.Module`, you define layers first:** + +```py +from torch import nn + +model = nn.Sequential( + nn.Embedding(vocab_size, embed_dim), + nn.Linear(embed_dim, hidden_dim), +) +# Architecture defined; weights initialized randomly or loaded later +``` + +**V5 tokenizers follow the same pattern:** + +```py +from transformers import LlamaTokenizer + +# Instantiate the architecture +tokenizer = LlamaTokenizer() + +# Train on your own data to fill in vocab and merges +tokenizer.train(files=["my_corpus.txt"]) +``` + +The tokenizer class now explicitly declares its structure. Looking at `LlamaTokenizer` in v5, you can immediately see: + +* [It uses **BPE**](https://github.com/huggingface/transformers/blob/0a8465420eecbac1c6d7dd9f45c08dd96b8c5027/src/transformers/models/llama/tokenization_llama.py#L92) as its tokenization model +* It may add a **prefix space** before text +* Its special tokens (`unk`, `bos`, `eos`) sit at specific vocabulary positions +* [It does **not normalize**](https://github.com/huggingface/transformers/blob/0a8465420eecbac1c6d7dd9f45c08dd96b8c5027/src/transformers/models/llama/tokenization_llama.py#L121) input text +* [Its decoder](https://github.com/huggingface/transformers/blob/0a8465420eecbac1c6d7dd9f45c08dd96b8c5027/src/transformers/models/llama/tokenization_llama.py#L122) replaces the metaspace character `▁` with spaces + +| ![v5 llama](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/tokenizers/v5-llama.png) | +| :--: | +| `LlamaTokenizer` as seen in v5 `transformers` | + +This transparency was impossible in v4, where the same information was buried in serialized files. + +### One file, one backend, one recommended path + +v5 consolidates the two-file system *into a single file per model*. `LlamaTokenizer` now inherits from `TokenizersBackend`, which wraps the Rust-based tokenizer that was previously exposed as the “fast” implementation and is now the default. + +The former “slow” Python implementation lives explicitly behind `PythonBackend`, and `SentencePieceBackend` remains for models that require it, but **Rust-backed tokenization is the preferred default**. + +This change eliminates: + +* Duplicate code across slow/fast implementations +* The confusing `Tokenizer` vs `TokenizerFast` naming convention +* Test suites dedicated to checking slow-fast parity + +Users now have one clear entry point. Advanced users who need to customize can still access lower-level components, but the library no longer forces everyone to navigate two parallel implementations. + +### You can now train model specific tokenizers from scratch + +Suppose you want a tokenizer that behaves exactly like LLaMA's – same normalization, same pre-tokenization, same BPE model type – but trained on a domain-specific corpus (medical text, legal documents, a new language). In v4, this required manually reconstructing the tokenizer pipeline from low-level `tokenizers` library primitives. In v5, you can instantiate the architecture directly and call `train`: + +```py +from transformers import LlamaTokenizer +from datasets import load_dataset + +# Initialize blank tokenizer +tokenizer = LlamaTokenizer() + +dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train") + +def get_training_corpus(): + batch = 1000 + for i in range(0, len(dataset), batch): + yield dataset[i : i + batch]["text"] + +trained_tokenizer = tokenizer.train_new_from_iterator( + text_iterator=get_training_corpus(), + vocab_size=32000, + length=len(dataset), + show_progress=True, +) + +trained_tokenizer.push_to_hub("my_custom_tokenizer") + +tokenizer = LlamaTokenizer.from_pretrained("my_custom_tokenizer") +``` + +The resulting tokenizer will have your custom vocabulary and merge rules, but will process text identically to how a standard LLaMA tokenizer would with the same whitespace handling, same special token conventions, same decoding behavior. + +| Aspect | V4 | V5 | +| :---: | :---: | :---: | +| Files per model | Two (`tokenization_X.py`, `tokenization_X_fast.py`) | One (`tokenization_X.py`) | +| Default backend | Split between Python and Rust | Rust (`TokenizersBackend`) preferred | +| Architecture visibility | Hidden in serialized files | Explicit in class definition | +| Training from scratch | Required manual pipeline construction | `tokenizer.train(files=[...])` | +| Component inspection | Difficult, undocumented | Direct properties (`tokenizer.normalizer`, etc.) | +| Parent classes | `PreTrainedTokenizer`, `PreTrainedTokenizerFast` | `TokenizersBackend` (or `SentencePieceBackend`, `PythonBackend`) | + +The shift from "tokenizers as loaded checkpoints" to "tokenizers as configurable architectures" makes the library more modular, more transparent, and more aligned with how practitioners think about building ML systems. + +## Summary + +Transformers v5 brings three improvements to tokenization: + +1. **One file per model** instead of separate slow/fast implementations +2. **Visible architecture** so you can inspect normalizers, pre-tokenizers, and decoders +3. **Trainable templates** that let you create custom tokenizers matching any model's design + +The wrapper layer between `tokenizers` and Transformers remains essential. It adds model awareness, context lengths, chat templates, special tokens, that raw tokenization doesn't provide. V5 just makes that layer clearer and more customizable. + +If you are looking to learn more about tokenization here are some resources: +- [Let's build the GPT Tokenizer](https://youtu.be/zduSFxRajkE?si=ZAfCjZjpyPHsnyfF) +- [Gotchas in Tokenizer Behavior Every Developer Should Know](https://huggingface.co/blog/qgallouedec/gotchas-in-tokenizer-behavior) +- [Chat Templates](https://huggingface.co/blog/chat-templates) +- [A list of resoruces we have gathered from the community!](https://x.com/ariG23498/status/1999058214906888237) \ No newline at end of file