SCRIPT: Script/Category Representation In (Pre-)Tokenization

This repository provides tools for SCRIPT encoding-based pre-tokenization and BPE.

For details of the methods, see our paper: BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization

Overview

This repository provides tools for SCRIPT encoding-based pre-tokenization and BPE, as well as regular byte-based BPE. It includes the following components:

script_bpe: Core modules for SCRIPT encoding and tokenization.
- pretokenize/: Pre-tokenizers: These handle both chunking and encoding to 'base tokens' (i.e. bytes or script/index)
  - bytes_gpt4/bytes_gpt4o: Classic regex + UTF8 based tokenizer, most obvious point of reference.
    - bytes_gpt4o_cb Variant with character boundaries merge limitations which prevents partial+full character merges, and enforces left-to-right merging within characters.
    - bytes_nosplit_cb Variant with no pre-tokenization chunking. Very slow, mainly for limited ablations.
  - scriptenc SCRIPT Encoding based encoding and pre-tokenization
    - scriptenc_cb Variant with character boundaries merge limitations (no partial+full merges, enforce merging into a full character first). This is the proposed algorithm.
    - scriptenc_gpt4o_cb Variant which does pre-tokenization chunking with regex, but then uses script encoding. For ablation testing.
    - scriptenc_nosplit_cb Variant with no pre-tokenization chunking. Very slow, mainly for limited ablations.
  - Regex chunked, script encoded.
- encoding/: SCRIPT Encoding utilities.
- bpe/: Byte Pair Encoding (BPE) implementation.
  - stats for tokenizer performance metrics.
- corpus/
  - PretokenizedCorpus represents a pretokenized sharded training dataset, as base token encoded chunk -> count

Usage

Installation

Ensure you have uv, it should take care of the rest.

Training

To explore the available options for training, run:

uv run train --help

To train a tokenizer using a specific corpus, use:

uv run train --corpus <kor_hang_300mb> -n <number of merge rules> --pretokenizer <pretokenizer_name>

Reproducing results

The directory paper_utils contains scripts to reproduce the paper's results from scratch. To remove checked-in results and reproduce all, you can run:

# rm -r results/ # for reproduction from scratch.
bash paper_utils/train_monolingual.sh  # Uses GNU parallel, make sure it is installed
bash paper_utils/train_multilingual.sh

The notebooks in the same directory can then be used to reproduce the tables and figures.

Sources

An interesting explanation of UTF-8 is given by Computerphile
For more information on Unicode character properties, refer to the Wikipedia article.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
paper_utils		paper_utils
results/tokenizers		results/tokenizers
script_bpe		script_bpe
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SCRIPT: Script/Category Representation In (Pre-)Tokenization

Overview

Usage

Installation

Training

Reproducing results

Sources

About

Uh oh!

Releases

Packages

Languages

License

sanderland/script_bpe

Folders and files

Latest commit

History

Repository files navigation

SCRIPT: Script/Category Representation In (Pre-)Tokenization

Overview

Usage

Installation

Training

Reproducing results

Sources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages