BabyLM: Training BPE Tokenizer

This repository contains code for training a tokenizer on BabyLM 10M corpus. To train a tokenizer, clone this repository, install the requirements and run the following command:

python scripts/train_bbpe.py

The code is based on BabyBERTa.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

BabyLM: Training BPE Tokenizer

Files

README.md

Latest commit

History

README.md

File metadata and controls

BabyLM: Training BPE Tokenizer