Skip to content

Latest commit

 

History

History
8 lines (7 loc) · 355 Bytes

README.md

File metadata and controls

8 lines (7 loc) · 355 Bytes

BabyLM: Training BPE Tokenizer

This repository contains code for training a tokenizer on BabyLM 10M corpus. To train a tokenizer, clone this repository, install the requirements and run the following command:

python scripts/train_bbpe.py

The code is based on BabyBERTa.