We've provided several scripts for pretraining BERT, GPT, CPM, T5 and Turing-NLG in examples directory.
The training data requires preprocessing. First, place your training data in a loose json format, with one json containing a text sample per line. For example:
{"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"}
The name of the text field of the json can be changed by using the --json-key flag in preprocess_data.py The other metadata are optional and are not used in training.
The loose json is then processed into a binary format for training. To convert the json into mmap, cached index file, or the lazy loader format use preprocess_data.py. Set the --dataset-impl flag to mmap, cached, or lazy, respectively (default is mmap). An example script to prepare data for BERT training is:
python tools/preprocess_data.py \
--input my-corpus.json \
--output-prefix my-bert \
--vocab bert-vocab.txt \
--dataset-impl mmap \
--tokenizer-type BertWordPieceLowerCase \
--split-sentences
The output will be two files named, in this case, my-bert_text_sentence.bin and my-bert_text_sentence.idx. The --data-path specified in later BERT training is the full path and new filename, but without the file extension.
For T5 use the same preprocessing as BERT, perhaps renaming it to:
--output-prefix my-t5 \
Some minor modifications are required for GPT data preprocessing, namely, the addition of a merge table, an end-of-document token, removal of sentence splitting, and a change to the tokenizer type:
python tools/preprocess_data.py \
--input my-corpus.json \
--output-prefix my-gpt2 \
--vocab gpt2-vocab.json \
--dataset-impl mmap \
--tokenizer-type GPT2BPETokenizer \
--merge-file gpt2-merges.txt \
--append-eod
Here the output files are named my-gpt2_text_document.bin and my-gpt2_text_document.idx. As before, in GPT training, use the longer name without the extension as --data-path.
Further command line arguments are described in the source file preprocess_data.py.