The following steps are to prepare Wikipedia corpus for pretraining. However, these steps can be used with little or no modification to preprocess other datasets as well:
- Download wiki dump file from https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.
This is a zip file and needs to be unzipped. - Clone Wikiextractor, and run it:
Running time can be 5-10 minutes/GB.
git clone https://github.com/attardi/wikiextractor python3 wikiextractor/WikiExtractor.py -o out -b 1000M enwiki-latest-pages-articles.xml
output:out
directory - Run:
This script removes html tags and empty lines and outputs to one file where each line is a paragraph.
ln -s out out2 python3 AzureML-BERT/pretrain/PyTorch/dataprep/single_line_doc_file_creation.py
(pip install tqdm
if needed.)
output:wikipedia.txt
- Run:
This script converts
python3 AzureML-BERT/pretrain/PyTorch/dataprep/sentence_segmentation.py wikipedia.txt wikipedia.segmented.nltk.txt
wikipedia.txt
to one file where each line is a sentence.
(pip install nltk
if needed.)
output:wikipedia.segmented.nltk.txt
- Split the above output file into ~100 files by line with:
output:
mkdir data_shards python3 AzureML-BERT/pretrain/PyTorch/dataprep/split_data_into_files.py
data_shards
directory - Run:
This script will convert each file into pickled
python3 AzureML-BERT/pretrain/PyTorch/dataprep/create_pretraining.py --input_dir=data_shards --output_dir=pickled_pretrain_data --do_lower_case=true
.bin
file.
output:pickled_pretrain_data
directory