To pretrain BERT language representation models on AzureML, following artifacts are required:
- Azure Machine Learning Workspace with an AzureML Compute cluster with 64 V100 GPUs (either 16 x
NC24s_v3
or 8 xND40_v2
VMs). Note that by default your subscription might not have enough quota and you are likely to submit a support ticket to get enough quota by following the guide here. - Preprocessed data: BERT paper references
Wikipedia
andBookCorpus
datasets for pretraining. The notebook in this pretrain recipe is configured to use Wikipedia dataset only, but can be used with other datasets as well, including custom datasets. The preprocessed data should be available in aDatastore
registered to the AzureMLWorkspace
that will be used for BERT pretraining. Preprocessed Wikipedia corpus is made available for use with the pretraining recipe in this repo. Refer to the instructions to access preprocessed Wikipedia corpus for pretraining. You can copy the Wikipedia dataset from this location to another Azure blob container and register it as a workspace before using it in the pretraining job. Alternatively, you can preprocess the data from scratch (refer to instructions on this), upload that to an Azure blob container and use it as the datastore for the pretraining job. Note that it is also possible to use other datasets with little or no modifications in this pretraining recipe. - Job configuration to define the parameters for the pretraining job. Refer to configs directory for different configuration settings (
BERT-base
vs.BERT-large
, likesingle-node configurations for debugging
vs.multi-node configurations for production-ready pretraining
). - Code to pretrain BERT model in AzureML. The notebook to submit a pretrain job to AzureML is available at BERT_Pretrain.ipynb.
BERT_Pretrain.ipynb notebook has the recipe to submit bert-large pretraining job to AzureML service and monitor metrics in Tensorboard.