|
| 1 | +# Bert |
| 2 | + |
| 3 | +The benchmark reference for Bert can be found in this [link](https://github.com/mlcommons/training/tree/master/language_model/tensorflow/bert), and here is the PR for the minified benchmark implementation: [link](https://github.com/mlcommons/training/pull/632). |
| 4 | + |
| 5 | +## Project setup |
| 6 | + |
| 7 | +An important requirement is that you must have Docker installed. |
| 8 | + |
| 9 | +```bash |
| 10 | +# Create Python environment and install MLCube Docker runner |
| 11 | +virtualenv -p python3 ./env && source ./env/bin/activate && pip install pip==24.0 && pip install mlcube-docker |
| 12 | +# Fetch the implementation from GitHub |
| 13 | +git clone https://github.com/mlcommons/training && cd ./training/language_model/tensorflow/bert |
| 14 | +``` |
| 15 | + |
| 16 | +Go to mlcube directory and study what tasks MLCube implements. |
| 17 | + |
| 18 | +```shell |
| 19 | +cd ./mlcube |
| 20 | +mlcube describe |
| 21 | +``` |
| 22 | + |
| 23 | +### Demo execution |
| 24 | + |
| 25 | +These tasks will use a demo dataset to execute a faster training workload for a quick demo (~8 min): |
| 26 | + |
| 27 | +```bash |
| 28 | +mlcube run --task=download_demo -Pdocker.build_strategy=always |
| 29 | + |
| 30 | +mlcube run --task=demo -Pdocker.build_strategy=always |
| 31 | +``` |
| 32 | + |
| 33 | +It's also possible to execute the two tasks in one single instruction: |
| 34 | + |
| 35 | +```bash |
| 36 | +mlcube run --task=download_demo,demo -Pdocker.build_strategy=always |
| 37 | +``` |
| 38 | + |
| 39 | +### MLCube tasks |
| 40 | + |
| 41 | +Download dataset. |
| 42 | + |
| 43 | +```shell |
| 44 | +mlcube run --task=download_data -Pdocker.build_strategy=always |
| 45 | +``` |
| 46 | + |
| 47 | +Process dataset. |
| 48 | + |
| 49 | +```shell |
| 50 | +mlcube run --task=process_data -Pdocker.build_strategy=always |
| 51 | +``` |
| 52 | + |
| 53 | +Train SSD. |
| 54 | + |
| 55 | +```shell |
| 56 | +mlcube run --task=train -Pdocker.build_strategy=always |
| 57 | +``` |
| 58 | + |
| 59 | +Run compliance checker. |
| 60 | + |
| 61 | +```shell |
| 62 | +mlcube run --task=check_logs -Pdocker.build_strategy=always |
| 63 | +``` |
| 64 | + |
| 65 | +### Execute the complete pipeline |
| 66 | + |
| 67 | +You can execute the complete pipeline with one single command. |
| 68 | + |
| 69 | +```shell |
| 70 | +mlcube run --task=download_data,process_data,train,check_logs -Pdocker.build_strategy=always |
| 71 | +``` |
| 72 | + |
| 73 | +## TPU Training |
| 74 | + |
| 75 | +For executing this benchmark using TPU you will need access to [Google Cloud Platform](https://cloud.google.com/), then you can create a project (Note: all the resources should be created in the same project) and after that, you will need to follow the next steps: |
| 76 | + |
| 77 | +1. Create a TPU node |
| 78 | + |
| 79 | +In the Google Cloud console, search for the Cloud TPU API page, then click Enable. |
| 80 | + |
| 81 | +Then go to the virtual machine sections and select [TPUs](https://console.cloud.google.com/compute/tpus) |
| 82 | + |
| 83 | +Select create TPU node, fill in all the needed parameters, the recommended TPU type in the [readme](../README.md#on-tpu-v3-128) is v3-128 and the recommended TPU software version is 2.4.0. |
| 84 | + |
| 85 | +The 3 most important parameters you need to remember are: `project name`, `TPU name`, and `TPU Zone`. |
| 86 | + |
| 87 | +After creating, click on the TPU name to see the TPU details, and copy the Service account (should int the format: <[email protected]>) |
| 88 | + |
| 89 | +2. Create a Google Storage Bucket |
| 90 | + |
| 91 | +Go to [Google Storage](https://console.cloud.google.com/storage/browser) and create a new Bucket, define the needed parameters. |
| 92 | + |
| 93 | +In the bucket list select the checkbox for the bucket you just created, then click on permissions, after that click on add principal. |
| 94 | + |
| 95 | +In the new principals field paste the Service account from step 1, and then for the roles select, Storage Legacy Bucket Owner, Storage Legacy Bucket Reader and Storage Legacy Bucket Writer. Then click on save, this will allow the TPU to save the checkpoints during training. |
| 96 | + |
| 97 | +3. Create a VM instance |
| 98 | + |
| 99 | +The idea is to create a virtual machine instance containing all the code we will execute using MLCube. |
| 100 | + |
| 101 | +Go to [VM instances](https://console.cloud.google.com/compute/instances), then click on create instance and define all the needed parameters (No GPU needed). |
| 102 | + |
| 103 | +**IMPORTANT:** In the section Identity and API access, check the option `Allow full access to all Cloud APIs`, this will allow the connection between this VM, the Cloud Storage Bucket and the TPU. |
| 104 | + |
| 105 | +Start the VM, connect to it via SSH, then use this [tutorial](https://docs.docker.com/engine/install/debian/) to install Docker. |
| 106 | + |
| 107 | +After installing Docker, clone the repo and install MLCube and follow the to install MLCube, then go to the path: `training/language_model/tensorflow/bert/mlcube` |
| 108 | + |
| 109 | +There modify the file at `workspace/parameters.yaml` and replace it with your data for: |
| 110 | + |
| 111 | +```yaml |
| 112 | +output_gs: your_gs_bucket_name |
| 113 | +tpu_name: your_tpu_instance_name |
| 114 | +tpu_zone: your_tpu_zone |
| 115 | +gcp_project: your_gcp_project |
| 116 | +``` |
| 117 | +
|
| 118 | +After that run the command: |
| 119 | +
|
| 120 | +```shell |
| 121 | +mlcube run --task=train_tpu --mlcube=mlcube_tpu.yaml -Pdocker.build_strategy=always |
| 122 | +``` |
| 123 | + |
| 124 | +This will start the MLCube task that internally in the host VM will send a gRPC with all the data to the TPU through gRPC, then the TPU will get the code to execute and the information of the Cloud Storage Bucket data and will execute the training workload. |
0 commit comments