Azure CLI 2.0 allows you to create and manage Batch AI resources - create/delete Batch AI file servers and clusters, submit and monitor training jobs.
This recipe shows how to create a GPU cluster, run and monitor training job using Microsoft Cognitive Toolkit.
The training script mnist_cnn.py is available at Official Keras GitHub page. This script trains convolutional neural network on MNIST database of handwritten digits.
To train a model, you typically need to perform the following steps:
- Create a GPU or CPU Batch AI cluster to run the job;
- Make the training data and training scripts available on the cluster nodes;
- Submit the training job and obtain its logs and/or generated models;
- Delete the cluster or resize it to have zero node to not pay for compute resources when you are not using them.
In this recipe, we will:
- Create a single node GPU cluster (with
Standard_NC6
VM size) with namenc6
; - Create a new storage account, Azure File Share with two folders
logs
andscripts
to store jobs output and training scripts; - Deploy the training script and the training data to the storage account before job submission;
- During the job submission we will instruct Batch AI to mount the Azure File Share and Azure Blob Container on the
cluster's node and make them available as regular file system at
$AZ_BATCHAI_JOB_MOUNT_ROOT/logs
,$AZ_BATCHAI_JOB_MOUNT_ROOT/scripts
, whereAZ_BATCHAI_JOB_MOUNT_ROOT
is an environment variable set by Batch AI for the job. - We will monitor the job execution by streaming its standard output;
- After the job completion, we will inspect its output;
- At the end, we will cleanup all allocated resources.
- Azure subscription - If you don't have an Azure subscription, create a free account before you begin.
- Access to Azure CLI 2.0. You can either use Azure CLI 2.0 available in Cloud Shell or install and configure it locally using the following instructions.
If you are using Cloud Shell, please change the working directory to /usr/$USER/clouddrive
because your home directory has no empty space:
cd /usr/$USER/clouddrive
An Azure resource group is a logical container for deploying and managing Azure resources. The following command will
create a new resource group batchai.recipes
in East US location:
az group create -n batchai.recipes -l eastus
The following command will create a new workspace recipe_workspace
in East US location:
az batchai workspace create -g batchai.recipes -n recipe_workspace -l eastus
The following command will create a single node GPU cluster (VM size is Standard_NC6) using Ubuntu as the operation system image.
az batchai cluster create -n nc6 -g batchai.recipes -w recipe_workspace -s Standard_NC6 -t 1 --generate-ssh-keys
--generate-ssh-keys
option tells Azure CLI to generate private and public ssh keys if you have not them already, so
you can ssh to cluster nodes using the ssh key and you current user name. Note. You need to backup ~/.ssh folder to
some permanent storage if you are using Cloud Shell.
Example output:
{
"allocationState": "steady",
"allocationStateTransitionTime": "2018-06-12T21:25:07.039000+00:00",
"creationTime": "2018-06-12T21:25:07.039000+00:00",
"currentNodeCount": 1,
"errors": null,
"id": "/subscriptions/1cba1da6-5a83-45e1-a88e-8b397eb84356/resourceGroups/batchai.recipes/providers/Microsoft.BatchAI/workspaces/recipe_workspace/clusters/nc6",
"name": "nc6",
"nodeSetup": null,
"nodeStateCounts": {
"idleNodeCount": 1,
"leavingNodeCount": 0,
"preparingNodeCount": 0,
"runningNodeCount": 0,
"unusableNodeCount": 0
},
"provisioningState": "succeeded",
"provisioningStateTransitionTime": "2018-06-12T21:25:23.591000+00:00",
"resourceGroup": "batchai.recipes",
"scaleSettings": {
"autoScale": null,
"manual": {
"nodeDeallocationOption": "requeue",
"targetNodeCount": 1
}
},
"subnet": null,
"type": "Microsoft.BatchAI/workspaces/clusters",
"userAccountSettings": {
"adminUserName": "recipeuser",
"adminUserPassword": null,
"adminUserSshPublicKey": "<YOUR SSH PUBLIC KEY HERE>"
},
"virtualMachineConfiguration": {
"imageReference": {
"offer": "UbuntuServer",
"publisher": "Canonical",
"sku": "16.04-LTS",
"version": "latest",
"virtualMachineImageId": null
}
},
"vmPriority": "dedicated",
"vmSize": "STANDARD_NC6"
}
Create a new storage account with an unique name in the same region where you are going to create Batch AI cluster and run the job. Node, each storage account must have an unique name.
az storage account create -n <storage account name> --sku Standard_LRS -g batchai.recipes
If selected storage account name is not available, the above command will report corresponding error. In this case, choose other name and retry.
- Download mnist_cnn.py example script into the current folder:
For GNU/Linux or Cloud Shell:
wget https://raw.githubusercontent.com/fchollet/keras/master/examples/mnist_cnn.py
The following commands will create Azure File Shares scripts
and logs
and will copy training script into tensorflow
folder inside of scripts
share:
az storage share create -n scripts --account-name <storage account name>
az storage share create -n logs --account-name <storage account name>
az storage directory create -n keras -s scripts --account-name <storage account name>
az storage file upload -s scripts --source mnist_cnn.py --path keras --account-name <storage account name>
Keras allows users to select different Backend such as Tensorflow and CNTK.
To use Tensorflow backend, create a training job configuration file with the following content:
{
"$schema": "https://raw.githubusercontent.com/Azure/BatchAI/master/schemas/2018-05-01/job.json",
"properties": {
"nodeCount": 1,
"tensorFlowSettings": {
"pythonScriptFilePath": "$AZ_BATCHAI_JOB_MOUNT_ROOT/scripts/keras/mnist_cnn.py"
},
"stdOutErrPathPrefix": "$AZ_BATCHAI_JOB_MOUNT_ROOT/logs",
"mountVolumes": {
"azureFileShares": [
{
"azureFileUrl": "https://<AZURE_BATCHAI_STORAGE_ACCOUNT>.file.core.windows.net/logs",
"relativeMountPath": "logs"
},
{
"azureFileUrl": "https://<AZURE_BATCHAI_STORAGE_ACCOUNT>.file.core.windows.net/scripts",
"relativeMountPath": "scripts"
}
]
},
"jobPreparation": {
"commandLine": "pip install keras"
},
"containerSettings": {
"imageSourceRegistry": {
"image": "tensorflow/tensorflow:1.8.0-gpu"
}
}
}
}
The configuration file specifies:
nodeCount
- number of nodes required by the job;tensorFlowSettings
- tells that the current Keras job needs Tensoeflow backend and specifies path the training script.stdOutErrPathPrefix
- path where Batch AI will create directories containing job's logs;mountVolumes
- list of filesystem to be mounted during the job execution. In this case, we are mounting two Azure File Shareslogs
andscripts
. The filesystems are mounted underAZ_BATCHAI_JOB_MOUNT_ROOT/<relativeMountPath>
;<AZURE_BATCHAI_STORAGE_ACCOUNT>
tells that the storage account name will be specified during the job submission via --storage-account-name parameter orAZURE_BATCHAI_STORAGE_ACCOUNT
environment variable on your computer.- Keras framework will be installed by job preparation command line.
- Will use official Tensorflow docker image
Aternatively, to use CNTK backend, create a training job configuration file with the following content:
{
"$schema": "https://raw.githubusercontent.com/Azure/BatchAI/master/schemas/2018-05-01/job.json",
"properties": {
"nodeCount": 1,
"cntkSettings": {
"pythonScriptFilePath": "$AZ_BATCHAI_JOB_MOUNT_ROOT/scripts/keras/mnist_cnn.py"
},
"stdOutErrPathPrefix": "$AZ_BATCHAI_JOB_MOUNT_ROOT/logs",
"mountVolumes": {
"azureFileShares": [
{
"azureFileUrl": "https://<AZURE_BATCHAI_STORAGE_ACCOUNT>.file.core.windows.net/logs",
"relativeMountPath": "logs"
},
{
"azureFileUrl": "https://<AZURE_BATCHAI_STORAGE_ACCOUNT>.file.core.windows.net/scripts",
"relativeMountPath": "scripts"
}
]
},
"containerSettings": {
"imageSourceRegistry": {
"image": "microsoft/cntk:2.5.1-gpu-python2.7-cuda9.0-cudnn7.0"
}
}
}
}
The difference will be:
cntkSettings
- tells that the current Keras job needs CNTK backend and specifies path the training script.- Keras framework has been preinstalled in the docker image
We will submit a Batch AI job for Keras with Tensorflow backend.
Use the following command to create a new experiment called keras_experiment
in the workspace:
az batchai experiment create -g batchai.recipes -w recipe_workspace -n keras_experiment
Use the following command to submit the job on the cluster:
wget -O job.json https://raw.githubusercontent.com/Azure/BatchAI/master/recipes/Keras/Keras-GPU/job_tensorflow.json
az batchai job create -c nc6 -n keras_tensorflow -g batchai.recipes -w recipe_workspace -e keras_experiment -f job.json --storage-account-name <storage account name>
Example output:
{
"caffe2Settings": null,
"caffeSettings": null,
"chainerSettings": null,
"cluster": {
"id": "/subscriptions/1cba1da6-5a83-45e1-a88e-8b397eb84356/resourceGroups/batchai.recipes/providers/Microsoft.BatchAI/workspaces/recipe_workspace/clusters/nc6",
"resourceGroup": "batchai.recipes"
},
"cntkSettings": null,
"constraints": {
"maxWallClockTime": "7 days, 0:00:00"
},
"containerSettings": {
"imageSourceRegistry": {
"credentials": null,
"image": "tensorflow/tensorflow:1.8.0-gpu",
"serverUrl": null
},
"shmSize": null
},
"creationTime": "2018-06-16T00:33:38.845000+00:00",
"customMpiSettings": null,
"customToolkitSettings": null,
"environmentVariables": null,
"executionInfo": {
"endTime": null,
"errors": null,
"exitCode": null,
"startTime": "2018-06-16T00:33:39.913000+00:00"
},
"executionState": "running",
"executionStateTransitionTime": "2018-06-16T00:33:39.913000+00:00",
"horovodSettings": null,
"id": "/subscriptions/1cba1da6-5a83-45e1-a88e-8b397eb84356/resourceGroups/batchai.recipes/providers/Microsoft.BatchAI/workspaces/recipe_workspace/experiments/keras_experiment/jobs/keras_tensorflow",
"inputDirectories": null,
"jobOutputDirectoryPathSegment": "1cba1da6-5a83-45e1-a88e-8b397eb84356/batchai.recipes/workspaces/recipe_workspace/experiments/keras_experiment/jobs/keras_tensorflow/1261f499-e7f5-4b07-80a4-60259044ddc7",
"jobPreparation": {
"commandLine": "pip install keras"
},
"mountVolumes": {
"azureBlobFileSystems": null,
"azureFileShares": [
{
"accountName": "batchairecipestorage",
"azureFileUrl": "https://batchairecipestorage.file.core.windows.net/logs",
"credentials": {
"accountKey": null,
"accountKeySecretReference": null
},
"directoryMode": "0777",
"fileMode": "0777",
"relativeMountPath": "logs"
},
{
"accountName": "batchairecipestorage",
"azureFileUrl": "https://batchairecipestorage.file.core.windows.net/scripts",
"credentials": {
"accountKey": null,
"accountKeySecretReference": null
},
"directoryMode": "0777",
"fileMode": "0777",
"relativeMountPath": "scripts"
}
],
"fileServers": null,
"unmanagedFileSystems": null
},
"name": "keras_tensorflow",
"nodeCount": 1,
"outputDirectories": null,
"provisioningState": "succeeded",
"provisioningStateTransitionTime": "2018-06-16T00:33:39.490000+00:00",
"pyTorchSettings": null,
"resourceGroup": "batchai.recipes",
"schedulingPriority": "normal",
"secrets": null,
"stdOutErrPathPrefix": "$AZ_BATCHAI_JOB_MOUNT_ROOT/logs",
"tensorFlowSettings": {
"masterCommandLineArgs": null,
"parameterServerCommandLineArgs": null,
"parameterServerCount": null,
"pythonInterpreterPath": null,
"pythonScriptFilePath": "$AZ_BATCHAI_JOB_MOUNT_ROOT/scripts/keras/mnist_cnn.py",
"workerCommandLineArgs": null,
"workerCount": null
},
"toolType": "tensorflow",
"type": "Microsoft.BatchAI/workspaces/experiments/jobs"
}
The training script is reporting the training progress in stdout-wk-0.txt
file (for CNTK backend, logfile name is stdout.txt
) inside the standard output directory. You
can monitor the progress using the following command:
az batchai job file stream -j keras_tensorflow -g batchai.recipes -w recipe_workspace -e keras_experiment -f stdout-wk-0.txt
Example output:
Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
16384/11490434 [..............................] - ETA: 0s
24576/11490434 [..............................] - ETA: 38s
40960/11490434 [..............................] - ETA: 46s
73728/11490434 [..............................] - ETA: 38s
106496/11490434 [..............................] - ETA: 35s
139264/11490434 [..............................] - ETA: 33s
196608/11490434 [..............................] - ETA: 28s
245760/11490434 [..............................] - ETA: 26s
...
56064/60000 [===========================>..] - ETA: 0s - loss: 0.0253 - acc: 0.9920
56576/60000 [===========================>..] - ETA: 0s - loss: 0.0255 - acc: 0.9920
57088/60000 [===========================>..] - ETA: 0s - loss: 0.0256 - acc: 0.9920
57600/60000 [===========================>..] - ETA: 0s - loss: 0.0255 - acc: 0.9920
58112/60000 [============================>.] - ETA: 0s - loss: 0.0256 - acc: 0.9920
58624/60000 [============================>.] - ETA: 0s - loss: 0.0256 - acc: 0.9920
59136/60000 [============================>.] - ETA: 0s - loss: 0.0256 - acc: 0.9919
59648/60000 [============================>.] - ETA: 0s - loss: 0.0256 - acc: 0.9920
60000/60000 [==============================] - 8s 132us/step - loss: 0.0256 - acc: 0.9919 - val_loss: 0.0248 - val_acc: 0.9925
The streaming is stopped when the job is completed.
Alternatively, you can use the Portal or Azure Storage Explorer to inspect the generated files. To distinguish output
from the different jobs, Batch AI creates an unique folder structure for each of them. You can find the path to the
folder containing the output using jobOutputDirectoryPathSegment
attribute of the submitted job:
az batchai job show -n keras_tensorflow -g batchai.recipes -w recipe_workspace -e keras_experiment --query jobOutputDirectoryPathSegment
Example output:
"00000000-0000-0000-0000-000000000000/batchai.recipes/workspaces/recipe_workspace/experiments/keras_experiment/jobs/keras_tensorflow/1261f499-e7f5-4b07-80a4-60259044ddc7"
Delete the resource group and all allocated resources with the following command:
az group delete -n batchai.recipes -y