This recipe shows how to reproduce Horovod distributed training benchmarks using Azure Batch AI.
Currently Batch AI has no native support for Horovod framework, but it's easy to run it using Batch AI custom toolkit.
- Official Horovod Benchmark scripts will be used;
- The job will be run on standard tensorflow container
tensorflow/tensorflow:1.8.0-gpu
; - Horovod framework and IntelMPI will be installed in the container using job preparation command line. Note, you can build your own docker image containing tensorflow and horovod instead.
- Benchmark scripts will be downloaded to GPU nodes using job preparation command line as well, stored in
$AZ_BATCHAI_JOB_TEMP
at each node - This sample needs to use at least two
STANDARD_NC24r
nodes, please be sure you have enough quota - Standard output of the job will be stored on Azure File Share.
- This recipe ONLY reproduce the training results with synthetic data on NVIDIA K80 GPUs.
You can find Jupyter Notebook for this recipe in Horovod-Infiniband-Benchmark.ipynb.
You can find Azure CLI 2.0 instructions for this recipe in cli-instructions.md.
Under construction...
If you have any problems or questions, you can reach the Batch AI team at [email protected] or you can create an issue on GitHub.
We also welcome your contributions of additional sample notebooks, scripts, or other examples of working with Batch AI.