Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Latest commit

 

History

History

Horovod-Infiniband-Benchmark

Horovod-Infiniband-Benchmark

This recipe shows how to reproduce Horovod distributed training benchmarks using Azure Batch AI.

Currently Batch AI has no native support for Horovod framework, but it's easy to run it using Batch AI custom toolkit.

Details

  • Official Horovod Benchmark scripts will be used;
  • The job will be run on standard tensorflow container tensorflow/tensorflow:1.8.0-gpu;
  • Horovod framework and IntelMPI will be installed in the container using job preparation command line. Note, you can build your own docker image containing tensorflow and horovod instead.
  • Benchmark scripts will be downloaded to GPU nodes using job preparation command line as well, stored in $AZ_BATCHAI_JOB_TEMP at each node
  • This sample needs to use at least two STANDARD_NC24r nodes, please be sure you have enough quota
  • Standard output of the job will be stored on Azure File Share.
  • This recipe ONLY reproduce the training results with synthetic data on NVIDIA K80 GPUs.

Instructions to Run Recipe

Python Jupyter Notebook

You can find Jupyter Notebook for this recipe in Horovod-Infiniband-Benchmark.ipynb.

Azure CLI 2.0

You can find Azure CLI 2.0 instructions for this recipe in cli-instructions.md.

License Notice

Under construction...

Help or Feedback


If you have any problems or questions, you can reach the Batch AI team at [email protected] or you can create an issue on GitHub.

We also welcome your contributions of additional sample notebooks, scripts, or other examples of working with Batch AI.