Horovod is a distributed training framework for TensorFlow. The goal of Horovod is to make distributed Deep Learning fast and easy to use.
See official Horovod GitHub page.
This Horovod recipe contains information on how to run Horovod distributed training job for Tensorflow on a GPU cluster with Batch AI.
This Horovod recipe contains information on how to run Horovod distributed training job for PyTorch on a GPU cluster with Batch AI.
This Horovod-Infiniband-Benchmark recipe contains information on how to reproduce Horovod distributed training benchmarks with infiniband support using Batch AI.
If you have any problems or questions, you can reach the Batch AI team at [email protected] or you can create an issue on GitHub.
We also welcome your contributions of additional sample notebooks, scripts, or other examples of working with Batch AI.