Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Latest commit

 

History

History

Chainer-GPU-Distributed-Infiniband

Chainer GPU Distributed Infiniband

This example demonstrates how to run standard ChainerMN train_mnist.py distributed training job using Batch AI with Infiniband enabled.

Details

  • Standard chainer sample script train_mnist.py is used;
  • Chainer downloads the standard MNIST Database on its own and distributed across workers;
  • Standard output of the job and the model will be stored on Azure File Share.
  • IntelMPI (non-CUDA-aware) will be used to launch ChainerMN jobs cross nodes

Instructions to Run Recipe

Python Jupyter Notebook

You can find Jupyter Notebook for this recipe in Chainer-GPU-Distributed-Infiniband.ipynb.

Azure CLI 2.0

You can find Azure CLI 2.0 instructions for this recipe in cli-instructions.md.

Dockerfile

The Dockerfile for the Docker images used in this recipe can be found here. The dockerfile is a modified version of ChainerMN example built based on IntelMPI library.

License Notice

Under construction...

Help or Feedback


If you have any problems or questions, you can reach the Batch AI team at [email protected] or you can create an issue on GitHub.

We also welcome your contributions of additional sample notebooks, scripts, or other examples of working with Batch AI.