Distributed Training using TensorFlow and Horovod on Amazon EKS with Synthetic Data

This document explains how to perform distributed training on Amazon EKS using TensorFlow and Horovod with synthetic imagenet dataset.

Pre-requisite

Create EKS cluster using GPU with Kubeflow

Install MPI Operator

Install mpi package

cd  ${KUBEFLOW_SRC}/${KFAPP}/ks_app
ks pkg install kubeflow/mpi-job

Install mpi operator

export MPI_OPERATOR=mpi-operator
ks generate mpi-operator ${MPI_OPERATOR}
ks param set ${MPI_OPERATOR} image mpioperator/mpi-operator:0.1.0
ks apply default -c ${MPI_OPERATOR}

Verify Installation

kubectl get crd

NAME                                         CREATED AT
...
mpijobs.kubeflow.org                         2019-02-12T22:12:32Z
...

$ kubectl -n kubeflow get deployment ${MPI_OPERATOR}

NAME           DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
mpi-operator   1         1         1            1           1m

Launch MPI Training Job

Generate job prototype:

export JOB_NAME=tf-resnet50-horovod-job
ks generate mpi-job-custom ${JOB_NAME}

You can leverage AWS Deep Learning Containers to Build a Docker image for Horovod using Dockerfile from training/distributed_training/Dockerfile. You will need to login to access the repository of AWS Deep Learning Containers by running the command $(aws ecr get-login --no-include-email --region us-east-1 --registry-ids 763104351884) and then build command docker image build -t ${YOUR_DOCKER_HUB_ID}/eks-kubeflow-horovod:latest .. Alternatively, you can use the image mpioperator/tensorflow-benchmarks:latest that already exists on Docker Hub.
```
ks param set ${JOB_NAME} image "mpioperator/tensorflow-benchmarks:latest"
```
Define the number of workers (containers) and number of GPU available per container:
```
ks param set ${JOB_NAME} replicas 2
ks param set ${JOB_NAME} gpusPerReplica 4
```

Formulate the MPI command based on official document from Horovod

EXEC="mpirun,-mca,btl_tcp_if_exclude,lo,-mca,pml,ob1,-mca,btl,^openib,--bind-to,none,-map-by,slot,-x,LD_LIBRARY_PATH,-x,PATH,-x,NCCL_DEBUG=INFO,python,scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py,--data_format=NCHW,--batch_size=256,--model=resnet50,--optimizer=sgd,--variable_update=horovod,--data_name=imagenet,--use_fp16"

ks param set ${JOB_NAME} command ${EXEC}

If more than one GPU is used, then you may have to replace NCCL_SOCKET_IFNAME=eth0 with NCCL_SOCKET_IFNAME=^docker0.

Verify your job configuration, it will look like mpi-job-template.yaml
```
ks show default -c ${JOB_NAME}
```
Deploy the config to your cluster
```
ks apply default -c ${JOB_NAME}
```

Check the pod status and dump logs

POD_NAME=$(kubectl -n kubeflow get pods -l mpi_job_name=${JOB_NAME},mpi_role_type=launcher -o name)

kubectl -n kubeflow logs -f ${POD_NAME}

Here is a sample output.

To delete job in Kubernetes and remove job configuration in your application,

ks delete default -c ${JOB_NAME}  # delete job in kubernetes
ks component rm ${JOB_NAME}       # delete manifest in your ksonnet application

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensorflow-horovod-synthetic.md

tensorflow-horovod-synthetic.md

Distributed Training using TensorFlow and Horovod on Amazon EKS with Synthetic Data

Pre-requisite

Install MPI Operator

Launch MPI Training Job

Files

tensorflow-horovod-synthetic.md

Latest commit

History

tensorflow-horovod-synthetic.md

File metadata and controls

Distributed Training using TensorFlow and Horovod on Amazon EKS with Synthetic Data

Pre-requisite

Install MPI Operator

Launch MPI Training Job