Distributed Training using TensorFlow and Horovod on Amazon EKS with ImageNet Data

This document explains how to perform distributed training on Amazon EKS using TensorFlow and Horovod with ImageNet dataset. The following steps can be ued for any data set though.

Pre-requisite

Create EKS cluster using GPU with Kubeflow.

Download and put prepare ImageNet dataset in your S3 bucket like this.

➜ aws s3 ls s3://eks-dl-benchmark/imagenet/train/
2019-02-28 12:03:46   56755552 train-00001-of-01024
2019-02-28 12:03:45   56365180 train-00002-of-01024
......
2019-02-28 12:03:45   56365180 train-01024-of-01024


➜ aws s3 ls s3://eks-dl-benchmark/imagenet/validation/
2019-02-28 12:14:10   19504012 validation-00001-of-00128
2019-02-28 12:14:10   19624967 validation-00002-of-00128
....
2019-02-28 12:14:10   20063161 validation-00128-of-00128

The bucket name can be different but all data needs to be in the imagenet folder. The training data needs to be in the train sub folder and validation data in the validation sub folder.

Create an FSX For Lustre filesystem and enable data integration with S3. Use the VPC info of the GPU-powered EKS cluster created in the first step to create FSX. Note down the file system id after FSX for Lustre is created.

Note: FSX can only mount to one AZ. Make sure to create a single-AZ EKS cluster. This is specified in the aws_config/cluster_config.yaml file during cluster creation.

Steps

Follow steps to install mpi-operator.

Deploy the Amazon FSx CSI Plugin.

cd ${KUBEFLOW_SRC}/${KFAPP}/ks_app
export COMPONENT=aws-fsx-csi-driver
ks generate aws-fsx-csi-driver ${COMPONENT}
ks apply default -c ${COMPONENT}

Prepare Persistent Volumne (PV), Persistent Volume Claim (PVC) and Storage Class. Go to FSX console and replace fsxId and dnsName with your FSx info.

cd ${KUBEFLOW_SRC}/${KFAPP}/ks_app
export COMPONENT=fsx-static-storage

ks generate aws-fsx-pv-static ${COMPONENT} --fsxId=fs-048xxxx7c25 --dnsName=fs-048xxxx7c25.fsx.us-west-2.amazonaws.com

ks apply default -c ${COMPONENT}

Prepare training job. Check here for more details

export JOB_NAME=tf-resnet50-horovod-job
ks generate mpi-job-custom ${JOB_NAME}

ks param set ${JOB_NAME} image "seedjeffwan/eks-dl-benchmark:cuda10-tf1.13.1-hvd0.16.0-py3.5"
ks param set ${JOB_NAME} replicas 2
ks param set ${JOB_NAME} gpusPerReplica 4

EXEC="mpirun,-mca,btl_tcp_if_exclude,lo,-mca,pml,ob1,-mca,btl,^openib,--bind-to,none,-map-by,slot,-x,LD_LIBRARY_PATH,-x,PATH,-x,NCCL_DEBUG=INFO,python,models/resnet/tensorflow/train_imagenet_resnet_hvd.py,--batch_size=256,--model=resnet50,--num_batches=300,--fp16,--display_every=50,--lr_decay_mode=poly,--data_dir=/data/imagenet/train"

NOTE: Instead of using synthetic data, job will read from --data_dir.

Deploy training job
```
ks apply default -c ${JOB_NAME}
```

Check pod status and logs

POD_NAME=$(kubectl -n kubeflow get pods -l mpi_job_name=${JOB_NAME},mpi_role_type=launcher -o name)

kubectl -n kubeflow logs -f ${POD_NAME}

Here is a sample output.

Appendix

Download and Pre-process ImageNet Data

If you work for Amazon, then reach out to the authors of this document to have access to the data. Otherwise, follow the instructions below.

Download ImageNet dataset and upload to your S3 bucket. Use Download Original Images (for non-commercial research/educational use only) option.
TensorFlow consumes the ImageNet data in a specific format. You can preprocess them by downloading and modifying the script:
```
curl -O https://raw.githubusercontent.com/aws-samples/deep-learning-models/master/utils/tensorflow/preprocess_imagenet.sh
chmod +x preprocess_imagenet.sh
```
The following values need to be changed:
- [your imagenet account]
- [your imagenet access key]
- [PATH TO TFRECORD TRAINING DATASET]
- [PATH TO RESIZED TFRECORD TRAINING DATASET]
- [PATH TO TFRECORD VALIDATION DATASET]
- [PATH TO RESIZED TFRECORD VALIDATION DATASET]
Execute the script:
```
./preprocess_imagenet.sh
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensorflow-horovod.md

tensorflow-horovod.md

Distributed Training using TensorFlow and Horovod on Amazon EKS with ImageNet Data

Pre-requisite

Steps

Appendix

Download and Pre-process ImageNet Data

Files

tensorflow-horovod.md

Latest commit

History

tensorflow-horovod.md

File metadata and controls

Distributed Training using TensorFlow and Horovod on Amazon EKS with ImageNet Data

Pre-requisite

Steps

Appendix

Download and Pre-process ImageNet Data