This document explains how to perform distributed training on Amazon EKS using TensorFlow and Horovod with ImageNet dataset. The following steps can be ued for any data set though.
-
Download and put prepare ImageNet dataset in your S3 bucket like this.
➜ aws s3 ls s3://eks-dl-benchmark/imagenet/train/ 2019-02-28 12:03:46 56755552 train-00001-of-01024 2019-02-28 12:03:45 56365180 train-00002-of-01024 ...... 2019-02-28 12:03:45 56365180 train-01024-of-01024 ➜ aws s3 ls s3://eks-dl-benchmark/imagenet/validation/ 2019-02-28 12:14:10 19504012 validation-00001-of-00128 2019-02-28 12:14:10 19624967 validation-00002-of-00128 .... 2019-02-28 12:14:10 20063161 validation-00128-of-00128
The bucket name can be different but all data needs to be in the
imagenet
folder. The training data needs to be in thetrain
sub folder and validation data in thevalidation
sub folder. -
Create an FSX For Lustre filesystem and enable data integration with S3. Use the VPC info of the GPU-powered EKS cluster created in the first step to create FSX. Note down the file system id after FSX for Lustre is created.
Note: FSX can only mount to one AZ. Make sure to create a single-AZ EKS cluster. This is specified in the
aws_config/cluster_config.yaml
file during cluster creation.
-
Follow steps to install mpi-operator.
-
Deploy the Amazon FSx CSI Plugin.
cd ${KUBEFLOW_SRC}/${KFAPP}/ks_app export COMPONENT=aws-fsx-csi-driver ks generate aws-fsx-csi-driver ${COMPONENT} ks apply default -c ${COMPONENT}
-
Prepare Persistent Volumne (PV), Persistent Volume Claim (PVC) and Storage Class. Go to FSX console and replace
fsxId
anddnsName
with your FSx info.cd ${KUBEFLOW_SRC}/${KFAPP}/ks_app export COMPONENT=fsx-static-storage ks generate aws-fsx-pv-static ${COMPONENT} --fsxId=fs-048xxxx7c25 --dnsName=fs-048xxxx7c25.fsx.us-west-2.amazonaws.com ks apply default -c ${COMPONENT}
-
Prepare training job. Check here for more details
export JOB_NAME=tf-resnet50-horovod-job ks generate mpi-job-custom ${JOB_NAME} ks param set ${JOB_NAME} image "seedjeffwan/eks-dl-benchmark:cuda10-tf1.13.1-hvd0.16.0-py3.5" ks param set ${JOB_NAME} replicas 2 ks param set ${JOB_NAME} gpusPerReplica 4 EXEC="mpirun,-mca,btl_tcp_if_exclude,lo,-mca,pml,ob1,-mca,btl,^openib,--bind-to,none,-map-by,slot,-x,LD_LIBRARY_PATH,-x,PATH,-x,NCCL_DEBUG=INFO,python,models/resnet/tensorflow/train_imagenet_resnet_hvd.py,--batch_size=256,--model=resnet50,--num_batches=300,--fp16,--display_every=50,--lr_decay_mode=poly,--data_dir=/data/imagenet/train"
NOTE: Instead of using synthetic data, job will read from
--data_dir
. -
Deploy training job
ks apply default -c ${JOB_NAME}
-
Check pod status and logs
POD_NAME=$(kubectl -n kubeflow get pods -l mpi_job_name=${JOB_NAME},mpi_role_type=launcher -o name) kubectl -n kubeflow logs -f ${POD_NAME}
Here is a sample output.
If you work for Amazon, then reach out to the authors of this document to have access to the data. Otherwise, follow the instructions below.
-
Download ImageNet dataset and upload to your S3 bucket. Use
Download Original Images (for non-commercial research/educational use only)
option. -
TensorFlow consumes the ImageNet data in a specific format. You can preprocess them by downloading and modifying the script:
curl -O https://raw.githubusercontent.com/aws-samples/deep-learning-models/master/utils/tensorflow/preprocess_imagenet.sh chmod +x preprocess_imagenet.sh
The following values need to be changed:
[your imagenet account]
[your imagenet access key]
[PATH TO TFRECORD TRAINING DATASET]
[PATH TO RESIZED TFRECORD TRAINING DATASET]
[PATH TO TFRECORD VALIDATION DATASET]
[PATH TO RESIZED TFRECORD VALIDATION DATASET]
Execute the script:
./preprocess_imagenet.sh