This document explains how to build a MNIST model using PyTorch on Amazon EKS.
This documents assumes that you have an EKS cluster available and running. Make sure to have a GPU-enabled Amazon EKS cluster ready.
This guide uses the MNIST) which contains a training set of 60,000 examples, and a test set of 10,000 examples.
-
You can use a pre-built Docker image
seedjeffwan/pytorch-dist-mnist-test:1.10
. This image usespytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
as the base image. It comes bundled with PyTorch. It also has training code and downloads training and test data sets. It also stores the model using a volume mount/mount
. This maps to/tmp
directory on the worker node.Alternatively, you can build a Docker image using the Dockerfile in
samples/mnist/training/pytorch/Dockerfile
to build it:docker build -t <dockerhub_username>/<repo_name>:<tag_name> .
-
Create a pod that will use this Docker image and run the MNIST training. First, the following changes need to be made in the manifest at
samples/mnist/training/pytorch/pytorch_mnist_example.yaml
:kubectl create -f samples/mnist/training/pytorch/pytorch_mnist_example.yaml
This will start the pod and start the training. Check status:
kubectl get pods NAME READY STATUS RESTARTS AGE pytorch-dist-mnist-gloo-master-0 1/1 Running 0 5s pytorch-dist-mnist-gloo-worker-0 1/1 Running 0 3s
-
Check the progress in training:
kubectl logs -f pytorch-dist-mnist-gloo-master-0
Using CUDA Using distributed PyTorch with gloo backend Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz Processing... Done! Train Epoch: 1 [0/60000 (0%)] loss=2.3000 Train Epoch: 1 [640/60000 (1%)] loss=2.2135 Train Epoch: 1 [1280/60000 (2%)] loss=2.1705 Train Epoch: 1 [1920/60000 (3%)] loss=2.0767 Train Epoch: 1 [2560/60000 (4%)] loss=1.8682 Train Epoch: 1 [3200/60000 (5%)] loss=1.4141 ...... ...... Train Epoch: 1 [56960/60000 (95%)] loss=0.0755 Train Epoch: 1 [57600/60000 (96%)] loss=0.1176 Train Epoch: 1 [58240/60000 (97%)] loss=0.1918 Train Epoch: 1 [58880/60000 (98%)] loss=0.2067 Train Epoch: 1 [59520/60000 (99%)] loss=0.0639
accuracy=0.9659 ```
- Runs
mnist.py
command (specified in theENTRYPOINT
at Dockerfile and available at https://github.com/aws-samples/machine-learning-using-k8s/blob/master/samples/mnist/training/pytorch/Dockerfile)- Download MNIST training and test data set
- Each set has images and labels that identify the image
- Performs supervised learning
- Run 10 epochs using the training data with the specified parameters
- For each epoch
- Reads the training data
- Builds the training model using the specified algorithm
- Feeds the test data and matches with the expected output
- Reports the accuracy, expected to improve with each run
- Generated model is persisted to worker host
/tmp/mnist_cnn.pt
.
- Download MNIST training and test data set