tensorflow
diff --git a/‎README.md‎
Lines changed: 4 additions & 29 deletions b/‎README.md‎
Lines changed: 4 additions & 29 deletions
diff --git a/‎distribution_strategy/multi_worker_mirrored_strategy/README.md‎
Lines changed: 117 additions & 0 deletions b/‎distribution_strategy/multi_worker_mirrored_strategy/README.md‎
Lines changed: 117 additions & 0 deletions
diff --git a/‎distribution_strategy/multi_worker_mirrored_strategy/examples/Dockerfile‎
Lines changed: 13 additions & 0 deletions b/‎distribution_strategy/multi_worker_mirrored_strategy/examples/Dockerfile‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎distribution_strategy/multi_worker_mirrored_strategy/examples/README.md‎
Lines changed: 53 additions & 0 deletions b/‎distribution_strategy/multi_worker_mirrored_strategy/examples/README.md‎
Lines changed: 53 additions & 0 deletions
diff --git a/‎docker/custom_training_mnist.py‎ renamed to ‎distribution_strategy/multi_worker_mirrored_strategy/examples/custom_training_mnist.py‎ b/‎docker/custom_training_mnist.py‎ renamed to ‎distribution_strategy/multi_worker_mirrored_strategy/examples/custom_training_mnist.py‎
diff --git a/‎docker/keras_mnist.py‎ renamed to ‎distribution_strategy/multi_worker_mirrored_strategy/examples/keras_mnist.py‎ b/‎docker/keras_mnist.py‎ renamed to ‎distribution_strategy/multi_worker_mirrored_strategy/examples/keras_mnist.py‎
diff --git a/‎kubernetes/MultiWorkerMirroredTemplate.jinja‎ renamed to ‎distribution_strategy/multi_worker_mirrored_strategy/kubernetes/MultiWorkerMirroredTemplate.jinja‎ b/‎kubernetes/MultiWorkerMirroredTemplate.jinja‎ renamed to ‎distribution_strategy/multi_worker_mirrored_strategy/kubernetes/MultiWorkerMirroredTemplate.jinja‎
diff --git a/‎docker/Dockerfile‎
Lines changed: 1 addition & 10 deletions b/‎docker/Dockerfile‎
Lines changed: 1 addition & 10 deletions
@@ -12,11 +12,11 @@ request.
 - [docker](docker) - Docker configuration for running TensorFlow on
   cluster managers.
 - [kubeflow](https://github.com/kubeflow/kubeflow) - A Kubernetes native platform for ML
-	* A K8s custom resource for running distributed [TensorFlow jobs](https://github.com/kubeflow/kubeflow/blob/master/user_guide.md#submitting-a-tensorflow-training-job)
+	* A K8s custom resource for running distributed [TensorFlow jobs](https://github.com/kubeflow/kubeflow/blob/master/user_guide.md#submitting-a-tensorflow-training-job) 
 	* Jupyter images for different versions of TensorFlow
 	* [TFServing](https://github.com/kubeflow/kubeflow/blob/master/user_guide.md#serve-a-model-using-tensorflow-serving) Docker images and K8s templates
 - [kubernetes](kubernetes) - Templates for running distributed TensorFlow on
-  Kubernetes.
+  Kubernetes. For the most upto-date examples, please also refer to the [distribution strategy](distribution_strategy) folder.
 - [marathon](marathon) - Templates for running distributed TensorFlow using
   Marathon, deployed on top of Mesos.
 - [hadoop](hadoop) - TFRecord file InputFormat/OutputFormat for Hadoop MapReduce
@@ -26,36 +26,12 @@ request.
 
 ## Distributed TensorFlow
 
-### Tensorflow 2
-
-For distributed training, the tensorflow server is implicitly started.
-The main configuration required by the tensorflow libraries is the cluster and local process configuration
-that can be passed as an environment variable.
-Refer to [Distributed TensorFlow Concepts](https://www.tensorflow.org/guide/distributed_training) for concepts.
-Refer to [Distributed TensorFlow Examples](https://www.tensorflow.org/tutorials/distribute/keras) for examples.
-
-#### Sample TF_CONFIG cluster configuration for distributed training
-
-```python
-os.environ["TF_CONFIG"] = json.dumps({
-    "cluster": {
-        "worker": ["host1:port", "host2:port", "host3:port"], # Worker IP/Port locations
-        "ps": ["host4:port", "host5:port"], # Parameter Server IP/Port Locations
-        "chief": ["host6:port"] # Chief worker location
-    },
-   "task": {"type": "worker", "index": 1} # Current Process configuration
-})
-```
-
-
-### Tensorflow 1
-
 See the [Distributed TensorFlow](https://www.tensorflow.org/deploy/distributed)
 documentation for a description of how it works. The examples in this
 repository focus on the most common form of distributed training: between-graph
 replication with asynchronous updates.
 
-#### Common Setup for distributed training
+### Common Setup for distributed training
 
 Every distributed training program has some common setup. First, define flags so
 that the worker knows about other workers and knows what role it plays in
@@ -97,8 +73,7 @@ if FLAGS.job_name == "ps":
 Afterwards, your code varies depending on the form of distributed training you
 intend on doing. The most common form is between-graph replication.
 
-#### Between-graph Replication
-
+### Between-graph Replication
 
 In this mode, each worker separately constructs the exact same graph. Each
 worker then runs the graph in isolation, only sharing gradients with the
 
@@ -0,0 +1,117 @@
+
+# MultiWorkerMirrored Training Strategy with examples
+
+The steps below are meant to train models using [MultiWorkerMirrored Strategy](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy) using the tensorflow 2.0 API on the Kubernetes platform.
+
+Reference programs such as [keras_mnist.py](examples/keras_mnist.py) and
+[custom_training_mnist.py](examples/custom_training_mnist.py) are available in the examples directory.
+
+The Kubernetes manifest templates and other cluster specific configuration is available in the [kubernetes](kubernetes) directory
+
+## Prerequisites
+
+1. (Optional) It is recommended that you have a Google Cloud project. Either create a new project or use an existing one. Install
+    [gcloud commandline tools](https://cloud.google.com/functions/docs/quickstart)
+    on your system, login, set project and zone, etc.
+
+2. [Jinja templates](http://jinja.pocoo.org/) must be installed.
+
+3. A Kubernetes cluster running Kubernetes 1.15 or above must be available. To create a test
+cluster on the local machine, [follow steps here](https://kubernetes.io/docs/tutorials/kubernetes-basics/create-cluster/). Kubernetes clusters can also be created on all major cloud providers. For instance,
+here are instructions to [create GKE clusters](https://cloud.google.com/kubernetes-engine/docs/how-to/creating-a-regional-cluster). Make sure that you have atleast 12 G of RAM between all nodes in the clusters. This should also install the `kubectl` tool on your system
+
+4. Set context for `kubectl` so that `kubectl` knows which cluster to use:
+
+    ```bash
+    kubectl config use-context <cluster_name>
+    ```
+
+5. Install [Docker](https://docs.docker.com/get-docker/) for your system, while also creating an account that you can associate with your container images.
+
+6. For model storage and checkpointing, a [persistent-volume-claim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) needs to be available to mount onto the chief worker pod. The steps below include the yaml to create a persistent-volume-claim for GKE backed by GCEPersistentDisk.
+
+### Steps to Run the job
+
+1. Follow the instructions for building and pushing the Docker image to a docker registry
+  in the [Docker README](examples/README.md).
+
+2. Copy the template file:
+
+  ```sh
+     cp kubernetes/MultiWorkerMirroredTemplate.yaml.jinja myjob.template.jinja
+  ```
+
+4. Edit the `myjob.template.jinja` file to edit job parameters.
+   1. `script` - which training program needs to be run. This should be either
+      `keras_mnist.py` or `custom_training_mnist.py` or `your_own_training_example.py`
+
+   2. `name` - the prefix attached to all the Kubernetes jobs created
+
+   3. `worker_replicas` - number of parallel worker processes that train the example
+
+   4. `port` - the port used by tensorflow worker processes to communicate with each other
+
+   5. `model_checkpoint_dir` - directory where the model is checkpointed and saved from the chief worker process.
+
+   6. `checkpoint_pvc_name` - name of the persistent-volume-claim which should be mounted at `model_checkpoint_dir`. This volume will contain the checkpointed model.
+
+   7. `image` - name of the docker image created in step 2 that needs to be loaded onto the cluster
+
+   8. `deploy` - set to True when the manifest is actually expected to be deployed
+
+   9. `create_pvc_checkpoint` - Creates a ReadWriteOnce persistent volume claim to checkpoint the model if needed. The name of the claim `checkpoint_pvc_name` should also be specified.
+
+   10. `create_volume_inspector` - Create a pod to inspect the contents of the volume after the training job is complete. If this is `True`, `deploy` cannot be `True` since the checkpoint volume can be mounted as read-write by a single node. Inspection cannot happen when training is happenning.
+
+5. Run the job:
+   1. Create a namespace to run your training jobs
+   
+      ```sh
+      kubectl create namespace <namespace>
+      ```
+
+   2. [Optional] First set `deploy` to `False`, `create_pvc_checkpoint` to `True` and set the name of           `checkpoint_pvc_name` appropriately. Then run
+
+      ```sh
+      python ../../render_template.py myjob.template.jinja | kubectl create -n <namespace> -f -
+      ```
+
+      This will create a persistent volume claim where you can checkpoint your image.
+
+   3. Set `deploy` to `True` with all parameters specified in step 4 and then run
+
+      ```sh
+      python ../../render_template.py myjob.template.jinja | kubectl create -n <namespace> -f -
+      ```
+
+      This will create the Kubernetes jobs on the clusters. Each Job has a single service-endpoint and a single pod that runs the training image. You can track the running jobs in the cluster by running
+
+       ```sh
+      kubectl get jobs -n <namespace>
+      kubectl describe jobs -n <namespace>   
+      ```
+
+      In order to inspect the trainining logs that are running in the jobs, run
+
+      ```sh
+      # Shows all the running pods 
+      kubectl get pods -n <namespace>
+      kubectl logs -n <namespace> -p <pod-name>
+      ```
+
+   4. Once the jobs are finished (based on the logs/output of kubectl get jobs),
+      the trained model can be inspected by a volume inspector pod. Set `deploy` to `False`
+      and `create_volume_inspector` to True. Then run
+
+      ```sh
+      python ../../render_template.py myjob.template.jinja | kubectl create -n <namespace> -f -
+      ```
+
+      Then, access the pod through ssh
+
+      ```sh
+      kubectl get pods -n <namespace>
+      kubectl -n <namspace> exec --stdin --tty <volume-inspector-pod> -- /bin/bash
+      ```
+
+      The contents of the trained model are available for inspection at `model_checkpoint_dir`.
@@ -0,0 +1,13 @@
+FROM tensorflow/tensorflow:nightly
+
+# Keeps Python from generating .pyc files in the container
+ENV PYTHONDONTWRITEBYTECODE=1
+
+# Turns off buffering for easier container logging
+ENV PYTHONUNBUFFERED=1
+
+WORKDIR /app
+
+COPY . /app/
+
+ENTRYPOINT ["python", "/keras_mnist.py"]
@@ -0,0 +1,53 @@
+# TensorFlow Docker Images
+
+This directory contains examples of MultiWorkerMirrored Training along with the docker file to build them
+
+- [Dockerfile](Dockerfile) contains all dependenices required to build a container image using docker with the training examples
+- [keras_mnist.py](mnist.py) demonstrates how to train an MNIST classifier using
+  [tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
+- [custom_training_mnist.py](mnist.py) demonstrates how to train a fashion MNIST classifier using
+  [tf.distribute.MultiWorkerMirroredStrategy and Tensorflow 2.0 Custom Training Loop APIs](https://www.tensorflow.org/tutorials/distribute/custom_training).
+
+## Best Practices
+
+- Always pin the TensorFlow version with the Docker image tag. This ensures that
+  TensorFlow updates don't adversely impact your training program for future
+  runs.
+- When creating an image, specify version tags (see below). If you make code
+  changes, increment the version. Cluster managers will not pull an updated
+  Docker image if they have them cached. Also, versions ensure that you have
+  a single copy of the code running for each job.
+
+## Building the Docker Files
+
+Ensure that docker is installed on your system.
+
+First, pick an image name for the job. When running on a cluster manager, you
+will want to push your images to a container registry. Note that both the
+[Google Container Registry](https://cloud.google.com/container-registry/)
+and the [Amazon EC2 Container Registry](https://aws.amazon.com/ecr/) require
+special paths. We append `:v1` to version our images. Versioning images is
+strongly recommended for reasons described in the best practices section.
+
+```sh
+docker build -t <image_name>:v1 -f Dockerfile .
+# Use gcloud docker push instead if on Google Container Registry.
+docker push <image_name>:v1
+```
+
+If you make any updates to the code, increment the version and rerun the above
+commands with the new version.
+
+## Running the keras_mnist.py example
+
+The [keras_mnist.py](keras_mnist.py) example demonstrates how to train an MNIST classifier using
+[tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
+The final model is saved to disk by the chief worker process. The disk is assumed to be mounted onto the running container by the cluster manager.
+It assumes that the cluster configuration is passed in through the `TF_CONFIG` environment variable when deployed in the cluster
+
+## Running the custom_training_mnist.py example
+
+The [custom_training_mnist.py](mnist.py) example demonstrates how to train a fashion MNIST classifier using
+[tf.distribute.MultiWorkerMirroredStrategy and Tensorflow 2.0 Custom Training Loop APIs](https://www.tensorflow.org/tutorials/distribute/custom_training).
+The final model is saved to disk by the chief worker process. The disk is assumed to be mounted onto the running container by the cluster manager.
+It assumes that the cluster configuration is passed in through the `TF_CONFIG` environment variable when deployed in the cluster.
@@ -1,13 +1,4 @@
 FROM tensorflow/tensorflow:nightly
 
-# Keeps Python from generating .pyc files in the container
-ENV PYTHONDONTWRITEBYTECODE=1
-
-# Turns off buffering for easier container logging
-ENV PYTHONUNBUFFERED=1
-
-WORKDIR /app
-
-COPY . /app/
-
+COPY mnist.py /
 ENTRYPOINT ["python", "/mnist.py"]