Skip to content
This repository was archived by the owner on Nov 28, 2025. It is now read-only.

Commit 137ea6a

Browse files
committed
Moving MultiWorkerTraining Examples and Documentation to the distribution_strategy folder
1 parent 7269d7b commit 137ea6a

File tree

11 files changed

+193
-171
lines changed

11 files changed

+193
-171
lines changed

README.md

Lines changed: 4 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,11 @@ request.
1212
- [docker](docker) - Docker configuration for running TensorFlow on
1313
cluster managers.
1414
- [kubeflow](https://github.com/kubeflow/kubeflow) - A Kubernetes native platform for ML
15-
* A K8s custom resource for running distributed [TensorFlow jobs](https://github.com/kubeflow/kubeflow/blob/master/user_guide.md#submitting-a-tensorflow-training-job)
15+
* A K8s custom resource for running distributed [TensorFlow jobs](https://github.com/kubeflow/kubeflow/blob/master/user_guide.md#submitting-a-tensorflow-training-job)
1616
* Jupyter images for different versions of TensorFlow
1717
* [TFServing](https://github.com/kubeflow/kubeflow/blob/master/user_guide.md#serve-a-model-using-tensorflow-serving) Docker images and K8s templates
1818
- [kubernetes](kubernetes) - Templates for running distributed TensorFlow on
19-
Kubernetes.
19+
Kubernetes. For the most upto-date examples, please also refer to the [distribution strategy](distribution_strategy) folder.
2020
- [marathon](marathon) - Templates for running distributed TensorFlow using
2121
Marathon, deployed on top of Mesos.
2222
- [hadoop](hadoop) - TFRecord file InputFormat/OutputFormat for Hadoop MapReduce
@@ -26,36 +26,12 @@ request.
2626

2727
## Distributed TensorFlow
2828

29-
### Tensorflow 2
30-
31-
For distributed training, the tensorflow server is implicitly started.
32-
The main configuration required by the tensorflow libraries is the cluster and local process configuration
33-
that can be passed as an environment variable.
34-
Refer to [Distributed TensorFlow Concepts](https://www.tensorflow.org/guide/distributed_training) for concepts.
35-
Refer to [Distributed TensorFlow Examples](https://www.tensorflow.org/tutorials/distribute/keras) for examples.
36-
37-
#### Sample TF_CONFIG cluster configuration for distributed training
38-
39-
```python
40-
os.environ["TF_CONFIG"] = json.dumps({
41-
"cluster": {
42-
"worker": ["host1:port", "host2:port", "host3:port"], # Worker IP/Port locations
43-
"ps": ["host4:port", "host5:port"], # Parameter Server IP/Port Locations
44-
"chief": ["host6:port"] # Chief worker location
45-
},
46-
"task": {"type": "worker", "index": 1} # Current Process configuration
47-
})
48-
```
49-
50-
51-
### Tensorflow 1
52-
5329
See the [Distributed TensorFlow](https://www.tensorflow.org/deploy/distributed)
5430
documentation for a description of how it works. The examples in this
5531
repository focus on the most common form of distributed training: between-graph
5632
replication with asynchronous updates.
5733

58-
#### Common Setup for distributed training
34+
### Common Setup for distributed training
5935

6036
Every distributed training program has some common setup. First, define flags so
6137
that the worker knows about other workers and knows what role it plays in
@@ -97,8 +73,7 @@ if FLAGS.job_name == "ps":
9773
Afterwards, your code varies depending on the form of distributed training you
9874
intend on doing. The most common form is between-graph replication.
9975

100-
#### Between-graph Replication
101-
76+
### Between-graph Replication
10277

10378
In this mode, each worker separately constructs the exact same graph. Each
10479
worker then runs the graph in isolation, only sharing gradients with the
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
2+
# MultiWorkerMirrored Training Strategy with examples
3+
4+
The steps below are meant to train models using [MultiWorkerMirrored Strategy](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy) using the tensorflow 2.0 API on the Kubernetes platform.
5+
6+
Reference programs such as [keras_mnist.py](examples/keras_mnist.py) and
7+
[custom_training_mnist.py](examples/custom_training_mnist.py) are available in the examples directory.
8+
9+
The Kubernetes manifest templates and other cluster specific configuration is available in the [kubernetes](kubernetes) directory
10+
11+
## Prerequisites
12+
13+
1. (Optional) It is recommended that you have a Google Cloud project. Either create a new project or use an existing one. Install
14+
[gcloud commandline tools](https://cloud.google.com/functions/docs/quickstart)
15+
on your system, login, set project and zone, etc.
16+
17+
2. [Jinja templates](http://jinja.pocoo.org/) must be installed.
18+
19+
3. A Kubernetes cluster running Kubernetes 1.15 or above must be available. To create a test
20+
cluster on the local machine, [follow steps here](https://kubernetes.io/docs/tutorials/kubernetes-basics/create-cluster/). Kubernetes clusters can also be created on all major cloud providers. For instance,
21+
here are instructions to [create GKE clusters](https://cloud.google.com/kubernetes-engine/docs/how-to/creating-a-regional-cluster). Make sure that you have atleast 12 G of RAM between all nodes in the clusters. This should also install the `kubectl` tool on your system
22+
23+
4. Set context for `kubectl` so that `kubectl` knows which cluster to use:
24+
25+
```bash
26+
kubectl config use-context <cluster_name>
27+
```
28+
29+
5. Install [Docker](https://docs.docker.com/get-docker/) for your system, while also creating an account that you can associate with your container images.
30+
31+
6. For model storage and checkpointing, a [persistent-volume-claim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) needs to be available to mount onto the chief worker pod. The steps below include the yaml to create a persistent-volume-claim for GKE backed by GCEPersistentDisk.
32+
33+
### Steps to Run the job
34+
35+
1. Follow the instructions for building and pushing the Docker image to a docker registry
36+
in the [Docker README](examples/README.md).
37+
38+
2. Copy the template file:
39+
40+
```sh
41+
cp kubernetes/MultiWorkerMirroredTemplate.yaml.jinja myjob.template.jinja
42+
```
43+
44+
4. Edit the `myjob.template.jinja` file to edit job parameters.
45+
1. `script` - which training program needs to be run. This should be either
46+
`keras_mnist.py` or `custom_training_mnist.py` or `your_own_training_example.py`
47+
48+
2. `name` - the prefix attached to all the Kubernetes jobs created
49+
50+
3. `worker_replicas` - number of parallel worker processes that train the example
51+
52+
4. `port` - the port used by tensorflow worker processes to communicate with each other
53+
54+
5. `model_checkpoint_dir` - directory where the model is checkpointed and saved from the chief worker process.
55+
56+
6. `checkpoint_pvc_name` - name of the persistent-volume-claim which should be mounted at `model_checkpoint_dir`. This volume will contain the checkpointed model.
57+
58+
7. `image` - name of the docker image created in step 2 that needs to be loaded onto the cluster
59+
60+
8. `deploy` - set to True when the manifest is actually expected to be deployed
61+
62+
9. `create_pvc_checkpoint` - Creates a ReadWriteOnce persistent volume claim to checkpoint the model if needed. The name of the claim `checkpoint_pvc_name` should also be specified.
63+
64+
10. `create_volume_inspector` - Create a pod to inspect the contents of the volume after the training job is complete. If this is `True`, `deploy` cannot be `True` since the checkpoint volume can be mounted as read-write by a single node. Inspection cannot happen when training is happenning.
65+
66+
5. Run the job:
67+
1. Create a namespace to run your training jobs
68+
69+
```sh
70+
kubectl create namespace <namespace>
71+
```
72+
73+
2. [Optional] First set `deploy` to `False`, `create_pvc_checkpoint` to `True` and set the name of `checkpoint_pvc_name` appropriately. Then run
74+
75+
```sh
76+
python ../../render_template.py myjob.template.jinja | kubectl create -n <namespace> -f -
77+
```
78+
79+
This will create a persistent volume claim where you can checkpoint your image.
80+
81+
3. Set `deploy` to `True` with all parameters specified in step 4 and then run
82+
83+
```sh
84+
python ../../render_template.py myjob.template.jinja | kubectl create -n <namespace> -f -
85+
```
86+
87+
This will create the Kubernetes jobs on the clusters. Each Job has a single service-endpoint and a single pod that runs the training image. You can track the running jobs in the cluster by running
88+
89+
```sh
90+
kubectl get jobs -n <namespace>
91+
kubectl describe jobs -n <namespace>
92+
```
93+
94+
In order to inspect the trainining logs that are running in the jobs, run
95+
96+
```sh
97+
# Shows all the running pods
98+
kubectl get pods -n <namespace>
99+
kubectl logs -n <namespace> -p <pod-name>
100+
```
101+
102+
4. Once the jobs are finished (based on the logs/output of kubectl get jobs),
103+
the trained model can be inspected by a volume inspector pod. Set `deploy` to `False`
104+
and `create_volume_inspector` to True. Then run
105+
106+
```sh
107+
python ../../render_template.py myjob.template.jinja | kubectl create -n <namespace> -f -
108+
```
109+
110+
Then, access the pod through ssh
111+
112+
```sh
113+
kubectl get pods -n <namespace>
114+
kubectl -n <namspace> exec --stdin --tty <volume-inspector-pod> -- /bin/bash
115+
```
116+
117+
The contents of the trained model are available for inspection at `model_checkpoint_dir`.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
FROM tensorflow/tensorflow:nightly
2+
3+
# Keeps Python from generating .pyc files in the container
4+
ENV PYTHONDONTWRITEBYTECODE=1
5+
6+
# Turns off buffering for easier container logging
7+
ENV PYTHONUNBUFFERED=1
8+
9+
WORKDIR /app
10+
11+
COPY . /app/
12+
13+
ENTRYPOINT ["python", "/keras_mnist.py"]
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# TensorFlow Docker Images
2+
3+
This directory contains examples of MultiWorkerMirrored Training along with the docker file to build them
4+
5+
- [Dockerfile](Dockerfile) contains all dependenices required to build a container image using docker with the training examples
6+
- [keras_mnist.py](mnist.py) demonstrates how to train an MNIST classifier using
7+
[tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
8+
- [custom_training_mnist.py](mnist.py) demonstrates how to train a fashion MNIST classifier using
9+
[tf.distribute.MultiWorkerMirroredStrategy and Tensorflow 2.0 Custom Training Loop APIs](https://www.tensorflow.org/tutorials/distribute/custom_training).
10+
11+
## Best Practices
12+
13+
- Always pin the TensorFlow version with the Docker image tag. This ensures that
14+
TensorFlow updates don't adversely impact your training program for future
15+
runs.
16+
- When creating an image, specify version tags (see below). If you make code
17+
changes, increment the version. Cluster managers will not pull an updated
18+
Docker image if they have them cached. Also, versions ensure that you have
19+
a single copy of the code running for each job.
20+
21+
## Building the Docker Files
22+
23+
Ensure that docker is installed on your system.
24+
25+
First, pick an image name for the job. When running on a cluster manager, you
26+
will want to push your images to a container registry. Note that both the
27+
[Google Container Registry](https://cloud.google.com/container-registry/)
28+
and the [Amazon EC2 Container Registry](https://aws.amazon.com/ecr/) require
29+
special paths. We append `:v1` to version our images. Versioning images is
30+
strongly recommended for reasons described in the best practices section.
31+
32+
```sh
33+
docker build -t <image_name>:v1 -f Dockerfile .
34+
# Use gcloud docker push instead if on Google Container Registry.
35+
docker push <image_name>:v1
36+
```
37+
38+
If you make any updates to the code, increment the version and rerun the above
39+
commands with the new version.
40+
41+
## Running the keras_mnist.py example
42+
43+
The [keras_mnist.py](keras_mnist.py) example demonstrates how to train an MNIST classifier using
44+
[tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
45+
The final model is saved to disk by the chief worker process. The disk is assumed to be mounted onto the running container by the cluster manager.
46+
It assumes that the cluster configuration is passed in through the `TF_CONFIG` environment variable when deployed in the cluster
47+
48+
## Running the custom_training_mnist.py example
49+
50+
The [custom_training_mnist.py](mnist.py) example demonstrates how to train a fashion MNIST classifier using
51+
[tf.distribute.MultiWorkerMirroredStrategy and Tensorflow 2.0 Custom Training Loop APIs](https://www.tensorflow.org/tutorials/distribute/custom_training).
52+
The final model is saved to disk by the chief worker process. The disk is assumed to be mounted onto the running container by the cluster manager.
53+
It assumes that the cluster configuration is passed in through the `TF_CONFIG` environment variable when deployed in the cluster.

docker/Dockerfile

Lines changed: 1 addition & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,4 @@
11
FROM tensorflow/tensorflow:nightly
22

3-
# Keeps Python from generating .pyc files in the container
4-
ENV PYTHONDONTWRITEBYTECODE=1
5-
6-
# Turns off buffering for easier container logging
7-
ENV PYTHONUNBUFFERED=1
8-
9-
WORKDIR /app
10-
11-
COPY . /app/
12-
3+
COPY mnist.py /
134
ENTRYPOINT ["python", "/mnist.py"]

0 commit comments

Comments
 (0)