You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Nov 28, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: distribution_strategy/multi_worker_mirrored_strategy/README.md
+115-5Lines changed: 115 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
The steps below are meant to train models using [MultiWorkerMirrored Strategy](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy) using the tensorflow 2.x API on the Kubernetes platform.
5
5
6
6
Reference programs such as [keras_mnist.py](examples/keras_mnist.py) and
7
-
[custom_training_mnist.py](examples/custom_training_mnist.py) are available in the examples directory.
7
+
[custom_training_mnist.py](examples/custom_training_mnist.py)and [keras_resnet_cifar.py](examples/keras_resnet_cifar.py)are available in the examples directory.
8
8
9
9
The Kubernetes manifest templates and other cluster specific configuration is available in the [kubernetes](kubernetes) directory
10
10
@@ -28,14 +28,39 @@ here are instructions to [create GKE clusters](https://cloud.google.com/kubernet
28
28
29
29
5. Install [Docker](https://docs.docker.com/get-docker/) for your system, while also creating an account that you can associate with your container images.
30
30
31
-
6. For model storage and checkpointing, a [persistent-volume-claim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) needs to be available to mount onto the chief worker pod. The steps below include the yaml to create a persistent-volume-claim for GKE backed by GCEPersistentDisk.
31
+
6. For the mnist examples, formodel storage and checkpointing, a [persistent-volume-claim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) needs to be available to mount onto the chief worker pod. The steps below include the yaml to create a persistent-volume-claim for GKE backed by GCEPersistentDisk.
3. Create three buckets for model data, checkpoints and training logs using either GCP web UI or gsutil tool (included with the gcloud tool you have installed above):
155
+
156
+
```bash
157
+
gsutil mb gs://<bucket_name>
158
+
```
159
+
You will use these bucket names to modify `data_dir`, `log_dir` and `model_dir`in step #4.
160
+
161
+
162
+
4. Download CIFAR-10 data and place them in your data_dir bucket. Head to the [ResNet in TensorFlow](https://github.com/tensorflow/models/tree/r1.13.0/official/resnet#cifar-10) directory to obtain CIFAR-10 data. Alternatively, you can use this [direct link](https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz) to download and extract the data yourself as well.
163
+
164
+
```bash
165
+
python cifar10_download_and_extract.py
166
+
```
167
+
168
+
Upload the contents of cifar-10-batches-bin directory to your `data_dir` bucket.
This will create the Kubernetes jobs on the clusters. Each Job has a single service-endpoint and a single pod that runs the training image. You can track the running jobsin the cluster by running
212
+
213
+
```sh
214
+
kubectl get jobs -n <namespace>
215
+
kubectl describe jobs -n <namespace>
216
+
```
217
+
218
+
By default, this also deploys tensorboard on the cluster.
219
+
220
+
```sh
221
+
kubectl get services -n <namespace>| grep tensorboard
222
+
```
223
+
224
+
Note the external-ip corresponding to the service and the previously configured `port`in the yaml
225
+
The tensorboard service should be accessible through the web at `http://tensorboard-external-ip:port`
226
+
227
+
3. The final model should be available in the GCP bucket corresponding to `model_dir` configured in the yaml
Copy file name to clipboardExpand all lines: distribution_strategy/multi_worker_mirrored_strategy/examples/README.md
+10-1Lines changed: 10 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,11 +3,13 @@
3
3
This directory contains examples of MultiWorkerMirrored Training along with the docker file to build them
4
4
5
5
-[Dockerfile](Dockerfile) contains all dependenices required to build a container image using docker with the training examples
6
+
-[Dockerfile.gpu](Dockerfile.gpu) contains all dependenices required to build a container image using docker with gpu and the tensorflow model garden
6
7
-[keras_mnist.py](mnist.py) demonstrates how to train an MNIST classifier using
7
8
[tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
8
9
-[custom_training_mnist.py](mnist.py) demonstrates how to train a fashion MNIST classifier using
9
10
[tf.distribute.MultiWorkerMirroredStrategy and Tensorflow 2.0 Custom Training Loop APIs](https://www.tensorflow.org/tutorials/distribute/custom_training).
10
-
11
+
-[keras_resnet_cifar.py](keras_resnet_cifar.py) demonstrates how to train the resnet56 model on the Cifar-10 dataset using
12
+
[tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
11
13
## Best Practices
12
14
13
15
- Always pin the TensorFlow version with the Docker image tag. This ensures that
@@ -51,3 +53,10 @@ The [custom_training_mnist.py](mnist.py) example demonstrates how to train a fas
51
53
[tf.distribute.MultiWorkerMirroredStrategy and Tensorflow 2.0 Custom Training Loop APIs](https://www.tensorflow.org/tutorials/distribute/custom_training).
52
54
The final model is saved to disk by the chief worker process. The disk is assumed to be mounted onto the running container by the cluster manager.
53
55
It assumes that the cluster configuration is passed in through the `TF_CONFIG` environment variable when deployed in the cluster.
56
+
57
+
## Running the keras_resnet_cifar.py example
58
+
59
+
The [keras_resnet_cifar.py](keras_resnet_cifar.py) example demonstrates how to train a Resnet56 model on the cifar-10 dataset using
60
+
[tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
61
+
The final model is saved to the GCP storage bucket.
62
+
It assumes that the cluster configuration is passed in through the `TF_CONFIG` environment variable when deployed in the cluster.
0 commit comments