diff --git a/README.md b/README.md index 099e5185..e2698296 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,7 @@ request. * Jupyter images for different versions of TensorFlow * [TFServing](https://github.com/kubeflow/kubeflow/blob/master/user_guide.md#serve-a-model-using-tensorflow-serving) Docker images and K8s templates - [kubernetes](kubernetes) - Templates for running distributed TensorFlow on - Kubernetes. + Kubernetes. For the most upto-date examples, please also refer to the [distribution strategy](distribution_strategy) folder. - [marathon](marathon) - Templates for running distributed TensorFlow using Marathon, deployed on top of Mesos. - [hadoop](hadoop) - TFRecord file InputFormat/OutputFormat for Hadoop MapReduce diff --git a/distribution_strategy/multi_worker_mirrored_strategy/README.md b/distribution_strategy/multi_worker_mirrored_strategy/README.md new file mode 100644 index 00000000..1c68d154 --- /dev/null +++ b/distribution_strategy/multi_worker_mirrored_strategy/README.md @@ -0,0 +1,227 @@ + +# MultiWorkerMirrored Training Strategy with examples + +The steps below are meant to train models using [MultiWorkerMirrored Strategy](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy) using the tensorflow 2.x API on the Kubernetes platform. + +Reference programs such as [keras_mnist.py](examples/keras_mnist.py) and +[custom_training_mnist.py](examples/custom_training_mnist.py) and [keras_resnet_cifar.py](examples/keras_resnet_cifar.py) are available in the examples directory. + +The Kubernetes manifest templates and other cluster specific configuration is available in the [kubernetes](kubernetes) directory + +## Prerequisites + +1. (Optional) It is recommended that you have a Google Cloud project. Either create a new project or use an existing one. Install + [gcloud commandline tools](https://cloud.google.com/functions/docs/quickstart) + on your system, login, set project and zone, etc. + +2. [Jinja templates](http://jinja.pocoo.org/) must be installed. + +3. A Kubernetes cluster running Kubernetes 1.15 or above must be available. To create a test +cluster on the local machine, [follow steps here](https://kubernetes.io/docs/tutorials/kubernetes-basics/create-cluster/). Kubernetes clusters can also be created on all major cloud providers. For instance, +here are instructions to [create GKE clusters](https://cloud.google.com/kubernetes-engine/docs/how-to/creating-a-regional-cluster). Make sure that you have atleast 12 G of RAM between all nodes in the clusters. This should also install the `kubectl` tool on your system + +4. Set context for `kubectl` so that `kubectl` knows which cluster to use: + + ```bash + kubectl config use-context + ``` + +5. Install [Docker](https://docs.docker.com/get-docker/) for your system, while also creating an account that you can associate with your container images. + +6. For the mnist examples, for model storage and checkpointing, a [persistent-volume-claim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) needs to be available to mount onto the chief worker pod. The steps below include the yaml to create a persistent-volume-claim for GKE backed by GCEPersistentDisk. + +### Additional prerequisites for resnet56 example + +1. Create a + [service account](https://cloud.google.com/compute/docs/access/service-accounts) + and download its key file in JSON format. Assign Storage Admin role for + [Google Cloud Storage](https://cloud.google.com/storage/) to this service account: + + ```bash + gcloud iam service-accounts create --display-name="" + ``` + + ```bash + gcloud projects add-iam-policy-binding \ + --member="serviceAccount:@.iam.gserviceaccount.com" \ + --role="roles/storage.admin" + ``` +2. Create a Kubernetes secret from the JSON key file of your service account: + + ```bash + kubectl create secret generic credential --from-file=key.json= + ``` + +3. For GPU based training, ensure your kubernetes cluster has a node-pool with gpu enabled. + The steps to achieve this on GKE are available [here](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus) + +## Steps to train mnist examples + +1. Follow the instructions for building and pushing the Docker image to a docker registry + in the [Docker README](examples/README.md). + +2. Copy the template file `MultiWorkerMirroredTemplate.yaml.jinja`: + + ```sh + cp kubernetes/MultiWorkerMirroredTemplate.yaml.jinja myjob.template.jinja + ``` + +3. Edit the `myjob.template.jinja` file to edit job parameters. + 1. `script` - which training program needs to be run. This should be either + `keras_mnist.py` or `custom_training_mnist.py` or `your_own_training_example.py` + + 2. `name` - the prefix attached to all the Kubernetes jobs created + + 3. `worker_replicas` - number of parallel worker processes that train the example + + 4. `port` - the port used by tensorflow worker processes to communicate with each other + + 5. `checkpoint_pvc_name` - name of the persistent-volume-claim that will contain the checkpointed model. + + 6. `model_checkpoint_dir` - mount location for inspecting the trained model in the volume inspector pod. Meant to be set if Volume inspector pod is mounted. + + 7. `image` - name of the docker image created in step 2 that needs to be loaded onto the cluster + + 8. `deploy` - set to True when the manifest is actually expected to be deployed + + 9. `create_pvc_checkpoint` - Creates a ReadWriteOnce persistent volume claim to checkpoint the model if needed. The name of the claim `checkpoint_pvc_name` should also be specified. + + 10. `create_volume_inspector` - Create a pod to inspect the contents of the volume after the training job is complete. If this is `True`, `deploy` cannot be `True` since the checkpoint volume can be mounted as read-write by a single node. Inspection cannot happen when training is happenning. + +4. Run the job: + 1. Create a namespace to run your training jobs + + ```sh + kubectl create namespace + ``` + + 2. [Optional: If Persistent volume does not already exist on cluster] First set `deploy` to `False`, `create_pvc_checkpoint` to `True` and set the name of `checkpoint_pvc_name` appropriately in the .jinja file. Then run + + ```sh + python ../../render_template.py myjob.template.jinja | kubectl apply -n -f - + ``` + + This will create a persistent volume claim where you can checkpoint your image. In GKE, this claim will auto-create a GCE persistent disk resource to back up the claim. + + 3. Set `deploy` to `True`, `create_pvc_checkpoint` to `False`, with all parameters specified in step 4 and then run + + ```sh + python ../../render_template.py myjob.template.jinja | kubectl apply -n -f - + ``` + + This will create the Kubernetes jobs on the clusters. Each Job has a single service-endpoint and a single pod that runs the training image. You can track the running jobs in the cluster by running + + ```sh + kubectl get jobs -n + kubectl describe jobs -n + ``` + + In order to inspect the trainining logs that are running in the jobs, run + + ```sh + # Shows all the running pods + kubectl get pods -n + kubectl logs -n -p + ``` + + 4. Once the jobs are finished (based on the logs/output of kubectl get jobs), + the trained model can be inspected by a volume inspector pod. Set `deploy` to `False` + and `create_volume_inspector` to True. Also set `model_checkpoint_dir` to indicate location where trained model will be mounted. Then run + + ```sh + python ../../render_template.py myjob.template.jinja | kubectl apply -n -f - + ``` + + This will create the volume inspector pod. Then, access the pod through ssh + + ```sh + kubectl get pods -n + kubectl -n exec --stdin --tty -- /bin/sh + ``` + + The contents of the trained model are available for inspection at `model_checkpoint_dir`. + +## Steps to train resnet examples + +1. Follow the instructions for building and pushing the Docker image using `Dockerfile.gpu` to a docker registry + in the [Docker README](examples/README.md). + +2. Copy the template file `EnhancedMultiWorkerMirroredTemplate.yaml.jinja` + + ```sh + cp kubernetes/EnhancedMultiWorkerMirroredTemplate.yaml.jinja myjob.template.jinja + ``` +3. Create three buckets for model data, checkpoints and training logs using either GCP web UI or gsutil tool (included with the gcloud tool you have installed above): + + ```bash + gsutil mb gs:// + ``` + You will use these bucket names to modify `data_dir`, `log_dir` and `model_dir` in step #4. + + +4. Download CIFAR-10 data and place them in your data_dir bucket. Head to the [ResNet in TensorFlow](https://github.com/tensorflow/models/tree/r1.13.0/official/resnet#cifar-10) directory to obtain CIFAR-10 data. Alternatively, you can use this [direct link](https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz) to download and extract the data yourself as well. + + ```bash + python cifar10_download_and_extract.py + ``` + + Upload the contents of cifar-10-batches-bin directory to your `data_dir` bucket. + + ```bash + gsutil -m cp cifar-10-batches-bin/* gs:/// + ``` + +5. Edit the `myjob.template.jinja` file to edit job parameters. + 1. `script` - which training program needs to be run. This should be either + `keras_resnet_cifar.py` or `your_own_training_example.py` + + 2. `name` - the prefix attached to all the Kubernetes jobs created + + 3. `worker_replicas` - number of parallel worker processes that train the example + + 4. `port` - the port used by tensorflow worker processes to communicate with each other. + + 5. `model_dir` - the GCP bucket path that stores the model checkoints `gs://model_dir/` + + 6. `image` - name of the docker image created in step 2 that needs to be loaded onto the cluster + + 7. `log_dir` - the GCP bucket path that where the logs are stored `gs://log_dir/` + + 8. `data_dir` - the GCP bucket path for the Cifar-10 dataset `gs://data_dir/` + + 9. `gcp_credential_secret` - the name of secret created in the kubernetes cluster that contains the service Account credentials + + 10. `batch_size` - the global batch size used for training + + 11. `num_train_epoch` - the number of training epochs + +4. Run the job: + 1. Create a namespace to run your training jobs + + ```sh + kubectl create namespace + ``` + + 2. Deploy the training workloads in the cluster + + ```sh + python ../../render_template.py myjob.template.jinja | kubectl apply -n -f - + ``` + + This will create the Kubernetes jobs on the clusters. Each Job has a single service-endpoint and a single pod that runs the training image. You can track the running jobs in the cluster by running + + ```sh + kubectl get jobs -n + kubectl describe jobs -n + ``` + + By default, this also deploys tensorboard on the cluster. + + ```sh + kubectl get services -n | grep tensorboard + ``` + + Note the external-ip corresponding to the service and the previously configured `port` in the yaml + The tensorboard service should be accessible through the web at `http://tensorboard-external-ip:port` + + 3. The final model should be available in the GCP bucket corresponding to `model_dir` configured in the yaml diff --git a/distribution_strategy/multi_worker_mirrored_strategy/examples/Dockerfile b/distribution_strategy/multi_worker_mirrored_strategy/examples/Dockerfile new file mode 100644 index 00000000..36aa8034 --- /dev/null +++ b/distribution_strategy/multi_worker_mirrored_strategy/examples/Dockerfile @@ -0,0 +1,13 @@ +FROM tensorflow/tensorflow:nightly + +# Keeps Python from generating .pyc files in the container +ENV PYTHONDONTWRITEBYTECODE=1 + +# Turns off buffering for easier container logging +ENV PYTHONUNBUFFERED=1 + +WORKDIR /app + +COPY . /app/ + +ENTRYPOINT ["python", "keras_mnist.py"] diff --git a/distribution_strategy/multi_worker_mirrored_strategy/examples/Dockerfile.gpu b/distribution_strategy/multi_worker_mirrored_strategy/examples/Dockerfile.gpu new file mode 100644 index 00000000..0ebb5928 --- /dev/null +++ b/distribution_strategy/multi_worker_mirrored_strategy/examples/Dockerfile.gpu @@ -0,0 +1,30 @@ +FROM tensorflow/tensorflow:2.3.1-gpu-jupyter + +RUN apt-get install -y python3 && \ + apt install python3-pip + +RUN pip3 install absl-py && \ + pip3 install portpicker + +# Install git +RUN apt-get update && \ + apt-get install -y git && \ + apt-get install -y vim + +WORKDIR /app + +RUN git clone --single-branch --branch benchmark https://github.com/tensorflow/models.git && \ + mv models tensorflow_models && \ + git clone https://github.com/tensorflow/model-optimization.git && \ + mv model-optimization tensorflow_model_optimization + +# Keeps Python from generating .pyc files in the container +ENV PYTHONDONTWRITEBYTECODE=1 +# Turns off buffering for easier container logging +ENV PYTHONUNBUFFERED=1 + +COPY . /app/ + +ENV PYTHONPATH "${PYTHONPATH}:/:/app/tensorflow_models" + +CMD ["python", "resnet_cifar_multiworker_strategy_keras.py"] \ No newline at end of file diff --git a/distribution_strategy/multi_worker_mirrored_strategy/examples/README.md b/distribution_strategy/multi_worker_mirrored_strategy/examples/README.md new file mode 100644 index 00000000..4b5f5682 --- /dev/null +++ b/distribution_strategy/multi_worker_mirrored_strategy/examples/README.md @@ -0,0 +1,62 @@ +# TensorFlow Docker Images + +This directory contains examples of MultiWorkerMirrored Training along with the docker file to build them + +- [Dockerfile](Dockerfile) contains all dependenices required to build a container image using docker with the training examples +- [Dockerfile.gpu](Dockerfile.gpu) contains all dependenices required to build a container image using docker with gpu and the tensorflow model garden +- [keras_mnist.py](mnist.py) demonstrates how to train an MNIST classifier using + [tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras). +- [custom_training_mnist.py](mnist.py) demonstrates how to train a fashion MNIST classifier using + [tf.distribute.MultiWorkerMirroredStrategy and Tensorflow 2.0 Custom Training Loop APIs](https://www.tensorflow.org/tutorials/distribute/custom_training). +- [keras_resnet_cifar.py](keras_resnet_cifar.py) demonstrates how to train the resnet56 model on the Cifar-10 dataset using + [tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras). +## Best Practices + +- Always pin the TensorFlow version with the Docker image tag. This ensures that + TensorFlow updates don't adversely impact your training program for future + runs. +- When creating an image, specify version tags (see below). If you make code + changes, increment the version. Cluster managers will not pull an updated + Docker image if they have them cached. Also, versions ensure that you have + a single copy of the code running for each job. + +## Building the Docker Files + +Ensure that docker is installed on your system. + +First, pick an image name for the job. When running on a cluster manager, you +will want to push your images to a container registry. Note that both the +[Google Container Registry](https://cloud.google.com/container-registry/) +and the [Amazon EC2 Container Registry](https://aws.amazon.com/ecr/) require +special paths. We append `:v1` to version our images. Versioning images is +strongly recommended for reasons described in the best practices section. + +```sh +docker build -t :v1 -f Dockerfile . +# Use gcloud docker push instead if on Google Container Registry. +docker push :v1 +``` + +If you make any updates to the code, increment the version and rerun the above +commands with the new version. + +## Running the keras_mnist.py example + +The [keras_mnist.py](keras_mnist.py) example demonstrates how to train an MNIST classifier using +[tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras). +The final model is saved to disk by the chief worker process. The disk is assumed to be mounted onto the running container by the cluster manager. +It assumes that the cluster configuration is passed in through the `TF_CONFIG` environment variable when deployed in the cluster + +## Running the custom_training_mnist.py example + +The [custom_training_mnist.py](mnist.py) example demonstrates how to train a fashion MNIST classifier using +[tf.distribute.MultiWorkerMirroredStrategy and Tensorflow 2.0 Custom Training Loop APIs](https://www.tensorflow.org/tutorials/distribute/custom_training). +The final model is saved to disk by the chief worker process. The disk is assumed to be mounted onto the running container by the cluster manager. +It assumes that the cluster configuration is passed in through the `TF_CONFIG` environment variable when deployed in the cluster. + +## Running the keras_resnet_cifar.py example + +The [keras_resnet_cifar.py](keras_resnet_cifar.py) example demonstrates how to train a Resnet56 model on the cifar-10 dataset using +[tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras). +The final model is saved to the GCP storage bucket. +It assumes that the cluster configuration is passed in through the `TF_CONFIG` environment variable when deployed in the cluster. diff --git a/distribution_strategy/multi_worker_mirrored_strategy/examples/custom_training_mnist.py b/distribution_strategy/multi_worker_mirrored_strategy/examples/custom_training_mnist.py new file mode 100644 index 00000000..7a0c2e05 --- /dev/null +++ b/distribution_strategy/multi_worker_mirrored_strategy/examples/custom_training_mnist.py @@ -0,0 +1,168 @@ +# ============================================================================== +# Copyright 2021 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== + +# This code serves as an example of using Tensorflow 2.x to build and train a CNN model on the +# Fashion MNIST dataset using the tf.distribute.MultiWorkerMirroredStrategy described here +# https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy +# using a custom training loop. This code is very similar to the example provided here +# https://www.tensorflow.org/tutorials/distribute/custom_training +# Assumptions: +# 1) The code assumes that the cluster configuration needed for the TF distribute strategy is available through the +# TF_CONFIG environment variable. See the link provided above for details +# 2) The model is checkpointed and saved in /pvcmnt by the chief worker process. + +import tensorflow as tf +import numpy as np +import os + +# Used to run example using CPU only. Untested on GPU +os.environ["CUDA_VISIBLE_DEVICES"] = "-1" +MAIN_MODEL_PATH = '/pvcmnt' + +EPOCHS = 10 +GLOBAL_BATCH_SIZE = 128 + +def _is_chief(task_type, task_id): + # If `task_type` is None, this may be operating as single worker, which works + # effectively as chief. + return task_type is None or task_type == 'chief' or ( + task_type == 'worker' and task_id == 0) + +def _get_temp_dir(task_id): + base_dirpath = 'workertemp_' + str(task_id) + temp_dir = os.path.join("/tmp", base_dirpath) + os.makedirs(temp_dir) + return temp_dir + +def write_filepath(strategy): + task_type, task_id = strategy.cluster_resolver.task_type, strategy.cluster_resolver.task_id + if not _is_chief(task_type, task_id): + checkpoint_dir = _get_temp_dir(task_id) + else: + base_dirpath = 'workertemp_' + str(task_id) + checkpoint_dir = os.path.join(MAIN_MODEL_PATH, base_dirpath) + if not os.path.exists(checkpoint_dir): + os.makedirs(checkpoint_dir) + return checkpoint_dir + +def create_model(): + model = tf.keras.Sequential([ + tf.keras.layers.Conv2D(32, 3, activation='relu'), + tf.keras.layers.MaxPooling2D(), + tf.keras.layers.Conv2D(64, 3, activation='relu'), + tf.keras.layers.MaxPooling2D(), + tf.keras.layers.Flatten(), + tf.keras.layers.Dense(64, activation='relu'), + tf.keras.layers.Dense(10) + ]) + return model + +def get_dist_data_set(strategy, batch_size): + fashion_mnist = tf.keras.datasets.fashion_mnist + (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data() + # Adding a dimension to the array -> new shape == (28, 28, 1) + # We are doing this because the first layer in our model is a convolutional + # layer and it requires a 4D input (batch_size, height, width, channels). + # batch_size dimension will be added later on. + train_images = train_images[..., None] + test_images = test_images[..., None] + # Getting the images in [0, 1] range. + train_images = train_images / np.float32(255) + test_images = test_images / np.float32(255) + train_dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels)).shuffle(60000).batch(batch_size) + test_dataset = tf.data.Dataset.from_tensor_slices((test_images, test_labels)).batch(batch_size) + train_dist_dataset = strategy.experimental_distribute_dataset(train_dataset) + test_dist_dataset = strategy.experimental_distribute_dataset(test_dataset) + return train_dist_dataset, test_dist_dataset + +def main(): + global GLOBAL_BATCH_SIZE + strategy = tf.distribute.MultiWorkerMirroredStrategy() + train_dist_dataset, test_dist_dataset = get_dist_data_set(strategy, GLOBAL_BATCH_SIZE) + checkpoint_pfx = write_filepath(strategy) + with strategy.scope(): + model = create_model() + optimizer = tf.keras.optimizers.Adam() + checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model) + loss_object = tf.keras.losses.SparseCategoricalCrossentropy( + from_logits=True, + reduction=tf.keras.losses.Reduction.NONE) + test_loss = tf.keras.metrics.Mean(name='test_loss') + train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy( + name='train_accuracy') + test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy( + name='test_accuracy') + + def compute_loss(labels, predictions): + per_example_loss = loss_object(labels, predictions) + return tf.nn.compute_average_loss(per_example_loss, global_batch_size=GLOBAL_BATCH_SIZE) + + def test_step(inputs): + images, labels = inputs + predictions = model(images, training=False) + t_loss = loss_object(labels, predictions) + test_loss.update_state(t_loss) + test_accuracy.update_state(labels, predictions) + + def train_step(inputs): + images, labels = inputs + with tf.GradientTape() as tape: + predictions = model(images, training=True) + loss = compute_loss(labels, predictions) + gradients = tape.gradient(loss, model.trainable_variables) + optimizer.apply_gradients(zip(gradients, model.trainable_variables)) + train_accuracy.update_state(labels, predictions) + return loss + + # `run` replicates the provided computation and runs it + # with the distributed input. + @tf.function + def distributed_train_step(dataset_inputs): + per_replica_losses = strategy.run(train_step, args=(dataset_inputs,)) + return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses, + axis=None) + + @tf.function + def distributed_test_step(dataset_inputs): + return strategy.run(test_step, args=(dataset_inputs,)) + + for epoch in range(EPOCHS): + # TRAIN LOOP + total_loss = 0.0 + num_batches = 0 + for x in train_dist_dataset: + total_loss += distributed_train_step(x) + num_batches += 1 + train_loss = total_loss / num_batches + + # TEST LOOP + for x in test_dist_dataset: + distributed_test_step(x) + if epoch % 2 == 0: + checkpoint.save(checkpoint_pfx) + + template = ("Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, " + "Test Accuracy: {}") + print (template.format(epoch+1, train_loss, + train_accuracy.result()*100, test_loss.result(), + test_accuracy.result()*100)) + + test_loss.reset_states() + train_accuracy.reset_states() + test_accuracy.reset_states() + +if __name__=="__main__": + main() diff --git a/distribution_strategy/multi_worker_mirrored_strategy/examples/keras_mnist.py b/distribution_strategy/multi_worker_mirrored_strategy/examples/keras_mnist.py new file mode 100644 index 00000000..41882c74 --- /dev/null +++ b/distribution_strategy/multi_worker_mirrored_strategy/examples/keras_mnist.py @@ -0,0 +1,110 @@ +# ============================================================================== +# Copyright 2021 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== + +# This code serves as an example of using Tensorflow 2.x Keras API to build and train a CNN model on the +# MNIST dataset using the tf.distribute.MultiWorkerMirroredStrategy described here +# https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy. +# This code is very similar to the example provided here +# https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras +# Assumptions: +# 1) The code assumes that the cluster configuration needed for the TF distribute strategy is available through the +# TF_CONFIG environment variable. See the link provided above for details +# 2) The model is checkpointed and saved in /pvcmnt by the chief worker process. + + +from __future__ import print_function + +import math +import os +import tensorflow as tf +import numpy as np +import json + +# Used to run example using CPU only. Untested on GPU +os.environ["CUDA_VISIBLE_DEVICES"] = "-1" + +# Model save directory +MAIN_MODEL_PATH = '/pvcmnt' + +GLOBAL_BATCH_SIZE = 128 + +def _is_chief(task_type, task_id): + # If `task_type` is None, this may be operating as single worker, which works + # effectively as chief. + return task_type is None or task_type == 'chief' or ( + task_type == 'worker' and task_id == 0) + +def _get_temp_dir(task_id): + base_dirpath = 'workertemp_' + str(task_id) + temp_dir = os.path.join("/tmp", base_dirpath) + os.makedirs(temp_dir) + return temp_dir + +def write_filepath(strategy): + task_type, task_id = strategy.cluster_resolver.task_type, strategy.cluster_resolver.task_id + if not _is_chief(task_type, task_id): + checkpoint_dir = _get_temp_dir(task_id) + else: + base_dirpath = 'workertemp_' + str(task_id) + checkpoint_dir = os.path.join(MAIN_MODEL_PATH, base_dirpath) + if not os.path.exists(checkpoint_dir): + os.makedirs(checkpoint_dir) + return checkpoint_dir + +def mnist_dataset(batch_size): + (x_train, y_train), _ = tf.keras.datasets.mnist.load_data() + # The `x` arrays are in uint8 and have values in the range [0, 255]. + # You need to convert them to float32 with values in the range [0, 1] + x_train = x_train / np.float32(255) + y_train = y_train.astype(np.int64) + train_dataset = tf.data.Dataset.from_tensor_slices( + (x_train, y_train)).shuffle(60000).repeat().batch(batch_size) + return train_dataset + +def build_and_compile_cnn_model(): + model = tf.keras.Sequential([ + tf.keras.Input(shape=(28, 28)), + tf.keras.layers.Reshape(target_shape=(28, 28, 1)), + tf.keras.layers.Conv2D(32, 3, activation='relu'), + tf.keras.layers.Flatten(), + tf.keras.layers.Dense(128, activation='relu'), + tf.keras.layers.Dense(10) + ]) + model.compile( + loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), + optimizer=tf.keras.optimizers.SGD(learning_rate=0.001), + metrics=['accuracy']) + return model + +def main(): + tf_config = json.loads(os.environ['TF_CONFIG']) + num_workers = len(tf_config['cluster']['worker']) + strategy = tf.distribute.MultiWorkerMirroredStrategy() + + multi_worker_dataset = mnist_dataset(GLOBAL_BATCH_SIZE) + + # missing needs to be fixed + # multi_worker_dataset = strategy.distribute_datasets_from_function(mnist_dataset(global_batch_size)) + + callbacks = [tf.keras.callbacks.experimental.BackupAndRestore(backup_dir=write_filepath(strategy))] + with strategy.scope(): + multi_worker_model = build_and_compile_cnn_model() + multi_worker_model.fit(multi_worker_dataset, epochs=10, steps_per_epoch=70, + callbacks=callbacks) + multi_worker_model.save(filepath=write_filepath(strategy)) + +if __name__=="__main__": + main() diff --git a/distribution_strategy/multi_worker_mirrored_strategy/examples/keras_resnet_cifar.py b/distribution_strategy/multi_worker_mirrored_strategy/examples/keras_resnet_cifar.py new file mode 100644 index 00000000..ab0f0318 --- /dev/null +++ b/distribution_strategy/multi_worker_mirrored_strategy/examples/keras_resnet_cifar.py @@ -0,0 +1,373 @@ +# Copyright 2018 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== +"""Runs a ResNet model on the Cifar-10 dataset.""" + +# This code serves as an example of using Tensorflow 2.0 Keras API to build and train a Resnet50 model on +# the Cifar 10 dataset using the tf.distribute.MultiWorkerMirroredStrategy described here +# https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy. +# This code is largely borrowed from +# https://github.com/tensorflow/models/blob/benchmark/official/benchmark/models/resnet_cifar_model.py +# with some minor tweaks to allow for training using GPU +# Assumptions: +# 1) The code assumes that the cluster configuration needed for the TF distribute strategy is available through the +# TF_CONFIG environment variable. See the link provided above for details +# 2) The libraries required to test this model are packaged into ./Dockerfile.gpu. Please refer to it + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +# Import libraries +from absl import app +from absl import flags +from absl import logging +import numpy as np +import tensorflow as tf +from tensorflow_models.official.benchmark.models import cifar_preprocessing +from tensorflow_models.official.benchmark.models import resnet_cifar_model +from tensorflow_models.official.benchmark.models import synthetic_util +from tensorflow_models.official.common import distribute_utils +from tensorflow_models.official.utils.flags import core as flags_core +#from tensorflow_models.official.utils.misc import keras_utils +from tensorflow_models.official.vision.image_classification.resnet import common +import multiprocessing +import os + +MAIN_MODEL_PATH = '/pvcmnt' + +# remove: duplicate function from keras_utils +def set_session_config(enable_xla=False): + """Sets the session config.""" + if enable_xla: + tf.config.optimizer.set_jit(True) + +# remove: duplicate function from keras_utils +def set_gpu_thread_mode_and_count(gpu_thread_mode, datasets_num_private_threads, + num_gpus, per_gpu_thread_count): + """Set GPU thread mode and count, and adjust dataset threads count.""" + cpu_count = multiprocessing.cpu_count() + logging.info('Logical CPU cores: %s', cpu_count) + + # Allocate private thread pool for each GPU to schedule and launch kernels + per_gpu_thread_count = per_gpu_thread_count or 2 + os.environ['TF_GPU_THREAD_MODE'] = gpu_thread_mode + os.environ['TF_GPU_THREAD_COUNT'] = str(per_gpu_thread_count) + logging.info('TF_GPU_THREAD_COUNT: %s', os.environ['TF_GPU_THREAD_COUNT']) + logging.info('TF_GPU_THREAD_MODE: %s', os.environ['TF_GPU_THREAD_MODE']) + + # Limit data preprocessing threadpool to CPU cores minus number of total GPU + # private threads and memory copy threads. + total_gpu_thread_count = per_gpu_thread_count * num_gpus + num_runtime_threads = num_gpus + if not datasets_num_private_threads: + datasets_num_private_threads = min( + cpu_count - total_gpu_thread_count - num_runtime_threads, num_gpus * 8) + logging.info('Set datasets_num_private_threads to %s', + datasets_num_private_threads) + +def _is_chief(task_type, task_id): + # If `task_type` is None, this may be operating as single worker, which works + # effectively as chief. + return task_type is None or task_type == 'chief' or ( + task_type == 'worker' and task_id == 0) + +def _get_temp_dir(task_id): + base_dirpath = 'workertemp_' + str(task_id) + temp_dir = os.path.join("/tmp", base_dirpath) + os.makedirs(temp_dir) + return temp_dir + +def write_filepath(strategy): + task_type, task_id = strategy.cluster_resolver.task_type, strategy.cluster_resolver.task_id + if not _is_chief(task_type, task_id): + checkpoint_dir = _get_temp_dir(task_id) + else: + base_dirpath = 'workertemp_' + str(task_id) + checkpoint_dir = os.path.join(MAIN_MODEL_PATH, base_dirpath) + if not os.path.exists(checkpoint_dir): + os.makedirs(checkpoint_dir) + return checkpoint_dir + + + +LR_SCHEDULE = [ # (multiplier, epoch to start) tuples + (0.1, 91), (0.01, 136), (0.001, 182) +] + + +def learning_rate_schedule(current_epoch, + current_batch, + batches_per_epoch, + batch_size): + """Handles linear scaling rule and LR decay. + Scale learning rate at epoch boundaries provided in LR_SCHEDULE by the + provided scaling factor. + Args: + current_epoch: integer, current epoch indexed from 0. + current_batch: integer, current batch in the current epoch, indexed from 0. + batches_per_epoch: integer, number of steps in an epoch. + batch_size: integer, total batch sized. + Returns: + Adjusted learning rate. + """ + del current_batch, batches_per_epoch # not used + initial_learning_rate = common.BASE_LEARNING_RATE * batch_size / 128 + learning_rate = initial_learning_rate + for mult, start_epoch in LR_SCHEDULE: + if current_epoch >= start_epoch: + learning_rate = initial_learning_rate * mult + else: + break + return learning_rate + + +class LearningRateBatchScheduler(tf.keras.callbacks.Callback): + """Callback to update learning rate on every batch (not epoch boundaries). + N.B. Only support Keras optimizers, not TF optimizers. + Attributes: + schedule: a function that takes an epoch index and a batch index as input + (both integer, indexed from 0) and returns a new learning rate as + output (float). + """ + + def __init__(self, schedule, batch_size, steps_per_epoch): + super(LearningRateBatchScheduler, self).__init__() + self.schedule = schedule + self.steps_per_epoch = steps_per_epoch + self.batch_size = batch_size + self.epochs = -1 + self.prev_lr = -1 + + def on_epoch_begin(self, epoch, logs=None): + if not hasattr(self.model.optimizer, 'learning_rate'): + raise ValueError('Optimizer must have a "learning_rate" attribute.') + self.epochs += 1 + + def on_batch_begin(self, batch, logs=None): + """Executes before step begins.""" + lr = self.schedule(self.epochs, + batch, + self.steps_per_epoch, + self.batch_size) + if not isinstance(lr, (float, np.float32, np.float64)): + raise ValueError('The output of the "schedule" function should be float.') + if lr != self.prev_lr: + self.model.optimizer.learning_rate = lr # lr should be a float here + self.prev_lr = lr + logging.debug( + 'Epoch %05d Batch %05d: LearningRateBatchScheduler ' + 'change learning rate to %s.', self.epochs, batch, lr) + + +def run(flags_obj): + """Run ResNet Cifar-10 training and eval loop using native Keras APIs. + Args: + flags_obj: An object containing parsed flag values. + Raises: + ValueError: If fp16 is passed as it is not currently supported. + Returns: + Dictionary of training and eval stats. + """ + #keras_utils.set_session_config( + # enable_xla=flags_obj.enable_xla) + set_session_config(enable_xla=True) + + # Execute flag override logic for better model performance + """ + if flags_obj.tf_gpu_thread_mode: + keras_utils.set_gpu_thread_mode_and_count( + per_gpu_thread_count=flags_obj.per_gpu_thread_count, + gpu_thread_mode=flags_obj.tf_gpu_thread_mode, + num_gpus=flags_obj.num_gpus, + datasets_num_private_threads=flags_obj.datasets_num_private_threads) + """ + if flags_obj.tf_gpu_thread_mode: + set_gpu_thread_mode_and_count( + per_gpu_thread_count=flags_obj.per_gpu_thread_count, + gpu_thread_mode=flags_obj.tf_gpu_thread_mode, + num_gpus=flags_obj.num_gpus, + datasets_num_private_threads=flags_obj.datasets_num_private_threads) + + common.set_cudnn_batchnorm_mode() + + dtype = flags_core.get_tf_dtype(flags_obj) + if dtype == 'fp16': + raise ValueError('dtype fp16 is not supported in Keras. Use the default ' + 'value(fp32).') + + data_format = flags_obj.data_format + if data_format is None: + data_format = ('channels_first' if tf.config.list_physical_devices('GPU') + else 'channels_last') + tf.keras.backend.set_image_data_format(data_format) + + """ + strategy = distribute_utils.get_distribution_strategy( + distribution_strategy=flags_obj.distribution_strategy, + num_gpus=flags_obj.num_gpus, + all_reduce_alg=flags_obj.all_reduce_alg, + num_packs=flags_obj.num_packs) + """ + strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy() + + if strategy: + # flags_obj.enable_get_next_as_optional controls whether enabling + # get_next_as_optional behavior in DistributedIterator. If true, last + # partial batch can be supported. + strategy.extended.experimental_enable_get_next_as_optional = ( + flags_obj.enable_get_next_as_optional + ) + + strategy_scope = distribute_utils.get_strategy_scope(strategy) + + if flags_obj.use_synthetic_data: + synthetic_util.set_up_synthetic_data() + input_fn = common.get_synth_input_fn( + height=cifar_preprocessing.HEIGHT, + width=cifar_preprocessing.WIDTH, + num_channels=cifar_preprocessing.NUM_CHANNELS, + num_classes=cifar_preprocessing.NUM_CLASSES, + dtype=flags_core.get_tf_dtype(flags_obj), + drop_remainder=True) + else: + synthetic_util.undo_set_up_synthetic_data() + input_fn = cifar_preprocessing.input_fn + + train_input_dataset = input_fn( + is_training=True, + data_dir=flags_obj.data_dir, + batch_size=flags_obj.batch_size, + parse_record_fn=cifar_preprocessing.parse_record, + datasets_num_private_threads=flags_obj.datasets_num_private_threads, + dtype=dtype, + # Setting drop_remainder to avoid the partial batch logic in normalization + # layer, which triggers tf.where and leads to extra memory copy of input + # sizes between host and GPU. + drop_remainder=(not flags_obj.enable_get_next_as_optional)) + + eval_input_dataset = None + if not flags_obj.skip_eval: + eval_input_dataset = input_fn( + is_training=False, + data_dir=flags_obj.data_dir, + batch_size=flags_obj.batch_size, + parse_record_fn=cifar_preprocessing.parse_record) + + steps_per_epoch = ( + cifar_preprocessing.NUM_IMAGES['train'] // flags_obj.batch_size) + lr_schedule = 0.1 + if flags_obj.use_tensor_lr: + initial_learning_rate = common.BASE_LEARNING_RATE * flags_obj.batch_size / 128 + lr_schedule = tf.keras.optimizers.schedules.PiecewiseConstantDecay( + boundaries=list(p[1] * steps_per_epoch for p in LR_SCHEDULE), + values=[initial_learning_rate] + + list(p[0] * initial_learning_rate for p in LR_SCHEDULE)) + + with strategy_scope: + optimizer = common.get_optimizer(lr_schedule) + model = resnet_cifar_model.resnet56(classes=cifar_preprocessing.NUM_CLASSES) + model.compile( + loss='sparse_categorical_crossentropy', + optimizer=optimizer, + metrics=(['sparse_categorical_accuracy'] + if flags_obj.report_accuracy_metrics else None), + run_eagerly=flags_obj.run_eagerly) + + train_epochs = flags_obj.train_epochs + + callbacks = common.get_callbacks() + + if not flags_obj.use_tensor_lr: + lr_callback = LearningRateBatchScheduler( + schedule=learning_rate_schedule, + batch_size=flags_obj.batch_size, + steps_per_epoch=steps_per_epoch) + callbacks.append(lr_callback) + + tensorboard_callback = tf.keras.callbacks.TensorBoard( + log_dir="gs://shankgan-tf-exp-train-log-dir/") + callbacks.append(tensorboard_callback) + + # if mutliple epochs, ignore the train_steps flag. + if train_epochs <= 1 and flags_obj.train_steps: + steps_per_epoch = min(flags_obj.train_steps, steps_per_epoch) + train_epochs = 1 + + num_eval_steps = (cifar_preprocessing.NUM_IMAGES['validation'] // + flags_obj.batch_size) + + validation_data = eval_input_dataset + if flags_obj.skip_eval: + if flags_obj.set_learning_phase_to_train: + # TODO(haoyuzhang): Understand slowdown of setting learning phase when + # not using distribution strategy. + tf.keras.backend.set_learning_phase(1) + num_eval_steps = None + validation_data = None + + if not strategy and flags_obj.explicit_gpu_placement: + # TODO(b/135607227): Add device scope automatically in Keras training loop + # when not using distribition strategy. + no_dist_strat_device = tf.device('/device:GPU:0') + no_dist_strat_device.__enter__() + + logging.info("Beginning to fit the model.....") + history = model.fit(train_input_dataset, + epochs=train_epochs, + steps_per_epoch=steps_per_epoch, + callbacks=callbacks, + validation_steps=num_eval_steps, + validation_data=validation_data, + validation_freq=flags_obj.epochs_between_evals, + verbose=2) + eval_output = None + if not flags_obj.skip_eval: + eval_output = model.evaluate(eval_input_dataset, + steps=num_eval_steps, + verbose=2) + + if not strategy and flags_obj.explicit_gpu_placement: + no_dist_strat_device.__exit__() + + stats = common.build_stats(history, eval_output, callbacks) + return stats + + +def define_cifar_flags(): + + common.define_keras_flags() + data_dir = os.getenv("DATA_DIR") + model_dir = os.getenv("MODEL_DIR") + batch_size = int(os.getenv("BATCH_SIZE", default=512)) + num_train_epoch = int(os.getenv("NUM_TRAIN_EPOCH", default=100)) + + if not data_dir or not model_dir: + raise Exception("Data directory and Model Directory need to be specified!") + + flags_core.set_defaults(data_dir=data_dir, + model_dir=model_dir, + train_epochs=num_train_epoch, + epochs_between_evals=20, + batch_size=batch_size, + use_synthetic_data=False) # Changed the batch size + +def main(_): + return run(flags.FLAGS) + + +if __name__ == '__main__': + logging.set_verbosity(logging.INFO) + define_cifar_flags() + app.run(main) \ No newline at end of file diff --git a/distribution_strategy/multi_worker_mirrored_strategy/kubernetes/EnhancedMultiWorkerMirroredTemplate.j2 b/distribution_strategy/multi_worker_mirrored_strategy/kubernetes/EnhancedMultiWorkerMirroredTemplate.j2 new file mode 100644 index 00000000..8ea5e5ab --- /dev/null +++ b/distribution_strategy/multi_worker_mirrored_strategy/kubernetes/EnhancedMultiWorkerMirroredTemplate.j2 @@ -0,0 +1,142 @@ +{%- set name = "" -%} +{%- set image = "" -%} +{%- set worker_replicas = 2 -%} +{%- set script = "" -%} +{%- set gcp_credential_secret = "" %} +{%- set log_dir = "" %} +{%- set data_dir = "" %} +{%- set model_dir = "" %} +{%- set batch_size = 256 %} +{%- set num_train_epoch = 100 %} +{%- set port = 5000 -%} +{%- set run_tensorboard = true %} + + +{%- macro worker_hosts() -%} + {%- for i in range(worker_replicas) -%} + {%- if not loop.first -%},{%- endif -%} + "{{ name }}-worker-{{ i }}:{{ port }}" + {%- endfor -%} +{%- endmacro -%} + +{%- for i in range(worker_replicas) -%} +kind: Service +apiVersion: v1 +metadata: + name: {{ name }}-worker-{{ i }} +spec: + selector: + name: {{ name }} + job: worker + task: "{{ i }}" + ports: + - port: {{ port }} +--- +kind: Job +apiVersion: batch/v1 +metadata: + name: {{ name }}-worker-{{ i }} +spec: + ttlSecondsAfterFinished: 600 + template: + metadata: + labels: + name: {{ name }} + job: worker + task: "{{ i }}" + spec: + restartPolicy: Never + containers: + - name: tensorflow + image: {{ image }} + ports: + - containerPort: {{ port }} + command: + - "python" + - "{{ script }}" + env: + - name: TF_CONFIG + value: '{"cluster": {"worker": [{{ worker_hosts() }}]}, "task": {"type": "worker", "index": {{ i }}}}' + - name: GOOGLE_APPLICATION_CREDENTIALS + value: "/var/secrets/google/key.json" + - name: DATA_DIR + value: "{{ data_dir }}" + - name: MODEL_DIR + value: "{{ model_dir }}" + - name: NUM_TRAIN_EPOCH + value: "{{ num_train_epoch }}" + - name: BATCH_SIZE + value: "{{ batch_size }}" + ports: + - containerPort: {{ port }} + resources: + limits: + nvidia.com/gpu: 1 + volumeMounts: + - name: credential + mountPath: /var/secrets/google + volumes: + - name: credential + secret: + secretName: {{ gcp_credential_secret }} +--- +{% endfor %} + +{% if run_tensorboard %} +kind: Service +apiVersion: v1 +metadata: + name: resnet-tensorboard-0 +spec: + type: LoadBalancer + selector: + name: resnet + job: tensorboard + task: "0" + ports: + - port: {{ port }} +--- +kind: Deployment +apiVersion: apps/v1 +metadata: + name: resnet-tensorboard-0 +spec: + replicas: 1 + selector: + matchLabels: + name: resnet + job: tensorboard + task: "0" + template: + metadata: + labels: + name: resnet + job: tensorboard + task: "0" + spec: + containers: + - name: tensorflow + image: tensorflow/tensorflow + env: + - name: GOOGLE_APPLICATION_CREDENTIALS + value: "/var/secrets/google/key.json" + ports: + - containerPort: {{ port }} + command: + - "tensorboard" + args: + - '--logdir= {{ log_dir }}' + - "--port={{ port }}" + - "--host=0.0.0.0" + volumeMounts: + - name: credential + mountPath: /var/secrets/google + volumes: + - name: credential + secret: + secretName: {{ gcp_credential_secret }} +--- +{% endif %} + + + diff --git a/distribution_strategy/multi_worker_mirrored_strategy/kubernetes/MultiWorkerMirroredTemplate.jinja b/distribution_strategy/multi_worker_mirrored_strategy/kubernetes/MultiWorkerMirroredTemplate.jinja new file mode 100644 index 00000000..e4a7799c --- /dev/null +++ b/distribution_strategy/multi_worker_mirrored_strategy/kubernetes/MultiWorkerMirroredTemplate.jinja @@ -0,0 +1,111 @@ +{%- set name = "tf-learning" -%} +{%- set image = "image-name" -%} +{%- set worker_replicas = 2 -%} +{%- set script = "keras_mnist.py" -%} +{%- set model_checkpoint_dir = "/pvcmnt" -%} +{%- set checkpoint_pvc_name = "pvc-demo" -%} +{%- set port = 5000 -%} +{%- set create_pvc_checkpoint = True -%} +{%- set create_volume_inspector = True -%} +{%- set deploy = False -%} + + +{% if deploy %} + +{%- macro worker_hosts() -%} + {%- for i in range(worker_replicas) -%} + {%- if not loop.first -%},{%- endif -%} + "{{ name }}-worker-{{ i }}:{{ port }}" + {%- endfor -%} +{%- endmacro -%} + +{%- for i in range(worker_replicas) -%} +kind: Service +apiVersion: v1 +metadata: + name: {{ name }}-worker-{{ i }} +spec: + selector: + name: {{ name }} + job: worker + task: "{{ i }}" + ports: + - port: {{ port }} +--- +kind: Job +apiVersion: batch/v1 +metadata: + name: {{ name }}-worker-{{ i }} +spec: + ttlSecondsAfterFinished: 600 + template: + metadata: + labels: + name: {{ name }} + job: worker + task: "{{ i }}" + spec: + restartPolicy: Never + containers: + - name: tensorflow + image: {{ image }} + ports: + - containerPort: {{ port }} + command: + - "python" + - "{{ script }}" + env: + - name: TF_CONFIG + value: '{"cluster": {"worker": [{{ worker_hosts() }}]}, "task": {"type": "worker", "index": {{ i }}}}' +{% if i == 0 %} + volumeMounts: + - mountPath: /pvcmnt + name: pvc-mount + volumes: + - name: pvc-mount + persistentVolumeClaim: + claimName: {{ checkpoint_pvc_name }} +{% endif %}--- +{% endfor %} + +{% endif %} +{% if create_pvc_checkpoint %} +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: {{ checkpoint_pvc_name }} +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 10G +--- +{% endif %} +{% if create_volume_inspector %} +kind: Pod +apiVersion: v1 +metadata: + name: volume-inspector +spec: + volumes: + - name: volume-to-inspect + persistentVolumeClaim: + claimName: {{ checkpoint_pvc_name }} + containers: + - name: debugger + image: busybox + command: ['sleep', '3600'] + volumeMounts: + - mountPath: {{ model_checkpoint_dir }} + name: volume-to-inspect + resources: + limits: + memory: 512Mi +--- +{% endif %} + + + + + diff --git a/kubernetes/README.md b/kubernetes/README.md index 7c5af8d5..1aefd277 100644 --- a/kubernetes/README.md +++ b/kubernetes/README.md @@ -1,9 +1,11 @@ # Running Distributed TensorFlow on Kubernetes This directory contains a template for running distributed TensorFlow on -Kubernetes. +Kubernetes. For newer examples, refer to the [distribution strategy](../distribution_strategy) -## Prerequisites +## Steps to train [mnist.py](../docker/mnist.py) + +### Prerequisites 1. You must be running Kubernetes 1.3 or above. If you are running an earlier version, the DNS addon must be enabled. See the @@ -12,7 +14,7 @@ Kubernetes. 2. [Jinja templates](http://jinja.pocoo.org/) must be installed. -## Steps to Run the job +### Steps to Run the job 1. Follow the instructions for creating the training program in the parent [README](../README.md). @@ -43,7 +45,7 @@ write to Google Cloud Storage. See the Google Cloud Storage section below. python render_template.py myjob.template.jinja | kubectl delete -f - ``` -## Google Cloud Storage +### Google Cloud Storage To support reading and writing to Google Cloud Storage, you need to set up a [Kubernetes secret](http://kubernetes.io/docs/user-guide/secrets/) with the @@ -62,4 +64,4 @@ credentials. 3. In your template, set `credential_secret_name` to `"credential"` (as specified above) and `credential_secret_key` to the `"[json_filename]"` in - the template. + the template. \ No newline at end of file