|
| 1 | +# TorchElastic Operator on Kubernetes |
| 2 | + |
| 3 | +## Background |
| 4 | + |
| 5 | +PyTorch continues to be used for the latest state-of-the-art research, making up nearly 70% of [papers](https://chillee.github.io/pytorch-vs-tensorflow/) that cite a framework. |
| 6 | + |
| 7 | +The current PyTorch Distributed Data Parallel (DDP) module enables data parallel training where each process trains the same model but on different shards of data. It enables bulk synchronous, multi-host, multi-GPU/CPU execution of ML training. However, DDP has several shortcomings; e.g.jobs cannot start without acquiring all the requested nodes; jobs cannot continue after a node fails due to an error or transient issue; jobs cannot incorporate a node that joined later; and lastly; progress cannot be made with the presence of a slow/stuck node. |
| 8 | + |
| 9 | +The focus of [PyTorch Elastic](https://github.com/pytorch/elastic), which uses Elastic Distributed Data Parallelism, is to address these issues and build a generic framework/APIs for PyTorch to enable reliable and elastic execution of these data parallel training workloads. It will provide better programmability, higher resilience to failures of all kinds, higher-efficiency and larger-scale training compared with pure DDP. |
| 10 | + |
| 11 | +## Motivation |
| 12 | + |
| 13 | +With job fault tolerance and elastic training, we can unlock a lot of features. |
| 14 | + |
| 15 | +Users can enable job priority and preemption in the cluster. Losing a task becomes acceptable and the user won't lose the entire job progress. More importantly, it will help guarantee SLAs of crititcal jobs, even in a cluster under resource pressure. |
| 16 | + |
| 17 | +Cost and GPU utilization will be further optimized with this feature, since users can launch jobs with partial resources and spot GPU instances can be used as well without worrying. |
| 18 | + |
| 19 | +## User Experience |
| 20 | + |
| 21 | +* Users should define the `minReplicas` and `maxReplicas` number of tasks of a job instead of a fixed number. The TorchElastic Operator will launch jobs in Kubernetes, setup the needed network topology and manage the job lifecycle. |
| 22 | +* Users need to specify the etcd endpoint used as the RDZV service for task coordination. |
| 23 | +* The desired `spec.replicaSpecs[Worker].replicas`, being number of tasks, has to be within the range from `minReplicas` to `maxReplicas`. |
| 24 | +* Users can easily create/delete a torch elastic job using `kubectl` using a job manifest. |
| 25 | +* Users are able to describe custom resources to monitor the job status. |
| 26 | + |
| 27 | +## High Level Design |
| 28 | + |
| 29 | +Workers in torch elastic job are equivalent and their communication is peer to peer. In this case, every pod should be able to talk with every other pod, and we need to create a `headless` service for every pod. Once the job is done, controller won't terminate any pods, user can check logs for any worker. Manual job deletion will delete all pods belong to it. |
| 30 | + |
| 31 | +A config with kind `ElasticJob` defines the job spec and the controller will reconcile against this definition. It will create/update/delete pods and services if there are any changes in the job orin the kubernetes resources (pods, services) changes owned by `ElasticJob`. |
| 32 | + |
| 33 | +``` |
| 34 | +apiVersion: "elastic.pytorch.org/v1alpha1" |
| 35 | +kind: "ElasticJob" |
| 36 | +metadata: |
| 37 | + name: "classy-vision-job" |
| 38 | +spec: |
| 39 | + rdzvEndpoint: "etcd-service:2379" |
| 40 | + minReplicas: 2 |
| 41 | + maxReplicas: 5 |
| 42 | + replicaSpecs: |
| 43 | + Worker: |
| 44 | + replicas: 3 |
| 45 | + restartPolicy: ExitCode |
| 46 | + template: |
| 47 | + apiVersion: v1 |
| 48 | + kind: Pod |
| 49 | + spec: |
| 50 | + containers: |
| 51 | + - name: torchelasticworker |
| 52 | + image: torchelastic/examples:0.1.0rc1 |
| 53 | + imagePullPolicy: Always |
| 54 | + args: |
| 55 | + - "s3://code_path/petctl/user/my_job/main.py" |
| 56 | + - --config_file |
| 57 | + - "/data/classy_vision/resnet50_synthetic_image_classy_config.json" |
| 58 | + - "--checkpoint_folder" |
| 59 | + - "/data/classy_vision/checkpoint"" |
| 60 | +
|
| 61 | +``` |
| 62 | + |
| 63 | +*Network Communication* |
| 64 | + |
| 65 | +In this case, every pod should be able to talk with each other and we need to create headless service for every pod since they use hostname registered in rdzv endpoint to find peers. |
| 66 | + |
| 67 | +*Failure condition* |
| 68 | + |
| 69 | +Torch Elastic controller will only fail a job if active workers is under minReplicas size user specified. Otherwise, it will try to reschedule failed pods and maintain the desired task size. |
| 70 | + |
| 71 | +*rdzvEndpoint* |
| 72 | + |
| 73 | +`rdzvEndpoint` needs to be specified by user. It could be high available etcd quorum or single etcd pod on Kubernetes cluster. |
| 74 | + |
| 75 | +*Replicas* |
| 76 | + |
| 77 | +`replicas` represents the desired task size. Torch elastic job doesn't need all the workers to be ready to start training. We can set this field to job.spec.maxReplicas and try to allocate more resources. If cluster doesn't have enough resources, some tasks maybe pending and job can still start. |
| 78 | + |
| 79 | + |
| 80 | +These are the resources the controller creates from a `TorchElasticJob`: |
| 81 | + |
| 82 | +**Pod** |
| 83 | + |
| 84 | +``` |
| 85 | +apiVersion: v1 |
| 86 | +kind: Pod |
| 87 | +metadata: |
| 88 | + name: classy-vision-job-worker-${index} |
| 89 | + labels: |
| 90 | + job-name: classy-vision-job |
| 91 | + group-name=elastic.pytorch.org |
| 92 | + replica-index: 0 |
| 93 | + replica-type=worker |
| 94 | +spec: |
| 95 | + containers: |
| 96 | + image: torchelastic/examples:0.1.0rc1 |
| 97 | + imagePullPolicy: Always |
| 98 | + name: torchelasticworker |
| 99 | + env: |
| 100 | + - name: RDZV_ENDPOINT |
| 101 | + value: "etcd-:2379" |
| 102 | + - name: JOB_ID |
| 103 | + value: "classy-vision-job" |
| 104 | + - name: SIZE |
| 105 | + value: "3" |
| 106 | + - name: MIN_SIZE |
| 107 | + value: "2" |
| 108 | + - name: MAX_SIZE |
| 109 | + value: "5" |
| 110 | + restartPolicy: OnFailure |
| 111 | +``` |
| 112 | + |
| 113 | +**Service** |
| 114 | + |
| 115 | +``` |
| 116 | +apiVersion: v1 |
| 117 | +kind: Service |
| 118 | +metadata: |
| 119 | + name: classy-vision-job-worker-${index} |
| 120 | +spec: |
| 121 | + selector: |
| 122 | + job-name: classy-vision-job |
| 123 | + group-name=elastic.pytorch.org |
| 124 | + replica-index: 0 |
| 125 | + replica-type=worker |
| 126 | + clusterIP: None |
| 127 | +``` |
| 128 | + |
| 129 | +**Job Status** |
| 130 | + |
| 131 | +``` yaml |
| 132 | +kubectl describe elasticjob classy-vision-job |
| 133 | +Name: classy-vision-job |
| 134 | +Namespace: default |
| 135 | +API Version: elastic.pytorch.org/v1alpha1 |
| 136 | +Kind: ElasticJob |
| 137 | +Spec: |
| 138 | + ... |
| 139 | +Status: |
| 140 | + Conditions: |
| 141 | + Last Transition Time: 2020-01-22T23:10:44Z |
| 142 | + Last Update Time: 2020-01-22T23:10:44Z |
| 143 | + Message: job classy-vision-job is created. |
| 144 | + Reason: ElasticJobCreated |
| 145 | + Status: True |
| 146 | + Type: Created |
| 147 | + Last Transition Time: 2020-01-22T23:10:49Z |
| 148 | + Last Update Time: 2020-01-22T23:10:49Z |
| 149 | + Message: ElasticJob classy-vision-job is running. |
| 150 | + Reason: ElasticJobRunning |
| 151 | + Status: False |
| 152 | + Type: Running |
| 153 | + Last Transition Time: 2020-01-22T23:10:49Z |
| 154 | + Last Update Time: 2020-01-22T23:10:49Z |
| 155 | + Message: ElasticJob classy-vision-job is failed because 2 workers replica(s) failed. |
| 156 | + Reason: ElasticJobFailed |
| 157 | + Status: True |
| 158 | + Type: Failed |
| 159 | + Replica Statuses: |
| 160 | + Worker: |
| 161 | + Active: 1 |
| 162 | + Failed: 2 |
| 163 | +Events: |
| 164 | + Type Reason Age From Message |
| 165 | + ---- ------ ---- ---- ------- |
| 166 | + Normal SuccessfulCreatePod 39m elastic-job-controller Created pod: classy-vision-job-worker-0 |
| 167 | + Normal SuccessfulCreatePod 39m elastic-job-controller Created pod: classy-vision-job-worker-1 |
| 168 | + Normal SuccessfulCreateService 39m elastic-job-controller Created service: classy-vision-job-worker-0 |
| 169 | + Normal SuccessfulCreateService 39m elastic-job-controller Created service: classy-vision-job-worker-1 |
| 170 | + Normal ExitedWithCode 39m (x3 over 39m) elastic-job-controller Pod: default.classy-vision-job-worker-0 exited with code 1 |
| 171 | + Warning ElasticJobRestarting 39m (x3 over 39m) elastic-job-controller ElasticJob classy-vision-job is restarting because 1 Worker replica(s) failed. |
| 172 | + Normal ElasticJobFailed 39m elastic-job-controller ElasticJob classy-vision-job is failed because 2 Worker replica(s) failed. |
| 173 | +``` |
| 174 | + |
| 175 | +## Not in scope |
| 176 | + |
| 177 | +TorchElastic Operator can simplify the setups to run torch elastic jobs and manage entire job lifecycle. It is hard for the controller to monitor cluster resources and dynamically adjust the task size. Instead, having a separate component like a batch scheduler to make the decision is a better option at this stage, to limit the scope of this project. |
| 178 | + |
| 179 | +Currently, operator has to accept an etcd service as `rdzvEndpoint`. We may consider to make this field optional and provide etcd service by controller if it's not set. |
0 commit comments