Skip to content
This repository was archived by the owner on Jan 6, 2023. It is now read-only.

Commit a79d3ce

Browse files
Jeffwanfacebook-github-bot
authored andcommitted
Add torchelastic operator design doc (#36)
Summary: This is second phase of kubernetes support for torch elastic job. The goal is to provide a controller for user to write a K8s `TorchElasticJob` manifest and then easily launch jobs in Kubernetes. All the complexity to construct pods and services, manage jobs should be hidden from users. User just need to define job spec instead of `operate` the entire stack. I did a POC to test fix number (min=max=desired) and it takes some time to cover all different cases and polish project. Dynamic allocation is not in the controller scope. I will write separate proposal to bring in batch scheduling capability in next phase. Pull Request resolved: #36 Differential Revision: D20643340 Pulled By: kiukchung fbshipit-source-id: 00ee1ac1c55e9262ac4d9c728b2659a8ff6b27e1
1 parent 652e0b8 commit a79d3ce

File tree

1 file changed

+179
-0
lines changed

1 file changed

+179
-0
lines changed
Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# TorchElastic Operator on Kubernetes
2+
3+
## Background
4+
5+
PyTorch continues to be used for the latest state-of-the-art research, making up nearly 70% of [papers](https://chillee.github.io/pytorch-vs-tensorflow/) that cite a framework.
6+
7+
The current PyTorch Distributed Data Parallel (DDP) module enables data parallel training where each process trains the same model but on different shards of data. It enables bulk synchronous, multi-host, multi-GPU/CPU execution of ML training. However, DDP has several shortcomings; e.g.jobs cannot start without acquiring all the requested nodes; jobs cannot continue after a node fails due to an error or transient issue; jobs cannot incorporate a node that joined later; and lastly; progress cannot be made with the presence of a slow/stuck node.
8+
9+
The focus of [PyTorch Elastic](https://github.com/pytorch/elastic), which uses Elastic Distributed Data Parallelism, is to address these issues and build a generic framework/APIs for PyTorch to enable reliable and elastic execution of these data parallel training workloads. It will provide better programmability, higher resilience to failures of all kinds, higher-efficiency and larger-scale training compared with pure DDP.
10+
11+
## Motivation
12+
13+
With job fault tolerance and elastic training, we can unlock a lot of features.
14+
15+
Users can enable job priority and preemption in the cluster. Losing a task becomes acceptable and the user won't lose the entire job progress. More importantly, it will help guarantee SLAs of crititcal jobs, even in a cluster under resource pressure.
16+
17+
Cost and GPU utilization will be further optimized with this feature, since users can launch jobs with partial resources and spot GPU instances can be used as well without worrying.
18+
19+
## User Experience
20+
21+
* Users should define the `minReplicas` and `maxReplicas` number of tasks of a job instead of a fixed number. The TorchElastic Operator will launch jobs in Kubernetes, setup the needed network topology and manage the job lifecycle.
22+
* Users need to specify the etcd endpoint used as the RDZV service for task coordination.
23+
* The desired `spec.replicaSpecs[Worker].replicas`, being number of tasks, has to be within the range from `minReplicas` to `maxReplicas`.
24+
* Users can easily create/delete a torch elastic job using `kubectl` using a job manifest.
25+
* Users are able to describe custom resources to monitor the job status.
26+
27+
## High Level Design
28+
29+
Workers in torch elastic job are equivalent and their communication is peer to peer. In this case, every pod should be able to talk with every other pod, and we need to create a `headless` service for every pod. Once the job is done, controller won't terminate any pods, user can check logs for any worker. Manual job deletion will delete all pods belong to it.
30+
31+
A config with kind `ElasticJob` defines the job spec and the controller will reconcile against this definition. It will create/update/delete pods and services if there are any changes in the job orin the kubernetes resources (pods, services) changes owned by `ElasticJob`.
32+
33+
```
34+
apiVersion: "elastic.pytorch.org/v1alpha1"
35+
kind: "ElasticJob"
36+
metadata:
37+
name: "classy-vision-job"
38+
spec:
39+
rdzvEndpoint: "etcd-service:2379"
40+
minReplicas: 2
41+
maxReplicas: 5
42+
replicaSpecs:
43+
Worker:
44+
replicas: 3
45+
restartPolicy: ExitCode
46+
template:
47+
apiVersion: v1
48+
kind: Pod
49+
spec:
50+
containers:
51+
- name: torchelasticworker
52+
image: torchelastic/examples:0.1.0rc1
53+
imagePullPolicy: Always
54+
args:
55+
- "s3://code_path/petctl/user/my_job/main.py"
56+
- --config_file
57+
- "/data/classy_vision/resnet50_synthetic_image_classy_config.json"
58+
- "--checkpoint_folder"
59+
- "/data/classy_vision/checkpoint""
60+
61+
```
62+
63+
*Network Communication*
64+
65+
In this case, every pod should be able to talk with each other and we need to create headless service for every pod since they use hostname registered in rdzv endpoint to find peers.
66+
67+
*Failure condition*
68+
69+
Torch Elastic controller will only fail a job if active workers is under minReplicas size user specified. Otherwise, it will try to reschedule failed pods and maintain the desired task size.
70+
71+
*rdzvEndpoint*
72+
73+
`rdzvEndpoint` needs to be specified by user. It could be high available etcd quorum or single etcd pod on Kubernetes cluster.
74+
75+
*Replicas*
76+
77+
`replicas` represents the desired task size. Torch elastic job doesn't need all the workers to be ready to start training. We can set this field to job.spec.maxReplicas and try to allocate more resources. If cluster doesn't have enough resources, some tasks maybe pending and job can still start.
78+
79+
80+
These are the resources the controller creates from a `TorchElasticJob`:
81+
82+
**Pod**
83+
84+
```
85+
apiVersion: v1
86+
kind: Pod
87+
metadata:
88+
name: classy-vision-job-worker-${index}
89+
labels:
90+
job-name: classy-vision-job
91+
group-name=elastic.pytorch.org
92+
replica-index: 0
93+
replica-type=worker
94+
spec:
95+
containers:
96+
image: torchelastic/examples:0.1.0rc1
97+
imagePullPolicy: Always
98+
name: torchelasticworker
99+
env:
100+
- name: RDZV_ENDPOINT
101+
value: "etcd-:2379"
102+
- name: JOB_ID
103+
value: "classy-vision-job"
104+
- name: SIZE
105+
value: "3"
106+
- name: MIN_SIZE
107+
value: "2"
108+
- name: MAX_SIZE
109+
value: "5"
110+
restartPolicy: OnFailure
111+
```
112+
113+
**Service**
114+
115+
```
116+
apiVersion: v1
117+
kind: Service
118+
metadata:
119+
name: classy-vision-job-worker-${index}
120+
spec:
121+
selector:
122+
job-name: classy-vision-job
123+
group-name=elastic.pytorch.org
124+
replica-index: 0
125+
replica-type=worker
126+
clusterIP: None
127+
```
128+
129+
**Job Status**
130+
131+
``` yaml
132+
kubectl describe elasticjob classy-vision-job
133+
Name: classy-vision-job
134+
Namespace: default
135+
API Version: elastic.pytorch.org/v1alpha1
136+
Kind: ElasticJob
137+
Spec:
138+
...
139+
Status:
140+
Conditions:
141+
Last Transition Time: 2020-01-22T23:10:44Z
142+
Last Update Time: 2020-01-22T23:10:44Z
143+
Message: job classy-vision-job is created.
144+
Reason: ElasticJobCreated
145+
Status: True
146+
Type: Created
147+
Last Transition Time: 2020-01-22T23:10:49Z
148+
Last Update Time: 2020-01-22T23:10:49Z
149+
Message: ElasticJob classy-vision-job is running.
150+
Reason: ElasticJobRunning
151+
Status: False
152+
Type: Running
153+
Last Transition Time: 2020-01-22T23:10:49Z
154+
Last Update Time: 2020-01-22T23:10:49Z
155+
Message: ElasticJob classy-vision-job is failed because 2 workers replica(s) failed.
156+
Reason: ElasticJobFailed
157+
Status: True
158+
Type: Failed
159+
Replica Statuses:
160+
Worker:
161+
Active: 1
162+
Failed: 2
163+
Events:
164+
Type Reason Age From Message
165+
---- ------ ---- ---- -------
166+
Normal SuccessfulCreatePod 39m elastic-job-controller Created pod: classy-vision-job-worker-0
167+
Normal SuccessfulCreatePod 39m elastic-job-controller Created pod: classy-vision-job-worker-1
168+
Normal SuccessfulCreateService 39m elastic-job-controller Created service: classy-vision-job-worker-0
169+
Normal SuccessfulCreateService 39m elastic-job-controller Created service: classy-vision-job-worker-1
170+
Normal ExitedWithCode 39m (x3 over 39m) elastic-job-controller Pod: default.classy-vision-job-worker-0 exited with code 1
171+
Warning ElasticJobRestarting 39m (x3 over 39m) elastic-job-controller ElasticJob classy-vision-job is restarting because 1 Worker replica(s) failed.
172+
Normal ElasticJobFailed 39m elastic-job-controller ElasticJob classy-vision-job is failed because 2 Worker replica(s) failed.
173+
```
174+
175+
## Not in scope
176+
177+
TorchElastic Operator can simplify the setups to run torch elastic jobs and manage entire job lifecycle. It is hard for the controller to monitor cluster resources and dynamically adjust the task size. Instead, having a separate component like a batch scheduler to make the decision is a better option at this stage, to limit the scope of this project.
178+
179+
Currently, operator has to accept an etcd service as `rdzvEndpoint`. We may consider to make this field optional and provide etcd service by controller if it's not set.

0 commit comments

Comments
 (0)