Allow 0 worker in pytorch plugins & Add objectMeta to PyTorchJob #348

ByronHsu · 2023-05-09T04:51:40Z

TL;DR

Kubeflow PyTorch can be configured to use 0 workers when running distributed PyTorch jobs. In this case, the training job would run on a single machine (the master node), without any additional worker nodes.
Pass objectMeta to PyTorchJob

Type

Bug Fix
Feature
Plugin

Are all requirements met?

Signed-off-by: byhsu <[email protected]>

codecov · 2023-05-09T05:25:07Z

Codecov Report

Merging #348 (dce4e46) into master (76a80ec) will increase coverage by 1.30%.
The diff coverage is 100.00%.

❗ Current head dce4e46 differs from pull request most recent head 2065a96. Consider uploading reports for the commit 2065a96 to get more accurate results

@@            Coverage Diff             @@
##           master     #348      +/-   ##
==========================================
+ Coverage   62.76%   64.06%   +1.30%     
==========================================
  Files         148      148              
  Lines       12444    10080    -2364     
==========================================
- Hits         7810     6458    -1352     
+ Misses       4038     3026    -1012     
  Partials      596      596

Flag	Coverage Δ
unittests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...o/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go	`79.80% <100.00%> (+0.28%)`	⬆️

... and 130 files with indirect coverage changes

pingsutw

is there any benefit to use only one master node in pytorch CRD? people can just run pytorch in regular python task for single node training, right?

fg91 · 2023-05-09T07:50:33Z

is there any benefit to use only one master node in pytorch CRD? people can just run pytorch in regular python task for single node training, right?

I can only imagine the env vars set by the operator like world size, rank, ...? In the torch elastic task we opted to run single worker trainings in a normal python task/single pod so that users don't need the training operator.

ByronHsu · 2023-05-09T14:37:20Z

In our case, we start with 0 worker since it's easier to debug, and then adjust to multiple workers.

Although python task can achieve the same thing, we shouldn't error out 0 worker in PyTorch because it's what PyTorch operator allows

hamersaw · 2023-05-09T15:06:12Z

go/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go

 	}
 	job := &kubeflowv1.PyTorchJob{
 		TypeMeta: metav1.TypeMeta{
 			Kind:       kubeflowv1.PytorchJobKind,
 			APIVersion: kubeflowv1.SchemeGroupVersion.String(),
 		},
-		Spec: jobSpec,
+		Spec:       jobSpec,
+		ObjectMeta: *objectMeta,


Is this just to set labels / annotations on the CR the same as the replicas? If this is the route we want to go, this should probably be done for all kf operator plugins.

fg91 · 2023-05-09T16:05:23Z

In our case, we start with 0 worker since it's easier to debug, and then adjust to multiple workers.

Although python task can achieve the same thing, we shouldn't error out 0 worker in PyTorch because it's what PyTorch operator allows

Fair point. Users can still switch to python task if they prefer by removing the task_config=Pytorch(....

Only thing I wonder: in the new pytorch elastic task we decided that with nnodes=1 (meaning only a single worker/pod) we will use a normal python task/pod so that users can run this without operator (see here). Should we maybe then change this to not have different behaviour between elastic and non-elastic pytorch dist training? @kumare3 argued that it would be nice to allow users to use torchrun without the operator.

kumare3 · 2023-05-10T00:40:27Z

I actually want to get rid of a required dependency on PytorchOperator for simple single node training. which can suffice in many cases. This actually makes scaling really nice, you start with one node and simply scale to more nodes - to scale you may need to deploy the operator?

This is why when nnodes=1, we just change the task type itself. WDYT? @fg91 and @ByronHsu

kumare3 · 2023-05-10T00:41:32Z

check this out - https://github.com/flyteorg/flytekit/blob/e44b8027bd0f59c0f90eb5f86bfa181aebad0d74/plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py#L112

ByronHsu · 2023-05-10T00:50:56Z

Could you elaborate the drawback to use PyTorch Operator to training a single node case?

kumare3 · 2023-05-10T04:31:13Z

@ByronHsu - FlytePropeller is way more optimal in allocating resources, retrying, and completing sooner. Also this for single node is faster, runs without needing an operator and does not need a CRD to be created.

ByronHsu · 2023-05-10T04:33:25Z

@kumare3 skipping the CRD part can definitely be faster. Thanks. I will raise a corresponding pr in flytekit

ByronHsu · 2023-05-16T05:17:57Z

Will merge with #345 and do a integration test

fg91 · 2023-08-16T09:37:37Z

Is this still supposed to be pushed over the finish line or shell we close it?

ByronHsu requested review from fg91 and igorvalko May 9, 2023 05:14

Allow 0 worker in pytorch plugins & Add objectMeta to PyTorchJob

2065a96

Signed-off-by: byhsu <[email protected]>

ByronHsu force-pushed the oss/pytorch-fix branch from e7f3888 to 2065a96 Compare May 9, 2023 05:18

pingsutw reviewed May 9, 2023

View reviewed changes

hamersaw reviewed May 9, 2023

View reviewed changes

ByronHsu mentioned this pull request May 10, 2023

Fallback to python task if worker is zero for pytorch flyteorg/flytekit#1629

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow 0 worker in pytorch plugins & Add objectMeta to PyTorchJob #348

Allow 0 worker in pytorch plugins & Add objectMeta to PyTorchJob #348

ByronHsu commented May 9, 2023

codecov bot commented May 9, 2023

pingsutw left a comment

fg91 commented May 9, 2023

ByronHsu commented May 9, 2023

hamersaw May 9, 2023

fg91 commented May 9, 2023 •

edited

Loading

kumare3 commented May 10, 2023

kumare3 commented May 10, 2023

ByronHsu commented May 10, 2023

kumare3 commented May 10, 2023

ByronHsu commented May 10, 2023

ByronHsu commented May 16, 2023

fg91 commented Aug 16, 2023

Allow 0 worker in pytorch plugins & Add objectMeta to PyTorchJob #348

Are you sure you want to change the base?

Allow 0 worker in pytorch plugins & Add objectMeta to PyTorchJob #348

Conversation

ByronHsu commented May 9, 2023

TL;DR

Type

Are all requirements met?

codecov bot commented May 9, 2023

Codecov Report

pingsutw left a comment

Choose a reason for hiding this comment

fg91 commented May 9, 2023

ByronHsu commented May 9, 2023

hamersaw May 9, 2023

Choose a reason for hiding this comment

fg91 commented May 9, 2023 • edited Loading

kumare3 commented May 10, 2023

kumare3 commented May 10, 2023

ByronHsu commented May 10, 2023

kumare3 commented May 10, 2023

ByronHsu commented May 10, 2023

ByronHsu commented May 16, 2023

fg91 commented Aug 16, 2023

fg91 commented May 9, 2023 •

edited

Loading