Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I implement pytorchjob's workers to use different images or configurations? #2365

Open
certainly-cyber opened this issue Dec 30, 2024 · 3 comments

Comments

@certainly-cyber
Copy link

What happened?

Hello everyone! As mentioned above, I hope that the worker of pytorhjob can use different images or configurations.

Normally, my yaml is similar to this:

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
name: "resnet-1"
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
...
Worker:
replicas: 3
...

However, if I want to use different configurations (such as images) for different workers, how should I do it? I tried to configure multiple workers, like this:

spec:
pytorchReplicaSpecs:
Master:
replicas: 1
...
Worker:
replicas: 1
...
Worker:
replicas: 1
...

But unfortunately, this doesn't work.

What did you expect to happen?

How can I implement pytorchjob's workers to use different images or configurations?

Environment

Kubernetes version:
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.2

Training Operator version:
kubeflow/training-operator:v1-9e52eb7#

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@tenzen-y
Copy link
Member

tenzen-y commented Jan 2, 2025

Thank you for creating this issue!
Currently, the training-operator does not support the requested behavior. The PyTorchJob does not allow us to specify duplicate roles in a single job.

/remove-kind bug
/kind support

@google-oss-prow google-oss-prow bot removed the kind/bug label Jan 2, 2025
Copy link

@tenzen-y: The label(s) kind/support cannot be applied, because the repository doesn't have them.

In response to this:

Thank you for creating this issue!
Currently, the training-operator does not support the requested behavior. The PyTorchJob does not allow us to specify duplicate roles in a single job.

/remove-kind bug
/kind support

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@certainly-cyber
Copy link
Author

Thank you for creating this issue! Currently, the training-operator does not support the requested behavior. The PyTorchJob does not allow us to specify duplicate roles in a single job.

/remove-kind bug /kind support

Ok, I got it, thank you for your reply~ By the way, if I still want to achieve this behavior, is there any way to do it? I can accept any other job (tfjob or anythong else) or any proposal, thanks again and have a nice day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants