Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set restartPolicy and backoff limit, it seems like not effect. #2342

Open
shaoqingyang opened this issue Dec 2, 2024 · 7 comments
Open

Set restartPolicy and backoff limit, it seems like not effect. #2342

shaoqingyang opened this issue Dec 2, 2024 · 7 comments

Comments

@shaoqingyang
Copy link

What happened?

nprocPerNode: auto
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
Worker:
replicas: 1
restartPolicy: OnFailure
runPolicy:
backoffLimit: 1
cleanPodPolicy: All
schedulingPolicy:
priorityClass: job-third
suspend: false
ttlSecondsAfterFinished: 60
7374678aac0b084437998ef8c4ab482e
c8e38c5384f42093921595db7eabf1a0
64e0f246ab8f54f0fc69cb899f2be600

What did you expect to happen?

I hope it can effect, maybe my yaml is not correct.

Environment

Kubernetes version:

$ kubectl version

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"

Training Operator Python SDK version:

$ pip show kubeflow-training

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@shaoqingyang
Copy link
Author

image

@andreyvelich
Copy link
Member

Thanks for creating this @shaoqingyang!
Can you try to run this simple PyTorchJob example to see if you are getting the same warning:

restartPolicy: OnFailure

cc @kubeflow/wg-training-leads @Electronic-Waste

@shaoqingyang
Copy link
Author

Thanks for creating this @shaoqingyang! Can you try to run this simple PyTorchJob example to see if you are getting the same warning:

restartPolicy: OnFailure

cc @kubeflow/wg-training-leads @Electronic-Waste

I think it can runinng and restart. My problem is that it can retry infinitely without judging based on the number of backoffLimit attempts.

@andreyvelich
Copy link
Member

@shaoqingyang Do you see the similar warning in the controller that restartPolicy is not onFailure when you run this example?

@shaoqingyang
Copy link
Author

This example has the same issue as the job I created myself. I set both the Master and Worker to RestartPolicy: OnFailure, Then set
runPolicy:
backoffLimit: 3
cleanPodPolicy: All
schedulingPolicy:
priorityClass: job-third
suspend: false
ttlSecondsAfterFinished: 60
The job can restart an unlimited number of times.

@shaoqingyang
Copy link
Author

Maybe this counter has not taken effect.
企业微信截图_e9414c41-269f-4670-844a-de8351a9b0c5

@shaoqingyang
Copy link
Author

企业微信截图_8e4d4998-a0c0-477f-a8d0-f72136ee9fb1
Sorry, there are some issues with my retry method. I deleted the pod and restarted it. I looked at the code and found that a container exception is required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants