-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set restartPolicy and backoff limit, it seems like not effect. #2342
Comments
Thanks for creating this @shaoqingyang!
cc @kubeflow/wg-training-leads @Electronic-Waste |
I think it can runinng and restart. My problem is that it can retry infinitely without judging based on the number of backoffLimit attempts. |
@shaoqingyang Do you see the similar warning in the controller that restartPolicy is not onFailure when you run this example? |
This example has the same issue as the job I created myself. I set both the Master and Worker to RestartPolicy: OnFailure, Then set |
What happened?
nprocPerNode: auto
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
Worker:
replicas: 1
restartPolicy: OnFailure
runPolicy:
backoffLimit: 1
cleanPodPolicy: All
schedulingPolicy:
priorityClass: job-third
suspend: false
ttlSecondsAfterFinished: 60
What did you expect to happen?
I hope it can effect, maybe my yaml is not correct.
Environment
Kubernetes version:
Training Operator version:
$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"
Training Operator Python SDK version:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
The text was updated successfully, but these errors were encountered: