Set restartPolicy and backoff limit, it seems like not effect. #2342

shaoqingyang · 2024-12-02T10:40:39Z

What happened?

nprocPerNode: auto
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
Worker:
replicas: 1
restartPolicy: OnFailure
runPolicy:
backoffLimit: 1
cleanPodPolicy: All
schedulingPolicy:
priorityClass: job-third
suspend: false
ttlSecondsAfterFinished: 60

What did you expect to happen?

I hope it can effect, maybe my yaml is not correct.

Environment

Kubernetes version:

$ kubectl version

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"

Training Operator Python SDK version:

$ pip show kubeflow-training

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

shaoqingyang · 2024-12-03T11:09:18Z

andreyvelich · 2024-12-04T20:36:26Z

Thanks for creating this @shaoqingyang!
Can you try to run this simple PyTorchJob example to see if you are getting the same warning:

training-operator/examples/pytorch/simple.yaml

Line 10 in 265d4c7

restartPolicy: OnFailure

cc @kubeflow/wg-training-leads @Electronic-Waste

shaoqingyang · 2024-12-06T09:54:30Z

Thanks for creating this @shaoqingyang! Can you try to run this simple PyTorchJob example to see if you are getting the same warning:

training-operator/examples/pytorch/simple.yaml

Line 10 in 265d4c7

restartPolicy: OnFailure

cc @kubeflow/wg-training-leads @Electronic-Waste

I think it can runinng and restart. My problem is that it can retry infinitely without judging based on the number of backoffLimit attempts.

andreyvelich · 2024-12-06T17:14:40Z

@shaoqingyang Do you see the similar warning in the controller that restartPolicy is not onFailure when you run this example?

shaoqingyang · 2024-12-16T07:40:12Z

This example has the same issue as the job I created myself. I set both the Master and Worker to RestartPolicy: OnFailure， Then set
runPolicy:
backoffLimit: 3
cleanPodPolicy: All
schedulingPolicy:
priorityClass: job-third
suspend: false
ttlSecondsAfterFinished: 60
The job can restart an unlimited number of times.

shaoqingyang · 2024-12-16T09:24:02Z

Maybe this counter has not taken effect.

shaoqingyang · 2024-12-16T09:46:52Z

Sorry, there are some issues with my retry method. I deleted the pod and restarted it. I looked at the code and found that a container exception is required.

shaoqingyang added kind/bug lifecycle/needs-triage labels Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set restartPolicy and backoff limit, it seems like not effect. #2342

Set restartPolicy and backoff limit, it seems like not effect. #2342

shaoqingyang commented Dec 2, 2024

shaoqingyang commented Dec 3, 2024

andreyvelich commented Dec 4, 2024

shaoqingyang commented Dec 6, 2024

andreyvelich commented Dec 6, 2024

shaoqingyang commented Dec 16, 2024

shaoqingyang commented Dec 16, 2024

shaoqingyang commented Dec 16, 2024

Set restartPolicy and backoff limit, it seems like not effect. #2342

Set restartPolicy and backoff limit, it seems like not effect. #2342

Comments

shaoqingyang commented Dec 2, 2024

What happened?

What did you expect to happen?

Environment

Impacted by this bug?

shaoqingyang commented Dec 3, 2024

andreyvelich commented Dec 4, 2024

shaoqingyang commented Dec 6, 2024

andreyvelich commented Dec 6, 2024

shaoqingyang commented Dec 16, 2024

shaoqingyang commented Dec 16, 2024

shaoqingyang commented Dec 16, 2024