-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not create the launcher job if the job starts suspended #670
base: master
Are you sure you want to change the base?
Do not create the launcher job if the job starts suspended #670
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
// If the job is suspended, the list of worker pods will be incorrect. We also do | ||
// not want to start the launcher job if the MPIJob starts suspended. | ||
if launcher == nil && !isMPIJobSuspended(mpiJob) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This opens the question on what should be done when unsuspending the launcher job in case kueue has decided to change the NodeSelector? Should we instead recreate the job since NodeSelector is immutable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this question does not have a straigtforward answer, I can see at least two approaches possible:
- as in JobSet- update the NodeSelector field in pod template when resuming the Job
- Recreate the launcher/worker Jobs , this can be probably achieved easily by deleting the jobs on suspending the MPIJob
I'm ok with any of those that is simpler to implement. Any optinion @tenzen-y ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case I think it is safe to decouple the fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delayed response. IMO, I would selecting as in JobSet- update the NodeSelector field in pod template when resuming the Job
instead of current solution.
Because I want to align with the behavior with JobSet since the JobSet with suspended creates Job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's already being done in theory by kueue with RunWithPodSetsInfo
, right? I think the issue here is that we create the job in non-suspended move even if the MPIJob is suspended. This results in pods being scheduled in nodes and then removed because the controller suspends the job later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y
I think it reproduces when GPUs are used in areas not managed by kueue.This procedure attaches two GPUs to a node, setting ClusterQueue to twice the value of GPUs available for allocation in the cluster.
Two jobs that use two GPUs as jobs are executed simultaneously. In this case, the launcher runs even though the worker is pending.
-
Create Kind Cluster and set label for fake-gpu-operator
kind create cluster kubectl label node kind-control-plane run.ai/simulated-gpu-node-pool=default
-
Install fake gpu operator
helm repo add fake-gpu-operator https://fake-gpu-operator.storage.googleapis.com helm repo update helm upgrade -i gpu-operator fake-gpu-operator/fake-gpu-operator --namespace gpu-operator --create-namespace
- Check that fake gpu is shown by running
kubectl get all -n gpu-operator
andkubectl describe node | grep "nvidia.com/gpu:"
.
-
Install mpi-operator
kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v2beta1/mpi-operator.yaml
- Check that mpi-operator is deployed properly by running
kubectl get all -n mpi-operator
.
-
Install kueue with waitForPodsReady enabled
Get kueue manifest and enable waitForPodsReady.
wget https://github.com/kubernetes-sigs/kueue/releases/download/v0.10.2/manifests.yaml sed -i '/#waitForPodsReady:/a \ waitForPodsReady:\n enable: true\n timeout: 5m\n blockAdmission: true\n requeuingStrategy:\n timestamp: Eviction\n backoffLimitCount: 5\n backoffBaseSeconds: 60\n backoffMaxSeconds: 3600' manifests.yaml
Then, apply the configuration by:
kubectl apply --server-side -f manifests.yaml
- Check that kueue is deployed properly by running
kubectl get all -n kueue-system
.
-
Prepare minimal kueue setup with gpu enabled
First, check the amount of allocatable GPU in your cluster.
TOTAL_ALLOCATABLE=$(kubectl get node --selector='run.ai/simulated-gpu-node-pool=default,node-role.kubernetes.io/control-plane' -o jsonpath='{range .items[*]}{.status.allocatable.nvidia\.com\/gpu}{"\n"}{end}' | numfmt --from=auto | awk '{s+=$1} END {print s}') echo $TOTAL_ALLOCATABLE
In our case this outputs 2.
-
Configure ClusterQueue quota
We configure the GPU flavor by doubling the total GPU allocatable in our cluster, in order to simulate issues with provisioning.
Execute the following command to create cluster queues configuration as single-clusterqueue-setup.yaml:
cat <<EOF >> single-clusterqueue-setup-gpu4.yaml apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: "default-flavor" --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: "cluster-queue" spec: namespaceSelector: {} # match all. resourceGroups: - coveredResources: ["nvidia.com/gpu"] flavors: - name: "default-flavor" resources: - name: "nvidia.com/gpu" nominalQuota: 4 # double the value of allocatable GPU in the cluster --- apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: namespace: "default" name: "user-queue" spec: clusterQueue: "cluster-queue" EOF
Then, apply the configuration by:
kubectl apply -f single-clusterqueue-setup-gpu4.yaml
-
Try mpijob works well with 2 replicas (2 gpus are available on the node)
Get tensorflow-benchmarks.yaml
wget https://raw.githubusercontent.com/kubeflow/mpi-operator/refs/heads/master/examples/v2beta1/tensorflow-benchmarks/tensorflow-benchmarks.yaml
-
Run with waitForPodsReady enabled
- Create start.sh script
cat <<EOF >> start.sh sed '/^metadata:/a \ labels:\n kueue.x-k8s.io/queue-name: user-queue' tensorflow-benchmarks.yaml | sed 's/name: tensorflow-benchmarks/name: tensorflow-benchmarks-job1/g' > /tmp/tensorflow-benchmarks-job1.yaml sed '/^metadata:/a \ labels:\n kueue.x-k8s.io/queue-name: user-queue' tensorflow-benchmarks.yaml | sed 's/name: tensorflow-benchmarks/name: tensorflow-benchmarks-job2/g' > /tmp/tensorflow-benchmarks-job2.yaml kubectl create -f /tmp/tensorflow-benchmarks-job1.yaml kubectl create -f /tmp/tensorflow-benchmarks-job2.yaml EOF chmod +x start.sh
- Run the start.sh script
./start.sh
-
Monitor the progress
# kubectl get mpijob,workload,pod
NAME AGE
mpijob.kubeflow.org/tensorflow-benchmarks-job1 8s
mpijob.kubeflow.org/tensorflow-benchmarks-job2 8s
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE
workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job1-73f4b user-queue cluster-queue True 8s
workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-8caef user-queue cluster-queue True 8s
NAME READY STATUS RESTARTS AGE
pod/tensorflow-benchmarks-job1-launcher-zx4ck 1/1 Running 0 7s
pod/tensorflow-benchmarks-job1-worker-0 1/1 Running 0 8s
pod/tensorflow-benchmarks-job1-worker-1 1/1 Running 0 8s
pod/tensorflow-benchmarks-job2-launcher-zch7j 1/1 Running 0 4s #launcher runs even though the worker is pending
pod/tensorflow-benchmarks-job2-worker-0 0/1 Pending 0 5s
pod/tensorflow-benchmarks-job2-worker-1 0/1 Pending 0 4s
# kubectl describe workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-8caef
Name: mpijob-tensorflow-benchmarks-job2-8caef
Namespace: default
Labels: kueue.x-k8s.io/job-uid=a2ad4da7-eb57-477d-a9bb-1142d8d847dc
Annotations: <none>
API Version: kueue.x-k8s.io/v1beta1
Kind: Workload
Metadata:
Creation Timestamp: 2025-03-07T06:59:46Z
Finalizers:
kueue.x-k8s.io/resource-in-use
Generation: 1
Owner References:
API Version: kubeflow.org/v2beta1
Block Owner Deletion: true
Controller: true
Kind: MPIJob
Name: tensorflow-benchmarks-job2
UID: a2ad4da7-eb57-477d-a9bb-1142d8d847dc
Resource Version: 4618
UID: 34379ab2-1539-4f61-b69a-fcdad7768478
Spec:
Active: true
Pod Sets:
Count: 1
Name: launcher
Template:
Metadata:
Spec:
Containers:
Command:
mpirun
--allow-run-as-root
-np
2
-bind-to
none
-map-by
slot
-x
NCCL_DEBUG=INFO
-x
LD_LIBRARY_PATH
-x
PATH
-mca
pml
ob1
-mca
btl
^openib
python
scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model=resnet101
--batch_size=64
--variable_update=horovod
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job2
Resources:
Count: 2
Name: worker
Template:
Metadata:
Spec:
Containers:
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job2
Resources:
Limits:
nvidia.com/gpu: 1
Priority: 0
Priority Class Source:
Queue Name: user-queue
Status:
Admission:
Cluster Queue: cluster-queue
Pod Set Assignments:
Count: 1
Name: launcher
Count: 2
Flavors:
nvidia.com/gpu: default-flavor
Name: worker
Resource Usage:
nvidia.com/gpu: 2
Conditions:
Last Transition Time: 2025-03-07T06:59:49Z
Message: Quota reserved in ClusterQueue cluster-queue
Observed Generation: 1
Reason: QuotaReserved
Status: True
Type: QuotaReserved
Last Transition Time: 2025-03-07T06:59:47Z
Message: Not all pods are ready or succeeded
Observed Generation: 1
Reason: PodsReady
Status: False
Type: PodsReady
Last Transition Time: 2025-03-07T06:59:49Z
Message: The workload is admitted
Observed Generation: 1
Reason: Admitted
Status: True
Type: Admitted
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal QuotaReserved 28s kueue-admission Quota reserved in ClusterQueue cluster-queue, wait time since queued was 4s
Normal Admitted 28s kueue-admission Admitted by ClusterQueue cluster-queue, wait time since reservation was 0s
# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job2
Name: tensorflow-benchmarks-job2
Namespace: default
Labels: kueue.x-k8s.io/queue-name=user-queue
Annotations: <none>
API Version: kubeflow.org/v2beta1
Kind: MPIJob
Metadata:
Creation Timestamp: 2025-03-07T06:59:46Z
Generation: 2
Resource Version: 4657
UID: a2ad4da7-eb57-477d-a9bb-1142d8d847dc
Spec:
Launcher Creation Policy: AtStartup
Mpi Implementation: OpenMPI
Mpi Replica Specs:
Launcher:
Replicas: 1
Template:
Metadata:
Spec:
Containers:
Command:
mpirun
--allow-run-as-root
-np
2
-bind-to
none
-map-by
slot
-x
NCCL_DEBUG=INFO
-x
LD_LIBRARY_PATH
-x
PATH
-mca
pml
ob1
-mca
btl
^openib
python
scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model=resnet101
--batch_size=64
--variable_update=horovod
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job2
Resources:
Worker:
Replicas: 2
Template:
Metadata:
Spec:
Containers:
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job2
Resources:
Limits:
nvidia.com/gpu: 1
Run Launcher As Worker: false
Run Policy:
Clean Pod Policy: Running
Suspend: false
Slots Per Worker: 1
Ssh Auth Mount Path: /root/.ssh
Status:
Conditions:
Last Transition Time: 2025-03-07T06:59:47Z
Last Update Time: 2025-03-07T06:59:47Z
Message: MPIJob default/tensorflow-benchmarks-job2 is created.
Reason: MPIJobCreated
Status: True
Type: Created
Last Transition Time: 2025-03-07T06:59:48Z
Last Update Time: 2025-03-07T06:59:48Z
Message: MPIJob default/tensorflow-benchmarks-job2 is suspended.
Reason: MPIJobSuspended
Status: False
Type: Running
Last Transition Time: 2025-03-07T06:59:50Z
Last Update Time: 2025-03-07T06:59:50Z
Message: MPIJob resumed
Reason: MPIJobResumed
Status: False
Type: Suspended
Replica Statuses:
Launcher:
Active: 1
Worker:
Start Time: 2025-03-07T06:59:50Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal CreatedWorkload 99s kubeflow.org/mpijob-kueue-controller Created Workload: default/mpijob-tensorflow-benchmarks-job2-8caef
Normal MPIJobCreated 98s (x2 over 99s) mpi-job-controller MPIJob default/tensorflow-benchmarks-job2 is created.
Normal MPIJobSuspended 98s (x2 over 98s) mpi-job-controller MPIJob suspended
Normal Started 97s kubeflow.org/mpijob-kueue-controller Admitted by clusterQueue cluster-queue
Normal MPIJobResumed 96s mpi-job-controller MPIJob resumed
# kubectl describe pod/tensorflow-benchmarks-job2-worker-0
Name: tensorflow-benchmarks-job2-worker-0
Namespace: default
Priority: 0
Service Account: default
Node: <none>
Labels: training.kubeflow.org/job-name=tensorflow-benchmarks-job2
training.kubeflow.org/job-role=worker
training.kubeflow.org/operator-name=mpi-operator
training.kubeflow.org/replica-index=0
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: MPIJob/tensorflow-benchmarks-job2
Containers:
tensorflow-benchmarks-job2:
Image: mpioperator/tensorflow-benchmarks:latest
Port: <none>
Host Port: <none>
Command:
/usr/sbin/sshd
-De
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment:
K_MPI_JOB_ROLE: worker
Mounts:
/root/.ssh from ssh-auth (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4l5z8 (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
ssh-auth:
Type: Secret (a volume populated by a Secret)
SecretName: tensorflow-benchmarks-job2-ssh
Optional: false
kube-api-access-4l5z8:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2m42s default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
# kubectl get mpijob,workload,pod
NAME AGE
mpijob.kubeflow.org/tensorflow-benchmarks-job1 5m6s
mpijob.kubeflow.org/tensorflow-benchmarks-job2 5m6s
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE
workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job1-73f4b user-queue cluster-queue True 5m6s
workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-8caef user-queue False 5m6s
NAME READY STATUS RESTARTS AGE
pod/tensorflow-benchmarks-job1-launcher-zx4ck 1/1 Running 1 (2m27s ago) 5m5s
pod/tensorflow-benchmarks-job1-worker-0 1/1 Running 0 5m6s
pod/tensorflow-benchmarks-job1-worker-1 1/1 Running 0 5m6s
- Cleanup
Clean up the jobs by:
kubectl delete -f /tmp/tensorflow-benchmarks-job1.yaml
kubectl delete -f /tmp/tensorflow-benchmarks-job2.yaml
-
Change ClusterQueue quota
Set the value of GPUs that can be allocated in the cluster to 2.
cat <<EOF >> single-clusterqueue-setup-gpu2.yaml
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "default-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "cluster-queue"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["nvidia.com/gpu"]
flavors:
- name: "default-flavor"
resources:
- name: "nvidia.com/gpu"
nominalQuota: 2
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
namespace: "default"
name: "user-queue"
spec:
clusterQueue: "cluster-queue"
EOF
Then, apply the configuration by:
kubectl apply -f single-clusterqueue-setup-gpu2.yaml
- Run the start.sh script
./start.sh
- Monitor the progress
# kubectl get mpijob,workload,pod
NAME AGE
mpijob.kubeflow.org/tensorflow-benchmarks-job1 19s
mpijob.kubeflow.org/tensorflow-benchmarks-job2 19s
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE
workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job1-7d852 user-queue cluster-queue True 19s
workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-c171c user-queue 19s
NAME READY STATUS RESTARTS AGE
pod/tensorflow-benchmarks-job1-launcher-qg2xj 1/1 Running 0 18s
pod/tensorflow-benchmarks-job1-worker-0 1/1 Running 0 19s
pod/tensorflow-benchmarks-job1-worker-1 1/1 Running 0 19s
# kubectl describe workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-c171c
Name: mpijob-tensorflow-benchmarks-job2-c171c
Namespace: default
Labels: kueue.x-k8s.io/job-uid=211c9316-b9ad-400a-b108-7549d40517b5
Annotations: <none>
API Version: kueue.x-k8s.io/v1beta1
Kind: Workload
Metadata:
Creation Timestamp: 2025-03-07T07:11:20Z
Finalizers:
kueue.x-k8s.io/resource-in-use
Generation: 1
Owner References:
API Version: kubeflow.org/v2beta1
Block Owner Deletion: true
Controller: true
Kind: MPIJob
Name: tensorflow-benchmarks-job2
UID: 211c9316-b9ad-400a-b108-7549d40517b5
Resource Version: 6372
UID: 7c481e50-9be5-4950-aff3-773d8e475989
Spec:
Active: true
Pod Sets:
Count: 1
Name: launcher
Template:
Metadata:
Spec:
Containers:
Command:
mpirun
--allow-run-as-root
-np
2
-bind-to
none
-map-by
slot
-x
NCCL_DEBUG=INFO
-x
LD_LIBRARY_PATH
-x
PATH
-mca
pml
ob1
-mca
btl
^openib
python
scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model=resnet101
--batch_size=64
--variable_update=horovod
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job2
Resources:
Count: 2
Name: worker
Template:
Metadata:
Spec:
Containers:
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job2
Resources:
Limits:
nvidia.com/gpu: 1
Priority: 0
Priority Class Source:
Queue Name: user-queue
Status:
Conditions:
Last Transition Time: 2025-03-07T07:11:20Z
Message: Not all pods are ready or succeeded
Observed Generation: 1
Reason: PodsReady
Status: False
Type: PodsReady
Last Transition Time: 2025-03-07T07:11:20Z
Message: couldn't assign flavors to pod set worker: insufficient unused quota for nvidia.com/gpu in flavor default-flavor, 2 more needed
Observed Generation: 1
Reason: Pending
Status: False
Type: QuotaReserved
Resource Requests:
Name: launcher
Name: worker
Resources:
nvidia.com/gpu: 2
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Pending 33s (x3 over 33s) kueue-admission couldn't assign flavors to pod set worker: insufficient unused quota for nvidia.com/gpu in flavor default-flavor, 2 more needed
# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job2
Name: tensorflow-benchmarks-job2
Namespace: default
Labels: kueue.x-k8s.io/queue-name=user-queue
Annotations: <none>
API Version: kubeflow.org/v2beta1
Kind: MPIJob
Metadata:
Creation Timestamp: 2025-03-07T07:11:20Z
Generation: 1
Resource Version: 6406
UID: 211c9316-b9ad-400a-b108-7549d40517b5
Spec:
Launcher Creation Policy: AtStartup
Mpi Implementation: OpenMPI
Mpi Replica Specs:
Launcher:
Replicas: 1
Template:
Metadata:
Spec:
Containers:
Command:
mpirun
--allow-run-as-root
-np
2
-bind-to
none
-map-by
slot
-x
NCCL_DEBUG=INFO
-x
LD_LIBRARY_PATH
-x
PATH
-mca
pml
ob1
-mca
btl
^openib
python
scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model=resnet101
--batch_size=64
--variable_update=horovod
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job2
Resources:
Worker:
Replicas: 2
Template:
Metadata:
Spec:
Containers:
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job2
Resources:
Limits:
nvidia.com/gpu: 1
Run Launcher As Worker: false
Run Policy:
Clean Pod Policy: Running
Suspend: true
Slots Per Worker: 1
Ssh Auth Mount Path: /root/.ssh
Status:
Conditions:
Last Transition Time: 2025-03-07T07:11:20Z
Last Update Time: 2025-03-07T07:11:20Z
Message: MPIJob default/tensorflow-benchmarks-job2 is created.
Reason: MPIJobCreated
Status: True
Type: Created
Last Transition Time: 2025-03-07T07:11:21Z
Last Update Time: 2025-03-07T07:11:21Z
Message: MPIJob suspended
Reason: MPIJobSuspended
Status: True
Type: Suspended
Last Transition Time: 2025-03-07T07:11:21Z
Last Update Time: 2025-03-07T07:11:21Z
Message: MPIJob default/tensorflow-benchmarks-job2 is suspended.
Reason: MPIJobSuspended
Status: False
Type: Running
Replica Statuses:
Launcher:
Worker:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal MPIJobCreated 75s mpi-job-controller MPIJob default/tensorflow-benchmarks-job2 is created.
Normal CreatedWorkload 75s kubeflow.org/mpijob-kueue-controller Created Workload: default/mpijob-tensorflow-benchmarks-job2-c171c
Normal MPIJobSuspended 74s mpi-job-controller MPIJob suspended
-
Cleanup
Clean up the jobs by:
kubectl delete -f /tmp/tensorflow-benchmarks-job1.yaml kubectl delete -f /tmp/tensorflow-benchmarks-job2.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ttakahashi21 Thank you for describing your situation. As I can see it, it seems to be expected behavior. The waitForPodsReady is the feature to observe the not ready or unschedulable pods, then requeue it to Kueue.
If you want to guarantee to schedule Worker, first, we can use launcherCreationPolicy:
mpi-operator/pkg/apis/kubeflow/v2beta1/types.go
Lines 194 to 197 in 7f94988
// launcherCreationPolicy if WaitForWorkersReady, the launcher is created only after all workers are in Ready state. Defaults to AtStartup. | |
// +kubebuilder:validation:Enum:AtStartup;WaitForWorkersReady | |
// +kubebuilder:default:=AtStartup | |
LauncherCreationPolicy LauncherCreationPolicy `json:"launcherCreationPolicy,omitempty"` |
If I am missing anything, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to guarantee to schedule Worker, first, we can use launcherCreationPolicy:
Thank you for your comment. I've tested with setting launcherCreationPolicy to WaitForWorkersReady, but it doesn't work as expected when kueue is enabled and nominalQuota is less than actual allocatable GPUs.
# | kueue | Condition of GPUs | launcherCreationPolicy(MPIJob) | Behavior |
---|---|---|---|---|
1 | disabled | - | WaitForWorkersReady | work as expected |
2 | enabled | NominalQuota == Allocatable | WaitForWorkersReady | work as expected |
3 | enabled | NominalQuota > Allocatable | WaitForWorkersReady | Not work as expected |
Note that the result I previously shared was tested with below conditions as you pointed out.
# | kueue | Condition of GPUs | launcherCreationPolicy(MPIJob) | Behavior |
---|---|---|---|---|
0 | enabled | NominalQuota > Allocatable | LauncherCreationPolicyAtStartup(deafult) | work as expected |
case1
When MPIJob is executed exceeding the nominalQuota of GPU resource, GPU resource is insufficient. So, worker is pending. Therefore, if WaitForWorkersReady for MPIJob is working, it is expected that the launcher will not run. In this case, the launcher do not run.
- WaitForWorkersReady feature works fine when Kueue is not used.
# kubectl get mpijob,workload,pod
NAME AGE
mpijob.kubeflow.org/tensorflow-benchmarks-job1 12s
mpijob.kubeflow.org/tensorflow-benchmarks-job2 12s
NAME READY STATUS RESTARTS AGE
pod/tensorflow-benchmarks-job1-launcher-k9r6m 1/1 Running 0 10s
pod/tensorflow-benchmarks-job1-worker-0 1/1 Running 0 12s
pod/tensorflow-benchmarks-job1-worker-1 1/1 Running 0 12s
pod/tensorflow-benchmarks-job2-worker-0 0/1 Pending 0 12s
pod/tensorflow-benchmarks-job2-worker-1 0/1 Pending 0 12s
# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job1
Name: tensorflow-benchmarks-job1
Namespace: default
Labels: <none>
Annotations: <none>
API Version: kubeflow.org/v2beta1
Kind: MPIJob
Metadata:
Creation Timestamp: 2025-03-12T20:54:48Z
Generation: 1
Resource Version: 2126
UID: 54b7c0a2-2bb7-40b4-9cb3-92991bf15c5a
Spec:
Launcher Creation Policy: WaitForWorkersReady
Mpi Implementation: OpenMPI
Mpi Replica Specs:
Launcher:
Replicas: 1
Template:
Metadata:
Spec:
Containers:
Command:
mpirun
--allow-run-as-root
-np
2
-bind-to
none
-map-by
slot
-x
NCCL_DEBUG=INFO
-x
LD_LIBRARY_PATH
-x
PATH
-mca
pml
ob1
-mca
btl
^openib
python
scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model=resnet101
--batch_size=64
--variable_update=horovod
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job1
Resources:
Worker:
Replicas: 2
Template:
Metadata:
Spec:
Containers:
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job1
Resources:
Limits:
nvidia.com/gpu: 1
Run Launcher As Worker: false
Run Policy:
Clean Pod Policy: Running
Suspend: false
Slots Per Worker: 1
Ssh Auth Mount Path: /root/.ssh
Status:
Conditions:
Last Transition Time: 2025-03-12T20:54:48Z
Last Update Time: 2025-03-12T20:54:48Z
Message: MPIJob default/tensorflow-benchmarks-job1 is created.
Reason: MPIJobCreated
Status: True
Type: Created
Last Transition Time: 2025-03-12T20:54:52Z
Last Update Time: 2025-03-12T20:54:52Z
Message: MPIJob default/tensorflow-benchmarks-job1 is running.
Reason: MPIJobRunning
Status: True
Type: Running
Replica Statuses:
Launcher:
Active: 1
Worker:
Active: 2
Start Time: 2025-03-12T20:54:48Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal MPIJobCreated 41s mpi-job-controller MPIJob default/tensorflow-benchmarks-job1 is created.
Normal MPIJobRunning 36s (x3 over 37s) mpi-job-controller MPIJob default/tensorflow-benchmarks-job1 is running
# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job2
Name: tensorflow-benchmarks-job2
Namespace: default
Labels: <none>
Annotations: <none>
API Version: kubeflow.org/v2beta1
Kind: MPIJob
Metadata:
Creation Timestamp: 2025-03-12T20:54:48Z
Generation: 1
Resource Version: 2079
UID: 0e11058e-32db-42c3-bdc5-60743a9087a0
Spec:
Launcher Creation Policy: WaitForWorkersReady
Mpi Implementation: OpenMPI
Mpi Replica Specs:
Launcher:
Replicas: 1
Template:
Metadata:
Spec:
Containers:
Command:
mpirun
--allow-run-as-root
-np
2
-bind-to
none
-map-by
slot
-x
NCCL_DEBUG=INFO
-x
LD_LIBRARY_PATH
-x
PATH
-mca
pml
ob1
-mca
btl
^openib
python
scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model=resnet101
--batch_size=64
--variable_update=horovod
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job2
Resources:
Worker:
Replicas: 2
Template:
Metadata:
Spec:
Containers:
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job2
Resources:
Limits:
nvidia.com/gpu: 1
Run Launcher As Worker: false
Run Policy:
Clean Pod Policy: Running
Suspend: false
Slots Per Worker: 1
Ssh Auth Mount Path: /root/.ssh
Status:
Conditions:
Last Transition Time: 2025-03-12T20:54:48Z
Last Update Time: 2025-03-12T20:54:48Z
Message: MPIJob default/tensorflow-benchmarks-job2 is created.
Reason: MPIJobCreated
Status: True
Type: Created
Replica Statuses:
Worker:
Start Time: 2025-03-12T20:54:48Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal MPIJobCreated 109s mpi-job-controller MPIJob default/tensorflow-benchmarks-job2 is created.
case2
When MPIJob is executed exceeding the nominalQuota of GPU resource, spec.runPolicy.suspend paramater of MPIJob change false to true. In this case, MPIJob of launcher and worker do not run.
case3
When MPIJob is executed exceeding the nominalQuota of GPU resource, spec.runPolicy.suspend paramater of MPIJob do not change false to true.
In this case, MPIJob try to run. But, GPU resource is insufficient, so worker is pending. Therefore, if WaitForWorkersReady for MPIJob is working, it is expected that the launcher will not run. However, the launcher is running.
# kubectl get mpijob,workload,pod
NAME AGE
mpijob.kubeflow.org/tensorflow-benchmarks-job1 14s
mpijob.kubeflow.org/tensorflow-benchmarks-job2 14s
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE
workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job1-11dc3 user-queue cluster-queue True 14s
workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-64330 user-queue cluster-queue True 14s
NAME READY STATUS RESTARTS AGE
pod/tensorflow-benchmarks-job1-launcher-mfj4b 1/1 Running 0 13s
pod/tensorflow-benchmarks-job1-worker-0 1/1 Running 0 14s
pod/tensorflow-benchmarks-job1-worker-1 1/1 Running 0 14s
pod/tensorflow-benchmarks-job2-launcher-6b5p6 1/1 Running 0 11s #launcher is running,
pod/tensorflow-benchmarks-job2-worker-0 0/1 Pending 0 11s
pod/tensorflow-benchmarks-job2-worker-1 0/1 Pending 0 11s
# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job1
Name: tensorflow-benchmarks-job1
Namespace: default
Labels: kueue.x-k8s.io/queue-name=user-queue
Annotations: <none>
API Version: kubeflow.org/v2beta1
Kind: MPIJob
Metadata:
Creation Timestamp: 2025-03-12T20:58:36Z
Generation: 2
Resource Version: 2768
UID: a6621976-3f98-40d4-8cff-98478fbe603b
Spec:
Launcher Creation Policy: WaitForWorkersReady
Mpi Implementation: OpenMPI
Mpi Replica Specs:
Launcher:
Replicas: 1
Template:
Metadata:
Spec:
Containers:
Command:
mpirun
--allow-run-as-root
-np
2
-bind-to
none
-map-by
slot
-x
NCCL_DEBUG=INFO
-x
LD_LIBRARY_PATH
-x
PATH
-mca
pml
ob1
-mca
btl
^openib
python
scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model=resnet101
--batch_size=64
--variable_update=horovod
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job1
Resources:
Worker:
Replicas: 2
Template:
Metadata:
Spec:
Containers:
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job1
Resources:
Limits:
nvidia.com/gpu: 1
Run Launcher As Worker: false
Run Policy:
Clean Pod Policy: Running
Suspend: false
Slots Per Worker: 1
Ssh Auth Mount Path: /root/.ssh
Status:
Conditions:
Last Transition Time: 2025-03-12T20:58:36Z
Last Update Time: 2025-03-12T20:58:36Z
Message: MPIJob default/tensorflow-benchmarks-job1 is created.
Reason: MPIJobCreated
Status: True
Type: Created
Last Transition Time: 2025-03-12T20:58:37Z
Last Update Time: 2025-03-12T20:58:37Z
Message: MPIJob resumed
Reason: MPIJobResumed
Status: False
Type: Suspended
Last Transition Time: 2025-03-12T20:58:39Z
Last Update Time: 2025-03-12T20:58:39Z
Message: MPIJob default/tensorflow-benchmarks-job1 is running.
Reason: MPIJobRunning
Status: True
Type: Running
Replica Statuses:
Launcher:
Active: 1
Worker:
Active: 2
Start Time: 2025-03-12T20:58:37Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal MPIJobCreated 40s (x2 over 40s) mpi-job-controller MPIJob default/tensorflow-benchmarks-job1 is created.
Normal MPIJobSuspended 40s (x2 over 40s) mpi-job-controller MPIJob suspended
Normal CreatedWorkload 40s kubeflow.org/mpijob-kueue-controller Created Workload: default/mpijob-tensorflow-benchmarks-job1-11dc3
Normal Started 40s kubeflow.org/mpijob-kueue-controller Admitted by clusterQueue cluster-queue
Normal MPIJobResumed 39s (x2 over 39s) mpi-job-controller MPIJob resumed
Normal MPIJobRunning 36s (x3 over 37s) mpi-job-controller MPIJob default/tensorflow-benchmarks-job1 is running
# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job2
Name: tensorflow-benchmarks-job2
Namespace: default
Labels: kueue.x-k8s.io/queue-name=user-queue
Annotations: <none>
API Version: kubeflow.org/v2beta1
Kind: MPIJob
Metadata:
Creation Timestamp: 2025-03-12T20:58:36Z
Generation: 2
Resource Version: 2810
UID: 58b33273-19b3-4e86-9152-fac3415526f7
Spec:
Launcher Creation Policy: WaitForWorkersReady
Mpi Implementation: OpenMPI
Mpi Replica Specs:
Launcher:
Replicas: 1
Template:
Metadata:
Spec:
Containers:
Command:
mpirun
--allow-run-as-root
-np
2
-bind-to
none
-map-by
slot
-x
NCCL_DEBUG=INFO
-x
LD_LIBRARY_PATH
-x
PATH
-mca
pml
ob1
-mca
btl
^openib
python
scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model=resnet101
--batch_size=64
--variable_update=horovod
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job2
Resources:
Worker:
Replicas: 2
Template:
Metadata:
Spec:
Containers:
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job2
Resources:
Limits:
nvidia.com/gpu: 1
Run Launcher As Worker: false
Run Policy:
Clean Pod Policy: Running
Suspend: false
Slots Per Worker: 1
Ssh Auth Mount Path: /root/.ssh
Status:
Conditions:
Last Transition Time: 2025-03-12T20:58:36Z
Last Update Time: 2025-03-12T20:58:36Z
Message: MPIJob default/tensorflow-benchmarks-job2 is created.
Reason: MPIJobCreated
Status: True
Type: Created
Last Transition Time: 2025-03-12T20:58:37Z
Last Update Time: 2025-03-12T20:58:37Z
Message: MPIJob default/tensorflow-benchmarks-job2 is suspended.
Reason: MPIJobSuspended
Status: False
Type: Running
Last Transition Time: 2025-03-12T20:58:39Z
Last Update Time: 2025-03-12T20:58:39Z
Message: MPIJob resumed
Reason: MPIJobResumed
Status: False
Type: Suspended
Replica Statuses:
Launcher:
Active: 1
Worker:
Start Time: 2025-03-12T20:58:39Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal CreatedWorkload 75s kubeflow.org/mpijob-kueue-controller Created Workload: default/mpijob-tensorflow-benchmarks-job2-64330
Normal MPIJobCreated 74s (x2 over 75s) mpi-job-controller MPIJob default/tensorflow-benchmarks-job2 is created.
Normal MPIJobSuspended 74s (x2 over 74s) mpi-job-controller MPIJob suspended
Normal Started 72s kubeflow.org/mpijob-kueue-controller Admitted by clusterQueue cluster-queue
Normal MPIJobResumed 72s (x2 over 72s) mpi-job-controller MPIJob resumed
# kubectl describe workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job1-11dc3
Name: mpijob-tensorflow-benchmarks-job1-11dc3
Namespace: default
Labels: kueue.x-k8s.io/job-uid=a6621976-3f98-40d4-8cff-98478fbe603b
Annotations: <none>
API Version: kueue.x-k8s.io/v1beta1
Kind: Workload
Metadata:
Creation Timestamp: 2025-03-12T20:58:36Z
Finalizers:
kueue.x-k8s.io/resource-in-use
Generation: 1
Owner References:
API Version: kubeflow.org/v2beta1
Block Owner Deletion: true
Controller: true
Kind: MPIJob
Name: tensorflow-benchmarks-job1
UID: a6621976-3f98-40d4-8cff-98478fbe603b
Resource Version: 2770
UID: cdb1c18a-144d-456f-be18-3073b6bd59b4
Spec:
Active: true
Pod Sets:
Count: 1
Name: launcher
Template:
Metadata:
Spec:
Containers:
Command:
mpirun
--allow-run-as-root
-np
2
-bind-to
none
-map-by
slot
-x
NCCL_DEBUG=INFO
-x
LD_LIBRARY_PATH
-x
PATH
-mca
pml
ob1
-mca
btl
^openib
python
scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model=resnet101
--batch_size=64
--variable_update=horovod
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job1
Resources:
Count: 2
Name: worker
Template:
Metadata:
Spec:
Containers:
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job1
Resources:
Limits:
nvidia.com/gpu: 1
Priority: 0
Priority Class Source:
Queue Name: user-queue
Status:
Admission:
Cluster Queue: cluster-queue
Pod Set Assignments:
Count: 1
Name: launcher
Count: 2
Flavors:
nvidia.com/gpu: default-flavor
Name: worker
Resource Usage:
nvidia.com/gpu: 2
Conditions:
Last Transition Time: 2025-03-12T20:58:36Z
Message: Quota reserved in ClusterQueue cluster-queue
Observed Generation: 1
Reason: QuotaReserved
Status: True
Type: QuotaReserved
Last Transition Time: 2025-03-12T20:58:36Z
Message: The workload is admitted
Observed Generation: 1
Reason: Admitted
Status: True
Type: Admitted
Last Transition Time: 2025-03-12T20:58:39Z
Message: All pods were ready or succeeded since the workload admission
Observed Generation: 1
Reason: PodsReady
Status: True
Type: PodsReady
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal QuotaReserved 2m7s kueue-admission Quota reserved in ClusterQueue cluster-queue, wait time since queued was 0s
Normal Admitted 2m7s kueue-admission Admitted by ClusterQueue cluster-queue, wait time since reservation was 0s
# kubectl describe workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-64330
Name: mpijob-tensorflow-benchmarks-job2-64330
Namespace: default
Labels: kueue.x-k8s.io/job-uid=58b33273-19b3-4e86-9152-fac3415526f7
Annotations: <none>
API Version: kueue.x-k8s.io/v1beta1
Kind: Workload
Metadata:
Creation Timestamp: 2025-03-12T20:58:36Z
Finalizers:
kueue.x-k8s.io/resource-in-use
Generation: 1
Owner References:
API Version: kubeflow.org/v2beta1
Block Owner Deletion: true
Controller: true
Kind: MPIJob
Name: tensorflow-benchmarks-job2
UID: 58b33273-19b3-4e86-9152-fac3415526f7
Resource Version: 2772
UID: 8fb74084-927a-4abc-90c8-16efb5c1515d
Spec:
Active: true
Pod Sets:
Count: 1
Name: launcher
Template:
Metadata:
Spec:
Containers:
Command:
mpirun
--allow-run-as-root
-np
2
-bind-to
none
-map-by
slot
-x
NCCL_DEBUG=INFO
-x
LD_LIBRARY_PATH
-x
PATH
-mca
pml
ob1
-mca
btl
^openib
python
scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model=resnet101
--batch_size=64
--variable_update=horovod
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job2
Resources:
Count: 2
Name: worker
Template:
Metadata:
Spec:
Containers:
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job2
Resources:
Limits:
nvidia.com/gpu: 1
Priority: 0
Priority Class Source:
Queue Name: user-queue
Status:
Admission:
Cluster Queue: cluster-queue
Pod Set Assignments:
Count: 1
Name: launcher
Count: 2
Flavors:
nvidia.com/gpu: default-flavor
Name: worker
Resource Usage:
nvidia.com/gpu: 2
Conditions:
Last Transition Time: 2025-03-12T20:58:39Z
Message: Quota reserved in ClusterQueue cluster-queue
Observed Generation: 1
Reason: QuotaReserved
Status: True
Type: QuotaReserved
Last Transition Time: 2025-03-12T20:58:36Z
Message: Not all pods are ready or succeeded
Observed Generation: 1
Reason: PodsReady
Status: False
Type: PodsReady
Last Transition Time: 2025-03-12T20:58:39Z
Message: The workload is admitted
Observed Generation: 1
Reason: Admitted
Status: True
Type: Admitted
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal QuotaReserved 2m47s kueue-admission Quota reserved in ClusterQueue cluster-queue, wait time since queued was 4s
Normal Admitted 2m47s kueue-admission Admitted by ClusterQueue cluster-queue, wait time since reservation was 0s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y
Additional information
https://github.com/ttakahashi21/mpi-operator/blob/dev-takahashi/pkg/controller/mpi_job_controller.go#L652-L686
I wrote the above debugging code and confirmed.
When using Kueue, is it expected that the initial value of mpiJob.Spec.RunPolicy.Suspend will be true?
case1
- The default value of mpiJob.Spec.RunPolicy.Suspend for Job1 and Job2 is set to false. Therefore, "! isMPIJobSuspended(mpiJob)" for Job1 and Job2 is true. This creates or gets the worker. Since the launcher has not yet been created, the condition “launcher == nil” for Job1 and Job2 is true. In job1, the condition “c.countReadyWorkerPods(worker) == len(worker)” becomes true and the launcher for job 1 is created. In job2, "LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup” and ”c. countReadyWorkerPods(worker) == len(worker)” are not true, respectively, so the launcher for job2 is not created.
The details of the debug log are as follows
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 1 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 1 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup || c.countReadyWorkerPods(checkworker) == len(checkworker): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - deploy launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
case3
The value of mpiJob.Spec.RunPolicy.Suspend for Job1 and Job2 is set to true when MPIjob is started for the first time. Therefore, "!isMPIJobSuspended(mpiJob)" for Job1 and Job2 is false. So, this can not creates or gets the worker. Since the launcher has not yet been created, the condition “launcher == nil” for Job1 and Job2 is true. However, the condition “c.countReadyWorkerPods(worker) == len(worker)” becomes true in job1 and job2 and the launcher for job 1 and job2 is created. The reason why “c.countReadyWorkerPods(worker)” and “len(worker)” will both be 0 because the worker information does not get.
The details of the debug log are as follows
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup || c.countReadyWorkerPods(checkworker) == len(checkworker): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - deploy launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup || c.countReadyWorkerPods(checkworker) == len(checkworker): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - deploy launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, #617 modification is necessary.
@GonzaloSaez please fix the DCO |
802bd49
to
804cfd8
Compare
For context, linking it back to the related Kueue issue: kubernetes-sigs/kueue#3400 and the slack discussion https://kubernetes.slack.com/archives/C032ZE66A2X/p1730369507818399 |
804cfd8
to
c1ea13d
Compare
Signed-off-by: GonzaloSaez <[email protected]>
c1ea13d
to
50abbdf
Compare
When the MPIJob starts suspended, we were creating the launcher job no matter the initial suspended state. This causes issues with kueue, since it will suspend the MPIJob but it will create a job with the wrong NodeSelector coming from the kueue flavour. I think avoiding creating the launcher in this scenario is the right thing to do but I'm not sure if others have different thoughts.