Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not create the launcher job if the job starts suspended #670

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

GonzaloSaez
Copy link
Contributor

@GonzaloSaez GonzaloSaez commented Oct 31, 2024

When the MPIJob starts suspended, we were creating the launcher job no matter the initial suspended state. This causes issues with kueue, since it will suspend the MPIJob but it will create a job with the wrong NodeSelector coming from the kueue flavour. I think avoiding creating the launcher in this scenario is the right thing to do but I'm not sure if others have different thoughts.

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign alculquicondor for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment on lines +661 to +663
// If the job is suspended, the list of worker pods will be incorrect. We also do
// not want to start the launcher job if the MPIJob starts suspended.
if launcher == nil && !isMPIJobSuspended(mpiJob) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This opens the question on what should be done when unsuspending the launcher job in case kueue has decided to change the NodeSelector? Should we instead recreate the job since NodeSelector is immutable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this question does not have a straigtforward answer, I can see at least two approaches possible:

  1. as in JobSet- update the NodeSelector field in pod template when resuming the Job
  2. Recreate the launcher/worker Jobs , this can be probably achieved easily by deleting the jobs on suspending the MPIJob

I'm ok with any of those that is simpler to implement. Any optinion @tenzen-y ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case I think it is safe to decouple the fixes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed response. IMO, I would selecting as in JobSet- update the NodeSelector field in pod template when resuming the Job instead of current solution.
Because I want to align with the behavior with JobSet since the JobSet with suspended creates Job.

Copy link
Contributor Author

@GonzaloSaez GonzaloSaez Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's already being done in theory by kueue with RunWithPodSetsInfo, right? I think the issue here is that we create the job in non-suspended move even if the MPIJob is suspended. This results in pods being scheduled in nodes and then removed because the controller suspends the job later on.

Copy link

@ttakahashi21 ttakahashi21 Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y
I think it reproduces when GPUs are used in areas not managed by kueue.This procedure attaches two GPUs to a node, setting ClusterQueue to twice the value of GPUs available for allocation in the cluster.
Two jobs that use two GPUs as jobs are executed simultaneously. In this case, the launcher runs even though the worker is pending.

  1. Create Kind Cluster and set label for fake-gpu-operator

    kind create cluster
    kubectl label node kind-control-plane run.ai/simulated-gpu-node-pool=default
  2. Install fake gpu operator

    helm repo add fake-gpu-operator https://fake-gpu-operator.storage.googleapis.com
    helm repo update
    helm upgrade -i gpu-operator fake-gpu-operator/fake-gpu-operator --namespace gpu-operator --create-namespace
  • Check that fake gpu is shown by running kubectl get all -n gpu-operator and kubectl describe node | grep "nvidia.com/gpu:".
  1. Install mpi-operator

    kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v2beta1/mpi-operator.yaml
  • Check that mpi-operator is deployed properly by running kubectl get all -n mpi-operator.
  1. Install kueue with waitForPodsReady enabled

    Get kueue manifest and enable waitForPodsReady.

    wget https://github.com/kubernetes-sigs/kueue/releases/download/v0.10.2/manifests.yaml
    sed -i '/#waitForPodsReady:/a \    waitForPodsReady:\n      enable: true\n      timeout: 5m\n      blockAdmission: true\n      requeuingStrategy:\n        timestamp: Eviction\n        backoffLimitCount: 5\n        backoffBaseSeconds: 60\n        backoffMaxSeconds: 3600' manifests.yaml

    Then, apply the configuration by:

    kubectl apply --server-side -f manifests.yaml
  • Check that kueue is deployed properly by running kubectl get all -n kueue-system.
  1. Prepare minimal kueue setup with gpu enabled

    First, check the amount of allocatable GPU in your cluster.

    TOTAL_ALLOCATABLE=$(kubectl get node --selector='run.ai/simulated-gpu-node-pool=default,node-role.kubernetes.io/control-plane' -o jsonpath='{range .items[*]}{.status.allocatable.nvidia\.com\/gpu}{"\n"}{end}' | numfmt --from=auto | awk '{s+=$1} END {print s}')
    echo $TOTAL_ALLOCATABLE

    In our case this outputs 2.

  • Configure ClusterQueue quota

    We configure the GPU flavor by doubling the total GPU allocatable in our cluster, in order to simulate issues with provisioning.

    Execute the following command to create cluster queues configuration as single-clusterqueue-setup.yaml:

    cat <<EOF >> single-clusterqueue-setup-gpu4.yaml
    apiVersion: kueue.x-k8s.io/v1beta1
    kind: ResourceFlavor
    metadata:
      name: "default-flavor"
    ---
    apiVersion: kueue.x-k8s.io/v1beta1
    kind: ClusterQueue
    metadata:
      name: "cluster-queue"
    spec:
      namespaceSelector: {} # match all.
      resourceGroups:
      - coveredResources: ["nvidia.com/gpu"]
        flavors:
        - name: "default-flavor"
          resources:
          - name: "nvidia.com/gpu"
            nominalQuota: 4 # double the value of allocatable GPU in the cluster
    ---
    apiVersion: kueue.x-k8s.io/v1beta1
    kind: LocalQueue
    metadata:
      namespace: "default"
      name: "user-queue"
    spec:
      clusterQueue: "cluster-queue"
    EOF

    Then, apply the configuration by:

    kubectl apply -f single-clusterqueue-setup-gpu4.yaml
  1. Try mpijob works well with 2 replicas (2 gpus are available on the node)

    Get tensorflow-benchmarks.yaml

    wget https://raw.githubusercontent.com/kubeflow/mpi-operator/refs/heads/master/examples/v2beta1/tensorflow-benchmarks/tensorflow-benchmarks.yaml
  2. Run with waitForPodsReady enabled

    • Create start.sh script
    cat <<EOF >> start.sh
    sed '/^metadata:/a \  labels:\n    kueue.x-k8s.io/queue-name: user-queue' tensorflow-benchmarks.yaml | sed  's/name: tensorflow-benchmarks/name: tensorflow-benchmarks-job1/g'  > /tmp/tensorflow-benchmarks-job1.yaml
    sed '/^metadata:/a \  labels:\n    kueue.x-k8s.io/queue-name: user-queue' tensorflow-benchmarks.yaml | sed  's/name: tensorflow-benchmarks/name: tensorflow-benchmarks-job2/g'  > /tmp/tensorflow-benchmarks-job2.yaml
    kubectl create -f /tmp/tensorflow-benchmarks-job1.yaml
    kubectl create -f /tmp/tensorflow-benchmarks-job2.yaml
    EOF
    
    chmod +x start.sh
    • Run the start.sh script
    ./start.sh
  3. Monitor the progress

# kubectl get mpijob,workload,pod
NAME                                             AGE
mpijob.kubeflow.org/tensorflow-benchmarks-job1   8s
mpijob.kubeflow.org/tensorflow-benchmarks-job2   8s

NAME                                                              QUEUE        RESERVED IN     ADMITTED   FINISHED   AGE
workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job1-73f4b   user-queue   cluster-queue   True                  8s
workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-8caef   user-queue   cluster-queue   True                  8s

NAME                                            READY   STATUS    RESTARTS   AGE
pod/tensorflow-benchmarks-job1-launcher-zx4ck   1/1     Running   0          7s
pod/tensorflow-benchmarks-job1-worker-0         1/1     Running   0          8s
pod/tensorflow-benchmarks-job1-worker-1         1/1     Running   0          8s
pod/tensorflow-benchmarks-job2-launcher-zch7j   1/1     Running   0          4s #launcher runs even though the worker is pending
pod/tensorflow-benchmarks-job2-worker-0         0/1     Pending   0          5s
pod/tensorflow-benchmarks-job2-worker-1         0/1     Pending   0          4s
# kubectl describe workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-8caef
Name:         mpijob-tensorflow-benchmarks-job2-8caef
Namespace:    default
Labels:       kueue.x-k8s.io/job-uid=a2ad4da7-eb57-477d-a9bb-1142d8d847dc
Annotations:  <none>
API Version:  kueue.x-k8s.io/v1beta1
Kind:         Workload
Metadata:
  Creation Timestamp:  2025-03-07T06:59:46Z
  Finalizers:
    kueue.x-k8s.io/resource-in-use
  Generation:  1
  Owner References:
    API Version:           kubeflow.org/v2beta1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  MPIJob
    Name:                  tensorflow-benchmarks-job2
    UID:                   a2ad4da7-eb57-477d-a9bb-1142d8d847dc
  Resource Version:        4618
  UID:                     34379ab2-1539-4f61-b69a-fcdad7768478
Spec:
  Active:  true
  Pod Sets:
    Count:  1
    Name:   launcher
    Template:
      Metadata:
      Spec:
        Containers:
          Command:
            mpirun
            --allow-run-as-root
            -np
            2
            -bind-to
            none
            -map-by
            slot
            -x
            NCCL_DEBUG=INFO
            -x
            LD_LIBRARY_PATH
            -x
            PATH
            -mca
            pml
            ob1
            -mca
            btl
            ^openib
            python
            scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
            --model=resnet101
            --batch_size=64
            --variable_update=horovod
          Image:  mpioperator/tensorflow-benchmarks:latest
          Name:   tensorflow-benchmarks-job2
          Resources:
    Count:  2
    Name:   worker
    Template:
      Metadata:
      Spec:
        Containers:
          Image:  mpioperator/tensorflow-benchmarks:latest
          Name:   tensorflow-benchmarks-job2
          Resources:
            Limits:
              nvidia.com/gpu:  1
  Priority:                    0
  Priority Class Source:       
  Queue Name:                  user-queue
Status:
  Admission:
    Cluster Queue:  cluster-queue
    Pod Set Assignments:
      Count:  1
      Name:   launcher
      Count:  2
      Flavors:
        nvidia.com/gpu:  default-flavor
      Name:              worker
      Resource Usage:
        nvidia.com/gpu:  2
  Conditions:
    Last Transition Time:  2025-03-07T06:59:49Z
    Message:               Quota reserved in ClusterQueue cluster-queue
    Observed Generation:   1
    Reason:                QuotaReserved
    Status:                True
    Type:                  QuotaReserved
    Last Transition Time:  2025-03-07T06:59:47Z
    Message:               Not all pods are ready or succeeded
    Observed Generation:   1
    Reason:                PodsReady
    Status:                False
    Type:                  PodsReady
    Last Transition Time:  2025-03-07T06:59:49Z
    Message:               The workload is admitted
    Observed Generation:   1
    Reason:                Admitted
    Status:                True
    Type:                  Admitted
Events:
  Type    Reason         Age   From             Message
  ----    ------         ----  ----             -------
  Normal  QuotaReserved  28s   kueue-admission  Quota reserved in ClusterQueue cluster-queue, wait time since queued was 4s
  Normal  Admitted       28s   kueue-admission  Admitted by ClusterQueue cluster-queue, wait time since reservation was 0s
# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job2
Name:         tensorflow-benchmarks-job2
Namespace:    default
Labels:       kueue.x-k8s.io/queue-name=user-queue
Annotations:  <none>
API Version:  kubeflow.org/v2beta1
Kind:         MPIJob
Metadata:
  Creation Timestamp:  2025-03-07T06:59:46Z
  Generation:          2
  Resource Version:    4657
  UID:                 a2ad4da7-eb57-477d-a9bb-1142d8d847dc
Spec:
  Launcher Creation Policy:  AtStartup
  Mpi Implementation:        OpenMPI
  Mpi Replica Specs:
    Launcher:
      Replicas:  1
      Template:
        Metadata:
        Spec:
          Containers:
            Command:
              mpirun
              --allow-run-as-root
              -np
              2
              -bind-to
              none
              -map-by
              slot
              -x
              NCCL_DEBUG=INFO
              -x
              LD_LIBRARY_PATH
              -x
              PATH
              -mca
              pml
              ob1
              -mca
              btl
              ^openib
              python
              scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
              --model=resnet101
              --batch_size=64
              --variable_update=horovod
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks-job2
            Resources:
    Worker:
      Replicas:  2
      Template:
        Metadata:
        Spec:
          Containers:
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks-job2
            Resources:
              Limits:
                nvidia.com/gpu:  1
  Run Launcher As Worker:        false
  Run Policy:
    Clean Pod Policy:   Running
    Suspend:            false
  Slots Per Worker:     1
  Ssh Auth Mount Path:  /root/.ssh
Status:
  Conditions:
    Last Transition Time:  2025-03-07T06:59:47Z
    Last Update Time:      2025-03-07T06:59:47Z
    Message:               MPIJob default/tensorflow-benchmarks-job2 is created.
    Reason:                MPIJobCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2025-03-07T06:59:48Z
    Last Update Time:      2025-03-07T06:59:48Z
    Message:               MPIJob default/tensorflow-benchmarks-job2 is suspended.
    Reason:                MPIJobSuspended
    Status:                False
    Type:                  Running
    Last Transition Time:  2025-03-07T06:59:50Z
    Last Update Time:      2025-03-07T06:59:50Z
    Message:               MPIJob resumed
    Reason:                MPIJobResumed
    Status:                False
    Type:                  Suspended
  Replica Statuses:
    Launcher:
      Active:  1
    Worker:
  Start Time:  2025-03-07T06:59:50Z
Events:
  Type    Reason           Age                From                                  Message
  ----    ------           ----               ----                                  -------
  Normal  CreatedWorkload  99s                kubeflow.org/mpijob-kueue-controller  Created Workload: default/mpijob-tensorflow-benchmarks-job2-8caef
  Normal  MPIJobCreated    98s (x2 over 99s)  mpi-job-controller                    MPIJob default/tensorflow-benchmarks-job2 is created.
  Normal  MPIJobSuspended  98s (x2 over 98s)  mpi-job-controller                    MPIJob suspended
  Normal  Started          97s                kubeflow.org/mpijob-kueue-controller  Admitted by clusterQueue cluster-queue
  Normal  MPIJobResumed    96s                mpi-job-controller                    MPIJob resumed
# kubectl describe pod/tensorflow-benchmarks-job2-worker-0
Name:             tensorflow-benchmarks-job2-worker-0
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           training.kubeflow.org/job-name=tensorflow-benchmarks-job2
                  training.kubeflow.org/job-role=worker
                  training.kubeflow.org/operator-name=mpi-operator
                  training.kubeflow.org/replica-index=0
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    MPIJob/tensorflow-benchmarks-job2
Containers:
  tensorflow-benchmarks-job2:
    Image:      mpioperator/tensorflow-benchmarks:latest
    Port:       <none>
    Host Port:  <none>
    Command:
      /usr/sbin/sshd
      -De
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:
      K_MPI_JOB_ROLE:  worker
    Mounts:
      /root/.ssh from ssh-auth (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4l5z8 (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  ssh-auth:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  tensorflow-benchmarks-job2-ssh
    Optional:    false
  kube-api-access-4l5z8:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  2m42s  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
# kubectl get mpijob,workload,pod
NAME                                             AGE
mpijob.kubeflow.org/tensorflow-benchmarks-job1   5m6s
mpijob.kubeflow.org/tensorflow-benchmarks-job2   5m6s

NAME                                                              QUEUE        RESERVED IN     ADMITTED   FINISHED   AGE
workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job1-73f4b   user-queue   cluster-queue   True                  5m6s
workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-8caef   user-queue                   False                 5m6s

NAME                                            READY   STATUS    RESTARTS        AGE
pod/tensorflow-benchmarks-job1-launcher-zx4ck   1/1     Running   1 (2m27s ago)   5m5s
pod/tensorflow-benchmarks-job1-worker-0         1/1     Running   0               5m6s
pod/tensorflow-benchmarks-job1-worker-1         1/1     Running   0               5m6s
  1. Cleanup

Clean up the jobs by:

kubectl delete -f /tmp/tensorflow-benchmarks-job1.yaml
kubectl delete -f /tmp/tensorflow-benchmarks-job2.yaml
  1. Change ClusterQueue quota

    Set the value of GPUs that can be allocated in the cluster to 2.

cat <<EOF >> single-clusterqueue-setup-gpu2.yaml
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "default-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cluster-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["nvidia.com/gpu"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "nvidia.com/gpu"
        nominalQuota: 2 
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: "default"
  name: "user-queue"
spec:
  clusterQueue: "cluster-queue"
EOF

Then, apply the configuration by:

kubectl apply -f single-clusterqueue-setup-gpu2.yaml
  • Run the start.sh script
./start.sh
  • Monitor the progress
# kubectl get mpijob,workload,pod
NAME                                             AGE
mpijob.kubeflow.org/tensorflow-benchmarks-job1   19s
mpijob.kubeflow.org/tensorflow-benchmarks-job2   19s

NAME                                                              QUEUE        RESERVED IN     ADMITTED   FINISHED   AGE
workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job1-7d852   user-queue   cluster-queue   True                  19s
workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-c171c   user-queue                                         19s

NAME                                            READY   STATUS    RESTARTS   AGE
pod/tensorflow-benchmarks-job1-launcher-qg2xj   1/1     Running   0          18s
pod/tensorflow-benchmarks-job1-worker-0         1/1     Running   0          19s
pod/tensorflow-benchmarks-job1-worker-1         1/1     Running   0          19s
# kubectl describe workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-c171c
Name:         mpijob-tensorflow-benchmarks-job2-c171c
Namespace:    default
Labels:       kueue.x-k8s.io/job-uid=211c9316-b9ad-400a-b108-7549d40517b5
Annotations:  <none>
API Version:  kueue.x-k8s.io/v1beta1
Kind:         Workload
Metadata:
  Creation Timestamp:  2025-03-07T07:11:20Z
  Finalizers:
    kueue.x-k8s.io/resource-in-use
  Generation:  1
  Owner References:
    API Version:           kubeflow.org/v2beta1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  MPIJob
    Name:                  tensorflow-benchmarks-job2
    UID:                   211c9316-b9ad-400a-b108-7549d40517b5
  Resource Version:        6372
  UID:                     7c481e50-9be5-4950-aff3-773d8e475989
Spec:
  Active:  true
  Pod Sets:
    Count:  1
    Name:   launcher
    Template:
      Metadata:
      Spec:
        Containers:
          Command:
            mpirun
            --allow-run-as-root
            -np
            2
            -bind-to
            none
            -map-by
            slot
            -x
            NCCL_DEBUG=INFO
            -x
            LD_LIBRARY_PATH
            -x
            PATH
            -mca
            pml
            ob1
            -mca
            btl
            ^openib
            python
            scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
            --model=resnet101
            --batch_size=64
            --variable_update=horovod
          Image:  mpioperator/tensorflow-benchmarks:latest
          Name:   tensorflow-benchmarks-job2
          Resources:
    Count:  2
    Name:   worker
    Template:
      Metadata:
      Spec:
        Containers:
          Image:  mpioperator/tensorflow-benchmarks:latest
          Name:   tensorflow-benchmarks-job2
          Resources:
            Limits:
              nvidia.com/gpu:  1
  Priority:                    0
  Priority Class Source:       
  Queue Name:                  user-queue
Status:
  Conditions:
    Last Transition Time:  2025-03-07T07:11:20Z
    Message:               Not all pods are ready or succeeded
    Observed Generation:   1
    Reason:                PodsReady
    Status:                False
    Type:                  PodsReady
    Last Transition Time:  2025-03-07T07:11:20Z
    Message:               couldn't assign flavors to pod set worker: insufficient unused quota for nvidia.com/gpu in flavor default-flavor, 2 more needed
    Observed Generation:   1
    Reason:                Pending
    Status:                False
    Type:                  QuotaReserved
  Resource Requests:
    Name:  launcher
    Name:  worker
    Resources:
      nvidia.com/gpu:  2
Events:
  Type     Reason   Age                From             Message
  ----     ------   ----               ----             -------
  Warning  Pending  33s (x3 over 33s)  kueue-admission  couldn't assign flavors to pod set worker: insufficient unused quota for nvidia.com/gpu in flavor default-flavor, 2 more needed
# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job2
Name:         tensorflow-benchmarks-job2
Namespace:    default
Labels:       kueue.x-k8s.io/queue-name=user-queue
Annotations:  <none>
API Version:  kubeflow.org/v2beta1
Kind:         MPIJob
Metadata:
  Creation Timestamp:  2025-03-07T07:11:20Z
  Generation:          1
  Resource Version:    6406
  UID:                 211c9316-b9ad-400a-b108-7549d40517b5
Spec:
  Launcher Creation Policy:  AtStartup
  Mpi Implementation:        OpenMPI
  Mpi Replica Specs:
    Launcher:
      Replicas:  1
      Template:
        Metadata:
        Spec:
          Containers:
            Command:
              mpirun
              --allow-run-as-root
              -np
              2
              -bind-to
              none
              -map-by
              slot
              -x
              NCCL_DEBUG=INFO
              -x
              LD_LIBRARY_PATH
              -x
              PATH
              -mca
              pml
              ob1
              -mca
              btl
              ^openib
              python
              scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
              --model=resnet101
              --batch_size=64
              --variable_update=horovod
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks-job2
            Resources:
    Worker:
      Replicas:  2
      Template:
        Metadata:
        Spec:
          Containers:
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks-job2
            Resources:
              Limits:
                nvidia.com/gpu:  1
  Run Launcher As Worker:        false
  Run Policy:
    Clean Pod Policy:   Running
    Suspend:            true
  Slots Per Worker:     1
  Ssh Auth Mount Path:  /root/.ssh
Status:
  Conditions:
    Last Transition Time:  2025-03-07T07:11:20Z
    Last Update Time:      2025-03-07T07:11:20Z
    Message:               MPIJob default/tensorflow-benchmarks-job2 is created.
    Reason:                MPIJobCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2025-03-07T07:11:21Z
    Last Update Time:      2025-03-07T07:11:21Z
    Message:               MPIJob suspended
    Reason:                MPIJobSuspended
    Status:                True
    Type:                  Suspended
    Last Transition Time:  2025-03-07T07:11:21Z
    Last Update Time:      2025-03-07T07:11:21Z
    Message:               MPIJob default/tensorflow-benchmarks-job2 is suspended.
    Reason:                MPIJobSuspended
    Status:                False
    Type:                  Running
  Replica Statuses:
    Launcher:
    Worker:
Events:
  Type    Reason           Age   From                                  Message
  ----    ------           ----  ----                                  -------
  Normal  MPIJobCreated    75s   mpi-job-controller                    MPIJob default/tensorflow-benchmarks-job2 is created.
  Normal  CreatedWorkload  75s   kubeflow.org/mpijob-kueue-controller  Created Workload: default/mpijob-tensorflow-benchmarks-job2-c171c
  Normal  MPIJobSuspended  74s   mpi-job-controller                    MPIJob suspended
  1. Cleanup

    Clean up the jobs by:

    kubectl delete -f /tmp/tensorflow-benchmarks-job1.yaml
    kubectl delete -f /tmp/tensorflow-benchmarks-job2.yaml

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ttakahashi21 Thank you for describing your situation. As I can see it, it seems to be expected behavior. The waitForPodsReady is the feature to observe the not ready or unschedulable pods, then requeue it to Kueue.

If you want to guarantee to schedule Worker, first, we can use launcherCreationPolicy:

// launcherCreationPolicy if WaitForWorkersReady, the launcher is created only after all workers are in Ready state. Defaults to AtStartup.
// +kubebuilder:validation:Enum:AtStartup;WaitForWorkersReady
// +kubebuilder:default:=AtStartup
LauncherCreationPolicy LauncherCreationPolicy `json:"launcherCreationPolicy,omitempty"`

If I am missing anything, please let me know.

Copy link

@ttakahashi21 ttakahashi21 Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y

If you want to guarantee to schedule Worker, first, we can use launcherCreationPolicy:

Thank you for your comment. I've tested with setting launcherCreationPolicy to WaitForWorkersReady, but it doesn't work as expected when kueue is enabled and nominalQuota is less than actual allocatable GPUs.

# kueue Condition of GPUs launcherCreationPolicy(MPIJob) Behavior
1 disabled - WaitForWorkersReady work as expected
2 enabled NominalQuota == Allocatable WaitForWorkersReady work as expected
3 enabled NominalQuota > Allocatable WaitForWorkersReady Not work as expected

Note that the result I previously shared was tested with below conditions as you pointed out.

# kueue Condition of GPUs launcherCreationPolicy(MPIJob) Behavior
0 enabled NominalQuota > Allocatable LauncherCreationPolicyAtStartup(deafult) work as expected

case1

When MPIJob is executed exceeding the nominalQuota of GPU resource, GPU resource is insufficient. So, worker is pending. Therefore, if WaitForWorkersReady for MPIJob is working, it is expected that the launcher will not run. In this case, the launcher do not run.

  • WaitForWorkersReady feature works fine when Kueue is not used.
# kubectl get mpijob,workload,pod
NAME                                             AGE
mpijob.kubeflow.org/tensorflow-benchmarks-job1   12s
mpijob.kubeflow.org/tensorflow-benchmarks-job2   12s

NAME                                            READY   STATUS    RESTARTS   AGE
pod/tensorflow-benchmarks-job1-launcher-k9r6m   1/1     Running   0          10s
pod/tensorflow-benchmarks-job1-worker-0         1/1     Running   0          12s
pod/tensorflow-benchmarks-job1-worker-1         1/1     Running   0          12s
pod/tensorflow-benchmarks-job2-worker-0         0/1     Pending   0          12s
pod/tensorflow-benchmarks-job2-worker-1         0/1     Pending   0          12s
# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job1
Name:         tensorflow-benchmarks-job1
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v2beta1
Kind:         MPIJob
Metadata:
  Creation Timestamp:  2025-03-12T20:54:48Z
  Generation:          1
  Resource Version:    2126
  UID:                 54b7c0a2-2bb7-40b4-9cb3-92991bf15c5a
Spec:
  Launcher Creation Policy:  WaitForWorkersReady
  Mpi Implementation:        OpenMPI
  Mpi Replica Specs:
    Launcher:
      Replicas:  1
      Template:
        Metadata:
        Spec:
          Containers:
            Command:
              mpirun
              --allow-run-as-root
              -np
              2
              -bind-to
              none
              -map-by
              slot
              -x
              NCCL_DEBUG=INFO
              -x
              LD_LIBRARY_PATH
              -x
              PATH
              -mca
              pml
              ob1
              -mca
              btl
              ^openib
              python
              scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
              --model=resnet101
              --batch_size=64
              --variable_update=horovod
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks-job1
            Resources:
    Worker:
      Replicas:  2
      Template:
        Metadata:
        Spec:
          Containers:
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks-job1
            Resources:
              Limits:
                nvidia.com/gpu:  1
  Run Launcher As Worker:        false
  Run Policy:
    Clean Pod Policy:   Running
    Suspend:            false
  Slots Per Worker:     1
  Ssh Auth Mount Path:  /root/.ssh
Status:
  Conditions:
    Last Transition Time:  2025-03-12T20:54:48Z
    Last Update Time:      2025-03-12T20:54:48Z
    Message:               MPIJob default/tensorflow-benchmarks-job1 is created.
    Reason:                MPIJobCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2025-03-12T20:54:52Z
    Last Update Time:      2025-03-12T20:54:52Z
    Message:               MPIJob default/tensorflow-benchmarks-job1 is running.
    Reason:                MPIJobRunning
    Status:                True
    Type:                  Running
  Replica Statuses:
    Launcher:
      Active:  1
    Worker:
      Active:  2
  Start Time:  2025-03-12T20:54:48Z
Events:
  Type    Reason         Age                From                Message
  ----    ------         ----               ----                -------
  Normal  MPIJobCreated  41s                mpi-job-controller  MPIJob default/tensorflow-benchmarks-job1 is created.
  Normal  MPIJobRunning  36s (x3 over 37s)  mpi-job-controller  MPIJob default/tensorflow-benchmarks-job1 is running
# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job2
Name:         tensorflow-benchmarks-job2
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v2beta1
Kind:         MPIJob
Metadata:
  Creation Timestamp:  2025-03-12T20:54:48Z
  Generation:          1
  Resource Version:    2079
  UID:                 0e11058e-32db-42c3-bdc5-60743a9087a0
Spec:
  Launcher Creation Policy:  WaitForWorkersReady
  Mpi Implementation:        OpenMPI
  Mpi Replica Specs:
    Launcher:
      Replicas:  1
      Template:
        Metadata:
        Spec:
          Containers:
            Command:
              mpirun
              --allow-run-as-root
              -np
              2
              -bind-to
              none
              -map-by
              slot
              -x
              NCCL_DEBUG=INFO
              -x
              LD_LIBRARY_PATH
              -x
              PATH
              -mca
              pml
              ob1
              -mca
              btl
              ^openib
              python
              scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
              --model=resnet101
              --batch_size=64
              --variable_update=horovod
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks-job2
            Resources:
    Worker:
      Replicas:  2
      Template:
        Metadata:
        Spec:
          Containers:
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks-job2
            Resources:
              Limits:
                nvidia.com/gpu:  1
  Run Launcher As Worker:        false
  Run Policy:
    Clean Pod Policy:   Running
    Suspend:            false
  Slots Per Worker:     1
  Ssh Auth Mount Path:  /root/.ssh
Status:
  Conditions:
    Last Transition Time:  2025-03-12T20:54:48Z
    Last Update Time:      2025-03-12T20:54:48Z
    Message:               MPIJob default/tensorflow-benchmarks-job2 is created.
    Reason:                MPIJobCreated
    Status:                True
    Type:                  Created
  Replica Statuses:
    Worker:
  Start Time:  2025-03-12T20:54:48Z
Events:
  Type    Reason         Age   From                Message
  ----    ------         ----  ----                -------
  Normal  MPIJobCreated  109s  mpi-job-controller  MPIJob default/tensorflow-benchmarks-job2 is created.

case2

When MPIJob is executed exceeding the nominalQuota of GPU resource, spec.runPolicy.suspend paramater of MPIJob change false to true. In this case, MPIJob of launcher and worker do not run.


case3

When MPIJob is executed exceeding the nominalQuota of GPU resource, spec.runPolicy.suspend paramater of MPIJob do not change false to true.
In this case, MPIJob try to run. But, GPU resource is insufficient, so worker is pending. Therefore, if WaitForWorkersReady for MPIJob is working, it is expected that the launcher will not run. However, the launcher is running.

# kubectl get mpijob,workload,pod
NAME                                             AGE
mpijob.kubeflow.org/tensorflow-benchmarks-job1   14s
mpijob.kubeflow.org/tensorflow-benchmarks-job2   14s

NAME                                                              QUEUE        RESERVED IN     ADMITTED   FINISHED   AGE
workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job1-11dc3   user-queue   cluster-queue   True                  14s
workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-64330   user-queue   cluster-queue   True                  14s

NAME                                            READY   STATUS    RESTARTS   AGE
pod/tensorflow-benchmarks-job1-launcher-mfj4b   1/1     Running   0          13s
pod/tensorflow-benchmarks-job1-worker-0         1/1     Running   0          14s
pod/tensorflow-benchmarks-job1-worker-1         1/1     Running   0          14s
pod/tensorflow-benchmarks-job2-launcher-6b5p6   1/1     Running   0          11s #launcher is running,
pod/tensorflow-benchmarks-job2-worker-0         0/1     Pending   0          11s
pod/tensorflow-benchmarks-job2-worker-1         0/1     Pending   0          11s
# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job1
Name:         tensorflow-benchmarks-job1
Namespace:    default
Labels:       kueue.x-k8s.io/queue-name=user-queue
Annotations:  <none>
API Version:  kubeflow.org/v2beta1
Kind:         MPIJob
Metadata:
  Creation Timestamp:  2025-03-12T20:58:36Z
  Generation:          2
  Resource Version:    2768
  UID:                 a6621976-3f98-40d4-8cff-98478fbe603b
Spec:
  Launcher Creation Policy:  WaitForWorkersReady
  Mpi Implementation:        OpenMPI
  Mpi Replica Specs:
    Launcher:
      Replicas:  1
      Template:
        Metadata:
        Spec:
          Containers:
            Command:
              mpirun
              --allow-run-as-root
              -np
              2
              -bind-to
              none
              -map-by
              slot
              -x
              NCCL_DEBUG=INFO
              -x
              LD_LIBRARY_PATH
              -x
              PATH
              -mca
              pml
              ob1
              -mca
              btl
              ^openib
              python
              scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
              --model=resnet101
              --batch_size=64
              --variable_update=horovod
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks-job1
            Resources:
    Worker:
      Replicas:  2
      Template:
        Metadata:
        Spec:
          Containers:
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks-job1
            Resources:
              Limits:
                nvidia.com/gpu:  1
  Run Launcher As Worker:        false
  Run Policy:
    Clean Pod Policy:   Running
    Suspend:            false
  Slots Per Worker:     1
  Ssh Auth Mount Path:  /root/.ssh
Status:
  Conditions:
    Last Transition Time:  2025-03-12T20:58:36Z
    Last Update Time:      2025-03-12T20:58:36Z
    Message:               MPIJob default/tensorflow-benchmarks-job1 is created.
    Reason:                MPIJobCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2025-03-12T20:58:37Z
    Last Update Time:      2025-03-12T20:58:37Z
    Message:               MPIJob resumed
    Reason:                MPIJobResumed
    Status:                False
    Type:                  Suspended
    Last Transition Time:  2025-03-12T20:58:39Z
    Last Update Time:      2025-03-12T20:58:39Z
    Message:               MPIJob default/tensorflow-benchmarks-job1 is running.
    Reason:                MPIJobRunning
    Status:                True
    Type:                  Running
  Replica Statuses:
    Launcher:
      Active:  1
    Worker:
      Active:  2
  Start Time:  2025-03-12T20:58:37Z
Events:
  Type    Reason           Age                From                                  Message
  ----    ------           ----               ----                                  -------
  Normal  MPIJobCreated    40s (x2 over 40s)  mpi-job-controller                    MPIJob default/tensorflow-benchmarks-job1 is created.
  Normal  MPIJobSuspended  40s (x2 over 40s)  mpi-job-controller                    MPIJob suspended
  Normal  CreatedWorkload  40s                kubeflow.org/mpijob-kueue-controller  Created Workload: default/mpijob-tensorflow-benchmarks-job1-11dc3
  Normal  Started          40s                kubeflow.org/mpijob-kueue-controller  Admitted by clusterQueue cluster-queue
  Normal  MPIJobResumed    39s (x2 over 39s)  mpi-job-controller                    MPIJob resumed
  Normal  MPIJobRunning    36s (x3 over 37s)  mpi-job-controller                    MPIJob default/tensorflow-benchmarks-job1 is running
# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job2
Name:         tensorflow-benchmarks-job2
Namespace:    default
Labels:       kueue.x-k8s.io/queue-name=user-queue
Annotations:  <none>
API Version:  kubeflow.org/v2beta1
Kind:         MPIJob
Metadata:
  Creation Timestamp:  2025-03-12T20:58:36Z
  Generation:          2
  Resource Version:    2810
  UID:                 58b33273-19b3-4e86-9152-fac3415526f7
Spec:
  Launcher Creation Policy:  WaitForWorkersReady
  Mpi Implementation:        OpenMPI
  Mpi Replica Specs:
    Launcher:
      Replicas:  1
      Template:
        Metadata:
        Spec:
          Containers:
            Command:
              mpirun
              --allow-run-as-root
              -np
              2
              -bind-to
              none
              -map-by
              slot
              -x
              NCCL_DEBUG=INFO
              -x
              LD_LIBRARY_PATH
              -x
              PATH
              -mca
              pml
              ob1
              -mca
              btl
              ^openib
              python
              scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
              --model=resnet101
              --batch_size=64
              --variable_update=horovod
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks-job2
            Resources:
    Worker:
      Replicas:  2
      Template:
        Metadata:
        Spec:
          Containers:
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks-job2
            Resources:
              Limits:
                nvidia.com/gpu:  1
  Run Launcher As Worker:        false
  Run Policy:
    Clean Pod Policy:   Running
    Suspend:            false
  Slots Per Worker:     1
  Ssh Auth Mount Path:  /root/.ssh
Status:
  Conditions:
    Last Transition Time:  2025-03-12T20:58:36Z
    Last Update Time:      2025-03-12T20:58:36Z
    Message:               MPIJob default/tensorflow-benchmarks-job2 is created.
    Reason:                MPIJobCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2025-03-12T20:58:37Z
    Last Update Time:      2025-03-12T20:58:37Z
    Message:               MPIJob default/tensorflow-benchmarks-job2 is suspended.
    Reason:                MPIJobSuspended
    Status:                False
    Type:                  Running
    Last Transition Time:  2025-03-12T20:58:39Z
    Last Update Time:      2025-03-12T20:58:39Z
    Message:               MPIJob resumed
    Reason:                MPIJobResumed
    Status:                False
    Type:                  Suspended
  Replica Statuses:
    Launcher:
      Active:  1
    Worker:
  Start Time:  2025-03-12T20:58:39Z
Events:
  Type    Reason           Age                From                                  Message
  ----    ------           ----               ----                                  -------
  Normal  CreatedWorkload  75s                kubeflow.org/mpijob-kueue-controller  Created Workload: default/mpijob-tensorflow-benchmarks-job2-64330
  Normal  MPIJobCreated    74s (x2 over 75s)  mpi-job-controller                    MPIJob default/tensorflow-benchmarks-job2 is created.
  Normal  MPIJobSuspended  74s (x2 over 74s)  mpi-job-controller                    MPIJob suspended
  Normal  Started          72s                kubeflow.org/mpijob-kueue-controller  Admitted by clusterQueue cluster-queue
  Normal  MPIJobResumed    72s (x2 over 72s)  mpi-job-controller                    MPIJob resumed
# kubectl describe workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job1-11dc3
Name:         mpijob-tensorflow-benchmarks-job1-11dc3
Namespace:    default
Labels:       kueue.x-k8s.io/job-uid=a6621976-3f98-40d4-8cff-98478fbe603b
Annotations:  <none>
API Version:  kueue.x-k8s.io/v1beta1
Kind:         Workload
Metadata:
  Creation Timestamp:  2025-03-12T20:58:36Z
  Finalizers:
    kueue.x-k8s.io/resource-in-use
  Generation:  1
  Owner References:
    API Version:           kubeflow.org/v2beta1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  MPIJob
    Name:                  tensorflow-benchmarks-job1
    UID:                   a6621976-3f98-40d4-8cff-98478fbe603b
  Resource Version:        2770
  UID:                     cdb1c18a-144d-456f-be18-3073b6bd59b4
Spec:
  Active:  true
  Pod Sets:
    Count:  1
    Name:   launcher
    Template:
      Metadata:
      Spec:
        Containers:
          Command:
            mpirun
            --allow-run-as-root
            -np
            2
            -bind-to
            none
            -map-by
            slot
            -x
            NCCL_DEBUG=INFO
            -x
            LD_LIBRARY_PATH
            -x
            PATH
            -mca
            pml
            ob1
            -mca
            btl
            ^openib
            python
            scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
            --model=resnet101
            --batch_size=64
            --variable_update=horovod
          Image:  mpioperator/tensorflow-benchmarks:latest
          Name:   tensorflow-benchmarks-job1
          Resources:
    Count:  2
    Name:   worker
    Template:
      Metadata:
      Spec:
        Containers:
          Image:  mpioperator/tensorflow-benchmarks:latest
          Name:   tensorflow-benchmarks-job1
          Resources:
            Limits:
              nvidia.com/gpu:  1
  Priority:                    0
  Priority Class Source:       
  Queue Name:                  user-queue
Status:
  Admission:
    Cluster Queue:  cluster-queue
    Pod Set Assignments:
      Count:  1
      Name:   launcher
      Count:  2
      Flavors:
        nvidia.com/gpu:  default-flavor
      Name:              worker
      Resource Usage:
        nvidia.com/gpu:  2
  Conditions:
    Last Transition Time:  2025-03-12T20:58:36Z
    Message:               Quota reserved in ClusterQueue cluster-queue
    Observed Generation:   1
    Reason:                QuotaReserved
    Status:                True
    Type:                  QuotaReserved
    Last Transition Time:  2025-03-12T20:58:36Z
    Message:               The workload is admitted
    Observed Generation:   1
    Reason:                Admitted
    Status:                True
    Type:                  Admitted
    Last Transition Time:  2025-03-12T20:58:39Z
    Message:               All pods were ready or succeeded since the workload admission
    Observed Generation:   1
    Reason:                PodsReady
    Status:                True
    Type:                  PodsReady
Events:
  Type    Reason         Age   From             Message
  ----    ------         ----  ----             -------
  Normal  QuotaReserved  2m7s  kueue-admission  Quota reserved in ClusterQueue cluster-queue, wait time since queued was 0s
  Normal  Admitted       2m7s  kueue-admission  Admitted by ClusterQueue cluster-queue, wait time since reservation was 0s
# kubectl describe workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-64330
Name:         mpijob-tensorflow-benchmarks-job2-64330
Namespace:    default
Labels:       kueue.x-k8s.io/job-uid=58b33273-19b3-4e86-9152-fac3415526f7
Annotations:  <none>
API Version:  kueue.x-k8s.io/v1beta1
Kind:         Workload
Metadata:
  Creation Timestamp:  2025-03-12T20:58:36Z
  Finalizers:
    kueue.x-k8s.io/resource-in-use
  Generation:  1
  Owner References:
    API Version:           kubeflow.org/v2beta1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  MPIJob
    Name:                  tensorflow-benchmarks-job2
    UID:                   58b33273-19b3-4e86-9152-fac3415526f7
  Resource Version:        2772
  UID:                     8fb74084-927a-4abc-90c8-16efb5c1515d
Spec:
  Active:  true
  Pod Sets:
    Count:  1
    Name:   launcher
    Template:
      Metadata:
      Spec:
        Containers:
          Command:
            mpirun
            --allow-run-as-root
            -np
            2
            -bind-to
            none
            -map-by
            slot
            -x
            NCCL_DEBUG=INFO
            -x
            LD_LIBRARY_PATH
            -x
            PATH
            -mca
            pml
            ob1
            -mca
            btl
            ^openib
            python
            scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
            --model=resnet101
            --batch_size=64
            --variable_update=horovod
          Image:  mpioperator/tensorflow-benchmarks:latest
          Name:   tensorflow-benchmarks-job2
          Resources:
    Count:  2
    Name:   worker
    Template:
      Metadata:
      Spec:
        Containers:
          Image:  mpioperator/tensorflow-benchmarks:latest
          Name:   tensorflow-benchmarks-job2
          Resources:
            Limits:
              nvidia.com/gpu:  1
  Priority:                    0
  Priority Class Source:       
  Queue Name:                  user-queue
Status:
  Admission:
    Cluster Queue:  cluster-queue
    Pod Set Assignments:
      Count:  1
      Name:   launcher
      Count:  2
      Flavors:
        nvidia.com/gpu:  default-flavor
      Name:              worker
      Resource Usage:
        nvidia.com/gpu:  2
  Conditions:
    Last Transition Time:  2025-03-12T20:58:39Z
    Message:               Quota reserved in ClusterQueue cluster-queue
    Observed Generation:   1
    Reason:                QuotaReserved
    Status:                True
    Type:                  QuotaReserved
    Last Transition Time:  2025-03-12T20:58:36Z
    Message:               Not all pods are ready or succeeded
    Observed Generation:   1
    Reason:                PodsReady
    Status:                False
    Type:                  PodsReady
    Last Transition Time:  2025-03-12T20:58:39Z
    Message:               The workload is admitted
    Observed Generation:   1
    Reason:                Admitted
    Status:                True
    Type:                  Admitted
Events:
  Type    Reason         Age    From             Message
  ----    ------         ----   ----             -------
  Normal  QuotaReserved  2m47s  kueue-admission  Quota reserved in ClusterQueue cluster-queue, wait time since queued was 4s
  Normal  Admitted       2m47s  kueue-admission  Admitted by ClusterQueue cluster-queue, wait time since reservation was 0s

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y
Additional information

https://github.com/ttakahashi21/mpi-operator/blob/dev-takahashi/pkg/controller/mpi_job_controller.go#L652-L686
I wrote the above debugging code and confirmed.
When using Kueue, is it expected that the initial value of mpiJob.Spec.RunPolicy.Suspend will be true?

case1

  • The default value of mpiJob.Spec.RunPolicy.Suspend for Job1 and Job2 is set to false. Therefore, "! isMPIJobSuspended(mpiJob)" for Job1 and Job2 is true. This creates or gets the worker. Since the launcher has not yet been created, the condition “launcher == nil” for Job1 and Job2 is true. In job1, the condition “c.countReadyWorkerPods(worker) == len(worker)” becomes true and the launcher for job 1 is created. In job2, "LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup” and ”c. countReadyWorkerPods(worker) == len(worker)” are not true, respectively, so the launcher for job2 is not created.

The details of the debug log are as follows

mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 1 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 1 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup || c.countReadyWorkerPods(checkworker) == len(checkworker): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - deploy launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 

case3

The value of mpiJob.Spec.RunPolicy.Suspend for Job1 and Job2 is set to true when MPIjob is started for the first time. Therefore, "!isMPIJobSuspended(mpiJob)" for Job1 and Job2 is false. So, this can not creates or gets the worker. Since the launcher has not yet been created, the condition “launcher == nil” for Job1 and Job2 is true. However, the condition “c.countReadyWorkerPods(worker) == len(worker)” becomes true in job1 and job2 and the launcher for job 1 and job2 is created. The reason why “c.countReadyWorkerPods(worker)” and “len(worker)” will both be 0 because the worker information does not get.

The details of the debug log are as follows

mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup || c.countReadyWorkerPods(checkworker) == len(checkworker): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - deploy launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup || c.countReadyWorkerPods(checkworker) == len(checkworker): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - deploy launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, #617 modification is necessary.

@mimowo
Copy link
Contributor

mimowo commented Nov 4, 2024

@GonzaloSaez please fix the DCO

@GonzaloSaez GonzaloSaez force-pushed the g/fix_job_launch_wait_for_pods_ready branch from 802bd49 to 804cfd8 Compare November 5, 2024 06:52
@mimowo
Copy link
Contributor

mimowo commented Nov 8, 2024

For context, linking it back to the related Kueue issue: kubernetes-sigs/kueue#3400 and the slack discussion https://kubernetes.slack.com/archives/C032ZE66A2X/p1730369507818399

@GonzaloSaez GonzaloSaez force-pushed the g/fix_job_launch_wait_for_pods_ready branch from 804cfd8 to c1ea13d Compare January 10, 2025 22:58
@GonzaloSaez GonzaloSaez force-pushed the g/fix_job_launch_wait_for_pods_ready branch from c1ea13d to 50abbdf Compare January 10, 2025 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants