Do not create the launcher job if the job starts suspended #670

GonzaloSaez · 2024-10-31T19:03:53Z

When the MPIJob starts suspended, we were creating the launcher job no matter the initial suspended state. This causes issues with kueue, since it will suspend the MPIJob but it will create a job with the wrong NodeSelector coming from the kueue flavour. I think avoiding creating the launcher in this scenario is the right thing to do but I'm not sure if others have different thoughts.

google-oss-prow · 2024-10-31T19:03:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign alculquicondor for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

GonzaloSaez · 2024-11-01T08:55:02Z

pkg/controller/mpi_job_controller.go

+		// If the job is suspended, the list of worker pods will be incorrect. We also do
+		// not want to start the launcher job if the MPIJob starts suspended.
+		if launcher == nil && !isMPIJobSuspended(mpiJob) {


This opens the question on what should be done when unsuspending the launcher job in case kueue has decided to change the NodeSelector? Should we instead recreate the job since NodeSelector is immutable?

Yeah, I think this question does not have a straigtforward answer, I can see at least two approaches possible:

as in JobSet- update the NodeSelector field in pod template when resuming the Job

Recreate the launcher/worker Jobs , this can be probably achieved easily by deleting the jobs on suspending the MPIJob

I'm ok with any of those that is simpler to implement. Any optinion @tenzen-y ?

In any case I think it is safe to decouple the fixes.

Sorry for the delayed response. IMO, I would selecting as in JobSet- update the NodeSelector field in pod template when resuming the Job instead of current solution.
Because I want to align with the behavior with JobSet since the JobSet with suspended creates Job.

That's already being done in theory by kueue with RunWithPodSetsInfo, right? I think the issue here is that we create the job in non-suspended move even if the MPIJob is suspended. This results in pods being scheduled in nodes and then removed because the controller suspends the job later on.

@tenzen-y
I think it reproduces when GPUs are used in areas not managed by kueue.This procedure attaches two GPUs to a node, setting ClusterQueue to twice the value of GPUs available for allocation in the cluster.
Two jobs that use two GPUs as jobs are executed simultaneously. In this case, the launcher runs even though the worker is pending.

Create Kind Cluster and set label for fake-gpu-operator

kind create cluster kubectl label node kind-control-plane run.ai/simulated-gpu-node-pool=default

Install fake gpu operator

helm repo add fake-gpu-operator https://fake-gpu-operator.storage.googleapis.com helm repo update helm upgrade -i gpu-operator fake-gpu-operator/fake-gpu-operator --namespace gpu-operator --create-namespace

Check that fake gpu is shown by running kubectl get all -n gpu-operator and kubectl describe node | grep "nvidia.com/gpu:".

Install mpi-operator

kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v2beta1/mpi-operator.yaml

Check that mpi-operator is deployed properly by running kubectl get all -n mpi-operator.

Install kueue with waitForPodsReady enabled

Get kueue manifest and enable waitForPodsReady.

wget https://github.com/kubernetes-sigs/kueue/releases/download/v0.10.2/manifests.yaml sed -i '/#waitForPodsReady:/a \ waitForPodsReady:\n enable: true\n timeout: 5m\n blockAdmission: true\n requeuingStrategy:\n timestamp: Eviction\n backoffLimitCount: 5\n backoffBaseSeconds: 60\n backoffMaxSeconds: 3600' manifests.yaml

Then, apply the configuration by:

kubectl apply --server-side -f manifests.yaml

Check that kueue is deployed properly by running kubectl get all -n kueue-system.

Prepare minimal kueue setup with gpu enabled

First, check the amount of allocatable GPU in your cluster.

TOTAL_ALLOCATABLE=$(kubectl get node --selector='run.ai/simulated-gpu-node-pool=default,node-role.kubernetes.io/control-plane' -o jsonpath='{range .items[*]}{.status.allocatable.nvidia\.com\/gpu}{"\n"}{end}' | numfmt --from=auto | awk '{s+=$1} END {print s}') echo $TOTAL_ALLOCATABLE

In our case this outputs 2.

Configure ClusterQueue quota

We configure the GPU flavor by doubling the total GPU allocatable in our cluster, in order to simulate issues with provisioning.

Execute the following command to create cluster queues configuration as single-clusterqueue-setup.yaml:

cat <<EOF >> single-clusterqueue-setup-gpu4.yaml apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: "default-flavor" --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: "cluster-queue" spec: namespaceSelector: {} # match all. resourceGroups: - coveredResources: ["nvidia.com/gpu"] flavors: - name: "default-flavor" resources: - name: "nvidia.com/gpu" nominalQuota: 4 # double the value of allocatable GPU in the cluster --- apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: namespace: "default" name: "user-queue" spec: clusterQueue: "cluster-queue" EOF

Then, apply the configuration by:

kubectl apply -f single-clusterqueue-setup-gpu4.yaml

Try mpijob works well with 2 replicas (2 gpus are available on the node)

Get tensorflow-benchmarks.yaml

wget https://raw.githubusercontent.com/kubeflow/mpi-operator/refs/heads/master/examples/v2beta1/tensorflow-benchmarks/tensorflow-benchmarks.yaml

Run with waitForPodsReady enabled

Create start.sh script

cat <<EOF >> start.sh sed '/^metadata:/a \ labels:\n kueue.x-k8s.io/queue-name: user-queue' tensorflow-benchmarks.yaml | sed 's/name: tensorflow-benchmarks/name: tensorflow-benchmarks-job1/g' > /tmp/tensorflow-benchmarks-job1.yaml sed '/^metadata:/a \ labels:\n kueue.x-k8s.io/queue-name: user-queue' tensorflow-benchmarks.yaml | sed 's/name: tensorflow-benchmarks/name: tensorflow-benchmarks-job2/g' > /tmp/tensorflow-benchmarks-job2.yaml kubectl create -f /tmp/tensorflow-benchmarks-job1.yaml kubectl create -f /tmp/tensorflow-benchmarks-job2.yaml EOF chmod +x start.sh

Run the start.sh script

./start.sh

Monitor the progress

# kubectl get mpijob,workload,pod NAME AGE mpijob.kubeflow.org/tensorflow-benchmarks-job1 8s mpijob.kubeflow.org/tensorflow-benchmarks-job2 8s NAME QUEUE RESERVED IN ADMITTED FINISHED AGE workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job1-73f4b user-queue cluster-queue True 8s workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-8caef user-queue cluster-queue True 8s NAME READY STATUS RESTARTS AGE pod/tensorflow-benchmarks-job1-launcher-zx4ck 1/1 Running 0 7s pod/tensorflow-benchmarks-job1-worker-0 1/1 Running 0 8s pod/tensorflow-benchmarks-job1-worker-1 1/1 Running 0 8s pod/tensorflow-benchmarks-job2-launcher-zch7j 1/1 Running 0 4s #launcher runs even though the worker is pending pod/tensorflow-benchmarks-job2-worker-0 0/1 Pending 0 5s pod/tensorflow-benchmarks-job2-worker-1 0/1 Pending 0 4s

# kubectl describe workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-8caef Name: mpijob-tensorflow-benchmarks-job2-8caef Namespace: default Labels: kueue.x-k8s.io/job-uid=a2ad4da7-eb57-477d-a9bb-1142d8d847dc Annotations: <none> API Version: kueue.x-k8s.io/v1beta1 Kind: Workload Metadata: Creation Timestamp: 2025-03-07T06:59:46Z Finalizers: kueue.x-k8s.io/resource-in-use Generation: 1 Owner References: API Version: kubeflow.org/v2beta1 Block Owner Deletion: true Controller: true Kind: MPIJob Name: tensorflow-benchmarks-job2 UID: a2ad4da7-eb57-477d-a9bb-1142d8d847dc Resource Version: 4618 UID: 34379ab2-1539-4f61-b69a-fcdad7768478 Spec: Active: true Pod Sets: Count: 1 Name: launcher Template: Metadata: Spec: Containers: Command: mpirun --allow-run-as-root -np 2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet101 --batch_size=64 --variable_update=horovod Image: mpioperator/tensorflow-benchmarks:latest Name: tensorflow-benchmarks-job2 Resources: Count: 2 Name: worker Template: Metadata: Spec: Containers: Image: mpioperator/tensorflow-benchmarks:latest Name: tensorflow-benchmarks-job2 Resources: Limits: nvidia.com/gpu: 1 Priority: 0 Priority Class Source: Queue Name: user-queue Status: Admission: Cluster Queue: cluster-queue Pod Set Assignments: Count: 1 Name: launcher Count: 2 Flavors: nvidia.com/gpu: default-flavor Name: worker Resource Usage: nvidia.com/gpu: 2 Conditions: Last Transition Time: 2025-03-07T06:59:49Z Message: Quota reserved in ClusterQueue cluster-queue Observed Generation: 1 Reason: QuotaReserved Status: True Type: QuotaReserved Last Transition Time: 2025-03-07T06:59:47Z Message: Not all pods are ready or succeeded Observed Generation: 1 Reason: PodsReady Status: False Type: PodsReady Last Transition Time: 2025-03-07T06:59:49Z Message: The workload is admitted Observed Generation: 1 Reason: Admitted Status: True Type: Admitted Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal QuotaReserved 28s kueue-admission Quota reserved in ClusterQueue cluster-queue, wait time since queued was 4s Normal Admitted 28s kueue-admission Admitted by ClusterQueue cluster-queue, wait time since reservation was 0s

# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job2 Name: tensorflow-benchmarks-job2 Namespace: default Labels: kueue.x-k8s.io/queue-name=user-queue Annotations: <none> API Version: kubeflow.org/v2beta1 Kind: MPIJob Metadata: Creation Timestamp: 2025-03-07T06:59:46Z Generation: 2 Resource Version: 4657 UID: a2ad4da7-eb57-477d-a9bb-1142d8d847dc Spec: Launcher Creation Policy: AtStartup Mpi Implementation: OpenMPI Mpi Replica Specs: Launcher: Replicas: 1 Template: Metadata: Spec: Containers: Command: mpirun --allow-run-as-root -np 2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet101 --batch_size=64 --variable_update=horovod Image: mpioperator/tensorflow-benchmarks:latest Name: tensorflow-benchmarks-job2 Resources: Worker: Replicas: 2 Template: Metadata: Spec: Containers: Image: mpioperator/tensorflow-benchmarks:latest Name: tensorflow-benchmarks-job2 Resources: Limits: nvidia.com/gpu: 1 Run Launcher As Worker: false Run Policy: Clean Pod Policy: Running Suspend: false Slots Per Worker: 1 Ssh Auth Mount Path: /root/.ssh Status: Conditions: Last Transition Time: 2025-03-07T06:59:47Z Last Update Time: 2025-03-07T06:59:47Z Message: MPIJob default/tensorflow-benchmarks-job2 is created. Reason: MPIJobCreated Status: True Type: Created Last Transition Time: 2025-03-07T06:59:48Z Last Update Time: 2025-03-07T06:59:48Z Message: MPIJob default/tensorflow-benchmarks-job2 is suspended. Reason: MPIJobSuspended Status: False Type: Running Last Transition Time: 2025-03-07T06:59:50Z Last Update Time: 2025-03-07T06:59:50Z Message: MPIJob resumed Reason: MPIJobResumed Status: False Type: Suspended Replica Statuses: Launcher: Active: 1 Worker: Start Time: 2025-03-07T06:59:50Z Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal CreatedWorkload 99s kubeflow.org/mpijob-kueue-controller Created Workload: default/mpijob-tensorflow-benchmarks-job2-8caef Normal MPIJobCreated 98s (x2 over 99s) mpi-job-controller MPIJob default/tensorflow-benchmarks-job2 is created. Normal MPIJobSuspended 98s (x2 over 98s) mpi-job-controller MPIJob suspended Normal Started 97s kubeflow.org/mpijob-kueue-controller Admitted by clusterQueue cluster-queue Normal MPIJobResumed 96s mpi-job-controller MPIJob resumed

# kubectl describe pod/tensorflow-benchmarks-job2-worker-0 Name: tensorflow-benchmarks-job2-worker-0 Namespace: default Priority: 0 Service Account: default Node: <none> Labels: training.kubeflow.org/job-name=tensorflow-benchmarks-job2 training.kubeflow.org/job-role=worker training.kubeflow.org/operator-name=mpi-operator training.kubeflow.org/replica-index=0 Annotations: <none> Status: Pending IP: IPs: <none> Controlled By: MPIJob/tensorflow-benchmarks-job2 Containers: tensorflow-benchmarks-job2: Image: mpioperator/tensorflow-benchmarks:latest Port: <none> Host Port: <none> Command: /usr/sbin/sshd -De Limits: nvidia.com/gpu: 1 Requests: nvidia.com/gpu: 1 Environment: K_MPI_JOB_ROLE: worker Mounts: /root/.ssh from ssh-auth (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4l5z8 (ro) Conditions: Type Status PodScheduled False Volumes: ssh-auth: Type: Secret (a volume populated by a Secret) SecretName: tensorflow-benchmarks-job2-ssh Optional: false kube-api-access-4l5z8: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 2m42s default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

# kubectl get mpijob,workload,pod NAME AGE mpijob.kubeflow.org/tensorflow-benchmarks-job1 5m6s mpijob.kubeflow.org/tensorflow-benchmarks-job2 5m6s NAME QUEUE RESERVED IN ADMITTED FINISHED AGE workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job1-73f4b user-queue cluster-queue True 5m6s workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-8caef user-queue False 5m6s NAME READY STATUS RESTARTS AGE pod/tensorflow-benchmarks-job1-launcher-zx4ck 1/1 Running 1 (2m27s ago) 5m5s pod/tensorflow-benchmarks-job1-worker-0 1/1 Running 0 5m6s pod/tensorflow-benchmarks-job1-worker-1 1/1 Running 0 5m6s

Cleanup

Clean up the jobs by:

kubectl delete -f /tmp/tensorflow-benchmarks-job1.yaml kubectl delete -f /tmp/tensorflow-benchmarks-job2.yaml

Change ClusterQueue quota

Set the value of GPUs that can be allocated in the cluster to 2.

cat <<EOF >> single-clusterqueue-setup-gpu2.yaml apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: "default-flavor" --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: "cluster-queue" spec: namespaceSelector: {} # match all. resourceGroups: - coveredResources: ["nvidia.com/gpu"] flavors: - name: "default-flavor" resources: - name: "nvidia.com/gpu" nominalQuota: 2 --- apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: namespace: "default" name: "user-queue" spec: clusterQueue: "cluster-queue" EOF

Then, apply the configuration by:

kubectl apply -f single-clusterqueue-setup-gpu2.yaml

Run the start.sh script

./start.sh

Monitor the progress

# kubectl get mpijob,workload,pod NAME AGE mpijob.kubeflow.org/tensorflow-benchmarks-job1 19s mpijob.kubeflow.org/tensorflow-benchmarks-job2 19s NAME QUEUE RESERVED IN ADMITTED FINISHED AGE workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job1-7d852 user-queue cluster-queue True 19s workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-c171c user-queue 19s NAME READY STATUS RESTARTS AGE pod/tensorflow-benchmarks-job1-launcher-qg2xj 1/1 Running 0 18s pod/tensorflow-benchmarks-job1-worker-0 1/1 Running 0 19s pod/tensorflow-benchmarks-job1-worker-1 1/1 Running 0 19s

# kubectl describe workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-c171c Name: mpijob-tensorflow-benchmarks-job2-c171c Namespace: default Labels: kueue.x-k8s.io/job-uid=211c9316-b9ad-400a-b108-7549d40517b5 Annotations: <none> API Version: kueue.x-k8s.io/v1beta1 Kind: Workload Metadata: Creation Timestamp: 2025-03-07T07:11:20Z Finalizers: kueue.x-k8s.io/resource-in-use Generation: 1 Owner References: API Version: kubeflow.org/v2beta1 Block Owner Deletion: true Controller: true Kind: MPIJob Name: tensorflow-benchmarks-job2 UID: 211c9316-b9ad-400a-b108-7549d40517b5 Resource Version: 6372 UID: 7c481e50-9be5-4950-aff3-773d8e475989 Spec: Active: true Pod Sets: Count: 1 Name: launcher Template: Metadata: Spec: Containers: Command: mpirun --allow-run-as-root -np 2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet101 --batch_size=64 --variable_update=horovod Image: mpioperator/tensorflow-benchmarks:latest Name: tensorflow-benchmarks-job2 Resources: Count: 2 Name: worker Template: Metadata: Spec: Containers: Image: mpioperator/tensorflow-benchmarks:latest Name: tensorflow-benchmarks-job2 Resources: Limits: nvidia.com/gpu: 1 Priority: 0 Priority Class Source: Queue Name: user-queue Status: Conditions: Last Transition Time: 2025-03-07T07:11:20Z Message: Not all pods are ready or succeeded Observed Generation: 1 Reason: PodsReady Status: False Type: PodsReady Last Transition Time: 2025-03-07T07:11:20Z Message: couldn't assign flavors to pod set worker: insufficient unused quota for nvidia.com/gpu in flavor default-flavor, 2 more needed Observed Generation: 1 Reason: Pending Status: False Type: QuotaReserved Resource Requests: Name: launcher Name: worker Resources: nvidia.com/gpu: 2 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Pending 33s (x3 over 33s) kueue-admission couldn't assign flavors to pod set worker: insufficient unused quota for nvidia.com/gpu in flavor default-flavor, 2 more needed

# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job2 Name: tensorflow-benchmarks-job2 Namespace: default Labels: kueue.x-k8s.io/queue-name=user-queue Annotations: <none> API Version: kubeflow.org/v2beta1 Kind: MPIJob Metadata: Creation Timestamp: 2025-03-07T07:11:20Z Generation: 1 Resource Version: 6406 UID: 211c9316-b9ad-400a-b108-7549d40517b5 Spec: Launcher Creation Policy: AtStartup Mpi Implementation: OpenMPI Mpi Replica Specs: Launcher: Replicas: 1 Template: Metadata: Spec: Containers: Command: mpirun --allow-run-as-root -np 2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet101 --batch_size=64 --variable_update=horovod Image: mpioperator/tensorflow-benchmarks:latest Name: tensorflow-benchmarks-job2 Resources: Worker: Replicas: 2 Template: Metadata: Spec: Containers: Image: mpioperator/tensorflow-benchmarks:latest Name: tensorflow-benchmarks-job2 Resources: Limits: nvidia.com/gpu: 1 Run Launcher As Worker: false Run Policy: Clean Pod Policy: Running Suspend: true Slots Per Worker: 1 Ssh Auth Mount Path: /root/.ssh Status: Conditions: Last Transition Time: 2025-03-07T07:11:20Z Last Update Time: 2025-03-07T07:11:20Z Message: MPIJob default/tensorflow-benchmarks-job2 is created. Reason: MPIJobCreated Status: True Type: Created Last Transition Time: 2025-03-07T07:11:21Z Last Update Time: 2025-03-07T07:11:21Z Message: MPIJob suspended Reason: MPIJobSuspended Status: True Type: Suspended Last Transition Time: 2025-03-07T07:11:21Z Last Update Time: 2025-03-07T07:11:21Z Message: MPIJob default/tensorflow-benchmarks-job2 is suspended. Reason: MPIJobSuspended Status: False Type: Running Replica Statuses: Launcher: Worker: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal MPIJobCreated 75s mpi-job-controller MPIJob default/tensorflow-benchmarks-job2 is created. Normal CreatedWorkload 75s kubeflow.org/mpijob-kueue-controller Created Workload: default/mpijob-tensorflow-benchmarks-job2-c171c Normal MPIJobSuspended 74s mpi-job-controller MPIJob suspended

Cleanup

Clean up the jobs by:

kubectl delete -f /tmp/tensorflow-benchmarks-job1.yaml kubectl delete -f /tmp/tensorflow-benchmarks-job2.yaml

@ttakahashi21 Thank you for describing your situation. As I can see it, it seems to be expected behavior. The waitForPodsReady is the feature to observe the not ready or unschedulable pods, then requeue it to Kueue.

If you want to guarantee to schedule Worker, first, we can use launcherCreationPolicy:

mpi-operator/pkg/apis/kubeflow/v2beta1/types.go

Lines 194 to 197 in 7f94988

// launcherCreationPolicy if WaitForWorkersReady, the launcher is created only after all workers are in Ready state. Defaults to AtStartup.

// +kubebuilder:validation:Enum:AtStartup;WaitForWorkersReady

// +kubebuilder:default:=AtStartup

LauncherCreationPolicy LauncherCreationPolicy `json:"launcherCreationPolicy,omitempty"`

If I am missing anything, please let me know.

@tenzen-y

If you want to guarantee to schedule Worker, first, we can use launcherCreationPolicy:

Thank you for your comment. I've tested with setting launcherCreationPolicy to WaitForWorkersReady, but it doesn't work as expected when kueue is enabled and nominalQuota is less than actual allocatable GPUs.

# kueue Condition of GPUs launcherCreationPolicy(MPIJob) Behavior

1 disabled - WaitForWorkersReady work as expected

2 enabled NominalQuota == Allocatable WaitForWorkersReady work as expected

3 enabled NominalQuota > Allocatable WaitForWorkersReady Not work as expected

Note that the result I previously shared was tested with below conditions as you pointed out.

# kueue Condition of GPUs launcherCreationPolicy(MPIJob) Behavior

0 enabled NominalQuota > Allocatable LauncherCreationPolicyAtStartup(deafult) work as expected

case1

When MPIJob is executed exceeding the nominalQuota of GPU resource, GPU resource is insufficient. So, worker is pending. Therefore, if WaitForWorkersReady for MPIJob is working, it is expected that the launcher will not run. In this case, the launcher do not run.

WaitForWorkersReady feature works fine when Kueue is not used.

# kubectl get mpijob,workload,pod NAME AGE mpijob.kubeflow.org/tensorflow-benchmarks-job1 12s mpijob.kubeflow.org/tensorflow-benchmarks-job2 12s NAME READY STATUS RESTARTS AGE pod/tensorflow-benchmarks-job1-launcher-k9r6m 1/1 Running 0 10s pod/tensorflow-benchmarks-job1-worker-0 1/1 Running 0 12s pod/tensorflow-benchmarks-job1-worker-1 1/1 Running 0 12s pod/tensorflow-benchmarks-job2-worker-0 0/1 Pending 0 12s pod/tensorflow-benchmarks-job2-worker-1 0/1 Pending 0 12s

# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job1 Name: tensorflow-benchmarks-job1 Namespace: default Labels: <none> Annotations: <none> API Version: kubeflow.org/v2beta1 Kind: MPIJob Metadata: Creation Timestamp: 2025-03-12T20:54:48Z Generation: 1 Resource Version: 2126 UID: 54b7c0a2-2bb7-40b4-9cb3-92991bf15c5a Spec: Launcher Creation Policy: WaitForWorkersReady Mpi Implementation: OpenMPI Mpi Replica Specs: Launcher: Replicas: 1 Template: Metadata: Spec: Containers: Command: mpirun --allow-run-as-root -np 2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet101 --batch_size=64 --variable_update=horovod Image: mpioperator/tensorflow-benchmarks:latest Name: tensorflow-benchmarks-job1 Resources: Worker: Replicas: 2 Template: Metadata: Spec: Containers: Image: mpioperator/tensorflow-benchmarks:latest Name: tensorflow-benchmarks-job1 Resources: Limits: nvidia.com/gpu: 1 Run Launcher As Worker: false Run Policy: Clean Pod Policy: Running Suspend: false Slots Per Worker: 1 Ssh Auth Mount Path: /root/.ssh Status: Conditions: Last Transition Time: 2025-03-12T20:54:48Z Last Update Time: 2025-03-12T20:54:48Z Message: MPIJob default/tensorflow-benchmarks-job1 is created. Reason: MPIJobCreated Status: True Type: Created Last Transition Time: 2025-03-12T20:54:52Z Last Update Time: 2025-03-12T20:54:52Z Message: MPIJob default/tensorflow-benchmarks-job1 is running. Reason: MPIJobRunning Status: True Type: Running Replica Statuses: Launcher: Active: 1 Worker: Active: 2 Start Time: 2025-03-12T20:54:48Z Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal MPIJobCreated 41s mpi-job-controller MPIJob default/tensorflow-benchmarks-job1 is created. Normal MPIJobRunning 36s (x3 over 37s) mpi-job-controller MPIJob default/tensorflow-benchmarks-job1 is running

# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job2 Name: tensorflow-benchmarks-job2 Namespace: default Labels: <none> Annotations: <none> API Version: kubeflow.org/v2beta1 Kind: MPIJob Metadata: Creation Timestamp: 2025-03-12T20:54:48Z Generation: 1 Resource Version: 2079 UID: 0e11058e-32db-42c3-bdc5-60743a9087a0 Spec: Launcher Creation Policy: WaitForWorkersReady Mpi Implementation: OpenMPI Mpi Replica Specs: Launcher: Replicas: 1 Template: Metadata: Spec: Containers: Command: mpirun --allow-run-as-root -np 2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet101 --batch_size=64 --variable_update=horovod Image: mpioperator/tensorflow-benchmarks:latest Name: tensorflow-benchmarks-job2 Resources: Worker: Replicas: 2 Template: Metadata: Spec: Containers: Image: mpioperator/tensorflow-benchmarks:latest Name: tensorflow-benchmarks-job2 Resources: Limits: nvidia.com/gpu: 1 Run Launcher As Worker: false Run Policy: Clean Pod Policy: Running Suspend: false Slots Per Worker: 1 Ssh Auth Mount Path: /root/.ssh Status: Conditions: Last Transition Time: 2025-03-12T20:54:48Z Last Update Time: 2025-03-12T20:54:48Z Message: MPIJob default/tensorflow-benchmarks-job2 is created. Reason: MPIJobCreated Status: True Type: Created Replica Statuses: Worker: Start Time: 2025-03-12T20:54:48Z Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal MPIJobCreated 109s mpi-job-controller MPIJob default/tensorflow-benchmarks-job2 is created.

case2

When MPIJob is executed exceeding the nominalQuota of GPU resource, spec.runPolicy.suspend paramater of MPIJob change false to true. In this case, MPIJob of launcher and worker do not run.

case3

When MPIJob is executed exceeding the nominalQuota of GPU resource, spec.runPolicy.suspend paramater of MPIJob do not change false to true.
In this case, MPIJob try to run. But, GPU resource is insufficient, so worker is pending. Therefore, if WaitForWorkersReady for MPIJob is working, it is expected that the launcher will not run. However, the launcher is running.

# kubectl get mpijob,workload,pod NAME AGE mpijob.kubeflow.org/tensorflow-benchmarks-job1 14s mpijob.kubeflow.org/tensorflow-benchmarks-job2 14s NAME QUEUE RESERVED IN ADMITTED FINISHED AGE workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job1-11dc3 user-queue cluster-queue True 14s workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-64330 user-queue cluster-queue True 14s NAME READY STATUS RESTARTS AGE pod/tensorflow-benchmarks-job1-launcher-mfj4b 1/1 Running 0 13s pod/tensorflow-benchmarks-job1-worker-0 1/1 Running 0 14s pod/tensorflow-benchmarks-job1-worker-1 1/1 Running 0 14s pod/tensorflow-benchmarks-job2-launcher-6b5p6 1/1 Running 0 11s　#launcher is running, pod/tensorflow-benchmarks-job2-worker-0 0/1 Pending 0 11s pod/tensorflow-benchmarks-job2-worker-1 0/1 Pending 0 11s

# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job1 Name: tensorflow-benchmarks-job1 Namespace: default Labels: kueue.x-k8s.io/queue-name=user-queue Annotations: <none> API Version: kubeflow.org/v2beta1 Kind: MPIJob Metadata: Creation Timestamp: 2025-03-12T20:58:36Z Generation: 2 Resource Version: 2768 UID: a6621976-3f98-40d4-8cff-98478fbe603b Spec: Launcher Creation Policy: WaitForWorkersReady Mpi Implementation: OpenMPI Mpi Replica Specs: Launcher: Replicas: 1 Template: Metadata: Spec: Containers: Command: mpirun --allow-run-as-root -np 2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet101 --batch_size=64 --variable_update=horovod Image: mpioperator/tensorflow-benchmarks:latest Name: tensorflow-benchmarks-job1 Resources: Worker: Replicas: 2 Template: Metadata: Spec: Containers: Image: mpioperator/tensorflow-benchmarks:latest Name: tensorflow-benchmarks-job1 Resources: Limits: nvidia.com/gpu: 1 Run Launcher As Worker: false Run Policy: Clean Pod Policy: Running Suspend: false Slots Per Worker: 1 Ssh Auth Mount Path: /root/.ssh Status: Conditions: Last Transition Time: 2025-03-12T20:58:36Z Last Update Time: 2025-03-12T20:58:36Z Message: MPIJob default/tensorflow-benchmarks-job1 is created. Reason: MPIJobCreated Status: True Type: Created Last Transition Time: 2025-03-12T20:58:37Z Last Update Time: 2025-03-12T20:58:37Z Message: MPIJob resumed Reason: MPIJobResumed Status: False Type: Suspended Last Transition Time: 2025-03-12T20:58:39Z Last Update Time: 2025-03-12T20:58:39Z Message: MPIJob default/tensorflow-benchmarks-job1 is running. Reason: MPIJobRunning Status: True Type: Running Replica Statuses: Launcher: Active: 1 Worker: Active: 2 Start Time: 2025-03-12T20:58:37Z Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal MPIJobCreated 40s (x2 over 40s) mpi-job-controller MPIJob default/tensorflow-benchmarks-job1 is created. Normal MPIJobSuspended 40s (x2 over 40s) mpi-job-controller MPIJob suspended Normal CreatedWorkload 40s kubeflow.org/mpijob-kueue-controller Created Workload: default/mpijob-tensorflow-benchmarks-job1-11dc3 Normal Started 40s kubeflow.org/mpijob-kueue-controller Admitted by clusterQueue cluster-queue Normal MPIJobResumed 39s (x2 over 39s) mpi-job-controller MPIJob resumed Normal MPIJobRunning 36s (x3 over 37s) mpi-job-controller MPIJob default/tensorflow-benchmarks-job1 is running

# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job2 Name: tensorflow-benchmarks-job2 Namespace: default Labels: kueue.x-k8s.io/queue-name=user-queue Annotations: <none> API Version: kubeflow.org/v2beta1 Kind: MPIJob Metadata: Creation Timestamp: 2025-03-12T20:58:36Z Generation: 2 Resource Version: 2810 UID: 58b33273-19b3-4e86-9152-fac3415526f7 Spec: Launcher Creation Policy: WaitForWorkersReady Mpi Implementation: OpenMPI Mpi Replica Specs: Launcher: Replicas: 1 Template: Metadata: Spec: Containers: Command: mpirun --allow-run-as-root -np 2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet101 --batch_size=64 --variable_update=horovod Image: mpioperator/tensorflow-benchmarks:latest Name: tensorflow-benchmarks-job2 Resources: Worker: Replicas: 2 Template: Metadata: Spec: Containers: Image: mpioperator/tensorflow-benchmarks:latest Name: tensorflow-benchmarks-job2 Resources: Limits: nvidia.com/gpu: 1 Run Launcher As Worker: false Run Policy: Clean Pod Policy: Running Suspend: false Slots Per Worker: 1 Ssh Auth Mount Path: /root/.ssh Status: Conditions: Last Transition Time: 2025-03-12T20:58:36Z Last Update Time: 2025-03-12T20:58:36Z Message: MPIJob default/tensorflow-benchmarks-job2 is created. Reason: MPIJobCreated Status: True Type: Created Last Transition Time: 2025-03-12T20:58:37Z Last Update Time: 2025-03-12T20:58:37Z Message: MPIJob default/tensorflow-benchmarks-job2 is suspended. Reason: MPIJobSuspended Status: False Type: Running Last Transition Time: 2025-03-12T20:58:39Z Last Update Time: 2025-03-12T20:58:39Z Message: MPIJob resumed Reason: MPIJobResumed Status: False Type: Suspended Replica Statuses: Launcher: Active: 1 Worker: Start Time: 2025-03-12T20:58:39Z Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal CreatedWorkload 75s kubeflow.org/mpijob-kueue-controller Created Workload: default/mpijob-tensorflow-benchmarks-job2-64330 Normal MPIJobCreated 74s (x2 over 75s) mpi-job-controller MPIJob default/tensorflow-benchmarks-job2 is created. Normal MPIJobSuspended 74s (x2 over 74s) mpi-job-controller MPIJob suspended Normal Started 72s kubeflow.org/mpijob-kueue-controller Admitted by clusterQueue cluster-queue Normal MPIJobResumed 72s (x2 over 72s) mpi-job-controller MPIJob resumed

# kubectl describe workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job1-11dc3 Name: mpijob-tensorflow-benchmarks-job1-11dc3 Namespace: default Labels: kueue.x-k8s.io/job-uid=a6621976-3f98-40d4-8cff-98478fbe603b Annotations: <none> API Version: kueue.x-k8s.io/v1beta1 Kind: Workload Metadata: Creation Timestamp: 2025-03-12T20:58:36Z Finalizers: kueue.x-k8s.io/resource-in-use Generation: 1 Owner References: API Version: kubeflow.org/v2beta1 Block Owner Deletion: true Controller: true Kind: MPIJob Name: tensorflow-benchmarks-job1 UID: a6621976-3f98-40d4-8cff-98478fbe603b Resource Version: 2770 UID: cdb1c18a-144d-456f-be18-3073b6bd59b4 Spec: Active: true Pod Sets: Count: 1 Name: launcher Template: Metadata: Spec: Containers: Command: mpirun --allow-run-as-root -np 2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet101 --batch_size=64 --variable_update=horovod Image: mpioperator/tensorflow-benchmarks:latest Name: tensorflow-benchmarks-job1 Resources: Count: 2 Name: worker Template: Metadata: Spec: Containers: Image: mpioperator/tensorflow-benchmarks:latest Name: tensorflow-benchmarks-job1 Resources: Limits: nvidia.com/gpu: 1 Priority: 0 Priority Class Source: Queue Name: user-queue Status: Admission: Cluster Queue: cluster-queue Pod Set Assignments: Count: 1 Name: launcher Count: 2 Flavors: nvidia.com/gpu: default-flavor Name: worker Resource Usage: nvidia.com/gpu: 2 Conditions: Last Transition Time: 2025-03-12T20:58:36Z Message: Quota reserved in ClusterQueue cluster-queue Observed Generation: 1 Reason: QuotaReserved Status: True Type: QuotaReserved Last Transition Time: 2025-03-12T20:58:36Z Message: The workload is admitted Observed Generation: 1 Reason: Admitted Status: True Type: Admitted Last Transition Time: 2025-03-12T20:58:39Z Message: All pods were ready or succeeded since the workload admission Observed Generation: 1 Reason: PodsReady Status: True Type: PodsReady Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal QuotaReserved 2m7s kueue-admission Quota reserved in ClusterQueue cluster-queue, wait time since queued was 0s Normal Admitted 2m7s kueue-admission Admitted by ClusterQueue cluster-queue, wait time since reservation was 0s

# kubectl describe workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-64330 Name: mpijob-tensorflow-benchmarks-job2-64330 Namespace: default Labels: kueue.x-k8s.io/job-uid=58b33273-19b3-4e86-9152-fac3415526f7 Annotations: <none> API Version: kueue.x-k8s.io/v1beta1 Kind: Workload Metadata: Creation Timestamp: 2025-03-12T20:58:36Z Finalizers: kueue.x-k8s.io/resource-in-use Generation: 1 Owner References: API Version: kubeflow.org/v2beta1 Block Owner Deletion: true Controller: true Kind: MPIJob Name: tensorflow-benchmarks-job2 UID: 58b33273-19b3-4e86-9152-fac3415526f7 Resource Version: 2772 UID: 8fb74084-927a-4abc-90c8-16efb5c1515d Spec: Active: true Pod Sets: Count: 1 Name: launcher Template: Metadata: Spec: Containers: Command: mpirun --allow-run-as-root -np 2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet101 --batch_size=64 --variable_update=horovod Image: mpioperator/tensorflow-benchmarks:latest Name: tensorflow-benchmarks-job2 Resources: Count: 2 Name: worker Template: Metadata: Spec: Containers: Image: mpioperator/tensorflow-benchmarks:latest Name: tensorflow-benchmarks-job2 Resources: Limits: nvidia.com/gpu: 1 Priority: 0 Priority Class Source: Queue Name: user-queue Status: Admission: Cluster Queue: cluster-queue Pod Set Assignments: Count: 1 Name: launcher Count: 2 Flavors: nvidia.com/gpu: default-flavor Name: worker Resource Usage: nvidia.com/gpu: 2 Conditions: Last Transition Time: 2025-03-12T20:58:39Z Message: Quota reserved in ClusterQueue cluster-queue Observed Generation: 1 Reason: QuotaReserved Status: True Type: QuotaReserved Last Transition Time: 2025-03-12T20:58:36Z Message: Not all pods are ready or succeeded Observed Generation: 1 Reason: PodsReady Status: False Type: PodsReady Last Transition Time: 2025-03-12T20:58:39Z Message: The workload is admitted Observed Generation: 1 Reason: Admitted Status: True Type: Admitted Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal QuotaReserved 2m47s kueue-admission Quota reserved in ClusterQueue cluster-queue, wait time since queued was 4s Normal Admitted 2m47s kueue-admission Admitted by ClusterQueue cluster-queue, wait time since reservation was 0s

@tenzen-y
Additional information

https://github.com/ttakahashi21/mpi-operator/blob/dev-takahashi/pkg/controller/mpi_job_controller.go#L652-L686
I wrote the above debugging code and confirmed.
When using Kueue, is it expected that the initial value of mpiJob.Spec.RunPolicy.Suspend will be true?

case1

The default value of mpiJob.Spec.RunPolicy.Suspend for Job1 and Job2 is set to false. Therefore, "! isMPIJobSuspended(mpiJob)" for Job1 and Job2 is true. This creates or gets the worker. Since the launcher has not yet been created, the condition “launcher == nil” for Job1 and Job2 is true. In job1, the condition “c.countReadyWorkerPods(worker) == len(worker)” becomes true and the launcher for job 1 is created. In job2, "LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup” and ”c. countReadyWorkerPods(worker) == len(worker)” are not true, respectively, so the launcher for job2 is not created.

The details of the debug log are as follows

mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 2 - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 2 - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 2 - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 2 - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 1 - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 1 - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 2 - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup || c.countReadyWorkerPods(checkworker) == len(checkworker): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - deploy launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi

case3

The value of mpiJob.Spec.RunPolicy.Suspend for Job1 and Job2 is set to true when MPIjob is started for the first time. Therefore, "!isMPIJobSuspended(mpiJob)" for Job1 and Job2 is false. So, this can not creates or gets the worker. Since the launcher has not yet been created, the condition “launcher == nil” for Job1 and Job2 is true. However, the condition “c.countReadyWorkerPods(worker) == len(worker)” becomes true in job1 and job2 and the launcher for job 1 and job2 is created. The reason why “c.countReadyWorkerPods(worker)” and “len(worker)” will both be 0 because the worker information does not get.

The details of the debug log are as follows

mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 0 - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup || c.countReadyWorkerPods(checkworker) == len(checkworker): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - deploy launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : false - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 0 - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup || c.countReadyWorkerPods(checkworker) == len(checkworker): true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - deploy launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : false - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : false - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : false - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi

In my opinion, #617 modification is necessary.

mimowo · 2024-11-04T14:05:10Z

@GonzaloSaez please fix the DCO

mimowo · 2024-11-08T11:44:34Z

For context, linking it back to the related Kueue issue: kubernetes-sigs/kueue#3400 and the slack discussion https://kubernetes.slack.com/archives/C032ZE66A2X/p1730369507818399

Signed-off-by: GonzaloSaez <[email protected]>

google-oss-prow bot requested review from carmark and zw0610 October 31, 2024 19:03

google-oss-prow bot added size/S size/M and removed size/S labels Oct 31, 2024

GonzaloSaez commented Nov 1, 2024

View reviewed changes

GonzaloSaez mentioned this pull request Nov 4, 2024

POD NodeSelector is not always consistent with their MPIJob node selector kubernetes-sigs/kueue#3400

Open

GonzaloSaez force-pushed the g/fix_job_launch_wait_for_pods_ready branch from 802bd49 to 804cfd8 Compare November 5, 2024 06:52

tenzen-y mentioned this pull request Jan 1, 2025

Failed IntelMPI E2E tests #675

Closed

GonzaloSaez force-pushed the g/fix_job_launch_wait_for_pods_ready branch from 804cfd8 to c1ea13d Compare January 10, 2025 22:58

Do not create the launcher job if the job starts suspended

50abbdf

Signed-off-by: GonzaloSaez <[email protected]>

GonzaloSaez force-pushed the g/fix_job_launch_wait_for_pods_ready branch from c1ea13d to 50abbdf Compare January 10, 2025 22:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not create the launcher job if the job starts suspended #670

Do not create the launcher job if the job starts suspended #670

GonzaloSaez commented Oct 31, 2024 •

edited

Loading

google-oss-prow bot commented Oct 31, 2024

GonzaloSaez Nov 1, 2024

mimowo Nov 4, 2024

mimowo Nov 4, 2024

tenzen-y Feb 4, 2025

GonzaloSaez Feb 5, 2025 •

edited

Loading

ttakahashi21 Mar 7, 2025 •

edited

Loading

tenzen-y Mar 13, 2025

ttakahashi21 Mar 14, 2025 •

edited

Loading

ttakahashi21 Mar 14, 2025

ttakahashi21 Mar 14, 2025

mimowo commented Nov 4, 2024

mimowo commented Nov 8, 2024

	// launcherCreationPolicy if WaitForWorkersReady, the launcher is created only after all workers are in Ready state. Defaults to AtStartup.
	// +kubebuilder:validation:Enum:AtStartup;WaitForWorkersReady
	// +kubebuilder:default:=AtStartup
	LauncherCreationPolicy LauncherCreationPolicy `json:"launcherCreationPolicy,omitempty"`

#	kueue	Condition of GPUs	launcherCreationPolicy(MPIJob)	Behavior
1	disabled	-	WaitForWorkersReady	work as expected
2	enabled	NominalQuota == Allocatable	WaitForWorkersReady	work as expected
3	enabled	NominalQuota > Allocatable	WaitForWorkersReady	Not work as expected

Do not create the launcher job if the job starts suspended #670

Are you sure you want to change the base?

Do not create the launcher job if the job starts suspended #670

Conversation

GonzaloSaez commented Oct 31, 2024 • edited Loading

google-oss-prow bot commented Oct 31, 2024

GonzaloSaez Nov 1, 2024

Choose a reason for hiding this comment

mimowo Nov 4, 2024

Choose a reason for hiding this comment

mimowo Nov 4, 2024

Choose a reason for hiding this comment

tenzen-y Feb 4, 2025

Choose a reason for hiding this comment

GonzaloSaez Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

ttakahashi21 Mar 7, 2025 • edited Loading

Choose a reason for hiding this comment

tenzen-y Mar 13, 2025

Choose a reason for hiding this comment

ttakahashi21 Mar 14, 2025 • edited Loading

Choose a reason for hiding this comment

case1

case2

case3

ttakahashi21 Mar 14, 2025

Choose a reason for hiding this comment

case1

case3

ttakahashi21 Mar 14, 2025

Choose a reason for hiding this comment

mimowo commented Nov 4, 2024

mimowo commented Nov 8, 2024

GonzaloSaez commented Oct 31, 2024 •

edited

Loading

GonzaloSaez Feb 5, 2025 •

edited

Loading

ttakahashi21 Mar 7, 2025 •

edited

Loading

ttakahashi21 Mar 14, 2025 •

edited

Loading