Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(MPI Training) : Scheduling Policy doc bug for MPIJob #4032

Open
ttakahashi21 opened this issue Mar 5, 2025 · 2 comments
Open

bug(MPI Training) : Scheduling Policy doc bug for MPIJob #4032

ttakahashi21 opened this issue Mar 5, 2025 · 2 comments
Assignees
Labels
area/trainer AREA: Kubeflow Trainer / Kubeflow Training Operator area/website AREA: Website Styles/Hosting/Serving kind/bug KIND: Bugs

Comments

@ttakahashi21
Copy link

ttakahashi21 commented Mar 5, 2025

This is a Bug Report

Problem:

error unknown field "spec.runPolicy.schedulingPolicy.minResources.reuests.nvidia.com/gpu"

  • Error case
    • use requests field
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue
  name: tensorflow-benchmarks
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
    schedulingPolicy:
      minAvailable: 4
      queue: user-queue
      minResources:
        reuests:
          nvidia.com/gpu: 3
      scheduleTimeoutSeconds: 0
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
         spec:
           containers:
           - image: mpioperator/tensorflow-benchmarks:latest
             name: tensorflow-benchmarks
             command:
             - mpirun
             - --allow-run-as-root
             - -np
             - "2"
             - -bind-to
             - none
             - -map-by
             - slot
             - -x
             - NCCL_DEBUG=INFO
             - -x
             - LD_LIBRARY_PATH
             - -x
             - PATH
             - -mca
             - pml
             - ob1
             - -mca
             - btl
             - ^openib
             - python
             - scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
             - --model=resnet101
             - --batch_size=64
             - --variable_update=horovod
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: mpioperator/tensorflow-benchmarks:latest
            name: tensorflow-benchmarks
            resources:
              limits:
                nvidia.com/gpu: 1
kubectl apply -f tensorflow-benchmarks.yaml
  • Output
Error from server (BadRequest): error when creating "tensorflow-benchmarks.yaml": MPIJob in version "v2beta1" cannot be handled as a MPIJob: strict decoding error: unknown field "spec.runPolicy.schedulingPolicy.minResources.reuests.nvidia.com/gpu"

Proposed Solution:

  • Correct case
    • do not use requests field
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue
  name: tensorflow-benchmarks
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
    schedulingPolicy:
      minAvailable: 4
      queue: user-queue
      minResources:
          nvidia.com/gpu: 3
      scheduleTimeoutSeconds: 0
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
         spec:
           containers:
           - image: mpioperator/tensorflow-benchmarks:latest
             name: tensorflow-benchmarks
             command:
             - mpirun
             - --allow-run-as-root
             - -np
             - "2"
             - -bind-to
             - none
             - -map-by
             - slot
             - -x
             - NCCL_DEBUG=INFO
             - -x
             - LD_LIBRARY_PATH
             - -x
             - PATH
             - -mca
             - pml
             - ob1
             - -mca
             - btl
             - ^openib
             - python
             - scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
             - --model=resnet101
             - --batch_size=64
             - --variable_update=horovod
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: mpioperator/tensorflow-benchmarks:latest
            name: tensorflow-benchmarks
            resources:
              limits:
                nvidia.com/gpu: 1
kubectl apply -f tensorflow-benchmarks.yaml

-Output

mpijob.kubeflow.org/tensorflow-benchmarks created

Page to Update (provide the full path):

https://www.kubeflow.org/docs/components/trainer/legacy-v1/user-guides/mpi/#scheduling-policy

Component/Kubeflow Version:

Labels

/area website
/area trainer


Impacted by this bug? Give it a 👍.

@ttakahashi21 ttakahashi21 added the kind/bug KIND: Bugs label Mar 5, 2025
@google-oss-prow google-oss-prow bot added the area/website AREA: Website Styles/Hosting/Serving label Mar 5, 2025
@ttakahashi21
Copy link
Author

/assign ttakahashi21

@ttakahashi21
Copy link
Author

/area trainer

@google-oss-prow google-oss-prow bot added the area/trainer AREA: Kubeflow Trainer / Kubeflow Training Operator label Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/trainer AREA: Kubeflow Trainer / Kubeflow Training Operator area/website AREA: Website Styles/Hosting/Serving kind/bug KIND: Bugs
Projects
None yet
Development

No branches or pull requests

1 participant