Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add resource.sharing-strategy labels #503

Merged
merged 1 commit into from
Feb 20, 2024

Conversation

elezar
Copy link
Member

@elezar elezar commented Feb 6, 2024

This change adds sharing-strategy labels per resource.

This label can have the value: none, mps, time-slicing depending on the sharing configuration. For invalid configurations, this label is empty.

Running against the in-tree kind cluster:

  1. Deploy GFD only and get the labels:
helm upgrade nvidia -i deployments/helm/nvidia-device-plugin \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --set runtimeClassName=nvidia \
    --set nvidiaDriverRoot=/ \
    --set devicePlugin.enabled=false \
    --set gfd.enabled=true
kubectl get node k8s-device-plugin-cluster-worker  --output=json | jq '.metadata.labels' | grep "nvidia.com" | sort
  "nvidia.com/cuda.driver.major": "515",
  "nvidia.com/cuda.driver.minor": "105",
  "nvidia.com/cuda.driver.rev": "01",
  "nvidia.com/cuda.runtime.major": "11",
  "nvidia.com/cuda.runtime.minor": "7",
  "nvidia.com/gfd.timestamp": "1707221624",
  "nvidia.com/gpu.compute.major": "7",
  "nvidia.com/gpu.compute.minor": "0",
  "nvidia.com/gpu.count": "8",
  "nvidia.com/gpu.family": "volta",
  "nvidia.com/gpu.machine": "kind",
  "nvidia.com/gpu.memory": "16384",
  "nvidia.com/gpu.product": "Tesla-V100-SXM2-16GB-N",
  "nvidia.com/gpu.replicas": "1",
  "nvidia.com/gpu.sharing-strategy": "none",
  "nvidia.com/mig.capable": "false",
  "nvidia.com/sharing.mps.enabled": "false"
  1. Configure MPS sharing:
cat << EOF > dp-mps-config.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy:
    - cdi-annotations
    deviceIDStrategy: uuid
sharing:
  mps:
    renameByDefault: false
    resources:
    - name: nvidia.com/gpu
      replicas: 4
EOF
kubectl create cm -n nvidia-device-plugin nvidia-plugin-mps \
    --from-file=config=dp-mps-config.yaml

Deploy using the config:

helm upgrade nvidia -i deployments/helm/nvidia-device-plugin \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --set runtimeClassName=nvidia \
    --set nvidiaDriverRoot=/ \
    --set devicePlugin.enabled=false \
    --set gfd.enabled=true \
    --set config.name=nvidia-plugin-mps

Get the labels:

kubectl get node k8s-device-plugin-cluster-worker  --output=json | jq '.metadata.labels' |
 grep "nvidia.com" | sort
  "nvidia.com/cuda.driver.major": "515",
  "nvidia.com/cuda.driver.minor": "105",
  "nvidia.com/cuda.driver.rev": "01",
  "nvidia.com/cuda.runtime.major": "11",
  "nvidia.com/cuda.runtime.minor": "7",
  "nvidia.com/gfd.timestamp": "1707222576",
  "nvidia.com/gpu.compute.major": "7",
  "nvidia.com/gpu.compute.minor": "0",
  "nvidia.com/gpu.count": "8",
  "nvidia.com/gpu.family": "volta",
  "nvidia.com/gpu.machine": "kind",
  "nvidia.com/gpu.memory": "16384",
  "nvidia.com/gpu.product": "Tesla-V100-SXM2-16GB-N-SHARED",
  "nvidia.com/gpu.replicas": "4",
  "nvidia.com/gpu.sharing-strategy": "mps",
  "nvidia.com/mig.capable": "false",
  "nvidia.com/sharing.mps.enabled": "true"
  1. Configure MPS sharing with renaming:
cat << EOF > dp-mps-rename-config.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy:
    - cdi-annotations
    deviceIDStrategy: uuid
sharing:
  mps:
    renameByDefault: true
    resources:
    - name: nvidia.com/gpu
      replicas: 4
EOF
kubectl create cm -n nvidia-device-plugin nvidia-plugin-mps-rename \
    --from-file=config=dp-mps-rename-config.yaml

Deploy:

helm upgrade nvidia -i deployments/helm/nvidia-device-plugin \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --set runtimeClassName=nvidia \
    --set nvidiaDriverRoot=/ \
    --set devicePlugin.enabled=false \
    --set gfd.enabled=true \
    --set config.name=nvidia-plugin-mps-rename

Get the labels:

kubectl get node k8s-device-plugin-cluster-worker  --output=json | jq '.metadata.labels' | grep "nvidia.com" | sort
  "nvidia.com/cuda.driver.major": "515",
  "nvidia.com/cuda.driver.minor": "105",
  "nvidia.com/cuda.driver.rev": "01",
  "nvidia.com/cuda.runtime.major": "11",
  "nvidia.com/cuda.runtime.minor": "7",
  "nvidia.com/gfd.timestamp": "1707222920",
  "nvidia.com/gpu.compute.major": "7",
  "nvidia.com/gpu.compute.minor": "0",
  "nvidia.com/gpu.count": "8",
  "nvidia.com/gpu.family": "volta",
  "nvidia.com/gpu.machine": "kind",
  "nvidia.com/gpu.memory": "16384",
  "nvidia.com/gpu.product": "Tesla-V100-SXM2-16GB-N",
  "nvidia.com/gpu.replicas": "4",
  "nvidia.com/gpu.sharing-strategy": "mps",
  "nvidia.com/mig.capable": "false",
  "nvidia.com/sharing.mps.enabled": "true"

"nvidia.com/gpu.count": "1",
"nvidia.com/gpu.replicas": "2",
"nvidia.com/gpu.sharing-strategy": "mps",
"nvidia.com/gpu.memory": "300",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question I just had was whether sharing using MPS should affect the memory label?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe have two separate labels, one representing total memory and another representing memory per replica?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about a separate label. What would this label signify when time-slicing is selected, for example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this as a follow-up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar elezar modified the milestone: k8s-device-plugin/v0.15.0 Feb 6, 2024
"nvidia.com/mig.capable": "[true|false]",
"nvidia.com/gpu.compute.major": "[0-9]+",
"nvidia.com/gpu.compute.minor": "[0-9]+",
"nvidia.com/sharing.mps.enabled": "[true|false]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- what component is adding this label to the node?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I see this was added to GFD in a previous commit: https://github.com/NVIDIA/k8s-device-plugin/blob/main/internal/lm/nvml.go#L149

Question -- do we still need nvidia.com/sharing.mps.enabled? AFAIK we use this as the nodeSelector for the MPS control daemon. Is it sufficient to just use nvidia.com/gpu.sharing-strategy=mps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that the sharing-strategy label is per-resource. This means that if we enable resource renaming or start allowing MPS with mixed mig mode, the label to trigger off is not known.

I would say that nvidia.com/sharing.mps.enabled is akn to nvidia.com/mig.capable in this respect.

@klueska I know that you expressed some concerns over the exact label. Do you have any other suggestions for the label? Should we just make it nvidia.com/mps.capable to mirror mig and to indicate that the final decision depends on whether a number of replicas greater than 1 is actually selected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mig.capable is different than nvidia.com/sharing.mps.enabled in that it says that the GPUs on a node are capable of being split into MIG devices -- it doesn't imply that they are in MIG mode or that there are any MIG devices currently configured.

However it is similar in that it is primarily used to signal that the MIG manager should be deployed on the node (whereas we want something to signal that an MPS daemon should be started for this resource type).

I think dropping nvidia.com/sharing.mps.enabled and adding a nvidia.com/mps.capable seems reasonable.

Just to make sure we are on the same page -- this would get set if there is a sharing.mps section in the config, and that would trigger the MPS daemonset to be deployed (similar to MIG manager) and then start one MPS control daemon per resources type that has replicas > 1. Is that correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To start: It could also be nvidia.com/mps.enabled instead of nvidia.com/mps.capable.

We currently trigger the nvidia.com/sharing.mps.enabled label if the sharing.mps section is present and at lease one resource pattern has a replica > 1. This controls the deployment of the MPS daemonset.

In this daemonset the pattern is evaluated against actual resources and if any of these have replicas > 1 then a nvidia-cuda-mps-control daemon is started for this resource. A further note is that we currently only support a single resource name and as such only start a single daemon. The plan is to extend this once we have more user feedback / experience.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should either make it nvidia.com/mps.capable to mimic what we have for MIG, or just stick with nvidia.com/sharing.mps.enabled. I think that having nvidia.com/mig.capable, but nvidia.com/mps.enabled would be confusing, especially since MPS can be layered on top of MIG.

Speaking of, are there any plans for an equivalent label for timeSlicing? I know there is no daemon that needs to be started in this respect, but does it make sense to add for symmetry?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point on using nvidia.com/mps.capable.

There are no current plans for timeSlicing, but we can add it. There are, for example, some devices that don't support time slicing and adding this label may make sense there. (see NVIDIA/k8s-dra-driver#58).

I can add an issue to create this label as a follow-up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #533 to do the rename.

rawLabels := map[string]interface{}{
"product": rl.getProductName(parts...),
"count": count,
"replicas": rl.getReplicas(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"replicas": rl.getReplicas(),
"replicas": replicas,

This change adds sharing-strategy labels per resource.

This label can have the value: none, mps, time-slicing depending
on the sharing configuration. For invalid configurations, this label
is empty.

Signed-off-by: Evan Lezar <[email protected]>
@elezar elezar merged commit 35c1393 into NVIDIA:main Feb 20, 2024
6 checks passed
@elezar elezar deleted the add-per-strategy-labels branch February 20, 2024 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants