Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add nvidia MIG #258

Draft
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

piyush-jena
Copy link
Contributor

Issue number:

Related:

Description of changes:
Adding nvidia-migmanager service and binary that configures the instance with nvidia mig.

Testing done:

  1. Instance joined the cluster
NAME                                           STATUS   ROLES    AGE   VERSION
ip-XXXX.us-west-2.compute.internal   Ready    <none>   15h   v1.29.5-eks-1109419
  1. Model Default:
bash-5.1# apiclient get settings.kubelet-device-plugin
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "device-partitioning-strategy": "none",
        "device-sharing-strategy": "none",
        "pass-device-specs": true
      }
    }
  }
}
  1. Model Updates:
bash-5.1#: apiclient set settings.kubelet-device-plugins.nvidia.device-partitioning-strategy="mig"
bash-5.1#: apiclient set settings.kubelet-device-plugins.nvidia.mig.profile-a100="1g.5gb"
bash-5.1# apiclient get settings.kubelet-device-plugin
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "device-partitioning-strategy": "mig",
        "device-sharing-strategy": "none",
        "mig": {
          "profile-a100": "1g.5gb"
        },
        "pass-device-specs": true
      }
    }
  }
}

kubectl describe node shows 56 gpus post instance reboot.

  1. Bounded check:
bash-5.1# apiclient set settings.kubelet-device-plugins.nvidia.mig.profile-a100="1g.10gb"
Failed to change settings: Failed PATCH request to '/settings/keypair?tx=apiclient-set-pRfJIisTgvDaRWMm': Status 400 when PATCHing /settings/keypair?tx=apiclient-set-pRfJIisTgvDaRWMm: Unable to match your input to the data model.  We may not have enough type information.  Please try the --json input form.  Cause: Error during deserialization: Unable to deserialize into MIGA100Profile: Invalid MIG Profile value '1g.10gb' at line 1 column 69
bash-5.1#
  1. Files generated:
bash-5.1# cat /etc/nvidia-k8s-device-plugin/settings.yaml
version: v1
flags:
  migStrategy: "single"
  failOnInitError: true
  plugin:
    passDeviceSpecs: true
    deviceListStrategy: volume-mounts
    deviceIDStrategy: index

bash-5.1# cat etc/nvidia-migmanager/nvidia-migmanager.toml
device-partitioning-strategy = "mig"
profile-a100 = "1g.5gb"

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@piyush-jena piyush-jena changed the title feat: add nvidia mig feat: add nvidia MIG Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant