-
I have a single NVidia GPU on a target Talos v1.8.3 node that I'd like to always move to nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d I'd typically add the commands above as a shell script in machine:
install:
extensions:
- image: asymingt/nvidia-compute-mode-service:v1.0.1 Unfortunately, when I installed the patch it seemed to work but the extension never showed up (also, as you can see below, this process is deprecated and so I don't know how wise it is to depend on it): $ talosctl -n 192.168.4.154 -e 192.168.4.154 --talosconfig=./talosconfig patch mc --patch @my-patch.yaml -m reboot
patched MachineConfigs.config.talos.dev/v1alpha1 at the node 192.168.4.154
WARNING: .machine.install.extensions is deprecated, please see https://www.talos.dev/latest/talos-guides/install/boot-assets/
Applied configuration with a reboot.
$ talosctl -n 192.168.4.154 -e 192.168.4.154 --talosconfig=./talosconfig get extensions
NODE NAMESPACE TYPE ID VERSION NAME VERSION
192.168.4.154 runtime ExtensionStatus 0 1 amd-ucode 20241110
192.168.4.154 runtime ExtensionStatus 1 1 amdgpu-firmware 20241110
192.168.4.154 runtime ExtensionStatus 2 1 i915-ucode 20241110
192.168.4.154 runtime ExtensionStatus 3 1 nvidia-container-toolkit-production 550.90.07-v1.16.1
192.168.4.154 runtime ExtensionStatus 4 1 nvidia-open-gpu-kernel-modules-production 550.90.07-v1.8.3
192.168.4.154 runtime ExtensionStatus 5 1 schematic 3efeb200f226e383f39b24073904fb1f776649189a791df2d54b9c321c3343c9
192.168.4.154 runtime ExtensionStatus modules.dep 1 modules.dep 6.6.60-talos My gut says that something as trivial as a simple command on boot should not require this level of complexity. Perhaps I am wrong, though. Any suggestions would be very helpful. My major concern at this point is that I'm going to have to write a different extension for each node in my cluster depending on how many GPUs are attached and how I want them configured. It would be much easier to just have sequence of init commands in a machine configuration or patch. Instructions for repeatabilityAfter following the Talos NVidia install instructions, it is possible to get MPS working by enabling privileged execution of a container, and then running the two required commands above inside the container as follows: kubectl label --overwrite ns default pod-security.kubernetes.io/enforce=privileged
kubectl run nvidia-test --restart=Never -ti --rm \
--image nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu20.04 \
--overrides '{"spec": {"runtimeClassName": "nvidia"}}' --privileged bash
> nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
> nvidia-cuda-mps-control -d Then, assuming that you have installed the
With following YAML configurations: # rtx4060.yaml
version: v1
sharing:
mps:
resources:
- name: nvidia.com/gpu
replicas: 4 # default.yaml
version: v1 And you label your target node to mark it as having a single GPU that we want to split with MPS, and make sure of course that the
After a bit of time you will see the single 8GB GPU split into four 2GB GPUs: kubectl describe node target-node
...
Capacity:
cpu: 192
ephemeral-storage: 1951051424Ki
hugepages-2Mi: 0
memory: 528144832Ki
nvidia.com/gpu: 4 <---- Yay!
pods: 110
Allocatable:
cpu: 191950m
ephemeral-storage: 1797820553926
hugepages-2Mi: 0
memory: 527518144Ki
nvidia.com/gpu: 4 <---- Yay!
pods: 110 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
You can use a Kubernetes DaemonSet as a way to run something on boot. Just run the command, and sleep forever in the shell script. This way on node reboot the contianer/pod will be recreated, and it will run the command once again. Longer term, it might be nice to make it part of the nvidia extension. |
Beta Was this translation helpful? Give feedback.
You can use a Kubernetes DaemonSet as a way to run something on boot. Just run the command, and sleep forever in the shell script. This way on node reboot the contianer/pod will be recreated, and it will run the command once again.
Longer term, it might be nice to make it part of the nvidia extension.