This is a Kubernetes device plugin implementation that enables the registration of AMD GPU in a container cluster for compute workload. With the appropriate hardware and this plugin deployed in your Kubernetes cluster, you will be able to run jobs that require AMD GPU.
More information about ROCm.
- ROCm capable machines
- kubeadm capable machines (if you are using kubeadm to deploy your k8s cluster)
- ROCm kernel (Installation guide) or latest AMD GPU Linux driver (Installation guide)
- A Kubernetes deployment
- If device health checks are enabled, the pods must be allowed to run in privileged mode (for example the
--allow-privileged=true
flag for kube-apiserver), in order to access/dev/kfd
- This plugin targets Kubernetes v1.18+.
The device plugin needs to be run on all the nodes that are equipped with AMD GPU. The simplest way of doing so is to create a Kubernetes DaemonSet, which runs a copy of a pod on all (or some) Nodes in the cluster. We have a pre-built Docker image on DockerHub that you can use for your DaemonSet. This repository also has a pre-defined yaml file named k8s-ds-amdgpu-dp.yaml
. You can create a DaemonSet in your Kubernetes cluster by running this command:
kubectl create -f k8s-ds-amdgpu-dp.yaml
or directly pull from the web using
kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml
If you want to enable the experimental device health check, please use k8s-ds-amdgpu-dp-health.yaml
after --allow-privileged=true
is set for kube-apiserver.
If you want to deploy this device plugin using Helm, a Helm Chart is available via Artifact Hub.
You can restrict workloads to a node with a GPU by adding resources.limits
to the pod definition. An example pod definition is provided in example/pod/alexnet-gpu.yaml
. This pod runs the timing benchmark for AlexNet on AMD GPU and then goes to sleep. You can create the pod by running:
kubectl create -f alexnet-gpu.yaml
or
kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/example/pod/alexnet-gpu.yaml
and then check the pod status by running
kubectl describe pods
After the pod is created and running, you can see the benchmark result by running:
kubectl logs alexnet-tf-gpu-pod alexnet-tf-gpu-container
For comparison, an example pod definition of running the same benchmark with CPU is provided in example/pod/alexnet-cpu.yaml
.
Please see AMD GPU Kubernetes Node Labeller for details. An example configuration is in k8s-ds-amdgpu-labeller.yaml:
kubectl create -f k8s-ds-amdgpu-labeller.yaml
or
kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-labeller.yaml
- This plugin uses
go modules
for dependencies management - Please consult the
Dockerfile
on how to build and use this plugin independent of a docker image
- Add proper GPU health check (health check without
/dev/kfd
access.)