AMD GPU Device Plugin for Kubernetes

Introduction

This is a Kubernetes device plugin implementation that enables the registration of AMD GPU in a container cluster for compute workload. With the appropriate hardware and this plugin deployed in your Kubernetes cluster, you will be able to run jobs that require AMD GPU.

This plugin is required by tools such as the AMD GPU Operator to expose AMD GPUs as schedulable resources.

More information about ROCm.

Prerequisites

ROCm capable machines
kubeadm capable machines (if you are using kubeadm to deploy your k8s cluster)
ROCm kernel (Installation guide) or latest AMD GPU Linux driver (Installation guide)
A Kubernetes deployment
If device health checks are enabled, the pods must be allowed to run in privileged mode (for example the --allow-privileged=true flag for kube-apiserver), in order to access /dev/kfd

Limitations

This plugin targets Kubernetes v1.18+.

Deployment

The device plugin needs to be run on all the nodes that are equipped with AMD GPU. The simplest way of doing so is to create a Kubernetes DaemonSet, which runs a copy of a pod on all (or some) Nodes in the cluster. We have a pre-built Docker image on DockerHub that you can use for your DaemonSet. This repository also has a pre-defined yaml file named k8s-ds-amdgpu-dp.yaml. You can create a DaemonSet in your Kubernetes cluster by running this command:

kubectl create -f k8s-ds-amdgpu-dp.yaml

or directly pull from the web using

kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml

If you want to enable the experimental device health check, please use k8s-ds-amdgpu-dp-health.yaml after --allow-privileged=true is set for kube-apiserver.

Helm Chart

If you want to deploy this device plugin using Helm, a Helm Chart is available via Artifact Hub.

Example workload

You can restrict workloads to a node with a GPU by adding resources.limits to the pod definition. An example pod definition is provided in example/pod/alexnet-gpu.yaml. This pod runs the timing benchmark for AlexNet on AMD GPU and then goes to sleep. You can create the pod by running:

kubectl create -f alexnet-gpu.yaml

or bash

kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/example/pod/alexnet-gpu.yaml

and then check the pod status by running

kubectl describe pods

After the pod is created and running, you can see the benchmark result by running:

kubectl logs alexnet-tf-gpu-pod alexnet-tf-gpu-container

For comparison, an example pod definition of running the same benchmark with CPU is provided in example/pod/alexnet-cpu.yaml.

Labelling node with additional GPU properties

Please see AMD GPU Kubernetes Node Labeller for details. An example configuration is in k8s-ds-amdgpu-labeller.yaml:

kubectl create -f k8s-ds-amdgpu-labeller.yaml

or

kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-labeller.yaml

Health per GPU

Extends more granular health detection per GPU using the exporter health service over grpc socket service mounted on /var/lib/amd-metrics-exporter/

Notes

This plugin uses go modules for dependencies management
Please consult the Dockerfile on how to build and use this plugin independent of a docker image

TODOs

~~Add proper GPU health check (health check without /dev/kfd access.)~~

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
.github		.github
cmd		cmd
docs		docs
example		example
helm		helm
internal/pkg		internal/pkg
testdata		testdata
vendor		vendor
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
k8s-ds-amdgpu-dp-health.yaml		k8s-ds-amdgpu-dp-health.yaml
k8s-ds-amdgpu-dp.yaml		k8s-ds-amdgpu-dp.yaml
k8s-ds-amdgpu-labeller.yaml		k8s-ds-amdgpu-labeller.yaml
labeller.Dockerfile		labeller.Dockerfile
ubi-dp.Dockerfile		ubi-dp.Dockerfile
ubi-labeller.Dockerfile		ubi-labeller.Dockerfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AMD GPU Device Plugin for Kubernetes

Introduction

Prerequisites

Limitations

Deployment

Helm Chart

Example workload

Labelling node with additional GPU properties

Health per GPU

Notes

TODOs

About

Uh oh!

Releases 43

Packages

Uh oh!

Contributors 29

Uh oh!

Languages

License

ROCm/k8s-device-plugin

Folders and files

Latest commit

History

Repository files navigation

AMD GPU Device Plugin for Kubernetes

Introduction

Prerequisites

Limitations

Deployment

Helm Chart

Example workload

Labelling node with additional GPU properties

Health per GPU

Notes

TODOs

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 43

Packages 0

Uh oh!

Contributors 29

Uh oh!

Languages

Packages