AMD GPU Kubernetes Node Labeller

Introduction

This tool automatically label nodes with GPU properties if a node has one or more AMD GPU installed. This tool leverage controller-runtime in the spirit of Custom Resource Definition (CRD) controller even though we do not define a Custom Resource.

Prerequisites

Node Labeller needs to be run inside a Kubernetes Pod
The node's hostname need's to be made available inside the container in a text file with the path /labeller/hostname.
The Pod containing the Labeller needs to be deployed by a service account with sufficient API access. This can be achieved through the use of ClusterRole and ClusterRoleBinding.
- apiGroups: core ("")
- resources: nodes
- verbs: watch, get, list, update

Deployment

The Labeller needs to be run on all the nodes that are equipped with AMD GPU. The simplist way of doing so is to create a Kubernete DaemonSet, which runs a copy of a pod on all (or some) Nodes in the cluster. An example configuration is available here.

The Labeller currently creates node label for the following AMD GPU properties:

Device ID (-device-id)
Product Name (-product-name)
Driver Version (-driver-version)
Driver Source Version (-driveri-src-version)
VRAM Size (-vram)
Number of SIMD (-simd-count)
Number of Compute Unit (-cu-count)
Firmware and Feature Versions (-firmware)
GPU Family, in two letters acronym (-family)
- SI - Southern Islands
- CI - Sea Islands
- KV - Kaveri
- VI - Volcanic Islands
- CZ - Carrizo
- AI - Arctic Islands
- RV - Raven
- NV - Navi
- VGH - Van Gogh
- GC_11_0_0 - GC 11.0.0
- YC - Yellow Carp
- GC_11_0_1 - GC 11.0.1
- GC_10_3_6 - GC 10.3.6
- GC_10_3_7 - GC 10.3.7
- GC_11_5_0 - GC 11.5.0

Example result

$ kubectl describe node cluster-node-23
Name:               cluster-node-23
Roles:              <none>
Labels:             beta.amd.com/gpu.cu-count.64=1
                    beta.amd.com/gpu.device-id.6860=1
                    beta.amd.com/gpu.family.AI=1
                    beta.amd.com/gpu.simd-count.256=1
                    beta.amd.com/gpu.vram.16G=1
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/hostname=cluster-node-23
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
......

You can selectively expose the GPU properties by passing in the corresponding flag. For example, to only explose VRAM and Device ID as node labels, run the Node Labeller like this:

$ ./k8s-node-labeller -vram -device-id

Usage example with label selector

Once the Node Labeller is deployed and functional, you can select specific nodes via Kubernetes' label selector. For example, to select nodes with only 8GB of VRAM:

$ kubectl get nodes -l beta.amd.com/gpu.vram.8G

Limitations

While container scheduling via label selector works for heterogeneous cluster, it requires homogenous nodes. For example, you can have a node with just Fiji and just Vega10 in the same cluster but not a node that has both Fiji and Vg10 cards in it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

AMD GPU Kubernetes Node Labeller

Introduction

Prerequisites

Deployment

Usage example with label selector

Limitations

Notes

TODOs

Files

README.md

Latest commit

History

README.md

File metadata and controls

AMD GPU Kubernetes Node Labeller

Introduction

Prerequisites

Deployment

Usage example with label selector

Limitations

Notes

TODOs