-
Config
skuTypes
Description:
A
skuType
defines a resource unit in all resource dimensions.Notes:
- It is like the Azure VM Series or GCP Machine Types.
- Currently, the
skuTypes
is not directly used by HivedScheduler, but it is used by OpenPAI RestServer to setup proportional Pod resource requests and limits. So, if you are not using with OpenPAI RestServer, you can skip to config it.
Example:
Assume you have some
K80
nodes of the same SKU in your cluster, and you want to schedule Pods on them:-
Using
kubectl describe nodes
to check if theseK80
nodes have nearly the same (Allocatable Resources - All Daemon Pods Requests, such as Pods for Device Plugin, Network Plugin, etc), especially for gpu, cpu, memory. If not, please fix it. Assume the aligned minimal resources are: 4 gpus, 23 cpus, and 219GB memory. -
Then proportionally, each gpu request should also has floor(23/4)=5 cpus and floor(219/4)=54GB memory along with it, so config the
K80
skuType
as below:physicalCluster: skuTypes: K80: gpu: 1 cpu: 5 memory: 54Gi
-
Config
cellTypes
Description:
A
cellType
defines a resource topology of askuType
.Notes:
skuTypes
are alsocellTypes
, but they are all leafcellTypes
which do not have internal topology anymore.
Example:
-
Using
nvidia-smi topo --matrix
to figure out the gpu topology on one aboveK80
node:GPU0 GPU1 GPU2 GPU3 CPU Affinity GPU0 X NODE NODE NODE 0-11 GPU1 NODE X NODE NODE 0-11 GPU2 NODE NODE X NODE 0-11 GPU3 NODE NODE NODE X 0-11
-
These 4 gpus are equivalent under the node, so config the
K80-NODE
cellType
as below:physicalCluster: cellTypes: K80-NODE: childCellType: K80 childCellNumber: 4 isNodeLevel: true
-
Assume you have 3 above
K80
nodes under the same network switch or as a pool, so config theK80-NODE-POOL
cellType
as below:physicalCluster: cellTypes: K80-NODE-POOL: childCellType: K80-NODE childCellNumber: 3
-
Config
physicalCells
Description:
A
physicalCell
defines a resource instance, i.e. acellType
instantiated by a specific set of physical devices.Example:
- Assume above 3
K80
nodes have K8S node namesnode1
,node2
andnode3
, so config aK80-NODE-POOL
physicalCell
as below:physicalCluster: physicalCells: - cellType: K80-NODE-POOL cellChildren: - cellAddress: node1 - cellAddress: node2 - cellAddress: node3
- Assume above 3
-
Config
virtualClusters
Description:
A
virtualCluster
defines a resource guaranteed quota in terms ofcellTypes
.Example:
- Assume you want to partition above 3
K80
nodes to 2 virtual clusters: vc1 with 1 node and vc2 with 2 nodes, so configvc1
andvc2
virtualCluster
as below:Notes:virtualClusters: vc1: virtualCells: - cellType: K80-NODE-POOL.K80-NODE cellNumber: 1 vc2: virtualCells: - cellType: K80-NODE-POOL.K80-NODE cellNumber: 2
- The name of
virtualCluster
should be constrained by the K8S naming convention. - The
virtualCells.cellType
should be full qualified and should be started with acellType
which is explicitly referred inphysicalCells
.
- The name of
- Assume you want to partition above 3
-
Put it together
Example:
Finally, after above steps, your config would be:
physicalCluster: skuTypes: K80: gpu: 1 cpu: 5 memory: 54Gi cellTypes: K80-NODE: childCellType: K80 childCellNumber: 4 isNodeLevel: true K80-NODE-POOL: childCellType: K80-NODE childCellNumber: 3 physicalCells: - cellType: K80-NODE-POOL cellChildren: - cellAddress: node1 - cellAddress: node2 - cellAddress: node3 virtualClusters: vc1: virtualCells: - cellType: K80-NODE-POOL.K80-NODE cellNumber: 1 vc2: virtualCells: - cellType: K80-NODE-POOL.K80-NODE cellNumber: 2
To leverage this scheduler to schedule GPUs, if one container in the Pod want to use the allocated GPUs for the whole Pod, it could contain below environment variables:
-
NVIDIA GPUs
env: - name: NVIDIA_VISIBLE_DEVICES valueFrom: fieldRef: fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
The scheduler directly delivers GPU isolation decision to nvidia-container-runtime through Pod Env
NVIDIA_VISIBLE_DEVICES
. -
AMD GPUs
env: - name: AMD_VISIBLE_DEVICES valueFrom: fieldRef: fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
The scheduler directly delivers GPU isolation decision to rocm-container-runtime through Pod Env
AMD_VISIBLE_DEVICES
.
The annotation referred by the env will be populated by scheduler when bind the pod.
If multiple containers in the Pod contain the env, the allocated GPUs are all visible to them, so it is these containers' freedom to control how to share these GPUs.