Skip to content

Commit 2e1fc1f

Browse files
farshadghodsiansajmera-pensando
authored andcommitted
Doc changes for v1.3.0 release
1 parent be6b754 commit 2e1fc1f

33 files changed

+1249
-345
lines changed

docs/.markdownlint-cli2.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,2 @@
11
ignores:
2-
- CHANGELOG.md
3-
- "vendor/**"
2+
- CHANGELOG.md

docs/.markdownlint.yaml

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,3 @@
1-
ignores:
2-
- CHANGELOG.md
3-
- "vendor/**"
41
default: true
52
MD013: false
63
MD024:

docs/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,4 +29,4 @@
2929
# Table of contents
3030
external_toc_path = "./sphinx/_toc.yml"
3131

32-
exclude_patterns = ['.venv']
32+
exclude_patterns = ['.venv']

docs/contributing/developer-guide.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ This project is not ready yet to accept the external developers commits.
88

99
## Prerequisites
1010

11+
- Go v1.20 (due to [open issues](https://github.com/golang/go/issues/65637) with Go v1.21 or v1.22)
1112
- Docker
1213
- Kubernetes cluster (v1.29.0+) or OpenShift (4.16+)
1314
- `kubectl` or `oc` CLI tool configured to access your cluster
@@ -24,6 +25,9 @@ chmod 700 get_helm.sh
2425

2526
For alternative installation methods, refer to the [Helm Official Website](https://helm.sh/docs/intro/install/).
2627

28+
- Install Helmify:
29+
- Download the released binary from the [Helmify GitHub release page](https://github.com/arttor/helmify/releases/tag/v0.4.13), unpack it, and move it to your `PATH`.
30+
2731
- Clone the repository:
2832

2933
```bash

docs/contributing/documentation-build-guide.md

Lines changed: 28 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -4,51 +4,51 @@ This guide provides information for developers who want to contribute to the AMD
44

55
## Building and Serving the Docs
66

7-
- Create a Python Virtual Environment (optional, but recommended)
7+
1. Create a Python Virtual Environment (optional, but recommended)
88

9-
```bash
10-
python3 -m venv .venv/docs
11-
source .venv/docs/bin/activate (or source .venv/docs/Scripts/activate on Windows)
12-
```
9+
```bash
10+
python3 -m venv .venv/docs
11+
source .venv/docs/bin/activate (or source .venv/docs/Scripts/activate on Windows)
12+
```
1313

14-
- Install required packages for docs
14+
2. Install required packages for docs
1515

16-
```bash
17-
pip install -r docs/sphinx/requirements.txt
18-
```
16+
```bash
17+
pip install -r docs/sphinx/requirements.txt
18+
```
1919

20-
- Build the docs
20+
3. Build the docs
2121

22-
```bash
23-
python3 -m sphinx -b html -d _build/doctrees -D language=en ./docs/ docs/_build/html
24-
```
22+
```bash
23+
python3 -m sphinx -b html -d _build/doctrees -D language=en ./docs/ docs/_build/html
24+
```
2525

26-
- Serve docs locally on port 8000
26+
4. Serve docs locally on port 8000
2727

28-
```bash
29-
python3 -m http.server -d ./docs/_build/html/
30-
```
28+
```bash
29+
python3 -m http.server -d ./docs/_build/html/
30+
```
3131

32-
- You can now view the docs site by going to http://localhost:8000
32+
5. You can now view the docs site by going to http://localhost:8000
3333

3434
## Auto-building the docs
3535

3636
The below will allow you to watch the docs directory and rebuild the documenatation each time you make a change to the documentation files:
3737

38-
- Install Sphinx Autobuild package
38+
1. Install Sphinx Autobuild package
3939

40-
```bash
41-
pip install sphinx-autobuild
42-
```
40+
```bash
41+
pip install sphinx-autobuild
42+
```
4343

44-
- Run the autobuild (will also serve the docs on port 8000 automatically)
44+
2. Run the autobuild (will also serve the docs on port 8000 automatically)
4545

46-
```bash
47-
sphinx-autobuild -b html -d _build/doctrees -D language=en ./docs docs/_build/html --ignore "docs/_build/*" --ignore "docs/sphinx/_toc.yml"
48-
```
46+
```bash
47+
sphinx-autobuild -b html -d _build/doctrees -D language=en ./docs docs/_build/html --ignore "docs/_build/*" --ignore "docs/sphinx/_toc.yml"
48+
```
4949

5050
## Troubleshooting
5151

52-
- **Navigation Menu not displaying new links**
52+
1. **Navigation Menu not displaying new links**
5353

54-
Note that if you've recently added a new link to the navigation menu previously unchanged pages may not correctly display the new link. To fix this delete the existing `_build/` directory and rebuild the docs so that the navigation menu will be rebuilt for all pages.
54+
Note that if you've recently added a new link to the navigation menu previously unchanged pages may not correctly display the new link. To fix this delete the existing `_build/` directory and rebuild the docs so that the navigation menu will be rebuilt for all pages.
Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
GPU Partitioning via DCM
2+
========================
3+
4+
- GPU on the node cannot be partitioned on the go, we need to bring down all daemonsets using the GPU resource before partitioning. Hence we need to taint the node and the partition.
5+
- DCM pod comes with a toleration
6+
- `key: amd-dcm , value: up , Operator: Equal, effect: NoExecute `
7+
- User can specify additional tolerations if required
8+
9+
GPU Partitioning Workflow
10+
-------------------------
11+
12+
1. Add tolerations to the required system pods to prevent them from being evicted during partitioning process
13+
2. Deploy the DCM pod by applying/updating the DeviceConfig
14+
3. Taint the node to evict all workloads and prevent scheduling on new workloads on the node
15+
4. Label the node to indicate what paritioning profile will be used
16+
5. DCM will partition the node accordingly
17+
6. Once partition is done, un-taint the node to add it back so workloads can be scheduled on the cluster
18+
19+
Setting GPU Partitioning
20+
-------------------------
21+
22+
1. Add tolerations to all deployments in kube-system namespace
23+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
24+
25+
Since tainting a node will bring down all pods/daemonsets, we need to add toleration to the Kubernetes system pods to prevent them from getting evicted. Pods in the system namespace are responsible for things like DNS, networking, proxy and the overall proper functioning of your node.
26+
27+
Here we are patching all the deployments in the `kube-system` namespace with the key `amd-dcm` which is used during the tainting process to evict all non-essential pods:
28+
29+
.. tab-set::
30+
31+
.. tab-item:: Kubernetes
32+
33+
.. code-block:: bash
34+
35+
kubectl get deployments -n kube-system -o json | jq -r '.items[] | .metadata.name' | xargs -I {} kubectl patch deployment {} -n kube-system --type='json' -p='[{"op": "add", "path": "/spec/template/spec/tolerations", "value": [{"key": "amd-dcm", "operator": "Equal", "value": "up", "effect": "NoExecute"}]}]'
36+
37+
.. .. tab-item:: OpenShift
38+
39+
.. .. code-block:: bash
40+
41+
.. oc get deployments -n kube-system -o json | jq -r '.items[] | .metadata.name' | xargs -I {} kubectl patch deployment {} -n kube-system --type='json' -p='[{"op": "add", "path": "/spec/template/spec/tolerations", "value": [{"key": "amd-dcm", "operator": "Equal", "value": "up", "effect": "NoExecute"}]}]'
42+
43+
44+
The above command is convenient as it adds the required tolerations all with a single command. However, you can also manually edit any required deployments or pods yourself and add this toleration to any other required pods in your cluster as follows:
45+
46+
.. code-block:: yaml
47+
48+
#Add this under the spec.template.spec.tolerations object
49+
tolerations:
50+
- key: "amd-dcm"
51+
operator: "Equal"
52+
value: "up"
53+
effect: "NoExecute"
54+
55+
56+
2. Create DCM Profile ConfigMap
57+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
58+
59+
Next you will need to create the Device Config Mangaer ConfigMap that specifies the different partitioning profiles you would like to set. Refer to the [Device Config Mangaer ConfigMap](../dcm/device-config-manager-configmap.html#configmap) page for more details on how to create the DCM ConfigMap.
60+
61+
Below is an example of how to create the `config-manager-config.yaml` file that has the following 2 profiles:
62+
63+
- **cpx-profile**: CPX+NPS4 (64 GPU partitions)
64+
- **spx-profile**: SPX+NPS1 (no GPU partitions)
65+
66+
.. code-block:: yaml
67+
68+
cat <<EOF > config-manager-config.yaml
69+
apiVersion: v1
70+
kind: ConfigMap
71+
metadata:
72+
name: config-manager-config
73+
namespace: kube-amd-gpu
74+
data:
75+
config.json: |
76+
{
77+
"gpu-config-profiles":
78+
{
79+
"cpx-profile":
80+
{
81+
"skippedGPUs": {
82+
"ids": []
83+
},
84+
"profiles": [
85+
{
86+
"computePartition": "CPX",
87+
"memoryPartition": "NPS4",
88+
"numGPUsAssigned": 8
89+
}
90+
]
91+
},
92+
"spx-profile":
93+
{
94+
"skippedGPUs": {
95+
"ids": []
96+
},
97+
"profiles": [
98+
{
99+
"computePartition": "SPX",
100+
"memoryPartition": "NPS1",
101+
"numGPUsAssigned": 8
102+
}
103+
]
104+
}
105+
}
106+
}
107+
EOF
108+
109+
110+
Now apply the DCM ConfigMap to your cluster
111+
112+
.. tab-set::
113+
114+
.. tab-item:: Kubernetes
115+
116+
.. code-block:: bash
117+
118+
kubectl apply -f config-manager-config.yaml
119+
120+
.. .. tab-item:: OpenShift
121+
122+
.. .. code-block:: bash
123+
124+
.. oc apply -f config-manager-config.yaml
125+
126+
3. Add Taint to node
127+
~~~~~~~~~~~~~~~~~~~~
128+
129+
In order to ensure there are no workloads on the node that are using the GPUs we taint the node to evict any non-essential workloads. To do this taint the node with the `amd-dcm=up:NoExecute` toleration. This ensures that only workloads and daemonsets with this specific tolerations will remain on the node. All others will terminate. This can be done as follows:
130+
131+
.. tab-set::
132+
133+
.. tab-item:: Kubernetes
134+
135+
.. code-block:: bash
136+
137+
kubectl taint nodes [nodename] amd-dcm=up:NoExecute
138+
139+
.. .. tab-item:: OpenShift
140+
141+
.. .. code-block:: bash
142+
143+
.. oc taint nodes [nodename] amd-dcm=up:NoExecute
144+
145+
4. Label the node with the CPX profile
146+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
147+
148+
Monitor the pods on the node to ensure that all non-essential workloads are being terminated. Wait for a short amount of time to ensure the pods have terminated. Once done we need to label the node with the parition profile we want DCM to apply. In this case we will apply the `cpx-profile` label as follows ensure we also pass the --overwrite flag to account for any existing `gpu-config-profile` label:
149+
150+
.. tab-set::
151+
152+
.. tab-item:: Kubernetes
153+
154+
.. code-block:: bash
155+
156+
kubectl label node [nodename] dcm.amd.com/gpu-config-profile=cpx-profile --overwrite
157+
158+
.. .. tab-item:: OpenShift
159+
160+
.. .. code-block:: bash
161+
162+
.. oc label node [nodename] dcm.amd.com/gpu-config-profile=cpx-profile --overwrite
163+
164+
You can also confirm that the label got applied by checking the node:
165+
166+
.. tab-set::
167+
168+
.. tab-item:: Kubernetes
169+
170+
.. code-block:: bash
171+
172+
kubectl describe node [nodename] | grep gpu-config-profile
173+
174+
.. .. tab-item:: OpenShift
175+
176+
.. .. code-block:: bash
177+
178+
.. oc describe node [nodename] | grep gpu-config-profile
179+
180+
5. Verify GPU partitioning
181+
~~~~~~~~~~~~~~~~~~~~~~~~~~
182+
183+
Connect to the node in your cluster via SSH and run amd-smi to confirm you now see the new partitions:
184+
185+
.. code-block:: bash
186+
187+
amd-smi list
188+
189+
6. Remove Taint from the node
190+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
191+
192+
Remove the taint from the node to restart all previous workloads and allow the node to be used again for scheduling workloads:
193+
194+
.. tab-set::
195+
196+
.. tab-item:: Kubernetes
197+
198+
.. code-block:: bash
199+
200+
kubectl taint nodes [nodename] amd-dcm=up:NoExecute-
201+
202+
.. .. tab-item:: OpenShift
203+
204+
.. .. code-block:: bash
205+
206+
.. oc taint nodes [nodename] amd-dcm=up:NoExecute-
207+
208+
Reverting back to SPX (no partitions)
209+
-------------------------------------
210+
211+
.. tab-set::
212+
213+
.. tab-item:: Kubernetes
214+
215+
.. code-block:: bash
216+
217+
kubectl label node [nodename] dcm.amd.com/gpu-config-profile=spx-profile --overwrite
218+
219+
.. .. tab-item:: OpenShift
220+
221+
.. .. code-block:: bash
222+
223+
.. oc label node [nodename] dcm.amd.com/gpu-config-profile=spx-profile --overwrite
224+
225+
Removing Partition Profile label
226+
--------------------------------
227+
228+
.. tab-set::
229+
230+
.. tab-item:: Kubernetes
231+
232+
.. code-block:: bash
233+
234+
kubectl label node [nodename] dcm.amd.com/gpu-config-profile-
235+
236+
.. .. tab-item:: OpenShift
237+
238+
.. .. code-block:: bash
239+
240+
.. oc label node [nodename] dcm.amd.com/gpu-config-profile-
241+
242+
Removing DCM tolerations from all daemonsets in kube-system namespace
243+
---------------------------------------------------------------------
244+
245+
.. tab-set::
246+
247+
.. tab-item:: Kubernetes
248+
249+
.. code-block:: bash
250+
251+
kubectl get daemonsets -n kube-system -o json | jq -r '.items[] | .metadata.name' | xargs -I {} kubectl patch daemonset {} -n kube-system --type='json' -p='[{"op": "remove", "path": "/spec/template/spec/tolerations/0"}]'
252+
253+
.. .. tab-item:: OpenShift
254+
255+
.. .. code-block:: bash
256+
257+
.. oc get daemonsets -n kube-system -o json | jq -r '.items[] | .metadata.name' | xargs -I {} kubectl patch daemonset {} -n kube-system --type='json' -p='[{"op": "remove", "path": "/spec/template/spec/tolerations/0"}]'

0 commit comments

Comments
 (0)