ROCm
diff --git a/‎docs/.markdownlint-cli2.yaml‎
Lines changed: 1 addition & 2 deletions b/‎docs/.markdownlint-cli2.yaml‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎docs/.markdownlint.yaml‎
Lines changed: 0 additions & 3 deletions b/‎docs/.markdownlint.yaml‎
Lines changed: 0 additions & 3 deletions
diff --git a/‎docs/conf.py‎
Lines changed: 1 addition & 1 deletion b/‎docs/conf.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/contributing/developer-guide.md‎
Lines changed: 4 additions & 0 deletions b/‎docs/contributing/developer-guide.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/contributing/documentation-build-guide.md‎
Lines changed: 28 additions & 28 deletions b/‎docs/contributing/documentation-build-guide.md‎
Lines changed: 28 additions & 28 deletions
diff --git a/‎docs/dcm/applying-partition-profiles.rst‎
Lines changed: 257 additions & 0 deletions b/‎docs/dcm/applying-partition-profiles.rst‎
Lines changed: 257 additions & 0 deletions
@@ -1,3 +1,2 @@
 ignores:
-  - CHANGELOG.md
-  - "vendor/**"
+  - CHANGELOG.md
@@ -1,6 +1,3 @@
-ignores:
-  - CHANGELOG.md
-  - "vendor/**"
 default: true
 MD013: false
 MD024:
 
@@ -29,4 +29,4 @@
 # Table of contents
 external_toc_path = "./sphinx/_toc.yml"
 
-exclude_patterns = ['.venv']
+exclude_patterns = ['.venv']
@@ -8,6 +8,7 @@ This project is not ready yet to accept the external developers commits.
 
 ## Prerequisites
 
+- Go v1.20 (due to [open issues](https://github.com/golang/go/issues/65637) with Go v1.21 or v1.22)
 - Docker
 - Kubernetes cluster (v1.29.0+) or OpenShift (4.16+)
 - `kubectl` or `oc` CLI tool configured to access your cluster
@@ -24,6 +25,9 @@ chmod 700 get_helm.sh
 
 For alternative installation methods, refer to the [Helm Official Website](https://helm.sh/docs/intro/install/).
 
+- Install Helmify:
+  - Download the released binary from the [Helmify GitHub release page](https://github.com/arttor/helmify/releases/tag/v0.4.13), unpack it, and move it to your `PATH`.
+
 - Clone the repository:
 
 ```bash
 
@@ -4,51 +4,51 @@ This guide provides information for developers who want to contribute to the AMD
 
 ## Building and Serving the Docs
 
-- Create a Python Virtual Environment (optional, but recommended)
+1. Create a Python Virtual Environment (optional, but recommended)
 
- ```bash
- python3 -m venv .venv/docs
- source .venv/docs/bin/activate (or source .venv/docs/Scripts/activate on Windows)
- ```
+    ```bash
+    python3 -m venv .venv/docs
+    source .venv/docs/bin/activate (or source .venv/docs/Scripts/activate on Windows)
+    ```
 
-- Install required packages for docs
+2. Install required packages for docs
 
- ```bash
- pip install -r docs/sphinx/requirements.txt
- ```
+    ```bash
+    pip install -r docs/sphinx/requirements.txt
+    ```
 
-- Build the docs
+3. Build the docs
 
- ```bash
- python3 -m sphinx -b html -d _build/doctrees -D language=en ./docs/ docs/_build/html
- ```
+    ```bash
+    python3 -m sphinx -b html -d _build/doctrees -D language=en ./docs/ docs/_build/html
+    ```
 
-- Serve docs locally on port 8000
+4. Serve docs locally on port 8000
 
- ```bash
- python3 -m http.server -d ./docs/_build/html/
- ```
+    ```bash
+    python3 -m http.server -d ./docs/_build/html/
+    ```
 
-- You can now view the docs site by going to http://localhost:8000
+5. You can now view the docs site by going to http://localhost:8000
 
 ## Auto-building the docs
 
 The below will allow you to watch the docs directory and rebuild the documenatation each time you make a change to the documentation files:
 
-- Install Sphinx Autobuild package
+1. Install Sphinx Autobuild package
 
- ```bash
- pip install sphinx-autobuild
- ```
+    ```bash
+    pip install sphinx-autobuild
+    ```
 
-- Run the autobuild (will also serve the docs on port 8000 automatically)
+2. Run the autobuild (will also serve the docs on port 8000 automatically)
 
- ```bash
- sphinx-autobuild -b html -d _build/doctrees -D language=en ./docs docs/_build/html --ignore "docs/_build/*" --ignore "docs/sphinx/_toc.yml"
- ```
+    ```bash
+    sphinx-autobuild -b html -d _build/doctrees -D language=en ./docs docs/_build/html --ignore "docs/_build/*" --ignore "docs/sphinx/_toc.yml"
+    ```
 
 ## Troubleshooting
 
-- **Navigation Menu not displaying new links**
+1. **Navigation Menu not displaying new links**
 
- Note that if you've recently added a new link to the navigation menu previously unchanged pages may not correctly display the new link. To fix this delete the existing `_build/` directory and rebuild the docs so that the navigation menu will be rebuilt for all pages.
+    Note that if you've recently added a new link to the navigation menu previously unchanged pages may not correctly display the new link. To fix this delete the existing `_build/` directory and rebuild the docs so that the navigation menu will be rebuilt for all pages.
@@ -0,0 +1,257 @@
+GPU Partitioning via DCM
+========================
+
+-  GPU on the node cannot be partitioned on the go, we need to bring down all daemonsets using the GPU resource before partitioning. Hence we need to taint the node and the partition.
+- DCM pod comes with a toleration
+    - `key: amd-dcm , value: up , Operator: Equal, effect: NoExecute `
+    - User can specify additional tolerations if required
+
+GPU Partitioning Workflow
+-------------------------
+
+1. Add tolerations to the required system pods to prevent them from being evicted during partitioning process
+2. Deploy the DCM pod by applying/updating the DeviceConfig
+3. Taint the node to evict all workloads and prevent scheduling on new workloads on the node
+4. Label the node to indicate what paritioning profile will be used
+5. DCM will partition the node accordingly
+6. Once partition is done, un-taint the node to add it back so workloads can be scheduled on the cluster
+
+Setting GPU Partitioning
+-------------------------
+
+1. Add tolerations to all deployments in kube-system namespace
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Since tainting a node will bring down all pods/daemonsets, we need to add toleration to the Kubernetes system pods to prevent them from getting evicted. Pods in the system namespace are responsible for things like DNS, networking, proxy and the overall proper functioning of your node.
+
+Here we are patching all the deployments in the `kube-system` namespace with the key `amd-dcm` which is used during the tainting process to evict all non-essential pods:
+
+.. tab-set::
+
+   .. tab-item:: Kubernetes
+
+      .. code-block:: bash
+
+         kubectl get deployments -n kube-system -o json | jq -r '.items[] | .metadata.name' | xargs -I {} kubectl patch deployment {} -n kube-system --type='json' -p='[{"op": "add", "path": "/spec/template/spec/tolerations", "value": [{"key": "amd-dcm", "operator": "Equal", "value": "up", "effect": "NoExecute"}]}]'
+
+..    .. tab-item:: OpenShift
+
+..       .. code-block:: bash
+
+..          oc get deployments -n kube-system -o json | jq -r '.items[] | .metadata.name' | xargs -I {} kubectl patch deployment {} -n kube-system --type='json' -p='[{"op": "add", "path": "/spec/template/spec/tolerations", "value": [{"key": "amd-dcm", "operator": "Equal", "value": "up", "effect": "NoExecute"}]}]'
+             
+
+The above command is convenient as it adds the required tolerations all with a single command. However, you can also manually edit any required deployments or pods yourself and add this toleration to any other required pods in your cluster as follows:
+
+.. code-block:: yaml
+
+    #Add this under the spec.template.spec.tolerations object
+    tolerations:
+        - key: "amd-dcm"
+            operator: "Equal"
+            value: "up"
+            effect: "NoExecute"
+
+
+2. Create DCM Profile ConfigMap
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Next you will need to create the Device Config Mangaer ConfigMap that specifies the different partitioning profiles you would like to set. Refer to the [Device Config Mangaer ConfigMap](../dcm/device-config-manager-configmap.html#configmap) page for more details on how to create the DCM ConfigMap. 
+
+Below is an example of how to create the `config-manager-config.yaml` file that has the following 2 profiles:
+
+- **cpx-profile**: CPX+NPS4 (64 GPU partitions)
+- **spx-profile**: SPX+NPS1 (no GPU partitions)
+
+.. code-block:: yaml
+    
+    cat <<EOF > config-manager-config.yaml
+    apiVersion: v1
+    kind: ConfigMap
+    metadata:
+    name: config-manager-config
+    namespace: kube-amd-gpu
+    data:
+    config.json: |
+        {
+        "gpu-config-profiles":
+        {
+            "cpx-profile":
+            {
+                "skippedGPUs": {
+                    "ids": []
+                },
+                "profiles": [
+                    {
+                        "computePartition": "CPX",
+                        "memoryPartition": "NPS4",
+                        "numGPUsAssigned": 8
+                    }
+                ]
+            },
+            "spx-profile":
+            {
+                "skippedGPUs": {
+                    "ids": []
+                },
+                "profiles": [
+                    {
+                        "computePartition": "SPX",
+                        "memoryPartition": "NPS1",
+                        "numGPUsAssigned": 8
+                    }
+                ]
+            }
+        }
+        }
+    EOF
+
+
+Now apply the DCM ConfigMap to your cluster
+
+.. tab-set::
+
+   .. tab-item:: Kubernetes
+
+      .. code-block:: bash
+
+            kubectl apply -f config-manager-config.yaml
+
+..    .. tab-item:: OpenShift
+
+..       .. code-block:: bash
+
+..             oc apply -f config-manager-config.yaml
+
+3. Add Taint to node
+~~~~~~~~~~~~~~~~~~~~
+
+In order to ensure there are no workloads on the node that are using the GPUs we taint the node to evict any non-essential workloads. To do this taint the node with the `amd-dcm=up:NoExecute` toleration. This ensures that only workloads and daemonsets with this specific tolerations will remain on the node. All others will terminate. This can be done as follows:
+
+.. tab-set::
+
+   .. tab-item:: Kubernetes
+
+      .. code-block:: bash
+
+            kubectl taint nodes [nodename] amd-dcm=up:NoExecute
+
+..    .. tab-item:: OpenShift
+
+..       .. code-block:: bash
+
+..             oc taint nodes [nodename] amd-dcm=up:NoExecute
+
+4. Label the node with the CPX profile
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Monitor the pods on the node to ensure that all non-essential workloads are being terminated. Wait for a short amount of time to ensure the pods have terminated. Once done we need to label the node with the parition profile we want DCM to apply. In this case we will apply the `cpx-profile` label as follows ensure we also pass the --overwrite flag to account for any existing `gpu-config-profile` label:
+
+.. tab-set::
+
+   .. tab-item:: Kubernetes
+
+      .. code-block:: bash
+
+            kubectl label node [nodename] dcm.amd.com/gpu-config-profile=cpx-profile --overwrite
+
+..    .. tab-item:: OpenShift
+
+..       .. code-block:: bash
+
+..             oc label node [nodename] dcm.amd.com/gpu-config-profile=cpx-profile --overwrite
+
+You can also confirm that the label got applied by checking the node:
+
+.. tab-set::
+
+   .. tab-item:: Kubernetes
+
+      .. code-block:: bash
+
+            kubectl describe node [nodename] | grep gpu-config-profile
+
+..    .. tab-item:: OpenShift
+
+..       .. code-block:: bash
+
+..             oc describe node [nodename] | grep gpu-config-profile
+
+5. Verify GPU partitioning
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Connect to the node in your cluster via SSH and run amd-smi to confirm you now see the new partitions:
+
+.. code-block:: bash
+
+    amd-smi list
+
+6. Remove Taint from the node
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Remove the taint from the node to restart all previous workloads and allow the node to be used again for scheduling workloads:
+
+.. tab-set::
+
+   .. tab-item:: Kubernetes
+
+      .. code-block:: bash
+
+            kubectl taint nodes [nodename] amd-dcm=up:NoExecute-
+
+..    .. tab-item:: OpenShift
+
+..       .. code-block:: bash
+
+..             oc taint nodes [nodename] amd-dcm=up:NoExecute-
+
+Reverting back to SPX (no partitions)
+-------------------------------------
+
+.. tab-set::
+
+   .. tab-item:: Kubernetes
+
+      .. code-block:: bash
+
+            kubectl label node [nodename] dcm.amd.com/gpu-config-profile=spx-profile --overwrite
+
+..    .. tab-item:: OpenShift
+
+..       .. code-block:: bash
+
+..             oc label node [nodename] dcm.amd.com/gpu-config-profile=spx-profile --overwrite
+
+Removing Partition Profile label
+--------------------------------
+
+.. tab-set::
+
+   .. tab-item:: Kubernetes
+
+      .. code-block:: bash
+
+            kubectl label node [nodename] dcm.amd.com/gpu-config-profile-
+
+..    .. tab-item:: OpenShift
+
+..       .. code-block:: bash
+
+..             oc label node [nodename] dcm.amd.com/gpu-config-profile-
+
+Removing DCM tolerations from all daemonsets in kube-system namespace
+---------------------------------------------------------------------
+
+.. tab-set::
+
+   .. tab-item:: Kubernetes
+
+      .. code-block:: bash
+
+            kubectl get daemonsets -n kube-system -o json | jq -r '.items[] | .metadata.name' | xargs -I {} kubectl patch daemonset {} -n kube-system --type='json' -p='[{"op": "remove", "path": "/spec/template/spec/tolerations/0"}]'
+
+..    .. tab-item:: OpenShift
+
+..       .. code-block:: bash
+
+..             oc get daemonsets -n kube-system -o json | jq -r '.items[] | .metadata.name' | xargs -I {} kubectl patch daemonset {} -n kube-system --type='json' -p='[{"op": "remove", "path": "/spec/template/spec/tolerations/0"}]'