fix: enhance cluster setup script for model training

This commit improves the script used to setup the cluster for model training Key changes: - Improved the readability through proper formatting - Fixed `cluster_up` failure due to incorrect `KUBECONFIG` setting - Applied shellcheck for script linting - Exported only necessary variables to avoid environment pollution - Removed `kubectl` command installation as it is already a pre-requisite - Updated to `v0.0.5` for local-dev-cluster and `latest` for model-server image - Updated the README to reflect these changes Signed-off-by: vprashar2929 <[email protected]>
sustainable-computing-io · Jul 29, 2024 · 51358b2 · 51358b2
1 parent 6c2eff4
commit 51358b2
Show file tree

Hide file tree

Showing 2 changed files with 313 additions and 247 deletions.
diff --git a/model_training/README.md b/model_training/README.md
@@ -1,62 +1,83 @@
-# Contribute to power profiling amd model training
+# Contribute to power profiling and model training
+
+<!--toc:start-->
+- [Contribute to power profiling and model training](#contribute-to-power-profiling-and-model-training)
+  - [Requirements](#requirements)
+  - [Pre-step](#pre-step)
+  - [Setup](#setup)
+    - [Prepare cluster](#prepare-cluster)
+    - [From scratch (no target kubernetes cluster)](#from-scratch-no-target-kubernetes-cluster)
+    - [For managed cluster](#for-managed-cluster)
+    - [Run benchmark and collect metrics](#run-benchmark-and-collect-metrics)
+    - [With manual execution](#with-manual-execution)
+    - [[Manual Metric Collection and Training with Entrypoint](./cmd_instruction.md)](#manual-metric-collection-and-training-with-entrypointcmdinstructionmd)
+  - [Clean up](#clean-up)
+<!--toc:end-->
 
 ## Requirements
+
 - git > 2.22
 - kubectl
 - yq, jq
 - power meter is available
 
 ## Pre-step
-1. Fork and clone this repository and move to profile folder
-    ```bash
-    git clone
-    cd model_training
-    chmod +x script.sh
-    ```
-## 1. Prepare cluster
 
-### From scratch (no target kubernetes cluster)
-- port 9090 and 5101 not being used (will be used in port-forward for prometheus and kind registry respectively)
+- Fork and clone this repository and move to `model_training` folder
 
-Run
+```bash
+git clone
+cd model_training
 ```
-./script.sh prepare_cluster
-```
-The script will 
-1. create a kind cluster `kind-for-training` with registry at port `5101`.
-2. deploy Prometheus.
-3. deploy Prometheus RBAC and node port to `30090` port on kind node which will be forwarded to `9090` port on the host.
-4. deploy service monitor for kepler and reload to Prometheus server
+
+## Setup
+
+### Prepare cluster
+
+### From scratch (no target kubernetes cluster)
+
+> Note: port 9090 and 5101 should not being used. It will be used in port-forward for prometheus and kind registry respectively
+
+  ```bash
+  ./script.sh prepare_cluster
+  ```
+
+The script will:
+
+- create a kind cluster `kind-for-training` with registry at port `5101`.
+- deploy Prometheus.
+- deploy Prometheus RBAC and node port to `30090` port on kind node which will be forwarded to `9090` port on the host.
+- deploy service monitor for kepler and reload to Prometheus server
 
 ### For managed cluster
 
 Please confirm the following requirements:
-1. Kepler installation
-2. Prometheus installation
-3. Kepler metrics are exported to Promtheus server
-4. Prometheus server is available at `http://localhost:9090`. Otherwise, set environment `PROM_SERVER`.
 
-## 2. Run benchmark and collect metrics
+- Kepler installation
+- Prometheus installation
+- Kepler metrics are exported to Promtheus server
+- Prometheus server is available at `http://localhost:9090`. Otherwise, set environment `PROM_SERVER`.
 
-### With benchmark automation and pipeline
-There are two options to run the benchmark and collect the metrics, [CPE-operator](https://github.com/IBM/cpe-operator) with manual script and [Tekton Pipeline](https://github.com/tektoncd/pipeline). 
+### Run benchmark and collect metrics
+
+There are two options to run the benchmark and collect the metrics, [CPE-operator](https://github.com/IBM/cpe-operator) with manual script and [Tekton Pipeline](https://github.com/tektoncd/pipeline).
 
 > The adoption of the CPE operator is slated for deprecation. We are on transitioning to the automation of collection and training processes through the Tekton pipeline. Nevertheless, the CPE operator might still be considered for usage in customized benchmarks requiring performance values per sub-workload within the benchmark suite.
 
-### [Tekton Pipeline Instruction](./tekton/README.md)
+- [Tekton Pipeline Instruction](./tekton/README.md)
+
+- [CPE Operator Instruction](./cpe_script_instruction.md)
 
-### [CPE Operator Instruction](./cpe_script_instruction.md)
+With manual execution
 
-### With manual execution
 In addition to the above two automation approach, you can manually run your own benchmarks, then collect, train, and export the models by the entrypoint `cmd/main.py`
 
-### [Manual Metric Collection and Training with Entrypoint](./cmd_instruction.md)
+[Manual Metric Collection and Training with Entrypoint](./cmd_instruction.md)
 
 ## Clean up
 
-### For kind-for-training cluster
+For kind-for-training cluster:
 
-Run
-```
+```bash
 ./script.sh cleanup
 ```