Skip to content

Commit

Permalink
fix: enhance cluster setup script for model training
Browse files Browse the repository at this point in the history
This commit improves the script used to setup the cluster for model
training

Key changes:
- Improved the readability through proper formatting
- Fixed `cluster_up` failure due to incorrect `KUBECONFIG` setting
- Applied shellcheck for script linting
- Exported only necessary variables to avoid environment pollution
- Removed `kubectl` command installation as it is already a pre-requisite
- Updated to `v0.0.5` for local-dev-cluster and `latest` for
  model-server image
- Updated the README to reflect these changes

Signed-off-by: vprashar2929 <[email protected]>
  • Loading branch information
vprashar2929 committed Jul 29, 2024
1 parent 6c2eff4 commit 51358b2
Show file tree
Hide file tree
Showing 2 changed files with 313 additions and 247 deletions.
85 changes: 53 additions & 32 deletions model_training/README.md
Original file line number Diff line number Diff line change
@@ -1,62 +1,83 @@
# Contribute to power profiling amd model training
# Contribute to power profiling and model training

<!--toc:start-->
- [Contribute to power profiling and model training](#contribute-to-power-profiling-and-model-training)
- [Requirements](#requirements)
- [Pre-step](#pre-step)
- [Setup](#setup)
- [Prepare cluster](#prepare-cluster)
- [From scratch (no target kubernetes cluster)](#from-scratch-no-target-kubernetes-cluster)
- [For managed cluster](#for-managed-cluster)
- [Run benchmark and collect metrics](#run-benchmark-and-collect-metrics)
- [With manual execution](#with-manual-execution)
- [[Manual Metric Collection and Training with Entrypoint](./cmd_instruction.md)](#manual-metric-collection-and-training-with-entrypointcmdinstructionmd)
- [Clean up](#clean-up)
<!--toc:end-->

## Requirements

- git > 2.22
- kubectl
- yq, jq
- power meter is available

## Pre-step
1. Fork and clone this repository and move to profile folder
```bash
git clone
cd model_training
chmod +x script.sh
```
## 1. Prepare cluster

### From scratch (no target kubernetes cluster)
- port 9090 and 5101 not being used (will be used in port-forward for prometheus and kind registry respectively)
- Fork and clone this repository and move to `model_training` folder

Run
```bash
git clone
cd model_training
```
./script.sh prepare_cluster
```
The script will
1. create a kind cluster `kind-for-training` with registry at port `5101`.
2. deploy Prometheus.
3. deploy Prometheus RBAC and node port to `30090` port on kind node which will be forwarded to `9090` port on the host.
4. deploy service monitor for kepler and reload to Prometheus server

## Setup

### Prepare cluster

### From scratch (no target kubernetes cluster)

> Note: port 9090 and 5101 should not being used. It will be used in port-forward for prometheus and kind registry respectively
```bash
./script.sh prepare_cluster
```

The script will:

- create a kind cluster `kind-for-training` with registry at port `5101`.
- deploy Prometheus.
- deploy Prometheus RBAC and node port to `30090` port on kind node which will be forwarded to `9090` port on the host.
- deploy service monitor for kepler and reload to Prometheus server

### For managed cluster

Please confirm the following requirements:
1. Kepler installation
2. Prometheus installation
3. Kepler metrics are exported to Promtheus server
4. Prometheus server is available at `http://localhost:9090`. Otherwise, set environment `PROM_SERVER`.

## 2. Run benchmark and collect metrics
- Kepler installation
- Prometheus installation
- Kepler metrics are exported to Promtheus server
- Prometheus server is available at `http://localhost:9090`. Otherwise, set environment `PROM_SERVER`.

### With benchmark automation and pipeline
There are two options to run the benchmark and collect the metrics, [CPE-operator](https://github.com/IBM/cpe-operator) with manual script and [Tekton Pipeline](https://github.com/tektoncd/pipeline).
### Run benchmark and collect metrics

There are two options to run the benchmark and collect the metrics, [CPE-operator](https://github.com/IBM/cpe-operator) with manual script and [Tekton Pipeline](https://github.com/tektoncd/pipeline).

> The adoption of the CPE operator is slated for deprecation. We are on transitioning to the automation of collection and training processes through the Tekton pipeline. Nevertheless, the CPE operator might still be considered for usage in customized benchmarks requiring performance values per sub-workload within the benchmark suite.

### [Tekton Pipeline Instruction](./tekton/README.md)
- [Tekton Pipeline Instruction](./tekton/README.md)

- [CPE Operator Instruction](./cpe_script_instruction.md)

### [CPE Operator Instruction](./cpe_script_instruction.md)
With manual execution

### With manual execution
In addition to the above two automation approach, you can manually run your own benchmarks, then collect, train, and export the models by the entrypoint `cmd/main.py`

### [Manual Metric Collection and Training with Entrypoint](./cmd_instruction.md)
[Manual Metric Collection and Training with Entrypoint](./cmd_instruction.md)

## Clean up

### For kind-for-training cluster
For kind-for-training cluster:

Run
```
```bash
./script.sh cleanup
```
Loading

0 comments on commit 51358b2

Please sign in to comment.