Skip to content

Commit

Permalink
Merge pull request #320 from vprashar2929/update-script
Browse files Browse the repository at this point in the history
fix: enhance cluster setup script for model training
  • Loading branch information
sunya-ch authored Jul 29, 2024
2 parents 5621759 + 51358b2 commit 0b42c0e
Show file tree
Hide file tree
Showing 2 changed files with 313 additions and 247 deletions.
85 changes: 53 additions & 32 deletions model_training/README.md
Original file line number Diff line number Diff line change
@@ -1,62 +1,83 @@
# Contribute to power profiling amd model training
# Contribute to power profiling and model training

<!--toc:start-->
- [Contribute to power profiling and model training](#contribute-to-power-profiling-and-model-training)
- [Requirements](#requirements)
- [Pre-step](#pre-step)
- [Setup](#setup)
- [Prepare cluster](#prepare-cluster)
- [From scratch (no target kubernetes cluster)](#from-scratch-no-target-kubernetes-cluster)
- [For managed cluster](#for-managed-cluster)
- [Run benchmark and collect metrics](#run-benchmark-and-collect-metrics)
- [With manual execution](#with-manual-execution)
- [[Manual Metric Collection and Training with Entrypoint](./cmd_instruction.md)](#manual-metric-collection-and-training-with-entrypointcmdinstructionmd)
- [Clean up](#clean-up)
<!--toc:end-->

## Requirements

- git > 2.22
- kubectl
- yq, jq
- power meter is available

## Pre-step
1. Fork and clone this repository and move to profile folder
```bash
git clone
cd model_training
chmod +x script.sh
```
## 1. Prepare cluster

### From scratch (no target kubernetes cluster)
- port 9090 and 5101 not being used (will be used in port-forward for prometheus and kind registry respectively)
- Fork and clone this repository and move to `model_training` folder

Run
```bash
git clone
cd model_training
```
./script.sh prepare_cluster
```
The script will
1. create a kind cluster `kind-for-training` with registry at port `5101`.
2. deploy Prometheus.
3. deploy Prometheus RBAC and node port to `30090` port on kind node which will be forwarded to `9090` port on the host.
4. deploy service monitor for kepler and reload to Prometheus server

## Setup

### Prepare cluster

### From scratch (no target kubernetes cluster)

> Note: port 9090 and 5101 should not being used. It will be used in port-forward for prometheus and kind registry respectively
```bash
./script.sh prepare_cluster
```

The script will:

- create a kind cluster `kind-for-training` with registry at port `5101`.
- deploy Prometheus.
- deploy Prometheus RBAC and node port to `30090` port on kind node which will be forwarded to `9090` port on the host.
- deploy service monitor for kepler and reload to Prometheus server

### For managed cluster

Please confirm the following requirements:
1. Kepler installation
2. Prometheus installation
3. Kepler metrics are exported to Promtheus server
4. Prometheus server is available at `http://localhost:9090`. Otherwise, set environment `PROM_SERVER`.

## 2. Run benchmark and collect metrics
- Kepler installation
- Prometheus installation
- Kepler metrics are exported to Promtheus server
- Prometheus server is available at `http://localhost:9090`. Otherwise, set environment `PROM_SERVER`.

### With benchmark automation and pipeline
There are two options to run the benchmark and collect the metrics, [CPE-operator](https://github.com/IBM/cpe-operator) with manual script and [Tekton Pipeline](https://github.com/tektoncd/pipeline).
### Run benchmark and collect metrics

There are two options to run the benchmark and collect the metrics, [CPE-operator](https://github.com/IBM/cpe-operator) with manual script and [Tekton Pipeline](https://github.com/tektoncd/pipeline).

> The adoption of the CPE operator is slated for deprecation. We are on transitioning to the automation of collection and training processes through the Tekton pipeline. Nevertheless, the CPE operator might still be considered for usage in customized benchmarks requiring performance values per sub-workload within the benchmark suite.

### [Tekton Pipeline Instruction](./tekton/README.md)
- [Tekton Pipeline Instruction](./tekton/README.md)

- [CPE Operator Instruction](./cpe_script_instruction.md)

### [CPE Operator Instruction](./cpe_script_instruction.md)
With manual execution

### With manual execution
In addition to the above two automation approach, you can manually run your own benchmarks, then collect, train, and export the models by the entrypoint `cmd/main.py`

### [Manual Metric Collection and Training with Entrypoint](./cmd_instruction.md)
[Manual Metric Collection and Training with Entrypoint](./cmd_instruction.md)

## Clean up

### For kind-for-training cluster
For kind-for-training cluster:

Run
```
```bash
./script.sh cleanup
```
Loading

0 comments on commit 0b42c0e

Please sign in to comment.