-
Notifications
You must be signed in to change notification settings - Fork 26
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: enhance cluster setup script for model training
This commit improves the script used to setup the cluster for model training Key changes: - Improved the readability through proper formatting - Fixed `cluster_up` failure due to incorrect `KUBECONFIG` setting - Applied shellcheck for script linting - Exported only necessary variables to avoid environment pollution - Removed `kubectl` command installation as it is already a pre-requisite - Updated to `v0.0.5` for local-dev-cluster and `latest` for model-server image - Updated the README to reflect these changes Signed-off-by: vprashar2929 <[email protected]>
- Loading branch information
1 parent
6c2eff4
commit 51358b2
Showing
2 changed files
with
313 additions
and
247 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,62 +1,83 @@ | ||
# Contribute to power profiling amd model training | ||
# Contribute to power profiling and model training | ||
|
||
<!--toc:start--> | ||
- [Contribute to power profiling and model training](#contribute-to-power-profiling-and-model-training) | ||
- [Requirements](#requirements) | ||
- [Pre-step](#pre-step) | ||
- [Setup](#setup) | ||
- [Prepare cluster](#prepare-cluster) | ||
- [From scratch (no target kubernetes cluster)](#from-scratch-no-target-kubernetes-cluster) | ||
- [For managed cluster](#for-managed-cluster) | ||
- [Run benchmark and collect metrics](#run-benchmark-and-collect-metrics) | ||
- [With manual execution](#with-manual-execution) | ||
- [[Manual Metric Collection and Training with Entrypoint](./cmd_instruction.md)](#manual-metric-collection-and-training-with-entrypointcmdinstructionmd) | ||
- [Clean up](#clean-up) | ||
<!--toc:end--> | ||
|
||
## Requirements | ||
|
||
- git > 2.22 | ||
- kubectl | ||
- yq, jq | ||
- power meter is available | ||
|
||
## Pre-step | ||
1. Fork and clone this repository and move to profile folder | ||
```bash | ||
git clone | ||
cd model_training | ||
chmod +x script.sh | ||
``` | ||
## 1. Prepare cluster | ||
|
||
### From scratch (no target kubernetes cluster) | ||
- port 9090 and 5101 not being used (will be used in port-forward for prometheus and kind registry respectively) | ||
- Fork and clone this repository and move to `model_training` folder | ||
|
||
Run | ||
```bash | ||
git clone | ||
cd model_training | ||
``` | ||
./script.sh prepare_cluster | ||
``` | ||
The script will | ||
1. create a kind cluster `kind-for-training` with registry at port `5101`. | ||
2. deploy Prometheus. | ||
3. deploy Prometheus RBAC and node port to `30090` port on kind node which will be forwarded to `9090` port on the host. | ||
4. deploy service monitor for kepler and reload to Prometheus server | ||
|
||
## Setup | ||
|
||
### Prepare cluster | ||
|
||
### From scratch (no target kubernetes cluster) | ||
|
||
> Note: port 9090 and 5101 should not being used. It will be used in port-forward for prometheus and kind registry respectively | ||
```bash | ||
./script.sh prepare_cluster | ||
``` | ||
|
||
The script will: | ||
|
||
- create a kind cluster `kind-for-training` with registry at port `5101`. | ||
- deploy Prometheus. | ||
- deploy Prometheus RBAC and node port to `30090` port on kind node which will be forwarded to `9090` port on the host. | ||
- deploy service monitor for kepler and reload to Prometheus server | ||
|
||
### For managed cluster | ||
|
||
Please confirm the following requirements: | ||
1. Kepler installation | ||
2. Prometheus installation | ||
3. Kepler metrics are exported to Promtheus server | ||
4. Prometheus server is available at `http://localhost:9090`. Otherwise, set environment `PROM_SERVER`. | ||
|
||
## 2. Run benchmark and collect metrics | ||
- Kepler installation | ||
- Prometheus installation | ||
- Kepler metrics are exported to Promtheus server | ||
- Prometheus server is available at `http://localhost:9090`. Otherwise, set environment `PROM_SERVER`. | ||
|
||
### With benchmark automation and pipeline | ||
There are two options to run the benchmark and collect the metrics, [CPE-operator](https://github.com/IBM/cpe-operator) with manual script and [Tekton Pipeline](https://github.com/tektoncd/pipeline). | ||
### Run benchmark and collect metrics | ||
|
||
There are two options to run the benchmark and collect the metrics, [CPE-operator](https://github.com/IBM/cpe-operator) with manual script and [Tekton Pipeline](https://github.com/tektoncd/pipeline). | ||
|
||
> The adoption of the CPE operator is slated for deprecation. We are on transitioning to the automation of collection and training processes through the Tekton pipeline. Nevertheless, the CPE operator might still be considered for usage in customized benchmarks requiring performance values per sub-workload within the benchmark suite. | ||
|
||
### [Tekton Pipeline Instruction](./tekton/README.md) | ||
- [Tekton Pipeline Instruction](./tekton/README.md) | ||
|
||
- [CPE Operator Instruction](./cpe_script_instruction.md) | ||
|
||
### [CPE Operator Instruction](./cpe_script_instruction.md) | ||
With manual execution | ||
|
||
### With manual execution | ||
In addition to the above two automation approach, you can manually run your own benchmarks, then collect, train, and export the models by the entrypoint `cmd/main.py` | ||
|
||
### [Manual Metric Collection and Training with Entrypoint](./cmd_instruction.md) | ||
[Manual Metric Collection and Training with Entrypoint](./cmd_instruction.md) | ||
|
||
## Clean up | ||
|
||
### For kind-for-training cluster | ||
For kind-for-training cluster: | ||
|
||
Run | ||
``` | ||
```bash | ||
./script.sh cleanup | ||
``` |
Oops, something went wrong.