Skip to content

Commit 31dc851

Browse files
committed
first commit
0 parents  commit 31dc851

22 files changed

+3126
-0
lines changed

.github/workflows/deploy.yml

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
name: Deploy Project
2+
on:
3+
push:
4+
branches:
5+
- main
6+
pull_request:
7+
branches:
8+
- main
9+
10+
env:
11+
GH_HEAD_REF: ${{ github.head_ref }}
12+
GH_REF: ${{ github.ref_name }}
13+
14+
permissions:
15+
id-token: write
16+
contents: read
17+
pull-requests: write
18+
19+
jobs:
20+
deploy:
21+
name: Deploy Project
22+
runs-on: ubuntu-latest
23+
24+
steps:
25+
- uses: actions/checkout@v4
26+
with:
27+
ref: ${{ github.event.pull_request.head.sha }}
28+
fetch-depth: 0
29+
30+
- name: Set up Python
31+
uses: actions/setup-python@v1
32+
with:
33+
python-version: 3.12
34+
35+
- name: Install dependencies
36+
run: |
37+
python3 -m pip install -U requests
38+
python3 -m pip install outerbounds pyyaml
39+
python3 -m pip install -U ob-project-utils
40+
- name: Configure Outerbounds
41+
run: |
42+
PROJECT_NAME=$(yq .project obproject.toml)
43+
DEFAULT_CICD_USER="${PROJECT_NAME//_/-}-cicd"
44+
PLATFORM=$(yq .platform obproject.toml)
45+
CICD_USER=$(yq ".cicd_user // \"$DEFAULT_CICD_USER\"")
46+
PERIMETER="default"
47+
echo "🏗️ Deployment target:"
48+
echo " Platform: $PLATFORM"
49+
echo " CI/CD User: $CICD_USER"
50+
echo " Perimeter: $PERIMETER"
51+
outerbounds service-principal-configure \
52+
--name $CICD_USER \
53+
--deployment-domain $PLATFORM \
54+
--perimeter $PERIMETER \
55+
--github-actions
56+
57+
- name: Deploy Project
58+
env:
59+
COMMIT_URL: "https://github.com/${{ github.repository }}/commit/"
60+
CI_URL: "https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}"
61+
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
62+
COMMENTS_URL: ${{ github.event.pull_request.comments_url }}
63+
PYTHONUNBUFFERED: 1
64+
run: obproject-deploy

.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.13

README.md

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
## Hyperparameter Optimization Project
2+
3+
This repository shows you how to run a hyperparameter optimization (HPO) system as an Outerbounds project.
4+
This `README.md` will explain why you'd want to connect these concepts, and will show you how to launch HPO jobs for:
5+
- classical ML models
6+
- deep learning models
7+
- end-to-end system tuning
8+
9+
If you have never deployed an Outerbounds project, please read [the documentation page](/outerbounds/project-setup/) before continuing.
10+
11+
### Local/workstation dependencies
12+
13+
[Install uv](https://docs.astral.sh/uv/getting-started/installation/).
14+
15+
From your laptop or Outerbounds workstation run:
16+
```bash
17+
uv sync
18+
```
19+
20+
Configure Outerbounds token. Ask in Slack if not sure.
21+
22+
### Optuna integration
23+
This system is an integration between [Optuna](https://optuna.org/), a feature-rich and open-source hyperparameter optimziation framework, and Outerbounds. Using it leverages functionality built-into your Outerbounds deployment to run a persistent relational database that tasks and applications can communicate with. The Optuna dashboard is run as an Outerbounds app, enabling sophisticated analysis of hyperparameter tuning runs.
24+
25+
### How to use this repository
26+
27+
#### Deploy the Optuna dashboard application
28+
29+
The Outerbounds app that will run your Optuna dashboard is defined in [`./deployments/optuna-dashboard/config.yml`](./deployments/optuna-dashboard/config.yml).
30+
When you push to the main branch of this repository, the `obproject-deployer` will create the application in your Outerbounds project branch.
31+
If you'd like to manually deploy the application:
32+
33+
```bash
34+
cd deployments/optuna-dashboard
35+
uv run outerbounds app deploy --config-file config.yml
36+
```
37+
38+
#### Run a workflow
39+
40+
There are two demos implemented within this project base in `flows/tree-model` and `flows/nn`.
41+
Each workflow template defines:
42+
- a `flow.py` containing a `FlowSpec`,
43+
- a single `config.json` to set system variables and hyperparameter configurations,
44+
- an `hpo_client.py` containing entrypoints to run and trigger the flow,
45+
- notebooks showing how to run and analyze results of hyperparameter tuning runs, and
46+
- the templates show how to define a modular, fully customizable objective function.
47+
48+
For the rest of this section, we'll use the `flows/nn` template, as everything else is the sames as for `flows/tree-model`.
49+
50+
```bash
51+
cd flows/nn
52+
```
53+
54+
##### Setting configs
55+
Before running or deploying the workflows, investigate the relationship between the flow and the `config.json` file.
56+
57+
Based on the compute pools available in your Outerbounds deployment, set the `compute_pool` variable.
58+
If you are new to compute pools, please visit the documentation or consult your Outerbounds admins/Slack for guidance.
59+
60+
As long as you haven't changed anything when deploying the application hosting the Optuna dashboard, you do not need to change anything besides the `compute_pool` in that file,
61+
but it is useful to be familiar with these contents and the way the configuration files are interacting with Metaflow code.
62+
63+
##### Regular Metaflow usage
64+
To run the flow directly (e.g., standard Metaflow user experience):
65+
66+
```bash
67+
python flow.py --environment=fast-bakery run --with kubernetes
68+
python flow.py --environment=fast-bakery argo-workflows create/trigger
69+
```
70+
71+
##### Using the HPO client
72+
These examples also include a convenience wrapper around the workflows in the `hpo_client.py`.
73+
The purpose is to make the flows easier to use and the abstractions more in line with typical HPO interfaces seen in the wild.
74+
75+
```bash
76+
cd flows/nn
77+
```
78+
79+
There are three client modes:
80+
1. Blocking - `python hpo_client.py -m 1`
81+
2. Async - `python hpo_client.py -m 2`
82+
3. Trigger - `python hpo_client.py -m 3`
83+
- Trigger option also works with a parameter `--namespace/-n`, which determines the namespace within which this code path checks for already-deployed flows.
84+
85+
### Optuna 101
86+
87+
This implementation wraps the standard Optuna interface, aiming to balance two goals:
88+
1. Provide full expressiveness and compatability with open-source Optuna features.
89+
2. Provide an opinionated and streamlined interface for launching HPO studies as Metaflow flows.
90+
91+
#### The objective function
92+
Typically, Optuna programs are developed in Python scripts.
93+
An objective function returns 1 or 2 values.
94+
It's argument is a [`trial`](https://optuna.readthedocs.io/en/stable/reference/trial.html),
95+
representing a single execution of the objective function; in other words, a sample drawn from the hyperparameter search space.
96+
97+
```python
98+
def objective(trial):
99+
x = trial.suggest_float("x", -100, 100)
100+
y = trial.suggest_categorical("y", [-1, 0, 1])
101+
f1 = x**2 + y
102+
f2 = -((x - 2) ** 2 + y)
103+
return f1, f2
104+
```
105+
106+
The key task of the user who wishes to use the `from outerbounds.hpo import HPORunner` abstraction this project affords is to determine:
107+
1. How to define the objective function?
108+
2. What data, model, and code does the objective function depend on?
109+
3. How many trials do you want to run per study?
110+
111+
With answers to these questions, you'll be ready to adapt your objective functions as demonstrated in the example [`flows/`](./flows/) and [`notebooks/`](./notebooks/) and call the `HPORunner` interface to automate HPO workflows.
112+
113+
#### Note on search spaces
114+
Notice that with Optuna, the user imperatively defines the hyperparameter space in how the `trial` object is used within the `objective` function.
115+
The number of variables for which we have `trial.suggest_*` defines the dimensionality of the search space.
116+
Be judicious with adding parameters. Many algorithms, especially bayesian optimization suffers performance degradation when there are many more than 5-10 parameters being tuned simultaneously.
117+
118+
[Read more](https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/002_configurations.html#configurations).
119+
120+
#### Studies, samplers, and pruners
121+
To optimize the hyperparameters, we create a study.
122+
Optuna implements many optimization algorithm families, called as [`optuna.samplers`](https://optuna.readthedocs.io/en/stable/reference/samplers/index.html). These include grid, random, tree-structure parzen estimators, evolutionary (CMA-ES, NSGA-II), Gaussian processes, Quasi Monte Carlo methods, and more.
123+
124+
For example, if you wanted to purely random sample - no learning throughout the study - the hyperparameter space 10 times, you'd run:
125+
```python
126+
study = optuna.create_study(sampler=optuna.samplers.RandomSampler())
127+
study.optimize(objective, n_trials=10)
128+
```
129+
130+
Sometimes it is desirable to early stop unpromising trials. The mechanism for doing this in Optuna is called as [`optuna.pruners`](https://optuna.readthedocs.io/en/stable/reference/pruners.html), which uses intermediate objective function state varaibles of previous trials to determine a boolean representing whether the trial should be pruned.
131+
132+
#### Resuming studies
133+
To resume a study, simply pass in the name of the previous study.
134+
If leveraging the Metaflow versioning scheme which uses the Metaflow Run pathspec as the study name - in other words not overriding the study name via configs or CLI - then
135+
you can set this value in the config and resume the study. You can also override in the command line using the `hpo_client`'s `--resume-study/-r` option:
136+
137+
```bash
138+
python hpo_client.py -m 1 -r TreeModelHpoFlow/argo-hposystem.prod.treemodelhpoflow-7ntvz
139+
```
140+
141+
## TODO
142+
- Benchmark gRPC vs. pure RDB scaling thresholds. When is it worth it to do gRPC? How hard is that to implement? How do costs scale in each mode?
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
name: hpo-dashboard
2+
port: 8088
3+
4+
commands:
5+
- gunicorn --workers 2 --bind 0.0.0.0:8088 main:app
6+
7+
dependencies:
8+
pypi:
9+
optuna-dashboard: ""
10+
psycopg2-binary: ""
11+
gunicorn: ""
12+
werkzeug: ""
13+
14+
resources:
15+
cpu: "2"
16+
memory: "4Gi"
17+
ephemeralStorage: "10Gi"
18+
19+
persistence: postgres

deployments/optuna-dashboard/main.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
from optuna_dashboard import wsgi
2+
from optuna.storages import RDBStorage
3+
from werkzeug.middleware.proxy_fix import ProxyFix
4+
import os
5+
import json
6+
7+
8+
def get_mf_token():
9+
with open(os.path.join(os.environ["METAFLOW_HOME"], "config.json"), "r") as f:
10+
conf = json.loads(f.read())
11+
return conf["METAFLOW_SERVICE_AUTH_KEY"]
12+
13+
14+
def generate_db_url():
15+
# This function mirrors the metaflow.plugins.optuna.get_db_url function used in the /flows.
16+
# FIXME: Reuse/consolidate existing function.
17+
mf_token = get_mf_token()
18+
return f"postgresql://userspace_default:{mf_token}@localhost:5432/userspace_default?sslmode=disable"
19+
20+
21+
STORAGE_URL = generate_db_url()
22+
23+
base_app = wsgi(RDBStorage(STORAGE_URL))
24+
app = ProxyFix(base_app, x_for=1, x_proto=1, x_host=1, x_port=1)

flows/nn/config.json

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
{
2+
"compute_pool": "obp-main",
3+
"n_trials": 15,
4+
"trials_per_task": 1,
5+
6+
"directions": [
7+
"minimize",
8+
"maximize"
9+
],
10+
11+
"optuna_app_name": "hpo-dashboard",
12+
"environment_builder": "fast-bakery",
13+
"flow_file": "flow.py",
14+
"objective_function_file": "objective_fn.py",
15+
16+
"environment": {
17+
"python": "3.12",
18+
"packages": {
19+
"optuna": "4.5.0",
20+
"psycopg2-binary": "2.9.10",
21+
"torch": "2.5.1",
22+
"torchvision": "0.20.1",
23+
"pandas": "2.3.2",
24+
"scipy": "1.16.1"
25+
}
26+
}
27+
}

0 commit comments

Comments
 (0)