|
| 1 | +## Hyperparameter Optimization Project |
| 2 | + |
| 3 | +This repository shows you how to run a hyperparameter optimization (HPO) system as an Outerbounds project. |
| 4 | +This `README.md` will explain why you'd want to connect these concepts, and will show you how to launch HPO jobs for: |
| 5 | +- classical ML models |
| 6 | +- deep learning models |
| 7 | +- end-to-end system tuning |
| 8 | + |
| 9 | +If you have never deployed an Outerbounds project, please read [the documentation page](/outerbounds/project-setup/) before continuing. |
| 10 | + |
| 11 | +### Local/workstation dependencies |
| 12 | + |
| 13 | +[Install uv](https://docs.astral.sh/uv/getting-started/installation/). |
| 14 | + |
| 15 | +From your laptop or Outerbounds workstation run: |
| 16 | +```bash |
| 17 | +uv sync |
| 18 | +``` |
| 19 | + |
| 20 | +Configure Outerbounds token. Ask in Slack if not sure. |
| 21 | + |
| 22 | +### Optuna integration |
| 23 | +This system is an integration between [Optuna](https://optuna.org/), a feature-rich and open-source hyperparameter optimziation framework, and Outerbounds. Using it leverages functionality built-into your Outerbounds deployment to run a persistent relational database that tasks and applications can communicate with. The Optuna dashboard is run as an Outerbounds app, enabling sophisticated analysis of hyperparameter tuning runs. |
| 24 | + |
| 25 | +### How to use this repository |
| 26 | + |
| 27 | +#### Deploy the Optuna dashboard application |
| 28 | + |
| 29 | +The Outerbounds app that will run your Optuna dashboard is defined in [`./deployments/optuna-dashboard/config.yml`](./deployments/optuna-dashboard/config.yml). |
| 30 | +When you push to the main branch of this repository, the `obproject-deployer` will create the application in your Outerbounds project branch. |
| 31 | +If you'd like to manually deploy the application: |
| 32 | + |
| 33 | +```bash |
| 34 | +cd deployments/optuna-dashboard |
| 35 | +uv run outerbounds app deploy --config-file config.yml |
| 36 | +``` |
| 37 | + |
| 38 | +#### Run a workflow |
| 39 | + |
| 40 | +There are two demos implemented within this project base in `flows/tree-model` and `flows/nn`. |
| 41 | +Each workflow template defines: |
| 42 | +- a `flow.py` containing a `FlowSpec`, |
| 43 | +- a single `config.json` to set system variables and hyperparameter configurations, |
| 44 | +- an `hpo_client.py` containing entrypoints to run and trigger the flow, |
| 45 | +- notebooks showing how to run and analyze results of hyperparameter tuning runs, and |
| 46 | +- the templates show how to define a modular, fully customizable objective function. |
| 47 | + |
| 48 | +For the rest of this section, we'll use the `flows/nn` template, as everything else is the sames as for `flows/tree-model`. |
| 49 | + |
| 50 | +```bash |
| 51 | +cd flows/nn |
| 52 | +``` |
| 53 | + |
| 54 | +##### Setting configs |
| 55 | +Before running or deploying the workflows, investigate the relationship between the flow and the `config.json` file. |
| 56 | + |
| 57 | +Based on the compute pools available in your Outerbounds deployment, set the `compute_pool` variable. |
| 58 | +If you are new to compute pools, please visit the documentation or consult your Outerbounds admins/Slack for guidance. |
| 59 | + |
| 60 | +As long as you haven't changed anything when deploying the application hosting the Optuna dashboard, you do not need to change anything besides the `compute_pool` in that file, |
| 61 | +but it is useful to be familiar with these contents and the way the configuration files are interacting with Metaflow code. |
| 62 | + |
| 63 | +##### Regular Metaflow usage |
| 64 | +To run the flow directly (e.g., standard Metaflow user experience): |
| 65 | + |
| 66 | +```bash |
| 67 | +python flow.py --environment=fast-bakery run --with kubernetes |
| 68 | +python flow.py --environment=fast-bakery argo-workflows create/trigger |
| 69 | +``` |
| 70 | + |
| 71 | +##### Using the HPO client |
| 72 | +These examples also include a convenience wrapper around the workflows in the `hpo_client.py`. |
| 73 | +The purpose is to make the flows easier to use and the abstractions more in line with typical HPO interfaces seen in the wild. |
| 74 | + |
| 75 | +```bash |
| 76 | +cd flows/nn |
| 77 | +``` |
| 78 | + |
| 79 | +There are three client modes: |
| 80 | +1. Blocking - `python hpo_client.py -m 1` |
| 81 | +2. Async - `python hpo_client.py -m 2` |
| 82 | +3. Trigger - `python hpo_client.py -m 3` |
| 83 | + - Trigger option also works with a parameter `--namespace/-n`, which determines the namespace within which this code path checks for already-deployed flows. |
| 84 | + |
| 85 | +### Optuna 101 |
| 86 | + |
| 87 | +This implementation wraps the standard Optuna interface, aiming to balance two goals: |
| 88 | +1. Provide full expressiveness and compatability with open-source Optuna features. |
| 89 | +2. Provide an opinionated and streamlined interface for launching HPO studies as Metaflow flows. |
| 90 | + |
| 91 | +#### The objective function |
| 92 | +Typically, Optuna programs are developed in Python scripts. |
| 93 | +An objective function returns 1 or 2 values. |
| 94 | +It's argument is a [`trial`](https://optuna.readthedocs.io/en/stable/reference/trial.html), |
| 95 | +representing a single execution of the objective function; in other words, a sample drawn from the hyperparameter search space. |
| 96 | + |
| 97 | +```python |
| 98 | +def objective(trial): |
| 99 | + x = trial.suggest_float("x", -100, 100) |
| 100 | + y = trial.suggest_categorical("y", [-1, 0, 1]) |
| 101 | + f1 = x**2 + y |
| 102 | + f2 = -((x - 2) ** 2 + y) |
| 103 | + return f1, f2 |
| 104 | +``` |
| 105 | + |
| 106 | +The key task of the user who wishes to use the `from outerbounds.hpo import HPORunner` abstraction this project affords is to determine: |
| 107 | +1. How to define the objective function? |
| 108 | +2. What data, model, and code does the objective function depend on? |
| 109 | +3. How many trials do you want to run per study? |
| 110 | + |
| 111 | +With answers to these questions, you'll be ready to adapt your objective functions as demonstrated in the example [`flows/`](./flows/) and [`notebooks/`](./notebooks/) and call the `HPORunner` interface to automate HPO workflows. |
| 112 | + |
| 113 | +#### Note on search spaces |
| 114 | +Notice that with Optuna, the user imperatively defines the hyperparameter space in how the `trial` object is used within the `objective` function. |
| 115 | +The number of variables for which we have `trial.suggest_*` defines the dimensionality of the search space. |
| 116 | +Be judicious with adding parameters. Many algorithms, especially bayesian optimization suffers performance degradation when there are many more than 5-10 parameters being tuned simultaneously. |
| 117 | + |
| 118 | +[Read more](https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/002_configurations.html#configurations). |
| 119 | + |
| 120 | +#### Studies, samplers, and pruners |
| 121 | +To optimize the hyperparameters, we create a study. |
| 122 | +Optuna implements many optimization algorithm families, called as [`optuna.samplers`](https://optuna.readthedocs.io/en/stable/reference/samplers/index.html). These include grid, random, tree-structure parzen estimators, evolutionary (CMA-ES, NSGA-II), Gaussian processes, Quasi Monte Carlo methods, and more. |
| 123 | + |
| 124 | +For example, if you wanted to purely random sample - no learning throughout the study - the hyperparameter space 10 times, you'd run: |
| 125 | +```python |
| 126 | +study = optuna.create_study(sampler=optuna.samplers.RandomSampler()) |
| 127 | +study.optimize(objective, n_trials=10) |
| 128 | +``` |
| 129 | + |
| 130 | +Sometimes it is desirable to early stop unpromising trials. The mechanism for doing this in Optuna is called as [`optuna.pruners`](https://optuna.readthedocs.io/en/stable/reference/pruners.html), which uses intermediate objective function state varaibles of previous trials to determine a boolean representing whether the trial should be pruned. |
| 131 | + |
| 132 | +#### Resuming studies |
| 133 | +To resume a study, simply pass in the name of the previous study. |
| 134 | +If leveraging the Metaflow versioning scheme which uses the Metaflow Run pathspec as the study name - in other words not overriding the study name via configs or CLI - then |
| 135 | +you can set this value in the config and resume the study. You can also override in the command line using the `hpo_client`'s `--resume-study/-r` option: |
| 136 | + |
| 137 | +```bash |
| 138 | +python hpo_client.py -m 1 -r TreeModelHpoFlow/argo-hposystem.prod.treemodelhpoflow-7ntvz |
| 139 | +``` |
| 140 | + |
| 141 | +## TODO |
| 142 | +- Benchmark gRPC vs. pure RDB scaling thresholds. When is it worth it to do gRPC? How hard is that to implement? How do costs scale in each mode? |
0 commit comments