In this tutorial, you will learn how to use and configure the built-in HPO algorithms. Alternatively, you can also use most algorithms from Ray Tune.
First, make sure you have installed the gpsearchers
and benchmarks
dependencies:
pip install -e .[gpsearchers,benchmarks]
The decision-making algorithms driving an HPO experiments are referred to as schedulers. As in Ray Tune, some of our schedulers are internally configured by a searcher. A scheduler interacts with the back-end, making decisions on which configuration to evaluate next, and whether to stop, pause or resume existing trials. It relays "next configuration" decisions to the searcher. Some searchers maintain a surrogate model which is fitted to metric data coming from evaluations.
This is the simplest kind of scheduler. It cannot stop or pause trials, each evaluation proceeds to the end. Depending on the searcher, this scheduler supports:
- Random search [
searcher=random
] - Bayesian optimization with Gaussian processes [
searcher=bayesopt
]
Here is a launcher script using FIFOScheduler
:
import logging
from syne_tune.backend import LocalBackend
from syne_tune.optimizer.schedulers import FIFOScheduler
from syne_tune import Tuner, StoppingCriterion
from benchmarking.definitions.definition_mlp_on_fashion_mnist import \
mlp_fashionmnist_benchmark
if __name__ == '__main__':
logging.getLogger().setLevel(logging.DEBUG)
n_workers = 4
# We pick the MLP on FashionMNIST benchmark
# The 'benchmark' dict contains arguments needed by scheduler and
# searcher (e.g., 'mode', 'metric'), along with suggested default values
# for other arguments (which you are free to override)
benchmark = mlp_fashionmnist_benchmark({'dataset_path': './'})
config_space = benchmark['config_space']
backend = LocalBackend(entry_point=benchmark['script'])
# GP-based Bayesian optimization searcher. Many options can be specified
# via `search_options`, but let's use the defaults
searcher = 'bayesopt'
search_options = {'num_init_random': n_workers + 2}
# FIFOScheduler. Together with searcher `bayesopt`, this selects Bayesian
# optimization without early stopping.
scheduler = FIFOScheduler(
config_space,
searcher=searcher,
search_options=search_options,
mode=benchmark['mode'],
metric=benchmark['metric'])
tuner = Tuner(
trial_backend=backend,
scheduler=scheduler,
stop_criterion=StoppingCriterion(max_wallclock_time=120),
n_workers=n_workers)
tuner.run()
What happens in this launcher script?
- We select the
mlp_fashionmnist
benchmark, adopting its default hyperparameter search space without modifications. - We select the local back-end, which runs up to
n_workers = 4
processes in parallel on the same instance. - We create a
FIFOScheduler
withsearcher = 'bayesopt'
. This means that new configurations to be evaluated are selected by Bayesian optimization, and all trials are run to the end. The scheduler needs to know theconfig_space
, the name of metric to tune (metric
) and whether to minimize or maximize this metric (mode
). Formlp_fashionmnist
, we havemetric = 'accuracy'
andmode = 'max'
, so we select a configuration which maximizes accuracy. - Options for the searcher can be passed via
search_options
. We use defaults, instead of changingnum_init_random
(see below) to the number of workers plus two. - Finally, we create the tuner, passing
backend
,scheduler
, as well as the stopping criterion for the experiment (stop after 120 seconds) and the number of workers. The experiment is started bytuner.run()
.
The full range of arguments of FIFOScheduler
is documented in
syne_tune/optimizer/schedulers/fifo.py.
Here, we list the most important ones:
config_space
: Hyperparameter search space. This argument is mandatory. Apart from hyperparameters to be searched over, the space may contain fixed parameters (such asepochs
in the example above). Aconfig
passed to the training script is always extended by these fixed parameters. If you use a benchmark, you can usebenchmark['config_space']
here, or you can modify this default search space.searcher
: Selects searcher to be used (see below).search_options
: Options to configure the searcher (see below).metric
,mode
: Name of metric to tune (i.e, key used inreport
call by the training script), which is either to be minimized (mode = 'min'
) or maximized (mode = 'max'
). If you use a benchmark, just usebenchmark['metric']
andbenchmark['mode']
here.points_to_evaluate
: Allows to specify a list of configurations which are evaluated first. If your training code corresponds to some open source ML algorithm, you may want to use the defaults provided in the code. The entry (or entries) inpoints_to_evaluate
do not have to specify values for all hyperparameters. For any hyperparameter not listed there, the following rule is used to choose a default. Forfloat
andint
value type, the mid-point of the search range is used (in linear or log scaling). For categorical value type, the first entry in the value set is used. The default is a single config with all values chosen by the default rule. Pass an empty list in order to not specify any initial configs.random_seed
: Master random seed. Random sampling in schedulers and searchers are done by a number ofnumpy.random.RandomState
generators, whose seeds are derived fromrandom_seed
. If not given, a random seed is sampled.
The simplest HPO baseline is random search, which you obtain with
searcher='random'
. Search decisions are not based on past data, a new
configuration is chosen by sampling attribute values at random, from
distributions specified in config_space
:
config_space.uniform(lower, upper)
: Real-valued uniform in[lower, upper]
config_space.loguniform(lower, upper)
: Real-valued log-uniform in[lower, upper]
. More precisely, the value isexp(x)
, wherex
is drawn uniformly in[log(lower), log(upper)]
config_space.randint(lower, upper)
: Integer uniform inlower, ..., upper
. The value range includes bothlower
andupper
(difference to Python range convention)config_space.lograndint(lower, upper)
: Integer log-uniform inlower, ..., upper
. More precisely, the value isint(round(exp(x)))
, wherex
is drawn uniformly in[log(lower - 0.5), log(upper + 0.5)]
config_space.choice(categories)
: Uniform from the finite listcategories
ofstr
values
If points_to_evaluate
is specified, configurations are first taken from this
list before any are drawn at random. Options for configuring the searcher are
given in search_options
. These are:
debug_log
: IfTrue
(default), a useful log output about the search progress is printed.
Bayesian optimization is obtained by searcher='bayesopt'
. A good overview
of Bayesian optimization for HPO is provided in
Practical Bayesian Optimization of Machine Learning Algorithms:
@article{
title={Practical {Bayesian} Optimization of Machine Learning Algorithms},
author={Snoek, J. and Larochelle, H. and Adams, R.},
booktitle={Neural Information Processing Systems 25},
year={2012},
pages={2951--2959}
}
Options for configuring the searcher are given in search_options
. These
include options for the random searcher. The full range of arguments of
GPFIFOSearcher
is documented in
syne_tune/optimizer/schedulers/searchers/gp_fifo_searcher.py.
Here, we list the most important ones:
num_init_random
: Number of initial configurations chosen at random (or viapoints_to_evaluate
). In fact, the number of initial configurations is the maximum of this and the length ofpoints_to_evaluate
. Afterwards, configurations are chosen by Bayesian optimization (BO). In general, BO is only used once at least one metric value from past trials is available. We recommend to set this value to the number of workers plus two.opt_nstarts
,opt_maxiter
: BO employs a Gaussian process surrogate model, whose own hyperparameters (e.g., kernel parameters, noise variance) are chosen by empirical Bayesian optimization. In general, this is done whenever new data becomes available. It is the most expensive computation in each round.opt_maxiter
is the maximum number of L-BFGS iterations. We runopt_nstarts
such optimizations from random starting points and pick the best.opt_skip_init_length
,opt_skip_period
: Refitting the GP hyperparameters in each round can become expensive, especially when the number of observations grows large. If so, you can choose to do it only everyopt_skip_period
rounds. Skipping optimizations is done only once the number of observations is aboveopt_skip_init_length
.map_reward
: Internally, the criterion is minimized. Ifmode='max'
for your tuning function (so you maximize a reward), you can specify how this reward is mapped to the inner criterion. Choices are 'minus_x' (criterion = -reward
) and '{a}_minus_x', where {a} is a constant(criterion = {a} - reward
. For example, '1_minus_x' maps accuracy to error.
This scheduler comes in two different variants, one may stop trials early,
the other may pause trials and resume them later. For tuning
neural network models, it tends to work much better than FIFOScheduler
. You
may have read about successive halving and Hyperband before. Chances are you
read about synchronous scheduling of parallel evaluations, while both
HyperbandScheduler
and FIFOScheduler
implement asynchronous scheduling,
which is different. The papers cited below provide a detailed overview of
asynchronous variants of successive halving, and of the algorithms discussed
here. Experiments therein indicate that asynchronous scheduling can be far more
efficient for HPO than synchronous scheduling. At present, Syne Tune supports
synchronous random search (by passing the argument asynchronous_scheduling=False
when creating the Tuner
object), but does not yet support synchronous
Hyperband.
Hyperband is an extension of successive halving to multiple brackets. We will discuss successive halving, mentioning Hyperband later. In our experience so far, asynchronous successive halving does not profit from multiple brackets if applied to neural network tuning.
Here is a launcher script using HyperbandScheduler
:
import logging
from syne_tune.backend import LocalBackend
from syne_tune.optimizer.schedulers import HyperbandScheduler
from syne_tune import Tuner, StoppingCriterion
from benchmarking.definitions.definition_mlp_on_fashion_mnist import \
mlp_fashionmnist_benchmark
if __name__ == '__main__':
logging.getLogger().setLevel(logging.DEBUG)
n_workers = 4
# We pick the MLP on FashionMNIST benchmark
# The 'benchmark' dict contains arguments needed by scheduler and
# searcher (e.g., 'mode', 'metric'), along with suggested default values
# for other arguments (which you are free to override)
benchmark = mlp_fashionmnist_benchmark({'dataset_path': './'})
config_space = benchmark['config_space']
backend = LocalBackend(entry_point=benchmark['script'])
# GP-based Bayesian optimization searcher. Many options can be specified
# via `search_options`, but let's use the defaults
searcher = 'bayesopt'
search_options = {'num_init_random': n_workers + 2}
# Hyperband (or successive halving) scheduler of the stopping type.
# Together with 'bayesopt', this selects the MOBSTER algorithm.
default_params = benchmark['default_params']
scheduler = HyperbandScheduler(
config_space,
searcher=searcher,
search_options=search_options,
type='stopping',
max_t=default_params['epochs'],
grace_period=default_params['grace_period'],
reduction_factor=default_params['reduction_factor'],
resource_attr=benchmark['resource_attr'],
mode=benchmark['mode'],
metric=benchmark['metric'])
tuner = Tuner(
trial_backend=backend,
scheduler=scheduler,
stop_criterion=StoppingCriterion(max_wallclock_time=120),
n_workers=n_workers,
)
tuner.run()
Much of this launcher script is the same as for FIFOScheduler
, but
HyperbandScheduler
comes with a number of extra arguments we will explain in
the sequel (type
, max_t
, grace_period
, reduction_factor
,
resource_attr
). The mlp_fashionmnist
benchmark trains a two-layer MLP on
FashionMNIST
(see
mlp_on_fashion_mnist.py).
The accuracy is computed and reported at the end of each epoch:
for epoch in range(resume_from + 1, config['epochs'] + 1):
train_model(config, state, train_loader)
accuracy = validate_model(config, state, valid_loader)
report(
epoch=epoch,
accuracy=accuracy)
While metric = 'accuracy'
is the criterion to be optimized,
resource_attr = 'epoch'
is the resource attribute. In the schedulers
discussed here, the resource attribute must be a positive integer.
HyperbandScheduler
maintains reported metrics for all trials at certain
rung levels (levels of resource attribute epoch
at which scheduling
decisions are done). When a trial reports (epoch, accuracy)
for a rung level
== epoch
, the scheduler makes a decision whether to stop (pause) or continue.
This decision is done based on all accuracy
values encountered before
at the same rung level. Whenever a trial is stopped (or paused), the executing
worker becomes available to evaluate a different configuration.
Rung level spacing and stop/go decisions are determined by the parameters
max_t
, grace_period
, and reduction_factor
. Rung levels are grace_period, grace_period * eta, grace_period * (eta ** 2), ..., max_t
, where
eta = reduction_factor
. In the example above, max_t = 81
, grace_period = 1
,
and reduction_factor = 3
, so that rung levels are 1, 3, 9, 27, 81. The spacing
is such that stop/go decisions are done less frequently for trials which already
went further: they have earned trust by not being stopped earlier. max_t
need
not be of the form grace_period * (eta ** k)
. If max_t = 56
in the example
above, the rung levels would be 1, 3, 9, 27, 56.
If max_t
is not given as argument to HyperbandScheduler
, the value may be
inferred from config_space
. Namely, config_space['epochs']
,
config_space['max-t']
, config_space['max-epochs']
are checked in this order.
In the example above, config_space['epochs']
contains the correct value, so we
could have dropped max_t
.
Given such a rung level spacing, stop/go decisions are done by comparing
accuracy
to the 1 / reduction_factor
quantile of values recorded at the
rung level. In the example above, our trial is stopped if accuracy
is no
better than the best 1/3 of previous values (the list includes the current
accuracy
value), otherwise it is stopped.
As detailed in
Model-based Asynchronous Hyperparameter and Neural Architecture Search,
there are two different types of asynchronous successive halving, selected by
the type
argument:
- Stopping-based asynchronous successive halving [
type='stopping'
]: This is essentially a refined variant of early stopping for HPO. If a stop/go decision comes out 'stop', the trial is terminated, otherwise it may continue. If there are less thanreduction_factor
recorded values at the rung level (including the current one), the trial continues. Moreover, whenever a worker is free, a trial is started with a newly chosen configuration. This variant is simple and does not require the back-end to pause and resume trials. It is the default fortype
. - Promotion-based asynchronous successive halving [
type = 'promotion'
]: This variant has been proposed as ASHA in A System for Massively Parallel Hyperparameter Tuning. If a stop/go decision comes out 'stop', the trial is paused, otherwise it may continue. If there are less thanreduction_factor
recorded values at the rung level (including the current one), the trial is paused. Moreover, whenever a worker is free, the scheduler first scans all paused trials in reverse rung level order (largest rung levels first). If for any of them, the stop/go decision comes out 'go', this trial is resumed. Otherwise, if none of the paused trials can be resumed, a trial is started with a newly chosen configuration. This variant requires the back-end to pause and resume trials (which typically includes support for checkpointing). - Progressive ASHA (PASHA) [
type='pasha'
]: This is the variant of ASHA presented in TODO:ADDLINK. This variant of ASHA have been developed to be resource-efficient on large datasets and it works by progressively extending the maximum resources level at which configurations are trained. It is empirically shown that it is often possible to identify optimal configurations early on, but it is often difficult to determine how reliable an early decision is. PASHA tries to automatically identify the minimum resources level at which it can perform a reliable decision. It can be used in situations where re-training a model several time would be just too expensive.
The full range of arguments of HyperbandScheduler
is documented in
syne_tune/optimizer/schedulers/hyperband.py.
It includes all those of FIFOScheduler
. Here, we list the most important ones:
max_t
,grace_period
,reduction_factor
: As detailed above, these determine the rung levels and the stop/go decisions. The resource attribute is a positive integer. We needreduction_factor >= 2
.rung_levels
: Alternatively, the user can specify the list of rung levels directly (positive integers, strictly increasing). The stop/go rule in the successive halving scheduler is set based on the ratio of successive rung levels.type
: Values are'stopping', 'promotion'
(see above).brackets
: Number of brackets to be used in Hyperband. The default is 1, which corresponds to successive halving. Each bracket has a differentgrace_period
, they sharemax_t
andreduction_factor
. When starting a new trial, it is assigned a randomly sampled bracket (smaller brackets have a higher probability). The larger the bracket, the largergrace_period
before the trial has to compete with others.rung_system_per_bracket
: Only used ifbrackets > 1
. IfTrue
, each bracket maintains its own rung level system, so that trials only compete with those started in the same bracket. IfFalse
, all trials compete with each other in a single rung level system, they just get different head starts in terms of theirgrace_period
.searcher_data
: This option is relevant whensearcher='bayesopt'
and is discussed below.
If HyperbandScheduler
is configured with a random searcher, we obtain ASHA,
as proposed in A System for Massively Parallel Hyperparameter Tuning.
@article{
title={A System for Massively Parallel Hyperparameter Tuning},
author={Liam Li and Kevin Jamieson and Afshin Rostamizadeh and Ekaterina Gonina and Moritz Hardt and Benjamin Recht and Ameet Talwalkar},
journal={arXiv preprint arXiv:1810.05934}
}
Strictly speaking, their paper details the promotion-based variant
(type = 'promotion'
), while the stopping-based variant is based on earlier
ideas like the median rule.
Nothing much can be configured via search_options
in this case. The arguments
are the same as for random search with FIFOScheduler
.
If HyperbandScheduler
is configured with a Bayesian optimization searcher, we
obtain MOBSTER, as proposed in
Model-based Asynchronous Hyperparameter and Neural Architecture Search.
@article{
title={Model-based Asynchronous Hyperparameter and Neural Architecture Search},
author={Aaron Klein and Louis C. Tiao and Thibaut Lienart and Cedric Archambeau and Matthias Seeger},
journal={arXiv preprint arXiv:2003.10865}
}
MOBSTER uses a multi-task Gaussian process surrogate model for metrics data
observed at all resource levels. Options for configuring the searcher are given
in search_options
. These include options for the random searcher. The full
range of arguments of GPMultiFidelitySearcher
is documented in
syne_tune/optimizer/schedulers/searchers/gp_multifidelity_searcher.py.
Here, we list the most important ones:
num_init_random
: SeeFIFOSearcher
,searcher='bayesopt'
.opt_nstarts
,opt_maxiter
: SeeFIFOSearcher
,searcher='bayesopt'
.opt_skip_init_length
,opt_skip_period
: SeeFIFOSearcher
,searcher='bayesopt'
.map_reward
: SeeFIFOSearcher
,searcher='bayesopt'
.gp_resource_kernel
: Values are'matern52', 'matern52-res-warp', 'exp-decay-sum', 'exp-decay-delta1', 'exp-decay-combined'
. Selects different multi-task GP surrogate models. For details, please see the code. The default choice is'exp-decay-sum'
, which is closely related to the kernel proposed inFreeze-Thaw Bayesian Optimization, but without the conditional independence assumptions made there.opt_skip_num_max_resource
: Alternative toopt_skip_period
. IfTrue
, the GP surrogate model hyperparameters are refit only when a trial reaches levelmax_t
.resource_acq_bohb_threshold
: MOBSTER is choosing a new configuration by maximizing the expected improvement (EI) acquisition function at a certain resource levelr_acq
. Since we are ultimately interested in performance atr = max_t
, we would like to setr_acq = max_t
as early as possible. On the other hand, EI may not be reliable at a resource level if too little metric data has been observed there (i.e., too few trials reached this level). MOBSTER is settingr_acq
to the largest rung levelr <= max_t
for which at leastresource_acq_bohb_threshold
metric values have been recorded.
Finally, for searcher='bayesopt'
, the HyperbandScheduler
argument
searcher_data
is relevant. Values are 'rungs', 'all', 'rungs_and_last'
.
Recall that the searcher represents past data by a multi-task GP surrogate
model conditioned on observed metric data. Inference in such a model scales
cubically in the number of datapoints. searcher_data
determines which
metric observations are passed to the searcher to update its surrogate model:
'all'
: All observations at all resource levels are used for the surrogate model. This provides the best fit, but can be expensive, which can slow down the search.'rungs'
: This is the default. Observations are used for the surrogate model only if their resource level is equal to a rung level. This renders the surrogate model cheaper, but may result in a worse fit.'rungs_and_last'
: Observations are used for the surrogate model if their resource level is equal to a rung level, or if they are the last recent observation of a trial. The surrogate model is only a bit more expensive, but all most recent observations are used.
Finally, we provide some general recommendations on how to use our built-in schedulers.
- If you can afford it for your problem, random search (
FIFOScheduler
,searcher='random'
) is a useful baseline. However, if even a single full evaluation takes a long time, try ASHA instead (HyperbandScheduler
,searcher='random'
,type='stopping'
ortype='promotion'
). - Use these baseline runs to get an idea how long your experiment needs
to run. It is recommended to use a stopping criterion of the form
stop_criterion=StoppingCriterion(max_wallclock_time=X)
, so that the experiment is stopped afterX
seconds. - If your tuning problem comes with an obvious resource parameter, make sure
to implement it such that results are reported during the evaluation, not
only at the end. When training a neural network model, choose the number
of epochs as resource. In other situations, choosing a resource parameter may be
more difficult. Our schedulers require positive (or non-negative) integers.
Make sure that evaluations for the same configuration scale linearly in
the resource parameter: an evaluation up to
2 * r
should be roughly twice as expensive as one up tor
. - If your problem has a resource parameter, always make sure to try
HyperbandScheduler
, which in many cases runs much faster thanFIFOScheduler
. - If you end up tuning the same ML algorithm or neural network model on
different datasets, make sure to set
points_to_evaluate
appropriately. If the model comes from frequently used open source code, its built-in defaults will be a good choice. Any hyperparameter not covered inpoints_to_evaluate
is set using a "midpoint" heuristic. While still better than choosing the first configuration at random, this may not be very good. - For
HyperbandScheduler
, you need to choose betweentype='stopping'
andtype='promotion'
. For neural network tuning, start with'stopping'
, which is simpler and does not need checkpointing. However, if checkpointing is in place, try both of them. For some problems, the notion of stopping and checkpointing does not apply, and'promotion'
may be more natural. For example, you may train your model on subsamples of size(r / 10) * total_size
,r=1,...,10
(assuming that training scales roughly linear in the dataset size). Training for everyr
has to start from scratch in this case. - In general, the defaults should work well if your tuning problem is
expensive enough (at least a few minutes per unit of
r
). In such cases, MOBSTER (HyperbandScheduler
,searcher='bayesopt'
) can outperform ASHA substantially. However, if your problem is cheap, so you can afford a lot of evaluation, the searchers based on GP surrogate models may end up expensive. With ASHA your baseline, you can try to speed up MOBSTER by changingopt_skip_period
(or usingopt_skip_num_max_resource
).