Skip to content

Latest commit

 

History

History
518 lines (447 loc) · 26.3 KB

schedulers.md

File metadata and controls

518 lines (447 loc) · 26.3 KB

Syne Tune: Using the Built-in Schedulers

In this tutorial, you will learn how to use and configure the built-in HPO algorithms. Alternatively, you can also use most algorithms from Ray Tune.

First, make sure you have installed the gpsearchers and benchmarks dependencies:

pip install -e .[gpsearchers,benchmarks]

Schedulers and Searchers

The decision-making algorithms driving an HPO experiments are referred to as schedulers. As in Ray Tune, some of our schedulers are internally configured by a searcher. A scheduler interacts with the back-end, making decisions on which configuration to evaluate next, and whether to stop, pause or resume existing trials. It relays "next configuration" decisions to the searcher. Some searchers maintain a surrogate model which is fitted to metric data coming from evaluations.

FIFOScheduler

This is the simplest kind of scheduler. It cannot stop or pause trials, each evaluation proceeds to the end. Depending on the searcher, this scheduler supports:

  • Random search [searcher=random]
  • Bayesian optimization with Gaussian processes [searcher=bayesopt]

Here is a launcher script using FIFOScheduler:

import logging

from syne_tune.backend import LocalBackend
from syne_tune.optimizer.schedulers import FIFOScheduler
from syne_tune import Tuner, StoppingCriterion

from benchmarking.definitions.definition_mlp_on_fashion_mnist import \
    mlp_fashionmnist_benchmark

if __name__ == '__main__':
  logging.getLogger().setLevel(logging.DEBUG)
  n_workers = 4

  # We pick the MLP on FashionMNIST benchmark
  # The 'benchmark' dict contains arguments needed by scheduler and
  # searcher (e.g., 'mode', 'metric'), along with suggested default values
  # for other arguments (which you are free to override)
  benchmark = mlp_fashionmnist_benchmark({'dataset_path': './'})
  config_space = benchmark['config_space']

  backend = LocalBackend(entry_point=benchmark['script'])

  # GP-based Bayesian optimization searcher. Many options can be specified
  # via `search_options`, but let's use the defaults
  searcher = 'bayesopt'
  search_options = {'num_init_random': n_workers + 2}
  # FIFOScheduler. Together with searcher `bayesopt`, this selects Bayesian
  # optimization without early stopping.
  scheduler = FIFOScheduler(
    config_space,
    searcher=searcher,
    search_options=search_options,
    mode=benchmark['mode'],
    metric=benchmark['metric'])

  tuner = Tuner(
    trial_backend=backend,
    scheduler=scheduler,
    stop_criterion=StoppingCriterion(max_wallclock_time=120),
    n_workers=n_workers)

  tuner.run()

What happens in this launcher script?

  • We select the mlp_fashionmnist benchmark, adopting its default hyperparameter search space without modifications.
  • We select the local back-end, which runs up to n_workers = 4 processes in parallel on the same instance.
  • We create a FIFOScheduler with searcher = 'bayesopt'. This means that new configurations to be evaluated are selected by Bayesian optimization, and all trials are run to the end. The scheduler needs to know the config_space, the name of metric to tune (metric) and whether to minimize or maximize this metric (mode). For mlp_fashionmnist, we have metric = 'accuracy' and mode = 'max', so we select a configuration which maximizes accuracy.
  • Options for the searcher can be passed via search_options. We use defaults, instead of changing num_init_random (see below) to the number of workers plus two.
  • Finally, we create the tuner, passing backend, scheduler, as well as the stopping criterion for the experiment (stop after 120 seconds) and the number of workers. The experiment is started by tuner.run().

The full range of arguments of FIFOScheduler is documented in syne_tune/optimizer/schedulers/fifo.py. Here, we list the most important ones:

  • config_space: Hyperparameter search space. This argument is mandatory. Apart from hyperparameters to be searched over, the space may contain fixed parameters (such as epochs in the example above). A config passed to the training script is always extended by these fixed parameters. If you use a benchmark, you can use benchmark['config_space'] here, or you can modify this default search space.
  • searcher: Selects searcher to be used (see below).
  • search_options: Options to configure the searcher (see below).
  • metric, mode: Name of metric to tune (i.e, key used in report call by the training script), which is either to be minimized (mode = 'min') or maximized (mode = 'max'). If you use a benchmark, just use benchmark['metric'] and benchmark['mode'] here.
  • points_to_evaluate: Allows to specify a list of configurations which are evaluated first. If your training code corresponds to some open source ML algorithm, you may want to use the defaults provided in the code. The entry (or entries) in points_to_evaluate do not have to specify values for all hyperparameters. For any hyperparameter not listed there, the following rule is used to choose a default. For float and int value type, the mid-point of the search range is used (in linear or log scaling). For categorical value type, the first entry in the value set is used. The default is a single config with all values chosen by the default rule. Pass an empty list in order to not specify any initial configs.
  • random_seed: Master random seed. Random sampling in schedulers and searchers are done by a number of numpy.random.RandomState generators, whose seeds are derived from random_seed. If not given, a random seed is sampled.

Random Search (searcher = 'random')

The simplest HPO baseline is random search, which you obtain with searcher='random'. Search decisions are not based on past data, a new configuration is chosen by sampling attribute values at random, from distributions specified in config_space:

  • config_space.uniform(lower, upper): Real-valued uniform in [lower, upper]
  • config_space.loguniform(lower, upper): Real-valued log-uniform in [lower, upper]. More precisely, the value is exp(x), where x is drawn uniformly in [log(lower), log(upper)]
  • config_space.randint(lower, upper): Integer uniform in lower, ..., upper. The value range includes both lower and upper (difference to Python range convention)
  • config_space.lograndint(lower, upper): Integer log-uniform in lower, ..., upper. More precisely, the value is int(round(exp(x))), where x is drawn uniformly in [log(lower - 0.5), log(upper + 0.5)]
  • config_space.choice(categories): Uniform from the finite list categories of str values

If points_to_evaluate is specified, configurations are first taken from this list before any are drawn at random. Options for configuring the searcher are given in search_options. These are:

  • debug_log: If True (default), a useful log output about the search progress is printed.

Bayesian Optimization (searcher = 'bayesopt')

Bayesian optimization is obtained by searcher='bayesopt'. A good overview of Bayesian optimization for HPO is provided in Practical Bayesian Optimization of Machine Learning Algorithms:

@article{
   title={Practical {Bayesian} Optimization of Machine Learning Algorithms},
   author={Snoek, J. and Larochelle, H. and Adams, R.},
   booktitle={Neural Information Processing Systems 25},
   year={2012},
   pages={2951--2959}
}

Options for configuring the searcher are given in search_options. These include options for the random searcher. The full range of arguments of GPFIFOSearcher is documented in syne_tune/optimizer/schedulers/searchers/gp_fifo_searcher.py. Here, we list the most important ones:

  • num_init_random: Number of initial configurations chosen at random (or via points_to_evaluate). In fact, the number of initial configurations is the maximum of this and the length of points_to_evaluate. Afterwards, configurations are chosen by Bayesian optimization (BO). In general, BO is only used once at least one metric value from past trials is available. We recommend to set this value to the number of workers plus two.
  • opt_nstarts, opt_maxiter: BO employs a Gaussian process surrogate model, whose own hyperparameters (e.g., kernel parameters, noise variance) are chosen by empirical Bayesian optimization. In general, this is done whenever new data becomes available. It is the most expensive computation in each round. opt_maxiter is the maximum number of L-BFGS iterations. We run opt_nstarts such optimizations from random starting points and pick the best.
  • opt_skip_init_length, opt_skip_period: Refitting the GP hyperparameters in each round can become expensive, especially when the number of observations grows large. If so, you can choose to do it only every opt_skip_period rounds. Skipping optimizations is done only once the number of observations is above opt_skip_init_length.
  • map_reward: Internally, the criterion is minimized. If mode='max' for your tuning function (so you maximize a reward), you can specify how this reward is mapped to the inner criterion. Choices are 'minus_x' (criterion = -reward) and '{a}_minus_x', where {a} is a constant (criterion = {a} - reward. For example, '1_minus_x' maps accuracy to error.

HyperbandScheduler

This scheduler comes in two different variants, one may stop trials early, the other may pause trials and resume them later. For tuning neural network models, it tends to work much better than FIFOScheduler. You may have read about successive halving and Hyperband before. Chances are you read about synchronous scheduling of parallel evaluations, while both HyperbandScheduler and FIFOScheduler implement asynchronous scheduling, which is different. The papers cited below provide a detailed overview of asynchronous variants of successive halving, and of the algorithms discussed here. Experiments therein indicate that asynchronous scheduling can be far more efficient for HPO than synchronous scheduling. At present, Syne Tune supports synchronous random search (by passing the argument asynchronous_scheduling=False when creating the Tuner object), but does not yet support synchronous Hyperband.

Hyperband is an extension of successive halving to multiple brackets. We will discuss successive halving, mentioning Hyperband later. In our experience so far, asynchronous successive halving does not profit from multiple brackets if applied to neural network tuning.

Here is a launcher script using HyperbandScheduler:

import logging

from syne_tune.backend import LocalBackend
from syne_tune.optimizer.schedulers import HyperbandScheduler
from syne_tune import Tuner, StoppingCriterion

from benchmarking.definitions.definition_mlp_on_fashion_mnist import \
    mlp_fashionmnist_benchmark

if __name__ == '__main__':
  logging.getLogger().setLevel(logging.DEBUG)
  n_workers = 4

  # We pick the MLP on FashionMNIST benchmark
  # The 'benchmark' dict contains arguments needed by scheduler and
  # searcher (e.g., 'mode', 'metric'), along with suggested default values
  # for other arguments (which you are free to override)
  benchmark = mlp_fashionmnist_benchmark({'dataset_path': './'})
  config_space = benchmark['config_space']

  backend = LocalBackend(entry_point=benchmark['script'])

  # GP-based Bayesian optimization searcher. Many options can be specified
  # via `search_options`, but let's use the defaults
  searcher = 'bayesopt'
  search_options = {'num_init_random': n_workers + 2}
  # Hyperband (or successive halving) scheduler of the stopping type.
  # Together with 'bayesopt', this selects the MOBSTER algorithm.
  default_params = benchmark['default_params']
  scheduler = HyperbandScheduler(
    config_space,
    searcher=searcher,
    search_options=search_options,
    type='stopping',
    max_t=default_params['epochs'],
    grace_period=default_params['grace_period'],
    reduction_factor=default_params['reduction_factor'],
    resource_attr=benchmark['resource_attr'],
    mode=benchmark['mode'],
    metric=benchmark['metric'])

  tuner = Tuner(
    trial_backend=backend,
    scheduler=scheduler,
    stop_criterion=StoppingCriterion(max_wallclock_time=120),
    n_workers=n_workers,
  )

  tuner.run()

Much of this launcher script is the same as for FIFOScheduler, but HyperbandScheduler comes with a number of extra arguments we will explain in the sequel (type, max_t, grace_period, reduction_factor, resource_attr). The mlp_fashionmnist benchmark trains a two-layer MLP on FashionMNIST (see mlp_on_fashion_mnist.py). The accuracy is computed and reported at the end of each epoch:

for epoch in range(resume_from + 1, config['epochs'] + 1):
    train_model(config, state, train_loader)
    accuracy = validate_model(config, state, valid_loader)
    report(
        epoch=epoch,
        accuracy=accuracy)

While metric = 'accuracy' is the criterion to be optimized, resource_attr = 'epoch' is the resource attribute. In the schedulers discussed here, the resource attribute must be a positive integer.

HyperbandScheduler maintains reported metrics for all trials at certain rung levels (levels of resource attribute epoch at which scheduling decisions are done). When a trial reports (epoch, accuracy) for a rung level == epoch, the scheduler makes a decision whether to stop (pause) or continue. This decision is done based on all accuracy values encountered before at the same rung level. Whenever a trial is stopped (or paused), the executing worker becomes available to evaluate a different configuration.

Rung level spacing and stop/go decisions are determined by the parameters max_t, grace_period, and reduction_factor. Rung levels are grace_period, grace_period * eta, grace_period * (eta ** 2), ..., max_t, where eta = reduction_factor. In the example above, max_t = 81, grace_period = 1, and reduction_factor = 3, so that rung levels are 1, 3, 9, 27, 81. The spacing is such that stop/go decisions are done less frequently for trials which already went further: they have earned trust by not being stopped earlier. max_t need not be of the form grace_period * (eta ** k). If max_t = 56 in the example above, the rung levels would be 1, 3, 9, 27, 56.

If max_t is not given as argument to HyperbandScheduler, the value may be inferred from config_space. Namely, config_space['epochs'], config_space['max-t'], config_space['max-epochs'] are checked in this order. In the example above, config_space['epochs'] contains the correct value, so we could have dropped max_t.

Given such a rung level spacing, stop/go decisions are done by comparing accuracy to the 1 / reduction_factor quantile of values recorded at the rung level. In the example above, our trial is stopped if accuracy is no better than the best 1/3 of previous values (the list includes the current accuracy value), otherwise it is stopped.

As detailed in Model-based Asynchronous Hyperparameter and Neural Architecture Search, there are two different types of asynchronous successive halving, selected by the type argument:

  • Stopping-based asynchronous successive halving [type='stopping']: This is essentially a refined variant of early stopping for HPO. If a stop/go decision comes out 'stop', the trial is terminated, otherwise it may continue. If there are less than reduction_factor recorded values at the rung level (including the current one), the trial continues. Moreover, whenever a worker is free, a trial is started with a newly chosen configuration. This variant is simple and does not require the back-end to pause and resume trials. It is the default for type.
  • Promotion-based asynchronous successive halving [type = 'promotion']: This variant has been proposed as ASHA in A System for Massively Parallel Hyperparameter Tuning. If a stop/go decision comes out 'stop', the trial is paused, otherwise it may continue. If there are less than reduction_factor recorded values at the rung level (including the current one), the trial is paused. Moreover, whenever a worker is free, the scheduler first scans all paused trials in reverse rung level order (largest rung levels first). If for any of them, the stop/go decision comes out 'go', this trial is resumed. Otherwise, if none of the paused trials can be resumed, a trial is started with a newly chosen configuration. This variant requires the back-end to pause and resume trials (which typically includes support for checkpointing).
  • Progressive ASHA (PASHA) [type='pasha']: This is the variant of ASHA presented in TODO:ADDLINK. This variant of ASHA have been developed to be resource-efficient on large datasets and it works by progressively extending the maximum resources level at which configurations are trained. It is empirically shown that it is often possible to identify optimal configurations early on, but it is often difficult to determine how reliable an early decision is. PASHA tries to automatically identify the minimum resources level at which it can perform a reliable decision. It can be used in situations where re-training a model several time would be just too expensive.

The full range of arguments of HyperbandScheduler is documented in syne_tune/optimizer/schedulers/hyperband.py. It includes all those of FIFOScheduler. Here, we list the most important ones:

  • max_t, grace_period, reduction_factor: As detailed above, these determine the rung levels and the stop/go decisions. The resource attribute is a positive integer. We need reduction_factor >= 2.
  • rung_levels: Alternatively, the user can specify the list of rung levels directly (positive integers, strictly increasing). The stop/go rule in the successive halving scheduler is set based on the ratio of successive rung levels.
  • type: Values are 'stopping', 'promotion' (see above).
  • brackets: Number of brackets to be used in Hyperband. The default is 1, which corresponds to successive halving. Each bracket has a different grace_period, they share max_t and reduction_factor. When starting a new trial, it is assigned a randomly sampled bracket (smaller brackets have a higher probability). The larger the bracket, the larger grace_period before the trial has to compete with others.
  • rung_system_per_bracket: Only used if brackets > 1. If True, each bracket maintains its own rung level system, so that trials only compete with those started in the same bracket. If False, all trials compete with each other in a single rung level system, they just get different head starts in terms of their grace_period.
  • searcher_data: This option is relevant when searcher='bayesopt' and is discussed below.

Asynchronous Hyperband ASHA (searcher = 'random')

If HyperbandScheduler is configured with a random searcher, we obtain ASHA, as proposed in A System for Massively Parallel Hyperparameter Tuning.

@article{
   title={A System for Massively Parallel Hyperparameter Tuning},
   author={Liam Li and Kevin Jamieson and Afshin Rostamizadeh and Ekaterina Gonina and Moritz Hardt and Benjamin Recht and Ameet Talwalkar},
   journal={arXiv preprint arXiv:1810.05934}
}

Strictly speaking, their paper details the promotion-based variant (type = 'promotion'), while the stopping-based variant is based on earlier ideas like the median rule.

Nothing much can be configured via search_options in this case. The arguments are the same as for random search with FIFOScheduler.

Model-based Asynchronous Hyperband MOBSTER (searcher = 'bayesopt')

If HyperbandScheduler is configured with a Bayesian optimization searcher, we obtain MOBSTER, as proposed in Model-based Asynchronous Hyperparameter and Neural Architecture Search.

@article{
   title={Model-based Asynchronous Hyperparameter and Neural Architecture Search},
   author={Aaron Klein and Louis C. Tiao and Thibaut Lienart and Cedric Archambeau and Matthias Seeger},
   journal={arXiv preprint arXiv:2003.10865}
}

MOBSTER uses a multi-task Gaussian process surrogate model for metrics data observed at all resource levels. Options for configuring the searcher are given in search_options. These include options for the random searcher. The full range of arguments of GPMultiFidelitySearcher is documented in syne_tune/optimizer/schedulers/searchers/gp_multifidelity_searcher.py. Here, we list the most important ones:

  • num_init_random: See FIFOSearcher, searcher='bayesopt'.
  • opt_nstarts, opt_maxiter: See FIFOSearcher, searcher='bayesopt'.
  • opt_skip_init_length, opt_skip_period: See FIFOSearcher, searcher='bayesopt'.
  • map_reward: See FIFOSearcher, searcher='bayesopt'.
  • gp_resource_kernel: Values are 'matern52', 'matern52-res-warp', 'exp-decay-sum', 'exp-decay-delta1', 'exp-decay-combined'. Selects different multi-task GP surrogate models. For details, please see the code. The default choice is 'exp-decay-sum', which is closely related to the kernel proposed inFreeze-Thaw Bayesian Optimization, but without the conditional independence assumptions made there.
  • opt_skip_num_max_resource: Alternative to opt_skip_period. If True, the GP surrogate model hyperparameters are refit only when a trial reaches level max_t.
  • resource_acq_bohb_threshold: MOBSTER is choosing a new configuration by maximizing the expected improvement (EI) acquisition function at a certain resource level r_acq. Since we are ultimately interested in performance at r = max_t, we would like to set r_acq = max_t as early as possible. On the other hand, EI may not be reliable at a resource level if too little metric data has been observed there (i.e., too few trials reached this level). MOBSTER is setting r_acq to the largest rung level r <= max_t for which at least resource_acq_bohb_threshold metric values have been recorded.

Finally, for searcher='bayesopt', the HyperbandScheduler argument searcher_data is relevant. Values are 'rungs', 'all', 'rungs_and_last'. Recall that the searcher represents past data by a multi-task GP surrogate model conditioned on observed metric data. Inference in such a model scales cubically in the number of datapoints. searcher_data determines which metric observations are passed to the searcher to update its surrogate model:

  • 'all': All observations at all resource levels are used for the surrogate model. This provides the best fit, but can be expensive, which can slow down the search.
  • 'rungs': This is the default. Observations are used for the surrogate model only if their resource level is equal to a rung level. This renders the surrogate model cheaper, but may result in a worse fit.
  • 'rungs_and_last': Observations are used for the surrogate model if their resource level is equal to a rung level, or if they are the last recent observation of a trial. The surrogate model is only a bit more expensive, but all most recent observations are used.

Recommendations

Finally, we provide some general recommendations on how to use our built-in schedulers.

  • If you can afford it for your problem, random search (FIFOScheduler, searcher='random') is a useful baseline. However, if even a single full evaluation takes a long time, try ASHA instead (HyperbandScheduler, searcher='random', type='stopping' or type='promotion').
  • Use these baseline runs to get an idea how long your experiment needs to run. It is recommended to use a stopping criterion of the form stop_criterion=StoppingCriterion(max_wallclock_time=X), so that the experiment is stopped after X seconds.
  • If your tuning problem comes with an obvious resource parameter, make sure to implement it such that results are reported during the evaluation, not only at the end. When training a neural network model, choose the number of epochs as resource. In other situations, choosing a resource parameter may be more difficult. Our schedulers require positive (or non-negative) integers. Make sure that evaluations for the same configuration scale linearly in the resource parameter: an evaluation up to 2 * r should be roughly twice as expensive as one up to r.
  • If your problem has a resource parameter, always make sure to try HyperbandScheduler, which in many cases runs much faster than FIFOScheduler.
  • If you end up tuning the same ML algorithm or neural network model on different datasets, make sure to set points_to_evaluate appropriately. If the model comes from frequently used open source code, its built-in defaults will be a good choice. Any hyperparameter not covered in points_to_evaluate is set using a "midpoint" heuristic. While still better than choosing the first configuration at random, this may not be very good.
  • For HyperbandScheduler, you need to choose between type='stopping' and type='promotion'. For neural network tuning, start with 'stopping', which is simpler and does not need checkpointing. However, if checkpointing is in place, try both of them. For some problems, the notion of stopping and checkpointing does not apply, and 'promotion' may be more natural. For example, you may train your model on subsamples of size (r / 10) * total_size, r=1,...,10 (assuming that training scales roughly linear in the dataset size). Training for every r has to start from scratch in this case.
  • In general, the defaults should work well if your tuning problem is expensive enough (at least a few minutes per unit of r). In such cases, MOBSTER (HyperbandScheduler, searcher='bayesopt') can outperform ASHA substantially. However, if your problem is cheap, so you can afford a lot of evaluation, the searchers based on GP surrogate models may end up expensive. With ASHA your baseline, you can try to speed up MOBSTER by changing opt_skip_period (or using opt_skip_num_max_resource).