We have seen that asynchronous decision-making tends to outperform synchronous variants in practice, and model-based extensions of the latter can outperform random sampling of new configurations. In this section, we discuss combinations of Bayesian optimization with asynchronous decision-making, leading to the currently best performing multi-fidelity methods in Syne Tune.
All examples here can either be run in stopping or promotion mode of ASHA. We will use the promotion mode here (i.e., pause-and-resume scheduling).
Recall that validation
error after
In the context of Gaussian process based Bayesian optimization, Syne Tune supports a number of different learning curve surrogate models. The type of model is selected upon construction of the scheduler:
scheduler = HyperbandScheduler(
config_space,
type="promotion",
searcher="bayesopt",
search_options=dict(
model="gp_multitask",
gp_resource_kernel="exp-decay-sum",
),
metric=benchmark.metric,
mode=benchmark.mode,
resource_attr=resource_attr,
random_seed=random_seed,
max_resource_attr=max_resource_attr,
)
First, searcher="bayesopt"
is selecting MOBSTER as searcher in asynchronous
Hyperband. Further options configuring the searcher are collected in
search_options
. The most important options are model
, selecting the type of
surrogate model, and gp_resource_kernel
selecting the covariance model in the
case model="gp_multitask"
.
A simple learning curve surrogate model is obtained by
search_options["model"] = "gp_independent"
. Here, search_options["separate_noise_variances"] = True
, different noise variances
A more advanced set of learning curve surrogate models is obtained by
search_options["model"] = "gp_multitask"
(which is the default for asynchronous
MOBSTER). In this case, a single Gaussian process model represents
search_options["gp_resource_kernel"]
, currently supported options
are "exp-decay-sum"
, "exp-decay-combined"
, "exp-decay-delta1"
, "freeze-thaw"
,
"matern52"
, "matern52-res-warp"
, "cross-validation"
. The default choice is
"exp-decay-sum"
, which is inspired by the exponential decay model proposed
here. Details about these different models are
given here and in the source code.
Decision-making is somewhat more expensive with "gp_multitask"
than with
"gp_independent"
, because the notorious cubic scaling of GP inference applies
over observations made at all rung levels. However, the extra cost is limited by
the fact that most observations by far are made at the lowest resource level
Two additional models are selected by search_options["model"] = "gp_expdecay"
and search_options["model"] = "gp_issm"
. The former is the exponential decay
model proposed here, the latter is a variant
thereof. These additive Gaussian models represent dependencies across "gp_multitask"
, and they can be fit to all observed data,
not just at rung levels. Also, joint sampling is cheap.
However, at this point, additive Gaussian models remain experimental, and they will not be further discussed here. They can be used with MOBSTER, but not with Hyper-Tune.
MOBSTER combines ASHA and
asynchronous Hyperband with GP-based Bayesian optimization. A Gaussian process
learning curve surrogate model is fit to the data at all rung levels, and
posterior predictive distributions are used in order to compute acquisition
function values and decide on which configuration to start next. We
distinguish between MOBSTER-JOINT with a GP multi-task model ("gp_multitask"
)
and MOBSTER-INDEP with an independent GP model ("gp_independent"
), as
detailed above. The acquisition function is expected improvement (EI) at the
rung level
A launcher script for (asynchronous) MOBSTER-JOINT is given in
launch_method.py, passing method="MOBSTER-JOINT"
.
The searcher can be configured with search_options
, but MOBSTER-JOINT with
the "exp-decay-sum"
covariance model is the default
As shown below, MOBSTER can outperform ASHA significantly. This is achieved by starting many less trials that stop very early (after 1 epoch) due to poor performance. Essentially, MOBSTER rapidly learns some important properties about the NASBench-201 problem and avoids basic mistakes which random sampling of configurations runs into at a constant rate. While ASHA stops such poor trials early, they still take away resources, which MOBSTER can spend on longer evaluations of more promising configurations. This advantage of model-based over random sampling based multi-fidelity methods is even more pronounced when starting and stopping jobs comes with delays. Such delays are typically present in real world distributed systems, but are absent in our simulations.
Different to BOHB, MOBSTER takes into account pending evaluations, i.e. trials which have been started but did not return metric values yet. This is done by integrating out their metric values by Monte Carlo. Namely, we draw a certain number of joint samples over pending targets and average the acquisition function over these. In the context of multi-fidelity, if a trial is running, a pending evaluation is registered for the next recent rung level it will reach.
Why is the surrogate model in MOBSTER-JOINT fit to the data at rung
levels only? After all, training scripts tend to report validation errors after
each epoch, why not use all this data? Syne Tune allows to do so (for the
"gp_multitask"
model), by passing searcher_data="all"
when creating the
HyperbandScheduler
(another intermediate is searcher_data="rungs_and_last"
).
However, while this may lead to a more accurate model, it also becomes more
expensive to fit, and does not tend to make a difference, so the default
searcher_data="rungs"
is recommended.
Finally, we can also combine ASHA with BOHB
decision-making, by choosing searcher="kde"
in HyperbandScheduler
. This is
an asynchronous version of BOHB.
A launcher script for (asynchronous) MOBSTER-INDEP is given in
launch_method.py, passing method="MOBSTER-INDEP"
.
The independent GPs model is selected by search_options["model"] = "gp_independent"
.
MOBSTER tends to perform slightly better with a joint multi-task GP model than with an independent GPs model, justifying the Syne Tune default. In our experience so far, changing the covariance model in MOBSTER-JOINT has only marginal impact.
Just like ASHA can be run with multiple brackets,
so can MOBSTER, simply by selecting brackets
when creating HyperbandScheduler
.
In our experience so far, just like with ASHA, MOBSTER tends to work best with
a single bracket.
Hyper-Tune is a model-based extension of ASHA
with some additional features compared to MOBSTER. It can be seen as extending
MOBSTER-INDEP (with the "gp_independent"
surrogate model) in two ways. First,
it uses an acquisition function based on an ensemble predictive distribution,
while MOBSTER relies on the
Before diving into details, a launcher script for Hyper-Tune (with one bracket)
is given in
launch_method.py, passing method="HYPERTUNE-INDEP"
.
The searcher can be configured with search_options
, but the independent GPs
model "gp_independent"
is the default
In this example, Hyper-Tune is using a single bracket, so the difference to MOBSTER-INDEP is due to the ensemble predictive distribution for the acquisition function.
Syne Tune also implements Hyper-Tune with the GP multi-task surrogate models used in
MOBSTER. In result plots for this tutorial, original Hyper-Tune is called
HYPERTUNE-INDEP, while this latter variant is called HYPERTUNE-JOINT. A launcher
script is given in
launch_method.py, passing method="HYPERTUNE-JOINT"
.
Just like ASHA and MOBSTER, Hyper-Tune can also be run with multiple brackets,
simply by using the brackets
argument of HyperbandScheduler
. If
brackets > 1
, Hyper-Tune samples the bracket for a new trial from an adaptive
distribution closely related to the ensemble distribution used for acquisitions.
A launcher script is given in
launch_method.py, passing method="HYPERTUNE4-INDEP"
.
Recall that both ASHA and MOBSTER tend to work better for one than for multiple
brackets. This may well be due to the fixed, non-adaptive distribution that
brackets are sampled from. Ideally, a method would learn over time whether a low
rung level tends to be reliable in predicting the ordering at higher ones, or
whether it should rather be avoided (and
In this section, we provide some details about Hyper-Tune and our implementation.
The Hyper-Tune extensions are based on a quantification of consistency of data on
different rung levels For example, assume that $r < r_{}$ are two rung
levels, with sufficiently many points at $r_{}$. If $\mathcal{X}{*}$
collects trials with data at $r{}$, all these have also been observed at
$r$. Sampling $f(\mathcal{X}_{}, r)$ from the posterior distribution
of the surrogate model, we can compare the ordering of these predictions at
At any point during the algorithm, denote by $r_{}$ the largest rung
level with a sufficient number of observations (our implementation
requires 6 points). Assuming that $r_{} > r_{min}$, we can estimate a
distribution argmin
indicator $[\text{I}{l_{r, s} = m_s}]$, where
$m_s = \text{min}(l_{r, s} | r\in\mathcal{R}{*})$. The distribution
$[\theta_r]$ is obtained as normalized sum of these indicators over
$s=1,\dots, S$. We also need to compute loss values $l{r_{*}, s}$,
this is done using a cross-validation approximation, see
here or our code for details.
In the beginning, with too little data at the second rung level, we use
Decisions about a new configuration are based on an acquisition function
over a predictive distribution indexed by
Note that our implementation generalizes
Hyper-Tune in that ranking losses and
If Hyper-Tune is used with more than one bracket, the
In the next section, we provide some empirical comparison of all the methods discussed so far.