From 358227eee728c257dc37f4e26963e7027b071293 Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" Date: Sat, 9 Nov 2024 03:50:56 +0000 Subject: [PATCH] Added navbar and removed insert_navbar.sh --- index.html | 1 + previews/PR129/elbo/overview/index.html | 461 ++++++++++++++++++++- previews/PR129/elbo/repgradelbo/index.html | 461 ++++++++++++++++++++- previews/PR129/examples/index.html | 461 ++++++++++++++++++++- previews/PR129/families/index.html | 461 ++++++++++++++++++++- previews/PR129/general/index.html | 461 ++++++++++++++++++++- previews/PR129/index.html | 461 ++++++++++++++++++++- previews/PR129/optimization/index.html | 461 ++++++++++++++++++++- 8 files changed, 3221 insertions(+), 7 deletions(-) diff --git a/index.html b/index.html index 6a5afc30..3ac25969 100644 --- a/index.html +++ b/index.html @@ -1,2 +1,3 @@ + diff --git a/previews/PR129/elbo/overview/index.html b/previews/PR129/elbo/overview/index.html index e1f89c80..5f2599da 100644 --- a/previews/PR129/elbo/overview/index.html +++ b/previews/PR129/elbo/overview/index.html @@ -1,2 +1,461 @@ -Overview · AdvancedVI.jl

Evidence Lower Bound Maximization

Introduction

Evidence lower bound (ELBO) maximization[JGJS1999] is a general family of algorithms that minimize the exclusive (or reverse) Kullback-Leibler (KL) divergence between the target distribution $\pi$ and a variational approximation $q_{\lambda}$. More generally, they aim to solve the following problem:

\[ \mathrm{minimize}_{q \in \mathcal{Q}}\quad \mathrm{KL}\left(q, \pi\right),\]

where $\mathcal{Q}$ is some family of distributions, often called the variational family. Since the target distribution $\pi$ is intractable in general, the KL divergence is also intractable. Instead, the ELBO maximization strategy maximizes a surrogate objective, the ELBO:

\[ \mathrm{ELBO}\left(q\right) \triangleq \mathbb{E}_{\theta \sim q} \log \pi\left(\theta\right) + \mathbb{H}\left(q\right),\]

which serves as a lower bound to the KL. The ELBO and its gradient can be readily estimated through various strategies. Overall, ELBO maximization algorithms aim to solve the problem:

\[ \mathrm{maximize}_{q \in \mathcal{Q}}\quad \mathrm{ELBO}\left(q\right).\]

Multiple ways to solve this problem exist, each leading to a different variational inference algorithm.

Algorithms

Currently, AdvancedVI only provides the approach known as black-box variational inference (also known as Monte Carlo VI, Stochastic Gradient VI). (Introduced independently by two groups [RGB2014][TL2014] in 2014.) In particular, AdvancedVI focuses on the reparameterization gradient estimator[TL2014][RMW2014][KW2014], which is generally superior compared to alternative strategies[XQKS2019], discussed in the following section:

  • JGJS1999Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37, 183-233.
  • TL2014Titsias, M., & Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning.
  • RMW2014Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning.
  • KW2014Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In International Conference on Learning Representations.
  • XQKS2019Xu, M., Quiroz, M., Kohn, R., & Sisson, S. A. (2019). Variance reduction properties of the reparameterization trick. In *The International Conference on Artificial Intelligence and Statistics.
  • RGB2014Ranganath, R., Gerrish, S., & Blei, D. (2014). Black box variational inference. In Artificial Intelligence and Statistics.
+Overview · AdvancedVI.jl + + + + + +

Evidence Lower Bound Maximization

Introduction

Evidence lower bound (ELBO) maximization[JGJS1999] is a general family of algorithms that minimize the exclusive (or reverse) Kullback-Leibler (KL) divergence between the target distribution $\pi$ and a variational approximation $q_{\lambda}$. More generally, they aim to solve the following problem:

\[ \mathrm{minimize}_{q \in \mathcal{Q}}\quad \mathrm{KL}\left(q, \pi\right),\]

where $\mathcal{Q}$ is some family of distributions, often called the variational family. Since the target distribution $\pi$ is intractable in general, the KL divergence is also intractable. Instead, the ELBO maximization strategy maximizes a surrogate objective, the ELBO:

\[ \mathrm{ELBO}\left(q\right) \triangleq \mathbb{E}_{\theta \sim q} \log \pi\left(\theta\right) + \mathbb{H}\left(q\right),\]

which serves as a lower bound to the KL. The ELBO and its gradient can be readily estimated through various strategies. Overall, ELBO maximization algorithms aim to solve the problem:

\[ \mathrm{maximize}_{q \in \mathcal{Q}}\quad \mathrm{ELBO}\left(q\right).\]

Multiple ways to solve this problem exist, each leading to a different variational inference algorithm.

Algorithms

Currently, AdvancedVI only provides the approach known as black-box variational inference (also known as Monte Carlo VI, Stochastic Gradient VI). (Introduced independently by two groups [RGB2014][TL2014] in 2014.) In particular, AdvancedVI focuses on the reparameterization gradient estimator[TL2014][RMW2014][KW2014], which is generally superior compared to alternative strategies[XQKS2019], discussed in the following section:

  • JGJS1999Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37, 183-233.
  • TL2014Titsias, M., & Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning.
  • RMW2014Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning.
  • KW2014Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In International Conference on Learning Representations.
  • XQKS2019Xu, M., Quiroz, M., Kohn, R., & Sisson, S. A. (2019). Variance reduction properties of the reparameterization trick. In *The International Conference on Artificial Intelligence and Statistics.
  • RGB2014Ranganath, R., Gerrish, S., & Blei, D. (2014). Black box variational inference. In Artificial Intelligence and Statistics.
+ diff --git a/previews/PR129/elbo/repgradelbo/index.html b/previews/PR129/elbo/repgradelbo/index.html index 5f03957b..0ec612c3 100644 --- a/previews/PR129/elbo/repgradelbo/index.html +++ b/previews/PR129/elbo/repgradelbo/index.html @@ -1,5 +1,463 @@ -Reparameterization Gradient Estimator · AdvancedVI.jl

Reparameterization Gradient Estimator

Overview

The reparameterization gradient[TL2014][RMW2014][KW2014] is an unbiased gradient estimator of the ELBO. Consider some variational family

\[\mathcal{Q} = \{q_{\lambda} \mid \lambda \in \Lambda \},\]

where $\lambda$ is the variational parameters of $q_{\lambda}$. If its sampling process can be described by some differentiable reparameterization function $\mathcal{T}_{\lambda}$ and a base distribution $\varphi$ independent of $\lambda$ such that

\[z \sim q_{\lambda} \qquad\Leftrightarrow\qquad +Reparameterization Gradient Estimator · AdvancedVI.jl + + +

+ + +

Reparameterization Gradient Estimator

Overview

The reparameterization gradient[TL2014][RMW2014][KW2014] is an unbiased gradient estimator of the ELBO. Consider some variational family

\[\mathcal{Q} = \{q_{\lambda} \mid \lambda \in \Lambda \},\]

where $\lambda$ is the variational parameters of $q_{\lambda}$. If its sampling process can be described by some differentiable reparameterization function $\mathcal{T}_{\lambda}$ and a base distribution $\varphi$ independent of $\lambda$ such that

\[z \sim q_{\lambda} \qquad\Leftrightarrow\qquad z \stackrel{d}{=} \mathcal{T}_{\lambda}\left(\epsilon\right);\quad \epsilon \sim \varphi\]

we can effectively estimate the gradient of the ELBO by directly differentiating

\[ \widehat{\mathrm{ELBO}}\left(\lambda\right) = \frac{1}{M}\sum^M_{m=1} \log \pi\left(\mathcal{T}_{\lambda}\left(\epsilon_m\right)\right) + \mathbb{H}\left(q_{\lambda}\right),\]

where $\epsilon_m \sim \varphi$ are Monte Carlo samples, with respect to $\lambda$. This estimator is called the reparameterization gradient estimator.

In addition to the reparameterization gradient, AdvancedVI provides the following features:

  1. Posteriors with constrained supports are handled through Bijectors, which is known as the automatic differentiation VI (ADVI; [KTRGB2017]) formulation. (See this section.)
  2. The gradient of the entropy can be estimated through various strategies depending on the capabilities of the variational family. (See this section.)

The RepGradELBO Objective

To use the reparameterization gradient, AdvancedVI provides the following variational objective:

AdvancedVI.RepGradELBOType
RepGradELBO(n_samples; kwargs...)

Evidence lower-bound objective with the reparameterization gradient formulation[TL2014][RMW2014][KW2014]. This computes the evidence lower-bound (ELBO) through the formulation:

\[\begin{aligned} \mathrm{ELBO}\left(\lambda\right) &\triangleq @@ -44,3 +502,4 @@ return scale_diag .* std_samples .+ location end nothing

(Note that this is a quick-and-dirty example, and there are more sophisticated ways to implement this.)

By plotting the ELBO, we can see the effect of quasi-Monte Carlo. We can see that quasi-Monte Carlo results in much lower variance than naive Monte Carlo. However, similarly to the STL example, just looking at the ELBO is often insufficient to really judge performance. Instead, let's look at the distance to the global optimum:

QMC yields an additional order of magnitude in accuracy. Also, unlike STL, it ever-so slightly accelerates convergence. This is because quasi-Monte Carlo uniformly reduces variance, unlike STL, which reduces variance only near the optimum.

  • TL2014Titsias, M., & Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning.
  • RMW2014Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning.
  • KW2014Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In International Conference on Learning Representations.
  • KTRGB2017Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic differentiation variational inference. Journal of Machine Learning Research.
  • DLTBV2017Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., ... & Saurous, R. A. (2017). Tensorflow distributions. arXiv.
  • FXTYG2020Fjelde, T. E., Xu, K., Tarek, M., Yalburgi, S., & Ge, H. (2020,. Bijectors. jl: Flexible transformations for probability distributions. In Symposium on Advances in Approximate Bayesian Inference.
  • RWD2017Roeder, G., Wu, Y., & Duvenaud, D. K. (2017). Sticking the landing: Simple, lower-variance gradient estimators for variational inference. Advances in Neural Information Processing Systems, 30.
  • KMG2024Kim, K., Ma, Y., & Gardner, J. (2024). Linear Convergence of Black-Box Variational Inference: Should We Stick the Landing?. In International Conference on Artificial Intelligence and Statistics (pp. 235-243). PMLR.
  • BWM2018Buchholz, A., Wenzel, F., & Mandt, S. (2018). Quasi-monte carlo variational inference. In International Conference on Machine Learning.
+ diff --git a/previews/PR129/examples/index.html b/previews/PR129/examples/index.html index 5e6f2023..a0c3d1b8 100644 --- a/previews/PR129/examples/index.html +++ b/previews/PR129/examples/index.html @@ -1,5 +1,463 @@ -Examples · AdvancedVI.jl

Evidence Lower Bound Maximization

In this tutorial, we will work with a normal-log-normal model.

\[\begin{aligned} +Examples · AdvancedVI.jl + + +

+ + +

Evidence Lower Bound Maximization

In this tutorial, we will work with a normal-log-normal model.

\[\begin{aligned} x &\sim \mathrm{LogNormal}\left(\mu_x, \sigma_x^2\right) \\ y &\sim \mathcal{N}\left(\mu_y, \sigma_y^2\right) \end{aligned}\]

BBVI with Bijectors.Exp bijectors is able to infer this model exactly.

Using the LogDensityProblems interface, we the model can be defined as follows:

using LogDensityProblems
@@ -67,3 +525,4 @@
 plot(t, y; label="BBVI", xlabel="Iteration", ylabel="ELBO")
 savefig("bbvi_example_elbo.svg")
 nothing

Further information can be gathered by defining your own callback!.

The final ELBO can be estimated by calling the objective directly with a different number of Monte Carlo samples as follows:

estimate_objective(objective, q_avg_trans, model; n_samples=10^4)
-0.026108045903352917
+ diff --git a/previews/PR129/families/index.html b/previews/PR129/families/index.html index f2878b9a..92311d03 100644 --- a/previews/PR129/families/index.html +++ b/previews/PR129/families/index.html @@ -1,5 +1,463 @@ -Variational Families · AdvancedVI.jl

Reparameterizable Variational Families

The RepGradELBO objective assumes that the members of the variational family have a differentiable sampling path. We provide multiple pre-packaged variational families that can be readily used.

The LocationScale Family

The location-scale variational family is a family of probability distributions, where their sampling process can be represented as

\[z \sim q_{\lambda} \qquad\Leftrightarrow\qquad +Variational Families · AdvancedVI.jl + + +

+ + +

Reparameterizable Variational Families

The RepGradELBO objective assumes that the members of the variational family have a differentiable sampling path. We provide multiple pre-packaged variational families that can be readily used.

The LocationScale Family

The location-scale variational family is a family of probability distributions, where their sampling process can be represented as

\[z \sim q_{\lambda} \qquad\Leftrightarrow\qquad z \stackrel{d}{=} C u + m;\quad u \sim \varphi\]

where $C$ is the scale, $m$ is the location, and $\varphi$ is the base distribution. $m$ and $C$ form the variational parameters $\lambda = (m, C)$ of $q_{\lambda}$. The location-scale family encompases many practical variational families, which can be instantiated by setting the base distribution of $u$ and the structure of $C$.

The probability density is given by

\[ q_{\lambda}(z) = {|C|}^{-1} \varphi(C^{-1}(z - m)),\]

the covariance is given as

\[ \mathrm{Var}\left(q_{\lambda}\right) = C \mathrm{Var}(q_{\lambda}) C^{\top}\]

and the entropy is given as

\[ \mathbb{H}(q_{\lambda}) = \mathbb{H}(\varphi) + \log |C|,\]

where $\mathbb{H}(\varphi)$ is the entropy of the base distribution. Notice the $\mathbb{H}(\varphi)$ does not depend on $\log |C|$. The derivative of the entropy with respect to $\lambda$ is thus independent of the base distribution.

API

Note

For stable convergence, the initial scale needs to be sufficiently large and well-conditioned. Initializing scale to have small eigenvalues will often result in initial divergences and numerical instabilities.

AdvancedVI.MvLocationScaleType
MvLocationScale(location, scale, dist; scale_eps)

The location scale variational family broadly represents various variational families using location and scale variational parameters.

It generally represents any distribution for which the sampling path can be represented as follows:

  d = length(location)
   u = rand(dist, d)
   z = scale*u + location

scale_eps sets a constraint on the smallest value of scale to be enforced during optimization. This is necessary to guarantee stable convergence.

Keyword Arguments

  • scale_eps: Lower bound constraint for the diagonal of the scale. (default: 1e-4).
source

The following are specialized constructors for convenience:

AdvancedVI.FullRankGaussianFunction
FullRankGaussian(μ, L; scale_eps)

Construct a Gaussian variational approximation with a dense covariance matrix.

Arguments

  • μ::AbstractVector{T}: Mean of the Gaussian.
  • L::LinearAlgebra.AbstractTriangular{T}: Cholesky factor of the covariance of the Gaussian.

Keyword Arguments

  • scale_eps: Smallest value allowed for the diagonal of the scale. (default: 1e-4).
source
AdvancedVI.MeanFieldGaussianFunction
MeanFieldGaussian(μ, L; scale_eps)

Construct a Gaussian variational approximation with a diagonal covariance matrix.

Arguments

  • μ::AbstractVector{T}: Mean of the Gaussian.
  • L::Diagonal{T}: Diagonal Cholesky factor of the covariance of the Gaussian.

Keyword Arguments

  • scale_eps: Smallest value allowed for the diagonal of the scale. (default: 1e-4).
source

Gaussian Variational Families

using AdvancedVI, LinearAlgebra, Distributions;
@@ -36,3 +494,4 @@
   u_diag = rand(dist, d)
   u_factors = rand(dist, r)
   z = scale_diag.*u_diag + scale_factors*u_factors + location

scale_eps sets a constraint on the smallest value of scale_diag to be enforced during optimization. This is necessary to guarantee stable convergence.

Keyword Arguments

  • scale_eps: Lower bound constraint for the values of scale_diag. (default: sqrt(eps(T))).
source

The logpdf of MvLocationScaleLowRank has an optional argument non_differentiable::Bool (default: false). If set as true, a more efficient $O\left(r d^2\right)$ implementation is used to evaluate the density. This, however, is not differentiable under most AD frameworks due to the use of Cholesky lowrankupdate. The default value is false, which uses a $O\left(d^3\right)$ implementation, is differentiable and therefore compatible with the StickingTheLandingEntropy estimator.

The following is a specialized constructor for convenience:

AdvancedVI.LowRankGaussianFunction
LowRankGaussian(μ, D, U; scale_eps)

Construct a Gaussian variational approximation with a diagonal plus low-rank covariance matrix.

Arguments

  • μ::AbstractVector{T}: Mean of the Gaussian.
  • D::Vector{T}: Diagonal of the scale.
  • U::Matrix{T}: Low-rank factors of the scale, where size(U,2) is the rank.

Keyword Arguments

  • scale_eps: Smallest value allowed for the diagonal of the scale. (default: 1e-4).
source
  • ONS2018Ong, V. M. H., Nott, D. J., & Smith, M. S. (2018). Gaussian variational approximation with a factor covariance structure. Journal of Computational and Graphical Statistics, 27(3), 465-478.
+ diff --git a/previews/PR129/general/index.html b/previews/PR129/general/index.html index b6207a96..e5a59304 100644 --- a/previews/PR129/general/index.html +++ b/previews/PR129/general/index.html @@ -1,2 +1,461 @@ -General Usage · AdvancedVI.jl

General Usage

Each VI algorithm provides the followings:

  1. Variational families supported by each VI algorithm.
  2. A variational objective corresponding to the VI algorithm. Note that each variational family is subject to its own constraints. Thus, please refer to the documentation of the variational inference algorithm of interest.

Optimizing a Variational Objective

After constructing a variational objective objective and initializing a variational approximation, one can optimize objective by calling optimize:

AdvancedVI.optimizeFunction
optimize(problem, objective, q_init, max_iter, objargs...; kwargs...)

Optimize the variational objective objective targeting the problem problem by estimating (stochastic) gradients.

The trainable parameters in the variational approximation are expected to be extractable through Optimisers.destructure. This requires the variational approximation to be marked as a functor through Functors.@functor.

Arguments

  • objective::AbstractVariationalObjective: Variational Objective.
  • q_init: Initial variational distribution. The variational parameters must be extractable through Optimisers.destructure.
  • max_iter::Int: Maximum number of iterations.
  • objargs...: Arguments to be passed to objective.

Keyword Arguments

  • adtype::ADtypes.AbstractADType: Automatic differentiation backend.
  • optimizer::Optimisers.AbstractRule: Optimizer used for inference. (Default: Adam.)
  • averager::AbstractAverager : Parameter averaging strategy. (Default: NoAveraging())
  • rng::AbstractRNG: Random number generator. (Default: Random.default_rng().)
  • show_progress::Bool: Whether to show the progress bar. (Default: true.)
  • callback: Callback function called after every iteration. See further information below. (Default: nothing.)
  • prog: Progress bar configuration. (Default: ProgressMeter.Progress(n_max_iter; desc="Optimizing", barlen=31, showspeed=true, enabled=prog).)
  • state::NamedTuple: Initial value for the internal state of optimization. Used to warm-start from the state of a previous run. (See the returned values below.)

Returns

  • averaged_params: Variational parameters generated by the algorithm averaged according to averager.
  • params: Last variational parameters generated by the algorithm.
  • stats: Statistics gathered during optimization.
  • state: Collection of the final internal states of optimization. This can used later to warm-start from the last iteration of the corresponding run.

Callback

The callback function callback has a signature of

callback(; stat, state, params, averaged_params, restructure, gradient)

The arguments are as follows:

  • stat: Statistics gathered during the current iteration. The content will vary depending on objective.
  • state: Collection of the internal states used for optimization.
  • params: Variational parameters.
  • averaged_params: Variational parameters averaged according to the averaging strategy.
  • restructure: Function that restructures the variational approximation from the variational parameters. Calling restructure(param) reconstructs the variational approximation.
  • gradient: The estimated (possibly stochastic) gradient.

callback can return a NamedTuple containing some additional information computed within cb. This will be appended to the statistic of the current corresponding iteration. Otherwise, just return nothing.

source

Estimating the Objective

In some cases, it is useful to directly estimate the objective value. This can be done by the following funciton:

AdvancedVI.estimate_objectiveFunction
estimate_objective([rng,] obj, q, prob; kwargs...)

Estimate the variational objective obj targeting prob with respect to the variational approximation q.

Arguments

  • rng::Random.AbstractRNG: Random number generator.
  • obj::AbstractVariationalObjective: Variational objective.
  • prob: The target log-joint likelihood implementing the LogDensityProblem interface.
  • q: Variational approximation.

Keyword Arguments

Depending on the objective, additional keyword arguments may apply. Please refer to the respective documentation of each variational objective for more info.

Returns

  • obj_est: Estimate of the objective value.
source
Info

Note that estimate_objective is not expected to be differentiated through, and may not result in optimal statistical performance.

Advanced Usage

Each variational objective is a subtype of the following abstract type:

AdvancedVI.AbstractVariationalObjectiveType
AbstractVariationalObjective

Abstract type for the VI algorithms supported by AdvancedVI.

Implementations

To be supported by AdvancedVI, a VI algorithm must implement AbstractVariationalObjective and estimate_objective. Also, it should provide gradients by implementing the function estimate_gradient!. If the estimator is stateful, it can implement init to initialize the state.

source

Furthermore, AdvancedVI only interacts with each variational objective by querying gradient estimates. Therefore, to create a new custom objective to be optimized through AdvancedVI, it suffices to implement the following function:

AdvancedVI.estimate_gradient!Function
estimate_gradient!(rng, obj, adtype, out, prob, params, restructure, obj_state)

Estimate (possibly stochastic) gradients of the variational objective obj targeting prob with respect to the variational parameters λ

Arguments

  • rng::Random.AbstractRNG: Random number generator.
  • obj::AbstractVariationalObjective: Variational objective.
  • adtype::ADTypes.AbstractADType: Automatic differentiation backend.
  • out::DiffResults.MutableDiffResult: Buffer containing the objective value and gradient estimates.
  • prob: The target log-joint likelihood implementing the LogDensityProblem interface.
  • params: Variational parameters to evaluate the gradient on.
  • restructure: Function that reconstructs the variational approximation from λ.
  • obj_state: Previous state of the objective.

Returns

  • out::MutableDiffResult: Buffer containing the objective value and gradient estimates.
  • obj_state: The updated state of the objective.
  • stat::NamedTuple: Statistics and logs generated during estimation.
source

If an objective needs to be stateful, one can implement the following function to inialize the state.

AdvancedVI.initFunction
init(rng, obj, prob, params, restructure)

Initialize a state of the variational objective obj given the initial variational parameters λ. This function needs to be implemented only if obj is stateful.

Arguments

  • rng::Random.AbstractRNG: Random number generator.
  • obj::AbstractVariationalObjective: Variational objective.
  • params: Initial variational parameters.
  • restructure: Function that reconstructs the variational approximation from λ.
source
init(avg, params)

Initialize the state of the averaging strategy avg with the initial parameters params.

Arguments

  • avg::AbstractAverager: Averaging strategy.
  • params: Initial variational parameters.
source
+General Usage · AdvancedVI.jl + + + + + +

General Usage

Each VI algorithm provides the followings:

  1. Variational families supported by each VI algorithm.
  2. A variational objective corresponding to the VI algorithm. Note that each variational family is subject to its own constraints. Thus, please refer to the documentation of the variational inference algorithm of interest.

Optimizing a Variational Objective

After constructing a variational objective objective and initializing a variational approximation, one can optimize objective by calling optimize:

AdvancedVI.optimizeFunction
optimize(problem, objective, q_init, max_iter, objargs...; kwargs...)

Optimize the variational objective objective targeting the problem problem by estimating (stochastic) gradients.

The trainable parameters in the variational approximation are expected to be extractable through Optimisers.destructure. This requires the variational approximation to be marked as a functor through Functors.@functor.

Arguments

  • objective::AbstractVariationalObjective: Variational Objective.
  • q_init: Initial variational distribution. The variational parameters must be extractable through Optimisers.destructure.
  • max_iter::Int: Maximum number of iterations.
  • objargs...: Arguments to be passed to objective.

Keyword Arguments

  • adtype::ADtypes.AbstractADType: Automatic differentiation backend.
  • optimizer::Optimisers.AbstractRule: Optimizer used for inference. (Default: Adam.)
  • averager::AbstractAverager : Parameter averaging strategy. (Default: NoAveraging())
  • rng::AbstractRNG: Random number generator. (Default: Random.default_rng().)
  • show_progress::Bool: Whether to show the progress bar. (Default: true.)
  • callback: Callback function called after every iteration. See further information below. (Default: nothing.)
  • prog: Progress bar configuration. (Default: ProgressMeter.Progress(n_max_iter; desc="Optimizing", barlen=31, showspeed=true, enabled=prog).)
  • state::NamedTuple: Initial value for the internal state of optimization. Used to warm-start from the state of a previous run. (See the returned values below.)

Returns

  • averaged_params: Variational parameters generated by the algorithm averaged according to averager.
  • params: Last variational parameters generated by the algorithm.
  • stats: Statistics gathered during optimization.
  • state: Collection of the final internal states of optimization. This can used later to warm-start from the last iteration of the corresponding run.

Callback

The callback function callback has a signature of

callback(; stat, state, params, averaged_params, restructure, gradient)

The arguments are as follows:

  • stat: Statistics gathered during the current iteration. The content will vary depending on objective.
  • state: Collection of the internal states used for optimization.
  • params: Variational parameters.
  • averaged_params: Variational parameters averaged according to the averaging strategy.
  • restructure: Function that restructures the variational approximation from the variational parameters. Calling restructure(param) reconstructs the variational approximation.
  • gradient: The estimated (possibly stochastic) gradient.

callback can return a NamedTuple containing some additional information computed within cb. This will be appended to the statistic of the current corresponding iteration. Otherwise, just return nothing.

source

Estimating the Objective

In some cases, it is useful to directly estimate the objective value. This can be done by the following funciton:

AdvancedVI.estimate_objectiveFunction
estimate_objective([rng,] obj, q, prob; kwargs...)

Estimate the variational objective obj targeting prob with respect to the variational approximation q.

Arguments

  • rng::Random.AbstractRNG: Random number generator.
  • obj::AbstractVariationalObjective: Variational objective.
  • prob: The target log-joint likelihood implementing the LogDensityProblem interface.
  • q: Variational approximation.

Keyword Arguments

Depending on the objective, additional keyword arguments may apply. Please refer to the respective documentation of each variational objective for more info.

Returns

  • obj_est: Estimate of the objective value.
source
Info

Note that estimate_objective is not expected to be differentiated through, and may not result in optimal statistical performance.

Advanced Usage

Each variational objective is a subtype of the following abstract type:

AdvancedVI.AbstractVariationalObjectiveType
AbstractVariationalObjective

Abstract type for the VI algorithms supported by AdvancedVI.

Implementations

To be supported by AdvancedVI, a VI algorithm must implement AbstractVariationalObjective and estimate_objective. Also, it should provide gradients by implementing the function estimate_gradient!. If the estimator is stateful, it can implement init to initialize the state.

source

Furthermore, AdvancedVI only interacts with each variational objective by querying gradient estimates. Therefore, to create a new custom objective to be optimized through AdvancedVI, it suffices to implement the following function:

AdvancedVI.estimate_gradient!Function
estimate_gradient!(rng, obj, adtype, out, prob, params, restructure, obj_state)

Estimate (possibly stochastic) gradients of the variational objective obj targeting prob with respect to the variational parameters λ

Arguments

  • rng::Random.AbstractRNG: Random number generator.
  • obj::AbstractVariationalObjective: Variational objective.
  • adtype::ADTypes.AbstractADType: Automatic differentiation backend.
  • out::DiffResults.MutableDiffResult: Buffer containing the objective value and gradient estimates.
  • prob: The target log-joint likelihood implementing the LogDensityProblem interface.
  • params: Variational parameters to evaluate the gradient on.
  • restructure: Function that reconstructs the variational approximation from λ.
  • obj_state: Previous state of the objective.

Returns

  • out::MutableDiffResult: Buffer containing the objective value and gradient estimates.
  • obj_state: The updated state of the objective.
  • stat::NamedTuple: Statistics and logs generated during estimation.
source

If an objective needs to be stateful, one can implement the following function to inialize the state.

AdvancedVI.initFunction
init(rng, obj, prob, params, restructure)

Initialize a state of the variational objective obj given the initial variational parameters λ. This function needs to be implemented only if obj is stateful.

Arguments

  • rng::Random.AbstractRNG: Random number generator.
  • obj::AbstractVariationalObjective: Variational objective.
  • params: Initial variational parameters.
  • restructure: Function that reconstructs the variational approximation from λ.
source
init(avg, params)

Initialize the state of the averaging strategy avg with the initial parameters params.

Arguments

  • avg::AbstractAverager: Averaging strategy.
  • params: Initial variational parameters.
source
+ diff --git a/previews/PR129/index.html b/previews/PR129/index.html index 96e03f81..76cae8ee 100644 --- a/previews/PR129/index.html +++ b/previews/PR129/index.html @@ -1,2 +1,461 @@ -AdvancedVI · AdvancedVI.jl
+AdvancedVI · AdvancedVI.jl + + + + + +
+ diff --git a/previews/PR129/optimization/index.html b/previews/PR129/optimization/index.html index e32e0e52..6f9dbf25 100644 --- a/previews/PR129/optimization/index.html +++ b/previews/PR129/optimization/index.html @@ -1,2 +1,461 @@ -Optimization · AdvancedVI.jl

Optimization

Parameter-Free Optimization Rules

We provide custom optimization rules that are not provided out-of-the-box by Optimisers.jl. The main theme of the provided optimizers is that they are parameter-free. This means that these optimization rules shouldn't require (or barely) any tuning to obtain performance competitive with well-tuned alternatives.

AdvancedVI.DoGType
DoG(repsilon)

Distance over gradient (DoG[IHC2023]) optimizer. It's only parameter is the initial guess of the Euclidean distance to the optimum repsilon. The original paper recommends $ 10^{-4} ( 1 + \lVert \lambda_0 \rVert ) $, but the default value is $ 10^{-6} $.

Parameters

  • repsilon: Initial guess of the Euclidean distance between the initial point and the optimum. (default value: 1e-6)
source
AdvancedVI.DoWGType
DoWG(repsilon)

Distance over weighted gradient (DoWG[KMJ2024]) optimizer. It's only parameter is the initial guess of the Euclidean distance to the optimum repsilon.

Parameters

  • repsilon: Initial guess of the Euclidean distance between the initial point and the optimum. (default value: 1e-6)
source
AdvancedVI.COCOBType
COCOB(alpha)

Continuous Coin Betting (COCOB[OT2017]) optimizer. We use the "COCOB-Backprop" variant, which is closer to the Adam optimizer. It's only parameter is the maximum change per parameter α, which shouldn't need much tuning.

Parameters

  • alpha: Scaling parameter. (default value: 100)
source

Parameter Averaging Strategies

In some cases, the best optimization performance is obtained by averaging the sequence of parameters generated by the optimization algorithm. For instance, the DoG[IHC2023] and DoWG[KMJ2024] papers report their best performance through averaging. The benefits of parameter averaging have been specifically confirmed for ELBO maximization[DCAMHV2020].

AdvancedVI.PolynomialAveragingType
PolynomialAveraging(eta)

Polynomial averaging rule proposed Shamir and Zhang[SZ2013]. At iteration t, the parameter average $ \bar{\lambda}_t $ according to the polynomial averaging rule is given as

\[ \bar{\lambda}_t = (1 - w_t) \bar{\lambda}_{t-1} + w_t \lambda_t \, ,\]

where the averaging weight is

\[ w_t = \frac{\eta + 1}{t + \eta} \, .\]

Higher eta ($\eta$) down-weights earlier iterations. When $\eta=0$, this is equivalent to uniformly averaging the iterates in an online fashion. The DoG paper[IHC2023] suggests $\eta=8$.

Parameters

  • eta: Regularization term. (default: 8)
source
  • OT2017Orabona, F., & Tommasi, T. (2017). Training deep networks without learning rates through coin betting. Advances in Neural Information Processing Systems, 30.
  • SZ2013Shamir, O., & Zhang, T. (2013). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning (pp. 71-79). PMLR.
  • DCAMHV2020Dhaka, A. K., Catalina, A., Andersen, M. R., Magnusson, M., Huggins, J., & Vehtari, A. (2020). Robust, accurate stochastic optimization for variational inference. Advances in Neural Information Processing Systems, 33, 10961-10973.
  • KMJ2024Khaled, A., Mishchenko, K., & Jin, C. (2023). Dowg unleashed: An efficient universal parameter-free gradient descent method. Advances in Neural Information Processing Systems, 36, 6748-6769.
  • IHC2023Ivgi, M., Hinder, O., & Carmon, Y. (2023). Dog is sgd's best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning (pp. 14465-14499). PMLR.
+Optimization · AdvancedVI.jl + + + + + +

Optimization

Parameter-Free Optimization Rules

We provide custom optimization rules that are not provided out-of-the-box by Optimisers.jl. The main theme of the provided optimizers is that they are parameter-free. This means that these optimization rules shouldn't require (or barely) any tuning to obtain performance competitive with well-tuned alternatives.

AdvancedVI.DoGType
DoG(repsilon)

Distance over gradient (DoG[IHC2023]) optimizer. It's only parameter is the initial guess of the Euclidean distance to the optimum repsilon. The original paper recommends $ 10^{-4} ( 1 + \lVert \lambda_0 \rVert ) $, but the default value is $ 10^{-6} $.

Parameters

  • repsilon: Initial guess of the Euclidean distance between the initial point and the optimum. (default value: 1e-6)
source
AdvancedVI.DoWGType
DoWG(repsilon)

Distance over weighted gradient (DoWG[KMJ2024]) optimizer. It's only parameter is the initial guess of the Euclidean distance to the optimum repsilon.

Parameters

  • repsilon: Initial guess of the Euclidean distance between the initial point and the optimum. (default value: 1e-6)
source
AdvancedVI.COCOBType
COCOB(alpha)

Continuous Coin Betting (COCOB[OT2017]) optimizer. We use the "COCOB-Backprop" variant, which is closer to the Adam optimizer. It's only parameter is the maximum change per parameter α, which shouldn't need much tuning.

Parameters

  • alpha: Scaling parameter. (default value: 100)
source

Parameter Averaging Strategies

In some cases, the best optimization performance is obtained by averaging the sequence of parameters generated by the optimization algorithm. For instance, the DoG[IHC2023] and DoWG[KMJ2024] papers report their best performance through averaging. The benefits of parameter averaging have been specifically confirmed for ELBO maximization[DCAMHV2020].

AdvancedVI.PolynomialAveragingType
PolynomialAveraging(eta)

Polynomial averaging rule proposed Shamir and Zhang[SZ2013]. At iteration t, the parameter average $ \bar{\lambda}_t $ according to the polynomial averaging rule is given as

\[ \bar{\lambda}_t = (1 - w_t) \bar{\lambda}_{t-1} + w_t \lambda_t \, ,\]

where the averaging weight is

\[ w_t = \frac{\eta + 1}{t + \eta} \, .\]

Higher eta ($\eta$) down-weights earlier iterations. When $\eta=0$, this is equivalent to uniformly averaging the iterates in an online fashion. The DoG paper[IHC2023] suggests $\eta=8$.

Parameters

  • eta: Regularization term. (default: 8)
source
  • OT2017Orabona, F., & Tommasi, T. (2017). Training deep networks without learning rates through coin betting. Advances in Neural Information Processing Systems, 30.
  • SZ2013Shamir, O., & Zhang, T. (2013). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning (pp. 71-79). PMLR.
  • DCAMHV2020Dhaka, A. K., Catalina, A., Andersen, M. R., Magnusson, M., Huggins, J., & Vehtari, A. (2020). Robust, accurate stochastic optimization for variational inference. Advances in Neural Information Processing Systems, 33, 10961-10973.
  • KMJ2024Khaled, A., Mishchenko, K., & Jin, C. (2023). Dowg unleashed: An efficient universal parameter-free gradient descent method. Advances in Neural Information Processing Systems, 36, 6748-6769.
  • IHC2023Ivgi, M., Hinder, O., & Carmon, Y. (2023). Dog is sgd's best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning (pp. 14465-14499). PMLR.
+