Added ReduceLROnPlateau callback for VI. #7011

alvaropp · 2023-11-14T14:38:20Z

This has been discussed in the forums and in this issue.

There’s big parallels between PyMC’s variational inference and training neural networks, where the choice of optimiser and learning rate have a huge impact on training quality and speed. A common technique for training neural networks is using learning rate schedulers which reduce the learning rate on a schedule, to get faster convergence by starting high and reducing it in successive epochs where you want to be more precise.

Currently, in PyMC, you need to specify a suitable learning rate that is used for fitting the model. Too large and it won't converge, too small and it will be too slow. Training once with a large-ish learning rate and then taking the results of that training round as a starting point for another training round with a smaller learning rate is not trivial and not very elegant.

To address this issue, I've implemented a new callback for pymc.variational, following Keras' ReduceLROnPlateau. Basically it monitors the model loss at every iteration of VI.fit() and it reduces the learning rate by a user-specified amount if the loss doesn't improve after a user-specified number of iterations.

Major / Breaking Changes

None (I think!)

New features

Added a ReduceLROnPlateau callback.

Bugfixes

None.

Documentation

None so far, but it would be nice to add an example.

Maintenance

None.

📚 Documentation preview 📚: https://pymc--7011.org.readthedocs.build/en/7011/

welcome · 2023-11-14T14:38:22Z

💖 Thanks for opening this pull request! 💖 The PyMC community really appreciates your time and effort to contribute to the project. Please make sure you have read our Contributing Guidelines and filled in our pull request template to the best of your ability.

alvaropp · 2023-11-14T15:16:10Z

@jessegrabowski raised a good point that it's not ideal to have the user need to define a shared variable, so I've reworked my callback so that it takes a vanilla PyMC optimiser instance and it modifies its learning rate directly:

with pm.Model() as model:
    [...]

    optimiser = pm.adam(learning_rate=0.1)
    fit = pm.fit(
        obj_optimizer=optimiser,
        callbacks=[
            ReduceLROnPlateau(optimiser=optimiser)
        ],
    )

pymc/variational/callbacks.py

codecov · 2023-11-15T09:30:47Z

Codecov Report

Merging #7011 (d375bf0) into main (ad450a6) will increase coverage by 1.46%.
Report is 20 commits behind head on main.
The diff coverage is 80.68%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7011      +/-   ##
==========================================
+ Coverage   87.66%   89.12%   +1.46%     
==========================================
  Files         101      101              
  Lines       16964    16982      +18     
==========================================
+ Hits        14871    15136     +265     
+ Misses       2093     1846     -247

Files	Coverage Δ
pymc/variational/opvi.py	`74.17% <100.00%> (-13.22%)`	⬇️
pymc/variational/updates.py	`91.74% <98.86%> (-0.37%)`	⬇️
pymc/variational/callbacks.py	`55.55% <23.21%> (-40.68%)`	⬇️

... and 42 files with indirect coverage changes

alvaropp · 2023-11-15T09:34:22Z

Is it possible to add an example somewhere? maybe here: https://www.pymc.io/projects/examples/en/latest/gallery.html#variational-inference

jessegrabowski

I still prefer the Torch-style API to the callback API, but this PR is great so I'll let it go. The tests need to be a bit more robust, though. I also want to keep advocating for a slightly more general solution that will allow for 1) different learning rate adjustment strategies, and 2) composition of learning rate strategies.

I'm not saying this PR needs to implement every single learning rate scheduler, but it would be nice to have LearningRateStrategy base class that could be extended in the future, then subclass ReduceLROnPlateau from that

jessegrabowski · 2023-11-15T10:00:54Z

pymc/variational/callbacks.py

+
+    def __init__(
+        self,
+        optimizer,


Does the user have to provide this? Can it instead be inferred somehow from the host VI object? It's ugly to have to pass the optimizer twice (once for the VI itself, then again in the callback)

Well, this would be great, but I haven't figured out whether it's possible. Probably one for someone more familiar with the codebase :)

tests/variational/test_callbacks.py

jessegrabowski · 2023-11-15T10:04:34Z

pymc/variational/callbacks.py

+                self.cooldown_counter = self.cooldown
+                self.wait = 0
+
+    def reduce_lr(self):


I would still prefer that this was done symbolically with shared variables, because it will allow for composition between learning rate annealing strategies.

alvaropp · 2023-11-16T10:10:04Z

I still prefer the Torch-style API to the callback API, but this PR is great so I'll let it go. The tests need to be a bit more robust, though. I also want to keep advocating for a slightly more general solution that will allow for 1) different learning rate adjustment strategies, and 2) composition of learning rate strategies.

I'm not saying this PR needs to implement every single learning rate scheduler, but it would be nice to have LearningRateStrategy base class that could be extended in the future, then subclass ReduceLROnPlateau from that

I've implemented an ExponentialDecay callback and refactored things a bit.

alvaropp · 2023-11-22T07:31:51Z

Hi @jessegrabowski, any concrete ideas for taking this PR forward?

As you said, it would be nice to infer the optimiser from the code rather than pass it to the callback, but I don't know how to go about that.

with pm.Model() as model:
    [...]

    optimiser = pm.adam(learning_rate=0.1)
    fit = pm.fit(
        obj_optimizer=optimiser,
        callbacks=[
            ReduceLROnPlateau(optimiser=optimiser)
        ],
    )

jessegrabowski · 2023-12-01T21:28:09Z

@alvaropp sorry for letting this get stale, I am going to do a careful review now/this weekend so you can get back on track and merged in ASAP

jessegrabowski · 2023-12-02T11:04:23Z

I'm going over the code carefully, and I don't think the callbacks as written do anything. There is a misunderstanding how about compiled pytensor functions work.

The implementation is very nice, and uses the following logic:

There is a learning_rate parameter in every optimizer object.
The optimizer object is called iteratively inside the _iterate_with_loss method of the Inference class.
Therefore, we instantiate the optimizer class once and share it between a callback and the Inference class
At each step, the callback will update the learning_rate of the optimizer
Which will be propagated to the Inference class (since it's the same object).

Unfortunately, this is not at all how the compiled pytensor function works. First, the implementation of the optimizers is deceptive. They are not classes, they are partial functions, whose only role is to return an updates dictionary.

This updates dictionary is important, because after a pytensor function is compiled, changes to variables outside the function will have no effect. Consider the following graph:

x = pt.dscalar('x')
y = 1.0
z = x + y
f = pytensor.function([x], z)

print(f(4)) # 5.0
print(f(5)) # 6.0

y = 2

print(f(4)) # 5.0

What happened? After running pytensor.function to compile the function, the variable y in the python namespace no longer has anything to do with the computation. If we try to naively update it, nothing will happen. This is exactly the situation we are in when running pm.fit. If you check the fit method in the Inference class, the line that actually does all the computation is this innocuous one. What, then, is step_func? It's created here, and as you can see, it's a compiled pytensor function (compiled using compile_pymc rather than pytensor.function, but if you keep digging you'll see pytensor.function is called by compile_pymc). So we're in exactly the case of my code snippet above.

To illustrate what's going on, I made a dummy "optimizer" that always adds the learning rate to the parameters:

from pymc.variational.updates import _get_call_kwargs, get_or_compute_grads
from functools import partial
from collections import OrderedDict

def dummy_optimizer(loss_or_grads=None, params=None, learning_rate=1e-3):
    if loss_or_grads is None and params is None:
        return partial(dummy_optimizer, **_get_call_kwargs(locals()))
    elif loss_or_grads is None or params is None:
        raise ValueError("Please provide both `loss_or_grads` and `params` to get updates")
    # grads = get_or_compute_grads(loss_or_grads, params)
    updates = OrderedDict()

    for param in params:
        new_param = param + learning_rate
        updates[param] = new_param

    return updates

We can use this optimizer to check the effect of the current schedulers. Helper functions:

def make_regression():
    X = np.random.normal(size=(100,))
    beta = np.random.normal()
    alpha = np.random.normal()
    noise = np.random.normal(size=(100,))
    y = alpha + X * beta + noise

    return X, y

def make_model():
    X, y = make_regression()
    with pm.Model() as mod:
        b = pm.Normal('b')
        a = pm.Normal('a')
        # sigma = pm.HalfNormal('sigma')
    
        mu = a + X * b
        y_hat = pm.Normal('y_hat', mu, 1, observed=y)

    return mod

Here's a base case with no scheduling callback. I do use a tracker so we can see how the dummy works:

learning_rate = 1
optimizer = dummy_optimizer(learning_rate=learning_rate)

with make_model():
    advi = pm.ADVI()
    tracker = pm.callbacks.Tracker(mean=advi.approx.mean.eval)
    approx = advi.fit(100, callbacks=[tracker], obj_optimizer=optimizer)

Check that the parameters are deterministically updated by the learning rate at every step:

mean_history = np.r_[tracker['mean']]
np.all(np.diff(mean_history, axis=0) == learning_rate) # True

Now run it again with the ExponentialDecay schedule callback:

learning_rate = 1
optimizer = dummy_optimizer(learning_rate=learning_rate)
scheduler = pm.callbacks.ExponentialDecay(optimizer=optimizer, decay_steps=10, decay_rate=0.5)

with make_model():
    advi = pm.ADVI()
    tracker = pm.callbacks.Tracker(mean=advi.approx.mean.eval)
    approx = advi.fit(100, callbacks=[tracker, scheduler], obj_optimizer=optimizer)

Check the mean history:

mean_history = np.r_[tracker['mean']]
np.all(np.diff(mean_history, axis=0) == learning_rate) # True

As you can see, the learning rate has not changed. We can check that the scheduler updated as expected:

scheduler.optimizer.keywords['learning_rate'] == 1 * 0.5 ** 10 # True

We are in the case of the simple pytensor code snippet posted above. The learning_rate kwarg in the optimizer is updating (this variables plays the part of the y variable in my simple example). But the train has already left the station -- the function is compiled, so this variable is totally irrelevant to the computation.

So what is the solution? To interact with a compiled function, you need to use shared variables. These are a special type of pytensor object that can be adjusted between function calls. Returning to the code snippet, let's use a shared variable:

x = pt.dscalar('x')
y = pytensor.shared(1.0, name='y')
z = x + y
f = pytensor.function([x], z)

print(f(4)) # 5.0
print(f(5)) # 6.0

y.set_value(2)

print(f(4)) # 6.0

After using a shared variable, we can adjust the value of y to adjust the output of f, even though it's already compiled.

The way that shared variables can be automatically updated from run-to-run of a function is through the use of an update dictionary. An update dictionary is a mapping from old shared values to new shared values. This is exactly the dictionary that is returned by the optimizer, as noted above. Let's take a look at the SGD function from here:

    grads = get_or_compute_grads(loss_or_grads, params)
    updates = OrderedDict()

    for param, grad in zip(params, grads):
        updates[param] = param - learning_rate * grad

    return updates

You can see that it computes the gradients of the loss function, then computes SGD update rule for each parameter in turn. It stores these in an OrderedDict, where the keys are the old parameters, and the values are the updated parameters. It then returns the dictionary.

As a final demonstration, let's adjust the dummy optimizer to have an exponential decay on the learning rate:

def dummy_optimizer_w_decay(loss_or_grads=None, params=None, learning_rate=1e-3):
    if loss_or_grads is None and params is None:
        return partial(dummy_optimizer, **_get_call_kwargs(locals()))
    elif loss_or_grads is None or params is None:
        raise ValueError("Please provide both `loss_or_grads` and `params` to get updates")
    # grads = get_or_compute_grads(loss_or_grads, params)
    updates = OrderedDict()
    if not isinstance(learning_rate, pt.sharedvar.SharedVariable):
        learning_rate = pytensor.shared(pm.floatX(learning_rate), name='learning_rate')

    for param in params:
        new_param = param + learning_rate
        updates[param] = new_param

    updates[learning_rate] = learning_rate * 0.9
    return updates

And see how this looks in a model:

learning_rate = 1
optimizer = dummy_optimizer(learning_rate=learning_rate)

with make_model():
    advi = pm.ADVI()
    tracker = pm.callbacks.Tracker(mean=advi.approx.mean.eval)
    approx = advi.fit(100, callbacks=[tracker], obj_optimizer=optimizer)

Here's a plot of the tracked parameters. As you can see, we now get the desired updating in the learning rate.

mean_history = np.r_[tracker['mean']]
plt.plot(mean_history)

jessegrabowski · 2023-12-02T11:09:41Z

So can the callback approach be saved? Unfortunately, I don't think so. By the time the callbacks are invoked, in the fit loop, the step_func function is already compiled. At this stage it's too late to adjust the function without some pretty impressive gymnastics. This is going to have to go back to the drawing board using something like the wrapper approach.

jessegrabowski · 2023-12-02T23:14:28Z

I tried my hand at implementing the schedulers using the function approach I suggested on the discourse. Here's how it looks:

import pandas as pd

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
import pymc as pm
import numpy as np
import pytensor
import pytensor.tensor as pt

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

Xt = pytensor.shared(X_train)
yt = pytensor.shared(y_train)

with pm.Model() as iris_model:
    # Coefficients for features
    β = pm.Normal("β", 0, sigma=1e2, shape=(4, 3))
    # Transoform to unit interval
    a = pm.Normal("a", sigma=1e4, shape=(3,))
    p = pt.special.softmax(Xt.dot(β) + a, axis=-1)

    observed = pm.Categorical("obs", p=p, observed=yt)

opt = pm.adamax(learning_rate=1)
scheduled_optimizer = pm.updates.reduce_lr_on_plateau_scheduler(opt, factor=0.25, cooldown=1000, patience=3000, min_lr=1e-12)
# scheduled_optimizer = pm.updates.exponential_decay_scheduler(opt, decay_steps=1, decay_rate=0.9999, staircase=False, min_lr=1e-8)

with iris_model:
    inference = pm.ADVI()

    updates = inference.objective.updates(obj_optimizer=scheduled_optimizer)
    inputs = list(updates.keys())
    outputs = list(updates.values())

    old_names = [x.name for x in inputs]
    new_names = [x.name for x in outputs]

    expected_names = ['best_loss', 'cooldown_counter', 'wait', 'learning_rate']
    outputs_of_interest = [outputs[new_names.index(f'{expected_name}__updated')] for expected_name in
                           expected_names]
    tracker = pm.callbacks.Tracker(best_loss = outputs_of_interest[0].eval, learning_rate = outputs_of_interest[3].eval,
                                   wait = outputs_of_interest[2].eval, cooldown_counter = outputs_of_interest[1].eval)
    convergence_trackers = [pm.callbacks.CheckParametersConvergence(every=100, diff='relative'),
                            pm.callbacks.CheckParametersConvergence(every=100, diff='absolute')]
    approx = inference.fit(1_000_000, callbacks=[tracker] + convergence_trackers, obj_optimizer=scheduled_optimizer)

There's still bugs, and I don't fully understand the VI codebase. I refactored a lot of stuff in variational/updates.py to make it possible to pass around the loss function.

Bugs are that the schedulers assume the loss function returned by the Approximations is always a scalar, which isn't true. I wasn't able to figure out what the extra graphs returned by e.g. FullRankAVDI and SVGD are. So these work with ADVI only for now. Also tracking cooldown_counter and wait doesn't work, but clearly they are doing something internally because the learning rate is updating according to them. I'm not sure if there's a better way to expose these new variables to tracking.

Consider it a suggestion, if you come up with a better approach you can feel free to revert this commits. I was thinking your approach might work combined with some of the refactoring I did? You would need to intercept the optimizers and make the learning rate a shared variable (like I do here), then use learning_rate.set_data from your callback objects to get the updates (that you compute off the graph) into the graph. My approach is totally on the graph -- I have no intuition about which is better. The one thing I can say is that I think my approach be 100% composable, so we can implement a helper function like pm.schedule_learning_rate or whatever that could take a list of schedulers and apply them in sequence.

alvaropp · 2023-12-04T14:39:49Z

D'oh, good spot!

fonnesbeck · 2024-01-05T16:40:52Z

It seems like a good idea to use schedulers universally, with a Constant scheduler as the default. This may be a cleaner API and allow us to easily infer the optimizer, etc.

jessegrabowski · 2024-01-05T22:49:01Z

My proposed changes follow the pytorch API that looks like this:

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min')

So the user has to pass the optimizer to the scheduler explicitly in the code. It's also written so that users can compose schedulers by chaining them together.

From a clean code perspective I like the idea of a "ConstantOptimizer" as a base class from which others inherit from, but I don't think it makes sense the way I have it here. If the scheduler were added as a keyword to pm.ADVI I would agree, but then we lose the ability to compose (plus we already have too many keywords).

sandstromviktor · 2024-12-18T08:48:30Z

What is the status of this PR? I think this feature is highly needed.

sandstromviktor · 2024-12-19T12:56:07Z

For anyone needing this sort of feature in their own work, here is a simple way to change learning rate during training.
This is not particularly nice, but it works and is simple.

base_lr = 0.1

# Create a shared tensorvariable
lr = pytensor.shared(base_lr)

# Define your optimizer
optimizer = pm.adam(learning_rate=lr)

# Create some function that changes the learning rate
def update_learning_rate(approx, loss, iteration):
    # Change learning rate after 1000 iterations
    if iteration > 1000:
       lr.set_value(0.01)

advi = advi.fit(
    10000,
    obj_optimizer=optimizer,
    callbacks=[update_learning_rate],
)

Added ReduceLROnPlateau and test.

caf0f34

alvaropp marked this pull request as ready for review November 14, 2023 14:41

alvaropp changed the title ~~Added ReduceLROnPlateau and test.~~ Added ReduceLROnPlateau. Nov 14, 2023

alvaropp changed the title ~~Added ReduceLROnPlateau.~~ Added ReduceLROnPlateau callback for VI. Nov 14, 2023

Pass optimiser and not shared variable to callback

0afd475

twiecki reviewed Nov 14, 2023

View reviewed changes

pymc/variational/callbacks.py Outdated Show resolved Hide resolved

alvaropp added 3 commits November 15, 2023 08:00

American spelling.

9e27de5

Linting.

d66d523

Removed unused variable.

a2c27bd

twiecki requested a review from ricardoV94 November 15, 2023 09:22

twiecki approved these changes Nov 15, 2023

View reviewed changes

jessegrabowski requested changes Nov 15, 2023

View reviewed changes

alvaropp added 5 commits November 16, 2023 10:20

Updated test to use all optimizers.

74850bc

Added baseclass and ExponentialDecay.

d2abbcd

Added tests.

3068e0c

Docstrings.

86f4097

Linting.

ba3e3f2

Added new callbacks to __all__.

8bf6241

jessegrabowski added 3 commits December 3, 2023 00:04

Implement schedulers as function wrappers

2700c2e

Implement schedulers as function wrappers

56e7772

Implement schedulers as function wrappers

03953cd

jessegrabowski added 2 commits December 3, 2023 00:21

Run pre-commit hooks

79e1304

Change typehint from Callable to partial for mypy

d375bf0

jessegrabowski requested a review from ferrine December 3, 2023 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added ReduceLROnPlateau callback for VI. #7011

Added ReduceLROnPlateau callback for VI. #7011

alvaropp commented Nov 14, 2023 •

edited

Loading

welcome bot commented Nov 14, 2023

alvaropp commented Nov 14, 2023 •

edited

Loading

codecov bot commented Nov 15, 2023 •

edited

Loading

alvaropp commented Nov 15, 2023

jessegrabowski left a comment

jessegrabowski Nov 15, 2023

alvaropp Nov 15, 2023

jessegrabowski Nov 15, 2023

alvaropp commented Nov 16, 2023

alvaropp commented Nov 22, 2023

jessegrabowski commented Dec 1, 2023

jessegrabowski commented Dec 2, 2023 •

edited

Loading

jessegrabowski commented Dec 2, 2023

jessegrabowski commented Dec 2, 2023 •

edited

Loading

alvaropp commented Dec 4, 2023

fonnesbeck commented Jan 5, 2024

jessegrabowski commented Jan 5, 2024 •

edited

Loading

sandstromviktor commented Dec 18, 2024

sandstromviktor commented Dec 19, 2024

Added ReduceLROnPlateau callback for VI. #7011

Are you sure you want to change the base?

Added ReduceLROnPlateau callback for VI. #7011

Conversation

alvaropp commented Nov 14, 2023 • edited Loading

Major / Breaking Changes

New features

Bugfixes

Documentation

Maintenance

welcome bot commented Nov 14, 2023

alvaropp commented Nov 14, 2023 • edited Loading

codecov bot commented Nov 15, 2023 • edited Loading

Codecov Report

alvaropp commented Nov 15, 2023

jessegrabowski left a comment

Choose a reason for hiding this comment

jessegrabowski Nov 15, 2023

Choose a reason for hiding this comment

alvaropp Nov 15, 2023

Choose a reason for hiding this comment

jessegrabowski Nov 15, 2023

Choose a reason for hiding this comment

alvaropp commented Nov 16, 2023

alvaropp commented Nov 22, 2023

jessegrabowski commented Dec 1, 2023

jessegrabowski commented Dec 2, 2023 • edited Loading

jessegrabowski commented Dec 2, 2023

jessegrabowski commented Dec 2, 2023 • edited Loading

alvaropp commented Dec 4, 2023

fonnesbeck commented Jan 5, 2024

jessegrabowski commented Jan 5, 2024 • edited Loading

sandstromviktor commented Dec 18, 2024

sandstromviktor commented Dec 19, 2024

alvaropp commented Nov 14, 2023 •

edited

Loading

alvaropp commented Nov 14, 2023 •

edited

Loading

codecov bot commented Nov 15, 2023 •

edited

Loading

jessegrabowski commented Dec 2, 2023 •

edited

Loading

jessegrabowski commented Dec 2, 2023 •

edited

Loading

jessegrabowski commented Jan 5, 2024 •

edited

Loading