Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added ReduceLROnPlateau callback for VI. #7011

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

alvaropp
Copy link

@alvaropp alvaropp commented Nov 14, 2023

This has been discussed in the forums and in this issue.

There’s big parallels between PyMC’s variational inference and training neural networks, where the choice of optimiser and learning rate have a huge impact on training quality and speed. A common technique for training neural networks is using learning rate schedulers which reduce the learning rate on a schedule, to get faster convergence by starting high and reducing it in successive epochs where you want to be more precise.

Currently, in PyMC, you need to specify a suitable learning rate that is used for fitting the model. Too large and it won't converge, too small and it will be too slow. Training once with a large-ish learning rate and then taking the results of that training round as a starting point for another training round with a smaller learning rate is not trivial and not very elegant.

To address this issue, I've implemented a new callback for pymc.variational, following Keras' ReduceLROnPlateau. Basically it monitors the model loss at every iteration of VI.fit() and it reduces the learning rate by a user-specified amount if the loss doesn't improve after a user-specified number of iterations.

Major / Breaking Changes

None (I think!)

New features

Added a ReduceLROnPlateau callback.

Bugfixes

None.

Documentation

None so far, but it would be nice to add an example.

Maintenance

None.


📚 Documentation preview 📚: https://pymc--7011.org.readthedocs.build/en/7011/

Copy link

welcome bot commented Nov 14, 2023

Thank You Banner
💖 Thanks for opening this pull request! 💖 The PyMC community really appreciates your time and effort to contribute to the project. Please make sure you have read our Contributing Guidelines and filled in our pull request template to the best of your ability.

@alvaropp alvaropp marked this pull request as ready for review November 14, 2023 14:41
@alvaropp alvaropp changed the title Added ReduceLROnPlateau and test. Added ReduceLROnPlateau. Nov 14, 2023
@alvaropp alvaropp changed the title Added ReduceLROnPlateau. Added ReduceLROnPlateau callback for VI. Nov 14, 2023
@alvaropp
Copy link
Author

alvaropp commented Nov 14, 2023

@jessegrabowski raised a good point that it's not ideal to have the user need to define a shared variable, so I've reworked my callback so that it takes a vanilla PyMC optimiser instance and it modifies its learning rate directly:

with pm.Model() as model:
    [...]

    optimiser = pm.adam(learning_rate=0.1)
    fit = pm.fit(
        obj_optimizer=optimiser,
        callbacks=[
            ReduceLROnPlateau(optimiser=optimiser)
        ],
    )

@twiecki twiecki requested a review from ricardoV94 November 15, 2023 09:22
Copy link

codecov bot commented Nov 15, 2023

Codecov Report

Merging #7011 (d375bf0) into main (ad450a6) will increase coverage by 1.46%.
Report is 20 commits behind head on main.
The diff coverage is 80.68%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #7011      +/-   ##
==========================================
+ Coverage   87.66%   89.12%   +1.46%     
==========================================
  Files         101      101              
  Lines       16964    16982      +18     
==========================================
+ Hits        14871    15136     +265     
+ Misses       2093     1846     -247     
Files Coverage Δ
pymc/variational/opvi.py 74.17% <100.00%> (-13.22%) ⬇️
pymc/variational/updates.py 91.74% <98.86%> (-0.37%) ⬇️
pymc/variational/callbacks.py 55.55% <23.21%> (-40.68%) ⬇️

... and 42 files with indirect coverage changes

@alvaropp
Copy link
Author

Is it possible to add an example somewhere? maybe here: https://www.pymc.io/projects/examples/en/latest/gallery.html#variational-inference

Copy link
Member

@jessegrabowski jessegrabowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still prefer the Torch-style API to the callback API, but this PR is great so I'll let it go. The tests need to be a bit more robust, though. I also want to keep advocating for a slightly more general solution that will allow for 1) different learning rate adjustment strategies, and 2) composition of learning rate strategies.

I'm not saying this PR needs to implement every single learning rate scheduler, but it would be nice to have LearningRateStrategy base class that could be extended in the future, then subclass ReduceLROnPlateau from that


def __init__(
self,
optimizer,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the user have to provide this? Can it instead be inferred somehow from the host VI object? It's ugly to have to pass the optimizer twice (once for the VI itself, then again in the callback)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this would be great, but I haven't figured out whether it's possible. Probably one for someone more familiar with the codebase :)

tests/variational/test_callbacks.py Outdated Show resolved Hide resolved
self.cooldown_counter = self.cooldown
self.wait = 0

def reduce_lr(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still prefer that this was done symbolically with shared variables, because it will allow for composition between learning rate annealing strategies.

@alvaropp
Copy link
Author

I still prefer the Torch-style API to the callback API, but this PR is great so I'll let it go. The tests need to be a bit more robust, though. I also want to keep advocating for a slightly more general solution that will allow for 1) different learning rate adjustment strategies, and 2) composition of learning rate strategies.

I'm not saying this PR needs to implement every single learning rate scheduler, but it would be nice to have LearningRateStrategy base class that could be extended in the future, then subclass ReduceLROnPlateau from that

I've implemented an ExponentialDecay callback and refactored things a bit.
plateau
expdecay
expdecay_staircase

@alvaropp
Copy link
Author

Hi @jessegrabowski, any concrete ideas for taking this PR forward?

As you said, it would be nice to infer the optimiser from the code rather than pass it to the callback, but I don't know how to go about that.

with pm.Model() as model:
    [...]

    optimiser = pm.adam(learning_rate=0.1)
    fit = pm.fit(
        obj_optimizer=optimiser,
        callbacks=[
            ReduceLROnPlateau(optimiser=optimiser)
        ],
    )

@jessegrabowski
Copy link
Member

@alvaropp sorry for letting this get stale, I am going to do a careful review now/this weekend so you can get back on track and merged in ASAP

@jessegrabowski
Copy link
Member

jessegrabowski commented Dec 2, 2023

I'm going over the code carefully, and I don't think the callbacks as written do anything. There is a misunderstanding how about compiled pytensor functions work.

The implementation is very nice, and uses the following logic:

  1. There is a learning_rate parameter in every optimizer object.
  2. The optimizer object is called iteratively inside the _iterate_with_loss method of the Inference class.
  3. Therefore, we instantiate the optimizer class once and share it between a callback and the Inference class
  4. At each step, the callback will update the learning_rate of the optimizer
  5. Which will be propagated to the Inference class (since it's the same object).

Unfortunately, this is not at all how the compiled pytensor function works. First, the implementation of the optimizers is deceptive. They are not classes, they are partial functions, whose only role is to return an updates dictionary.

This updates dictionary is important, because after a pytensor function is compiled, changes to variables outside the function will have no effect. Consider the following graph:

x = pt.dscalar('x')
y = 1.0
z = x + y
f = pytensor.function([x], z)

print(f(4)) # 5.0
print(f(5)) # 6.0

y = 2

print(f(4)) # 5.0

What happened? After running pytensor.function to compile the function, the variable y in the python namespace no longer has anything to do with the computation. If we try to naively update it, nothing will happen. This is exactly the situation we are in when running pm.fit. If you check the fit method in the Inference class, the line that actually does all the computation is this innocuous one. What, then, is step_func? It's created here, and as you can see, it's a compiled pytensor function (compiled using compile_pymc rather than pytensor.function, but if you keep digging you'll see pytensor.function is called by compile_pymc). So we're in exactly the case of my code snippet above.

To illustrate what's going on, I made a dummy "optimizer" that always adds the learning rate to the parameters:

from pymc.variational.updates import _get_call_kwargs, get_or_compute_grads
from functools import partial
from collections import OrderedDict

def dummy_optimizer(loss_or_grads=None, params=None, learning_rate=1e-3):
    if loss_or_grads is None and params is None:
        return partial(dummy_optimizer, **_get_call_kwargs(locals()))
    elif loss_or_grads is None or params is None:
        raise ValueError("Please provide both `loss_or_grads` and `params` to get updates")
    # grads = get_or_compute_grads(loss_or_grads, params)
    updates = OrderedDict()

    for param in params:
        new_param = param + learning_rate
        updates[param] = new_param

    return updates

We can use this optimizer to check the effect of the current schedulers. Helper functions:

def make_regression():
    X = np.random.normal(size=(100,))
    beta = np.random.normal()
    alpha = np.random.normal()
    noise = np.random.normal(size=(100,))
    y = alpha + X * beta + noise

    return X, y

def make_model():
    X, y = make_regression()
    with pm.Model() as mod:
        b = pm.Normal('b')
        a = pm.Normal('a')
        # sigma = pm.HalfNormal('sigma')
    
        mu = a + X * b
        y_hat = pm.Normal('y_hat', mu, 1, observed=y)

    return mod

Here's a base case with no scheduling callback. I do use a tracker so we can see how the dummy works:

learning_rate = 1
optimizer = dummy_optimizer(learning_rate=learning_rate)

with make_model():
    advi = pm.ADVI()
    tracker = pm.callbacks.Tracker(mean=advi.approx.mean.eval)
    approx = advi.fit(100, callbacks=[tracker], obj_optimizer=optimizer)

Check that the parameters are deterministically updated by the learning rate at every step:

mean_history = np.r_[tracker['mean']]
np.all(np.diff(mean_history, axis=0) == learning_rate) # True

Now run it again with the ExponentialDecay schedule callback:

learning_rate = 1
optimizer = dummy_optimizer(learning_rate=learning_rate)
scheduler = pm.callbacks.ExponentialDecay(optimizer=optimizer, decay_steps=10, decay_rate=0.5)

with make_model():
    advi = pm.ADVI()
    tracker = pm.callbacks.Tracker(mean=advi.approx.mean.eval)
    approx = advi.fit(100, callbacks=[tracker, scheduler], obj_optimizer=optimizer)

Check the mean history:

mean_history = np.r_[tracker['mean']]
np.all(np.diff(mean_history, axis=0) == learning_rate) # True

As you can see, the learning rate has not changed. We can check that the scheduler updated as expected:

scheduler.optimizer.keywords['learning_rate'] == 1 * 0.5 ** 10 # True

We are in the case of the simple pytensor code snippet posted above. The learning_rate kwarg in the optimizer is updating (this variables plays the part of the y variable in my simple example). But the train has already left the station -- the function is compiled, so this variable is totally irrelevant to the computation.

So what is the solution? To interact with a compiled function, you need to use shared variables. These are a special type of pytensor object that can be adjusted between function calls. Returning to the code snippet, let's use a shared variable:

x = pt.dscalar('x')
y = pytensor.shared(1.0, name='y')
z = x + y
f = pytensor.function([x], z)

print(f(4)) # 5.0
print(f(5)) # 6.0

y.set_value(2)

print(f(4)) # 6.0

After using a shared variable, we can adjust the value of y to adjust the output of f, even though it's already compiled.

The way that shared variables can be automatically updated from run-to-run of a function is through the use of an update dictionary. An update dictionary is a mapping from old shared values to new shared values. This is exactly the dictionary that is returned by the optimizer, as noted above. Let's take a look at the SGD function from here:

    grads = get_or_compute_grads(loss_or_grads, params)
    updates = OrderedDict()

    for param, grad in zip(params, grads):
        updates[param] = param - learning_rate * grad

    return updates

You can see that it computes the gradients of the loss function, then computes SGD update rule for each parameter in turn. It stores these in an OrderedDict, where the keys are the old parameters, and the values are the updated parameters. It then returns the dictionary.

As a final demonstration, let's adjust the dummy optimizer to have an exponential decay on the learning rate:

def dummy_optimizer_w_decay(loss_or_grads=None, params=None, learning_rate=1e-3):
    if loss_or_grads is None and params is None:
        return partial(dummy_optimizer, **_get_call_kwargs(locals()))
    elif loss_or_grads is None or params is None:
        raise ValueError("Please provide both `loss_or_grads` and `params` to get updates")
    # grads = get_or_compute_grads(loss_or_grads, params)
    updates = OrderedDict()
    if not isinstance(learning_rate, pt.sharedvar.SharedVariable):
        learning_rate = pytensor.shared(pm.floatX(learning_rate), name='learning_rate')

    for param in params:
        new_param = param + learning_rate
        updates[param] = new_param

    updates[learning_rate] = learning_rate * 0.9
    return updates

And see how this looks in a model:

learning_rate = 1
optimizer = dummy_optimizer(learning_rate=learning_rate)

with make_model():
    advi = pm.ADVI()
    tracker = pm.callbacks.Tracker(mean=advi.approx.mean.eval)
    approx = advi.fit(100, callbacks=[tracker], obj_optimizer=optimizer)

Here's a plot of the tracked parameters. As you can see, we now get the desired updating in the learning rate.

mean_history = np.r_[tracker['mean']]
plt.plot(mean_history)

Untitled

@jessegrabowski
Copy link
Member

So can the callback approach be saved? Unfortunately, I don't think so. By the time the callbacks are invoked, in the fit loop, the step_func function is already compiled. At this stage it's too late to adjust the function without some pretty impressive gymnastics. This is going to have to go back to the drawing board using something like the wrapper approach.

@jessegrabowski
Copy link
Member

jessegrabowski commented Dec 2, 2023

I tried my hand at implementing the schedulers using the function approach I suggested on the discourse. Here's how it looks:

import pandas as pd

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
import pymc as pm
import numpy as np
import pytensor
import pytensor.tensor as pt

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

Xt = pytensor.shared(X_train)
yt = pytensor.shared(y_train)

with pm.Model() as iris_model:
    # Coefficients for features
    β = pm.Normal("β", 0, sigma=1e2, shape=(4, 3))
    # Transoform to unit interval
    a = pm.Normal("a", sigma=1e4, shape=(3,))
    p = pt.special.softmax(Xt.dot(β) + a, axis=-1)

    observed = pm.Categorical("obs", p=p, observed=yt)

opt = pm.adamax(learning_rate=1)
scheduled_optimizer = pm.updates.reduce_lr_on_plateau_scheduler(opt, factor=0.25, cooldown=1000, patience=3000, min_lr=1e-12)
# scheduled_optimizer = pm.updates.exponential_decay_scheduler(opt, decay_steps=1, decay_rate=0.9999, staircase=False, min_lr=1e-8)

with iris_model:
    inference = pm.ADVI()

    updates = inference.objective.updates(obj_optimizer=scheduled_optimizer)
    inputs = list(updates.keys())
    outputs = list(updates.values())

    old_names = [x.name for x in inputs]
    new_names = [x.name for x in outputs]

    expected_names = ['best_loss', 'cooldown_counter', 'wait', 'learning_rate']
    outputs_of_interest = [outputs[new_names.index(f'{expected_name}__updated')] for expected_name in
                           expected_names]
    tracker = pm.callbacks.Tracker(best_loss = outputs_of_interest[0].eval, learning_rate = outputs_of_interest[3].eval,
                                   wait = outputs_of_interest[2].eval, cooldown_counter = outputs_of_interest[1].eval)
    convergence_trackers = [pm.callbacks.CheckParametersConvergence(every=100, diff='relative'),
                            pm.callbacks.CheckParametersConvergence(every=100, diff='absolute')]
    approx = inference.fit(1_000_000, callbacks=[tracker] + convergence_trackers, obj_optimizer=scheduled_optimizer)

Untitled

There's still bugs, and I don't fully understand the VI codebase. I refactored a lot of stuff in variational/updates.py to make it possible to pass around the loss function.

Bugs are that the schedulers assume the loss function returned by the Approximations is always a scalar, which isn't true. I wasn't able to figure out what the extra graphs returned by e.g. FullRankAVDI and SVGD are. So these work with ADVI only for now. Also tracking cooldown_counter and wait doesn't work, but clearly they are doing something internally because the learning rate is updating according to them. I'm not sure if there's a better way to expose these new variables to tracking.

Consider it a suggestion, if you come up with a better approach you can feel free to revert this commits. I was thinking your approach might work combined with some of the refactoring I did? You would need to intercept the optimizers and make the learning rate a shared variable (like I do here), then use learning_rate.set_data from your callback objects to get the updates (that you compute off the graph) into the graph. My approach is totally on the graph -- I have no intuition about which is better. The one thing I can say is that I think my approach be 100% composable, so we can implement a helper function like pm.schedule_learning_rate or whatever that could take a list of schedulers and apply them in sequence.

@alvaropp
Copy link
Author

alvaropp commented Dec 4, 2023

D'oh, good spot!

@fonnesbeck
Copy link
Member

It seems like a good idea to use schedulers universally, with a Constant scheduler as the default. This may be a cleaner API and allow us to easily infer the optimizer, etc.

@jessegrabowski
Copy link
Member

jessegrabowski commented Jan 5, 2024

My proposed changes follow the pytorch API that looks like this:

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min')

So the user has to pass the optimizer to the scheduler explicitly in the code. It's also written so that users can compose schedulers by chaining them together.

From a clean code perspective I like the idea of a "ConstantOptimizer" as a base class from which others inherit from, but I don't think it makes sense the way I have it here. If the scheduler were added as a keyword to pm.ADVI I would agree, but then we lose the ability to compose (plus we already have too many keywords).

@sandstromviktor
Copy link

What is the status of this PR? I think this feature is highly needed.

@sandstromviktor
Copy link

For anyone needing this sort of feature in their own work, here is a simple way to change learning rate during training.
This is not particularly nice, but it works and is simple.

base_lr = 0.1

# Create a shared tensorvariable
lr = pytensor.shared(base_lr)

# Define your optimizer
optimizer = pm.adam(learning_rate=lr)

# Create some function that changes the learning rate
def update_learning_rate(approx, loss, iteration):
    # Change learning rate after 1000 iterations
    if iteration > 1000:
       lr.set_value(0.01)

advi = advi.fit(
    10000,
    obj_optimizer=optimizer,
    callbacks=[update_learning_rate],
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants