-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added ReduceLROnPlateau callback for VI. #7011
base: main
Are you sure you want to change the base?
Conversation
|
@jessegrabowski raised a good point that it's not ideal to have the user need to define a shared variable, so I've reworked my callback so that it takes a vanilla PyMC optimiser instance and it modifies its learning rate directly: with pm.Model() as model:
[...]
optimiser = pm.adam(learning_rate=0.1)
fit = pm.fit(
obj_optimizer=optimiser,
callbacks=[
ReduceLROnPlateau(optimiser=optimiser)
],
) |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #7011 +/- ##
==========================================
+ Coverage 87.66% 89.12% +1.46%
==========================================
Files 101 101
Lines 16964 16982 +18
==========================================
+ Hits 14871 15136 +265
+ Misses 2093 1846 -247
|
Is it possible to add an example somewhere? maybe here: https://www.pymc.io/projects/examples/en/latest/gallery.html#variational-inference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still prefer the Torch-style API to the callback API, but this PR is great so I'll let it go. The tests need to be a bit more robust, though. I also want to keep advocating for a slightly more general solution that will allow for 1) different learning rate adjustment strategies, and 2) composition of learning rate strategies.
I'm not saying this PR needs to implement every single learning rate scheduler, but it would be nice to have LearningRateStrategy base class that could be extended in the future, then subclass ReduceLROnPlateau from that
|
||
def __init__( | ||
self, | ||
optimizer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the user have to provide this? Can it instead be inferred somehow from the host VI object? It's ugly to have to pass the optimizer twice (once for the VI itself, then again in the callback)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this would be great, but I haven't figured out whether it's possible. Probably one for someone more familiar with the codebase :)
pymc/variational/callbacks.py
Outdated
self.cooldown_counter = self.cooldown | ||
self.wait = 0 | ||
|
||
def reduce_lr(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still prefer that this was done symbolically with shared variables, because it will allow for composition between learning rate annealing strategies.
Hi @jessegrabowski, any concrete ideas for taking this PR forward? As you said, it would be nice to infer the optimiser from the code rather than pass it to the callback, but I don't know how to go about that. with pm.Model() as model:
[...]
optimiser = pm.adam(learning_rate=0.1)
fit = pm.fit(
obj_optimizer=optimiser,
callbacks=[
ReduceLROnPlateau(optimiser=optimiser)
],
) |
@alvaropp sorry for letting this get stale, I am going to do a careful review now/this weekend so you can get back on track and merged in ASAP |
I'm going over the code carefully, and I don't think the callbacks as written do anything. There is a misunderstanding how about compiled pytensor functions work. The implementation is very nice, and uses the following logic:
Unfortunately, this is not at all how the compiled pytensor function works. First, the implementation of the optimizers is deceptive. They are not classes, they are partial functions, whose only role is to return an This updates dictionary is important, because after a pytensor function is compiled, changes to variables outside the function will have no effect. Consider the following graph:
What happened? After running To illustrate what's going on, I made a dummy "optimizer" that always adds the learning rate to the parameters:
We can use this optimizer to check the effect of the current schedulers. Helper functions:
Here's a base case with no scheduling callback. I do use a tracker so we can see how the dummy works:
Check that the parameters are deterministically updated by the learning rate at every step:
Now run it again with the ExponentialDecay schedule callback:
Check the mean history:
As you can see, the learning rate has not changed. We can check that the scheduler updated as expected:
We are in the case of the simple pytensor code snippet posted above. The learning_rate So what is the solution? To interact with a compiled function, you need to use shared variables. These are a special type of pytensor object that can be adjusted between function calls. Returning to the code snippet, let's use a shared variable:
After using a shared variable, we can adjust the value of The way that shared variables can be automatically updated from run-to-run of a function is through the use of an update dictionary. An update dictionary is a mapping from old shared values to new shared values. This is exactly the dictionary that is returned by the optimizer, as noted above. Let's take a look at the SGD function from here:
You can see that it computes the gradients of the loss function, then computes SGD update rule for each parameter in turn. It stores these in an As a final demonstration, let's adjust the dummy optimizer to have an exponential decay on the learning rate:
And see how this looks in a model:
Here's a plot of the tracked parameters. As you can see, we now get the desired updating in the learning rate.
|
So can the callback approach be saved? Unfortunately, I don't think so. By the time the callbacks are invoked, in the |
I tried my hand at implementing the schedulers using the function approach I suggested on the discourse. Here's how it looks:
There's still bugs, and I don't fully understand the VI codebase. I refactored a lot of stuff in variational/updates.py to make it possible to pass around the loss function. Bugs are that the schedulers assume the loss function returned by the Approximations is always a scalar, which isn't true. I wasn't able to figure out what the extra graphs returned by e.g. FullRankAVDI and SVGD are. So these work with ADVI only for now. Also tracking Consider it a suggestion, if you come up with a better approach you can feel free to revert this commits. I was thinking your approach might work combined with some of the refactoring I did? You would need to intercept the optimizers and make the learning rate a shared variable (like I do here), then use |
D'oh, good spot! |
It seems like a good idea to use schedulers universally, with a |
My proposed changes follow the pytorch API that looks like this: optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min') So the user has to pass the optimizer to the scheduler explicitly in the code. It's also written so that users can compose schedulers by chaining them together. From a clean code perspective I like the idea of a "ConstantOptimizer" as a base class from which others inherit from, but I don't think it makes sense the way I have it here. If the scheduler were added as a keyword to |
What is the status of this PR? I think this feature is highly needed. |
For anyone needing this sort of feature in their own work, here is a simple way to change learning rate during training. base_lr = 0.1
# Create a shared tensorvariable
lr = pytensor.shared(base_lr)
# Define your optimizer
optimizer = pm.adam(learning_rate=lr)
# Create some function that changes the learning rate
def update_learning_rate(approx, loss, iteration):
# Change learning rate after 1000 iterations
if iteration > 1000:
lr.set_value(0.01)
advi = advi.fit(
10000,
obj_optimizer=optimizer,
callbacks=[update_learning_rate],
) |
This has been discussed in the forums and in this issue.
There’s big parallels between PyMC’s variational inference and training neural networks, where the choice of optimiser and learning rate have a huge impact on training quality and speed. A common technique for training neural networks is using learning rate schedulers which reduce the learning rate on a schedule, to get faster convergence by starting high and reducing it in successive epochs where you want to be more precise.
Currently, in PyMC, you need to specify a suitable learning rate that is used for fitting the model. Too large and it won't converge, too small and it will be too slow. Training once with a large-ish learning rate and then taking the results of that training round as a starting point for another training round with a smaller learning rate is not trivial and not very elegant.
To address this issue, I've implemented a new callback for pymc.variational, following Keras' ReduceLROnPlateau. Basically it monitors the model loss at every iteration of VI.fit() and it reduces the learning rate by a user-specified amount if the loss doesn't improve after a user-specified number of iterations.
Major / Breaking Changes
None (I think!)
New features
Added a ReduceLROnPlateau callback.
Bugfixes
None.
Documentation
None so far, but it would be nice to add an example.
Maintenance
None.
📚 Documentation preview 📚: https://pymc--7011.org.readthedocs.build/en/7011/