These instructions will allow you to run this project on your local machine.
Once you have a virtual environment in Python, you can simply install necessary packages with: pip install -r requirements.txt
git clone https://github.com/edwisdom/bbvi
Run the models with:
python bbvi.py
Figure 2: Variational loss function and its gradient with 10 MC samples
Figure 3: Variational loss function and its gradient with 100 MC samples
Figure 4: Variational loss function and its gradient with 1000 MC samples
As the first columns of Figures 1-4 show, as the number of samples increase, the loss function estimate becomes smoother and more accurate. The results make sense, since the ideal value of the mean-weight parameter should be 1, and since a linear regression model should have a quadratic loss function. The 1-sample loss is noisy, but on the whole, fairly accurate in its shape. The minimum is still at 1, and overall, it still looks like a parabola, even if it's a bit noisy.
The second column provides gradients that seem reasonable given the loss function from the first column. They are mostly linear with a positive slope, which makes sense as the derivative of the parabolas of loss. Note that as the number of samples increases, like before, the gradient becomes less noisy. However, in this case, the 1-sample estimates are not really accurate, especially as we deviate far from the optimal value of the mean. Moreover, the basic upward linear slope is not even preserved. This remains true for 10 Monte Carlo samples, though once we take 100 or 1000 MC samples, the gradient becomes more clearly correct. Curiously, even 1000 MC samples produces somewhat noisy gradients at mean values that are far from optimal.
Figure 5: Losses and Fitted Posteriors for Learning Rate 5e-6
Figure 6: Losses and Fitted Posteriors for Learning Rate 1e-5
Figure 7: Losses and Fitted Posteriors for Learning Rate 5e-5
My approach was a little brute-force, as I tried a number of combinations of learning rates and MC samples. I found that the higher learning rate, 5e-5, produced the best results, and as expected, more samples also improved performance. Of course, taking more samples incurs a higher computational cost, so I limited that number to 500. I stuck with these two values for the hyperparameters for the following experiments.
Figure 8: Estimated loss function over iterations for 3 chains of variational inference
Figure 9: 10 samples from the posterior and a plot of uncertainty (2 std.) for 3 VI chains with learning rate 5e-5 and 500 gradient samples
Compared to the Hamiltonian Monte Carlo technique, variational inference does not fit the data as well. This is somewhat strange, especially considering we're approximating normally distributed data, so our choice of a normal distribution as our density family should make this easier.
Moreover, the results are very unstable, in a way that they were not for Hamiltonian Monte Carlo, where the right choice of leapfrog steps and learning rate would usually guarantee a good posterior sample. Here, even with the right hyperparameters, a bad initialization can severely hamper variational inference. At the same time, variational inference severely underestimates the variance of the posterior; whereas HMC would give large uncertainty estimates far away from data, variational inference does not, thereby weakening the strongest advantage of the Bayesian approach.
The value added with black-box variational inference is that it can be used with any complex model we want, even ones that aren't differentiable. This isn't, of course, possible with Hamiltonian Monte Carlo. However, these results make me skeptical of the claim that black-box variational inference can practically, not theoretically, be used with any model. With how noisy its gradient estimates are, my guess is that models more complex than our single-hidden layer neural network might make it necessary to take a lot more Monte Carlo gradient samples to make variational inference work.
In the future, I would like to explore the following:
- Varying the learning rates using the Robbins-Munro sequence
- Applying this model to real-world data and comparing it to neural networks that take similar time to train
- Using a different density family (Gaussian scale mixtures) for the approximating distribution
A huge thanks to Prof. Michael Hughes, who supervised this work, Daniel Dinjian, who thought through architectures with me in the early phases, and Ramtin Hosseini, who sat with me to think about bugs that could lead to small gradient norms.
































