-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dismissal of "bayesian methods don't overfit" is too quick #10
Comments
Thank you very much for your comment. |
In your original errata I thought you were talking about very broad priors, which indeed is related to very little regularization, if you’re going to do MAP.
Now it seems like you’re talking about the opposite, where you have prior that’s very peaked/concentrated on the wrong thing. You can think of this as a soft version of constraining the parameter space to just the region around the peak. This is much closer to underfitting than overfitting. The prior is so strong that it basically ignores the data until you have a ton of it.
(Sorry if I missed your point - I’m not at home right now so I can’t cross reference with the book to see exactly what you’re referring to. )
… On Mar 28, 2018, at 7:26 PM, Yousuke Takada ***@***.***> wrote:
Thank you very much for your comment.
I have a posterior distribution in mind. What I have tried to say in the corresponding errata item is that we can think of a very pessimistic model in which the posterior becomes a MAP or ML estimate and the predictive distribution gives a wrong answer very confidently.
Let us take a Bayesian linear regression model of Section 3.3 as an example and suppose that the precision $\beta$ of the target variable is very large and the precision $\alpha$ of the parameters is very small. Then, the posterior is very sharply peaked and is thus nearly a MAP or ML estimate; and the predictive distribution is also very sharply peaked and is thus nearly the target distribution (3.8) conditioned on the MAP or ML estimate of the parameters.
Of course, we can think of hyperpriors over $\beta$ and $\alpha$ and introduce more Bayesian averaging. However, again, if the extended model is wrong (e.g., the hyporpriors are again sharply peaked at wrong values), we shall have a wrong posterior and a wrong predictive.
The point is that, since we do not know the true model and thus we cannot even know whether the assumed model is reasonable without any knowledge, we cannot guarantee in general that "the bayesian methods don't overfit." (However, we can say a model is better than the other using Bayesian model comparison for example).
Please correct me if I am missing something. Thank you!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
You were right; I am talking about a very broad prior (i.e., the precision |
Ok now I think I know what you're talking about. So say, for example, we have a very expressive feature space, and so we can fit the training data very well with a Gaussian linear model, say with maximum likelihood... and such a maximum likelihood fit does poorly out of sample. So, standard overfitting. Then if we took a Bayesian approach with a very broad prior on the coefficient vector and the variance of the response is assumed to be very small, conditional on the input, then I see your point that the posterior variance would also be very small, and so the predictive variance would also be quite small, leading to poor out of sample performance again. I would just suggest in the errata elaborating a little bit, to say that you're talking about the posterior and predictive distributions having small variance under the scenario you describe. Also, Bishop doubles down on this point of view in 3.4: "As we shall see, the over-fitting associated with maximum likelihood can be avoided by marginalizing (summing or integrating) over the model parameters instead of making point estimates of their values. Models can then be compared directly on the training data, without the need for a validation set." Since he's comparing to maximum likelihood, as opposed to regularized maximum likelihood, maybe his case is really more about regularization in general, vs not regularizing. And the second part is something new, about using marginal likelihood to choose the amount of regularization. Anyway - your errata notes are awesome! Thanks for sharing. |
Thank you again for your very helpful comment and I am very glad to hear that you have found my errata "awesome"! As you suggested, I would like to make it clearer in my errata that I am concerning here a fully Bayesian treatment (not MAP or ML) but, if the model is not sensible, it can, say, reduce to MAP or ML (the posterior can become nearly a point estimate), giving poor prediction (the predictive is also not usable). The statement in Section 3.4 is to me more reasonable; it only says overfitting can be avoided by Bayesian methods (if we assume sensible models). You are right in that the statement that the marginal likelihood can be used for model selection is new. |
I see marginal likelihood as a conservative estimate of out-of-sample performance, in the sense that it's In other words, I see marginal likelihood as quite connected to assessing overfitting. |
Your point perfectly makes sense!
However, since we do not know the true probability distribution in general,
we must be careful about the distribution with which we take expectation.
So please let me elaborate on your point by introducing some notation.
First, let X = { x_1, ..., x_N } be the training data (here I use x instead
of y).
They are assumed to be i.i.d. and generated from some true distribution
p(x) so that
p(X) = p(x_1, ..., x_N) = \prod_{n=1}^{N} p(x_n).
As a shorthand, the probability distribution of our assumed model M is
denoted by q(.) = p(.|M).
If there exists a true model M^*, then the true distribution p(.) can be
written as p(.) = p(.|M^*).
The model M we learn is a pair of the likelihood q(x|w) and the prior q(w)
where w is a set of parameters.
The marginal likelihood (or the model evidence) is given by
q(X) = \int dw q(w) q(X|w) = \int dw q(w) \prod_{n=1}^{N} q(x_n|w).
Note that q(X) does not factorize in general because we have the unknown
parameters w in the model M.
The Bayes free energy F_N is defined as the negative log marginal
likelihood so that
F_N = - ln q(X) = - ln q(x_1, ..., x_N).
Taking its expectation over all the data X using the true distribution
p(X), we have
E[F_N] = - \int dX p(X) ln q(X) = N S + KL(p(X)||q(X))
where S = \int dx p(x) ln p(x) is the (differential) entropy of x and is
thus a constant independent of the model M.
Therefore, minimizing the expected Bayes free energy E[F_N] is equivalent
to minimizing the Kullback-Leibler divergence from the true distribution
p(X) to the learned model q(X).
Although we cannot compute the expected value E[F_N] of the Bayes free
energy F_N in practice, it is not quite unreasonable to suppose that a
smaller value of F_N tends to imply that q(X) better approximates p(X) on
average if the assumed model M is somewhat reasonable.
See also the discussion around (3.73).
On the other hand, in much the same way as we have motivated the maximum
likelihood in (1.119), the generalization loss G_N is defined as the
negative logarithm of the predictive q(x|X) expected with respect to the
new input x so that
G_N = E_x[- ln q(x|X)] = - \int p(x) ln q(x|X) = S + KL(p(x)||q(x|X))
where S is the (differential) entropy of x (see above).
Therefore, minimizing the generalization loss is equivalent to minimizing
the Kullback-Leibler divergence from p(x) to q(x|X).
Since we have
F_N = - ln q(x_1) - ln q(x_2|x_1) - ln q(x_3|x_1, x_2) - ... - ln
q(x_N|x_1, ..., x_{N - 1})
the expected values of the Bayes free energy F_N and the generalization
loss G_N are related by
E[F_N] = \sum_{n=0}^{N - 1} E[G_n]
where the expectations are taken with respect to all the data X.
Therefore, as you suggested, the Bayes free energy can be considered a
somewhat conservative estimate of the generalization loss.
The terminology here is borrowed from Watanabe (2013) (see References).
…On Sat, Mar 31, 2018 at 5:38 AM, David Rosenberg ***@***.***> wrote:
I see marginal likelihood as a conservative estimate of out-of-sample
performance, in the sense that it's
proportional to average( log[p(y_1)] , log[p(y_2|y_1)] , ...,
log[p(y_n|y_1,...,y_{n-1})] ), where y_1 ,...,y_n is your training set.
Each thing you're averaging is the log-likelihood performance on an
out-of-sample example. Of course most predictions you're evaluating are
using much less training data than you'll have after conditioning on the
full training set, so the performance measure is conservative. Does this
makes sense? I guess LOOCV would be a better estimate on the training set.
In other words, I see marginal likelihood as quite connected to assessing
overfitting.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#10 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AM0Irg96F_CUFYhwSpq9QKy5k3FjEUjjks5tjpesgaJpZM4S9R7E>
.
--
Yousuke Takada (yousuketakada at gmail.com)/
|
The paragraph renamed from Bayes free energy vs. generalization loss
Since I require more revisions, I am now working in PR #12. Some more thoughts:
|
Very nice! I have never seen these precise connections between LOO cross-validation, generalization error, and Bayes free energy. Is a there a reference to cite for these connections? |
As I said, the terminology here is due to Watanabe (2010, 2013). (The notation is slightly different, though.) I think much of such connections can be found therein. I also found in Section 28.3 of MacKay (2003) a brief explanation (similar to yours) about the connection between LOOCV and the log marginal likelihood. To be honest, Watanabe (2010, 2013) are too advanced to me. I too would appreciate if you could point me more accessible references like MacKay (2003) for the connection of these quantities.
|
A quote from Section 28 "Model Comparison and Occam's Razor" of MacKay (2003) [I have adapted the notation to match the one I introduced above]:
It is in Watanabe (2010) that the the asymptotic equivalence between the LOOCV loss and the WAIC (Watanabe-Akaike information criterion) is proved. Some references cited in WAIC and WBIC regarding WAIC:
Regarding the connections between the generalization loss, the Bayes free energy, and the LOOCV loss, I could not find a good reference other than Watanabe (2012), but these relations are probably obvious from their definitions.
Also, I could not find anywhere the relation between the LOOCV loss and the Bayes free energy I noted above, which is, to me, very interesting but is again probably obvious, though. I think a detailed discussion about the behaviors of the Bayes free energy and the LOOCV loss as random variables is way beyond the scope of my report (and, probably, that of PRML) and it suffices to note their relationships and also that the LOOCV loss can be considered an attempt to estimate the generalization loss. |
You say that using a very broad prior distribution leads to insufficient regularization and thus overfitting. I'm guessing you have in mind using the MAP estimator. But this isn't a Bayesian thing to do. A Bayesian would produce a predictive distribution (section 3.3.2), or if required to produce a point prediction for every input, might produce something like the predictive mean or predictive median, depending on what the ultimate loss function is.
The text was updated successfully, but these errors were encountered: