Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dismissal of "bayesian methods don't overfit" is too quick #10

Open
davidrosenberg opened this issue Mar 27, 2018 · 11 comments
Open

Dismissal of "bayesian methods don't overfit" is too quick #10

davidrosenberg opened this issue Mar 27, 2018 · 11 comments

Comments

@davidrosenberg
Copy link

You say that using a very broad prior distribution leads to insufficient regularization and thus overfitting. I'm guessing you have in mind using the MAP estimator. But this isn't a Bayesian thing to do. A Bayesian would produce a predictive distribution (section 3.3.2), or if required to produce a point prediction for every input, might produce something like the predictive mean or predictive median, depending on what the ultimate loss function is.

@yousuketakada
Copy link
Owner

Thank you very much for your comment.
I have a posterior distribution in mind. What I have tried to say in the corresponding errata item is that we can think of a very pessimistic model in which the posterior becomes a MAP or ML estimate and the predictive distribution gives a wrong answer very confidently.
Let us take a Bayesian linear regression model of Section 3.3 as an example and suppose that the precision $\beta$ of the target variable is very large and the precision $\alpha$ of the parameters is very small. Then, the posterior is very sharply peaked and is thus nearly a MAP or ML estimate; and the predictive distribution is also very sharply peaked and is thus nearly the target distribution (3.8) conditioned on the MAP or ML estimate of the parameters.
Of course, we can think of hyperpriors over $\beta$ and $\alpha$ and introduce more Bayesian averaging. However, again, if the extended model is wrong (e.g., the hyporpriors are again sharply peaked at wrong values), we shall have a wrong posterior and a wrong predictive.
The point is that, since we do not know the true model and thus we cannot even know whether the assumed model is reasonable without any knowledge, we cannot guarantee in general that "the bayesian methods don't overfit." (However, we can say a model is better than the other using Bayesian model comparison for example).
Please correct me if I am missing something. Thank you!

@davidrosenberg
Copy link
Author

davidrosenberg commented Mar 29, 2018 via email

@yousuketakada
Copy link
Owner

You were right; I am talking about a very broad prior (i.e., the precision $\alpha$ is very small).
If combined with a very narrow conditional over the target (i.e., the precision $\beta$ is very large), the model gives a posterior that is very narrow (sharply peaked around the MAP or ML estimate).
In this case, the predictive is also very narrow and cannot give very useful error estimates.

@davidrosenberg
Copy link
Author

Ok now I think I know what you're talking about. So say, for example, we have a very expressive feature space, and so we can fit the training data very well with a Gaussian linear model, say with maximum likelihood... and such a maximum likelihood fit does poorly out of sample. So, standard overfitting. Then if we took a Bayesian approach with a very broad prior on the coefficient vector and the variance of the response is assumed to be very small, conditional on the input, then I see your point that the posterior variance would also be very small, and so the predictive variance would also be quite small, leading to poor out of sample performance again.

I would just suggest in the errata elaborating a little bit, to say that you're talking about the posterior and predictive distributions having small variance under the scenario you describe.

Also, Bishop doubles down on this point of view in 3.4: "As we shall see, the over-fitting associated with maximum likelihood can be avoided by marginalizing (summing or integrating) over the model parameters instead of making point estimates of their values. Models can then be compared directly on the training data, without the need for a validation set."

Since he's comparing to maximum likelihood, as opposed to regularized maximum likelihood, maybe his case is really more about regularization in general, vs not regularizing. And the second part is something new, about using marginal likelihood to choose the amount of regularization.

Anyway - your errata notes are awesome! Thanks for sharing.

@yousuketakada
Copy link
Owner

Thank you again for your very helpful comment and I am very glad to hear that you have found my errata "awesome"!

As you suggested, I would like to make it clearer in my errata that I am concerning here a fully Bayesian treatment (not MAP or ML) but, if the model is not sensible, it can, say, reduce to MAP or ML (the posterior can become nearly a point estimate), giving poor prediction (the predictive is also not usable).
Stated differently, since the Bayesian methods include MAP or ML as a special case, it is a logical consequence that, if MAP or ML can overfit, Bayesian can also overfit (although overfitting can also occur in more complicated models if they are again not sensible).

The statement in Section 3.4 is to me more reasonable; it only says overfitting can be avoided by Bayesian methods (if we assume sensible models).

You are right in that the statement that the marginal likelihood can be used for model selection is new.
Actually, the marginal likelihood (model evidence) is related to the problem of overfitting (i.e., prediction or generalization) but they are not the same (although they are often correlated in practice).

@davidrosenberg
Copy link
Author

I see marginal likelihood as a conservative estimate of out-of-sample performance, in the sense that it's
proportional to average( log[p(y_1)] , log[p(y_2|y_1)] , ..., log[p(y_n|y_1,...,y_{n-1})] ), where y_1 ,...,y_n is your training set. Each thing you're averaging is the log-likelihood performance on an out-of-sample example. Of course most predictions you're evaluating are using much less training data than you'll have after conditioning on the full training set, so the performance measure is conservative. Does this makes sense? I guess LOOCV would be a better estimate on the training set.

In other words, I see marginal likelihood as quite connected to assessing overfitting.

@yousuketakada
Copy link
Owner

yousuketakada commented Apr 1, 2018 via email

@yousuketakada
Copy link
Owner

yousuketakada commented Apr 11, 2018

Since I require more revisions, I am now working in PR #12.

Some more thoughts:

  • The Bayes free energy F_N and the generalization loss G_N, as defined above, are the random variables dependent on X so that they are better written as F_N(X) and G_N(X), respectively, when we emphasize the dependence on X.
  • The leave-one-out cross-validation loss CV_N(X) = (1/N) \sum_{n=1}^{N} - ln q(x_n|X\x_n) is an attempt to estimate the generalization loss from a specific data set X so that E[G_{N-1}] \approx CV_N(X). On average, they are equal: E[G_{N-1}] = E[CV_N].
  • The Bayes free energy and the cross-validation loss are related by F_N(X) = CV_N(X) + (1/N) \sum_{n=1}^{N} F_{N-1}(X\x_n) because p(X) = p(x_n|X\x_n) p(X\x_n) for any n.
  • Further expanding the F_{N-1} terms recursively, we can write F_N(X) as the sum of the average LOOCV loss for all the n-element subsets of X from n=1 to N (this is a bit difficult to write in pseudo TeX here; I shall add more precise equation in the report). Taking expectations, we again have E[F_N] = \sum_{n=1}^{N} E[CV_n] = \sum_{n=0}^{N-1} E[G_n].

@davidrosenberg
Copy link
Author

davidrosenberg commented Apr 12, 2018

Very nice! I have never seen these precise connections between LOO cross-validation, generalization error, and Bayes free energy. Is a there a reference to cite for these connections?

@yousuketakada
Copy link
Owner

As I said, the terminology here is due to Watanabe (2010, 2013). (The notation is slightly different, though.) I think much of such connections can be found therein.
Another book I refer to is Thoery and Method of Bayes Statistics (in Japanese).

I also found in Section 28.3 of MacKay (2003) a brief explanation (similar to yours) about the connection between LOOCV and the log marginal likelihood.

To be honest, Watanabe (2010, 2013) are too advanced to me. I too would appreciate if you could point me more accessible references like MacKay (2003) for the connection of these quantities.

  • MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge
    University Press.
  • Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable
    information criterion in singular learning theory. Journal of Machine Learning Research 11,
    3571–3594.
  • Watanabe, S. (2013). A widely applicable Bayesian information criterion. Journal of Machine
    Learning Research 14, 867–897.

@yousuketakada
Copy link
Owner

A quote from Section 28 "Model Comparison and Occam's Razor" of MacKay (2003) [I have adapted the notation to match the one I introduced above]:

On-line learning and cross-validation.
In cases where the data consist of a sequence of points X = {x_1, x_2, ..., x_N}, the log evidence can be decomposed as a sum of "on-line" predictive performances:
ln q(X) = ln q(x_1) + ln q(x_2|x_1) + ln q(x_3|x_1, x_2) + ... + ln q(x_N|x_1, ..., x_{N - 1})
This decomposition can be used to explain the difference between the evidence and "leave-one-out cross-validation" as measures of predictive ability. Cross-validation examines the average value of just the last term, ln q(x_N|x_1, ..., x_{N - 1}), under random re-orderings of the data. The evidence, on the other hand, sums up how well the model predicted all the data, starting from scratch.

It is in Watanabe (2010) that the the asymptotic equivalence between the LOOCV loss and the WAIC (Watanabe-Akaike information criterion) is proved.

Some references cited in WAIC and WBIC regarding WAIC:

  • A.Gelman, J.B.Carlin, H.S.Stern, D.B. Dunson, A.Vehtari, and D.B.Rubin, Bayesian Data Analysis, 3rd Edition, Chapman and Hall/CRC, 2013
  • Andrew Gelman, Jessica Hwang, Aki Vehtari, ``Understanding predictive information criteria for Bayesian models," Statistics and Computing, DOI 10.1007/s11222-013-9416-2, 2013
  • Aki Vehtari and Janne Ojanen, ``A survery of Bayesian predictive methods for model assessment, selection and comparison, Statistics Surveys, Vol.6, pp.142-228, 2012.

Regarding the connections between the generalization loss, the Bayes free energy, and the LOOCV loss, I could not find a good reference other than Watanabe (2012), but these relations are probably obvious from their definitions.

  • Watanabe, S. (2012). Theory and Method of Bayes Statistics. Corona Publishing Co., Ltd. (in Japanese)

Also, I could not find anywhere the relation between the LOOCV loss and the Bayes free energy I noted above, which is, to me, very interesting but is again probably obvious, though.

I think a detailed discussion about the behaviors of the Bayes free energy and the LOOCV loss as random variables is way beyond the scope of my report (and, probably, that of PRML) and it suffices to note their relationships and also that the LOOCV loss can be considered an attempt to estimate the generalization loss.
Also, the Bayes free energy is closely related to the LOOCV loss and it is not unreasonable to use it (or, equivalently, the marginal likelihood) for model comparison as also suggested in Section 3.4 of PRML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants