Dismissal of "bayesian methods don't overfit" is too quick #10

davidrosenberg · 2018-03-27T16:54:47Z

You say that using a very broad prior distribution leads to insufficient regularization and thus overfitting. I'm guessing you have in mind using the MAP estimator. But this isn't a Bayesian thing to do. A Bayesian would produce a predictive distribution (section 3.3.2), or if required to produce a point prediction for every input, might produce something like the predictive mean or predictive median, depending on what the ultimate loss function is.

yousuketakada · 2018-03-28T23:26:25Z

Thank you very much for your comment.
I have a posterior distribution in mind. What I have tried to say in the corresponding errata item is that we can think of a very pessimistic model in which the posterior becomes a MAP or ML estimate and the predictive distribution gives a wrong answer very confidently.
Let us take a Bayesian linear regression model of Section 3.3 as an example and suppose that the precision $\beta$ of the target variable is very large and the precision $\alpha$ of the parameters is very small. Then, the posterior is very sharply peaked and is thus nearly a MAP or ML estimate; and the predictive distribution is also very sharply peaked and is thus nearly the target distribution (3.8) conditioned on the MAP or ML estimate of the parameters.
Of course, we can think of hyperpriors over $\beta$ and $\alpha$ and introduce more Bayesian averaging. However, again, if the extended model is wrong (e.g., the hyporpriors are again sharply peaked at wrong values), we shall have a wrong posterior and a wrong predictive.
The point is that, since we do not know the true model and thus we cannot even know whether the assumed model is reasonable without any knowledge, we cannot guarantee in general that "the bayesian methods don't overfit." (However, we can say a model is better than the other using Bayesian model comparison for example).
Please correct me if I am missing something. Thank you!

davidrosenberg · 2018-03-29T01:04:59Z

In your original errata I thought you were talking about very broad priors, which indeed is related to very little regularization, if you’re going to do MAP. Now it seems like you’re talking about the opposite, where you have prior that’s very peaked/concentrated on the wrong thing. You can think of this as a soft version of constraining the parameter space to just the region around the peak. This is much closer to underfitting than overfitting. The prior is so strong that it basically ignores the data until you have a ton of it. (Sorry if I missed your point - I’m not at home right now so I can’t cross reference with the book to see exactly what you’re referring to. )

…

On Mar 28, 2018, at 7:26 PM, Yousuke Takada ***@***.***> wrote: Thank you very much for your comment. I have a posterior distribution in mind. What I have tried to say in the corresponding errata item is that we can think of a very pessimistic model in which the posterior becomes a MAP or ML estimate and the predictive distribution gives a wrong answer very confidently. Let us take a Bayesian linear regression model of Section 3.3 as an example and suppose that the precision $\beta$ of the target variable is very large and the precision $\alpha$ of the parameters is very small. Then, the posterior is very sharply peaked and is thus nearly a MAP or ML estimate; and the predictive distribution is also very sharply peaked and is thus nearly the target distribution (3.8) conditioned on the MAP or ML estimate of the parameters. Of course, we can think of hyperpriors over $\beta$ and $\alpha$ and introduce more Bayesian averaging. However, again, if the extended model is wrong (e.g., the hyporpriors are again sharply peaked at wrong values), we shall have a wrong posterior and a wrong predictive. The point is that, since we do not know the true model and thus we cannot even know whether the assumed model is reasonable without any knowledge, we cannot guarantee in general that "the bayesian methods don't overfit." (However, we can say a model is better than the other using Bayesian model comparison for example). Please correct me if I am missing something. Thank you! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

yousuketakada · 2018-03-29T03:18:29Z

You were right; I am talking about a very broad prior (i.e., the precision $\alpha$ is very small).
If combined with a very narrow conditional over the target (i.e., the precision $\beta$ is very large), the model gives a posterior that is very narrow (sharply peaked around the MAP or ML estimate).
In this case, the predictive is also very narrow and cannot give very useful error estimates.

davidrosenberg · 2018-03-29T21:52:15Z

Ok now I think I know what you're talking about. So say, for example, we have a very expressive feature space, and so we can fit the training data very well with a Gaussian linear model, say with maximum likelihood... and such a maximum likelihood fit does poorly out of sample. So, standard overfitting. Then if we took a Bayesian approach with a very broad prior on the coefficient vector and the variance of the response is assumed to be very small, conditional on the input, then I see your point that the posterior variance would also be very small, and so the predictive variance would also be quite small, leading to poor out of sample performance again.

I would just suggest in the errata elaborating a little bit, to say that you're talking about the posterior and predictive distributions having small variance under the scenario you describe.

Also, Bishop doubles down on this point of view in 3.4: "As we shall see, the over-fitting associated with maximum likelihood can be avoided by marginalizing (summing or integrating) over the model parameters instead of making point estimates of their values. Models can then be compared directly on the training data, without the need for a validation set."

Since he's comparing to maximum likelihood, as opposed to regularized maximum likelihood, maybe his case is really more about regularization in general, vs not regularizing. And the second part is something new, about using marginal likelihood to choose the amount of regularization.

Anyway - your errata notes are awesome! Thanks for sharing.

yousuketakada · 2018-03-29T23:22:03Z

Thank you again for your very helpful comment and I am very glad to hear that you have found my errata "awesome"!

As you suggested, I would like to make it clearer in my errata that I am concerning here a fully Bayesian treatment (not MAP or ML) but, if the model is not sensible, it can, say, reduce to MAP or ML (the posterior can become nearly a point estimate), giving poor prediction (the predictive is also not usable).
Stated differently, since the Bayesian methods include MAP or ML as a special case, it is a logical consequence that, if MAP or ML can overfit, Bayesian can also overfit (although overfitting can also occur in more complicated models if they are again not sensible).

The statement in Section 3.4 is to me more reasonable; it only says overfitting can be avoided by Bayesian methods (if we assume sensible models).

You are right in that the statement that the marginal likelihood can be used for model selection is new.
Actually, the marginal likelihood (model evidence) is related to the problem of overfitting (i.e., prediction or generalization) but they are not the same (although they are often correlated in practice).

davidrosenberg · 2018-03-30T20:38:04Z

I see marginal likelihood as a conservative estimate of out-of-sample performance, in the sense that it's
proportional to average( log[p(y_1)] , log[p(y_2|y_1)] , ..., log[p(y_n|y_1,...,y_{n-1})] ), where y_1 ,...,y_n is your training set. Each thing you're averaging is the log-likelihood performance on an out-of-sample example. Of course most predictions you're evaluating are using much less training data than you'll have after conditioning on the full training set, so the performance measure is conservative. Does this makes sense? I guess LOOCV would be a better estimate on the training set.

In other words, I see marginal likelihood as quite connected to assessing overfitting.

yousuketakada · 2018-04-01T23:24:29Z

Your point perfectly makes sense! However, since we do not know the true probability distribution in general, we must be careful about the distribution with which we take expectation. So please let me elaborate on your point by introducing some notation. First, let X = { x_1, ..., x_N } be the training data (here I use x instead of y). They are assumed to be i.i.d. and generated from some true distribution p(x) so that p(X) = p(x_1, ..., x_N) = \prod_{n=1}^{N} p(x_n). As a shorthand, the probability distribution of our assumed model M is denoted by q(.) = p(.|M). If there exists a true model M^*, then the true distribution p(.) can be written as p(.) = p(.|M^*). The model M we learn is a pair of the likelihood q(x|w) and the prior q(w) where w is a set of parameters. The marginal likelihood (or the model evidence) is given by q(X) = \int dw q(w) q(X|w) = \int dw q(w) \prod_{n=1}^{N} q(x_n|w). Note that q(X) does not factorize in general because we have the unknown parameters w in the model M. The Bayes free energy F_N is defined as the negative log marginal likelihood so that F_N = - ln q(X) = - ln q(x_1, ..., x_N). Taking its expectation over all the data X using the true distribution p(X), we have E[F_N] = - \int dX p(X) ln q(X) = N S + KL(p(X)||q(X)) where S = \int dx p(x) ln p(x) is the (differential) entropy of x and is thus a constant independent of the model M. Therefore, minimizing the expected Bayes free energy E[F_N] is equivalent to minimizing the Kullback-Leibler divergence from the true distribution p(X) to the learned model q(X). Although we cannot compute the expected value E[F_N] of the Bayes free energy F_N in practice, it is not quite unreasonable to suppose that a smaller value of F_N tends to imply that q(X) better approximates p(X) on average if the assumed model M is somewhat reasonable. See also the discussion around (3.73). On the other hand, in much the same way as we have motivated the maximum likelihood in (1.119), the generalization loss G_N is defined as the negative logarithm of the predictive q(x|X) expected with respect to the new input x so that G_N = E_x[- ln q(x|X)] = - \int p(x) ln q(x|X) = S + KL(p(x)||q(x|X)) where S is the (differential) entropy of x (see above). Therefore, minimizing the generalization loss is equivalent to minimizing the Kullback-Leibler divergence from p(x) to q(x|X). Since we have F_N = - ln q(x_1) - ln q(x_2|x_1) - ln q(x_3|x_1, x_2) - ... - ln q(x_N|x_1, ..., x_{N - 1}) the expected values of the Bayes free energy F_N and the generalization loss G_N are related by E[F_N] = \sum_{n=0}^{N - 1} E[G_n] where the expectations are taken with respect to all the data X. Therefore, as you suggested, the Bayes free energy can be considered a somewhat conservative estimate of the generalization loss. The terminology here is borrowed from Watanabe (2013) (see References).

…

On Sat, Mar 31, 2018 at 5:38 AM, David Rosenberg ***@***.***> wrote: I see marginal likelihood as a conservative estimate of out-of-sample performance, in the sense that it's proportional to average( log[p(y_1)] , log[p(y_2|y_1)] , ..., log[p(y_n|y_1,...,y_{n-1})] ), where y_1 ,...,y_n is your training set. Each thing you're averaging is the log-likelihood performance on an out-of-sample example. Of course most predictions you're evaluating are using much less training data than you'll have after conditioning on the full training set, so the performance measure is conservative. Does this makes sense? I guess LOOCV would be a better estimate on the training set. In other words, I see marginal likelihood as quite connected to assessing overfitting. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#10 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AM0Irg96F_CUFYhwSpq9QKy5k3FjEUjjks5tjpesgaJpZM4S9R7E> .

-- Yousuke Takada (yousuketakada at gmail.com)/

The paragraph renamed from Bayes free energy vs. generalization loss

yousuketakada · 2018-04-11T21:41:07Z

Since I require more revisions, I am now working in PR #12.

Some more thoughts:

The Bayes free energy F_N and the generalization loss G_N, as defined above, are the random variables dependent on X so that they are better written as F_N(X) and G_N(X), respectively, when we emphasize the dependence on X.
The leave-one-out cross-validation loss CV_N(X) = (1/N) \sum_{n=1}^{N} - ln q(x_n|X\x_n) is an attempt to estimate the generalization loss from a specific data set X so that E[G_{N-1}] \approx CV_N(X). On average, they are equal: E[G_{N-1}] = E[CV_N].
The Bayes free energy and the cross-validation loss are related by F_N(X) = CV_N(X) + (1/N) \sum_{n=1}^{N} F_{N-1}(X\x_n) because p(X) = p(x_n|X\x_n) p(X\x_n) for any n.
Further expanding the F_{N-1} terms recursively, we can write F_N(X) as the sum of the average LOOCV loss for all the n-element subsets of X from n=1 to N (this is a bit difficult to write in pseudo TeX here; I shall add more precise equation in the report). Taking expectations, we again have E[F_N] = \sum_{n=1}^{N} E[CV_n] = \sum_{n=0}^{N-1} E[G_n].

davidrosenberg · 2018-04-12T17:00:26Z

Very nice! I have never seen these precise connections between LOO cross-validation, generalization error, and Bayes free energy. Is a there a reference to cite for these connections?

yousuketakada · 2018-04-12T21:39:45Z

As I said, the terminology here is due to Watanabe (2010, 2013). (The notation is slightly different, though.) I think much of such connections can be found therein.
Another book I refer to is Thoery and Method of Bayes Statistics (in Japanese).

I also found in Section 28.3 of MacKay (2003) a brief explanation (similar to yours) about the connection between LOOCV and the log marginal likelihood.

To be honest, Watanabe (2010, 2013) are too advanced to me. I too would appreciate if you could point me more accessible references like MacKay (2003) for the connection of these quantities.

MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge
University Press.
Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable
information criterion in singular learning theory. Journal of Machine Learning Research 11,
3571–3594.
Watanabe, S. (2013). A widely applicable Bayesian information criterion. Journal of Machine
Learning Research 14, 867–897.

yousuketakada · 2018-04-15T06:44:26Z

A quote from Section 28 "Model Comparison and Occam's Razor" of MacKay (2003) [I have adapted the notation to match the one I introduced above]:

On-line learning and cross-validation.
In cases where the data consist of a sequence of points X = {x_1, x_2, ..., x_N}, the log evidence can be decomposed as a sum of "on-line" predictive performances:
ln q(X) = ln q(x_1) + ln q(x_2|x_1) + ln q(x_3|x_1, x_2) + ... + ln q(x_N|x_1, ..., x_{N - 1})
This decomposition can be used to explain the difference between the evidence and "leave-one-out cross-validation" as measures of predictive ability. Cross-validation examines the average value of just the last term, ln q(x_N|x_1, ..., x_{N - 1}), under random re-orderings of the data. The evidence, on the other hand, sums up how well the model predicted all the data, starting from scratch.

It is in Watanabe (2010) that the the asymptotic equivalence between the LOOCV loss and the WAIC (Watanabe-Akaike information criterion) is proved.

Some references cited in WAIC and WBIC regarding WAIC:

A.Gelman, J.B.Carlin, H.S.Stern, D.B. Dunson, A.Vehtari, and D.B.Rubin, Bayesian Data Analysis, 3rd Edition, Chapman and Hall/CRC, 2013
Andrew Gelman, Jessica Hwang, Aki Vehtari, ``Understanding predictive information criteria for Bayesian models," Statistics and Computing, DOI 10.1007/s11222-013-9416-2, 2013
Aki Vehtari and Janne Ojanen, ``A survery of Bayesian predictive methods for model assessment, selection and comparison, Statistics Surveys, Vol.6, pp.142-228, 2012.

Regarding the connections between the generalization loss, the Bayes free energy, and the LOOCV loss, I could not find a good reference other than Watanabe (2012), but these relations are probably obvious from their definitions.

Watanabe, S. (2012). Theory and Method of Bayes Statistics. Corona Publishing Co., Ltd. (in Japanese)

Also, I could not find anywhere the relation between the LOOCV loss and the Bayes free energy I noted above, which is, to me, very interesting but is again probably obvious, though.

I think a detailed discussion about the behaviors of the Bayes free energy and the LOOCV loss as random variables is way beyond the scope of my report (and, probably, that of PRML) and it suffices to note their relationships and also that the LOOCV loss can be considered an attempt to estimate the generalization loss.
Also, the Bayes free energy is closely related to the LOOCV loss and it is not unreasonable to use it (or, equivalently, the marginal likelihood) for model comparison as also suggested in Section 3.4 of PRML.

yousuketakada self-assigned this Apr 3, 2018

yousuketakada added a commit that referenced this issue Apr 7, 2018

Elaborate on Bayesian model that exhibit overfitting (#10)

7e15960

yousuketakada added a commit that referenced this issue Apr 7, 2018

Edit: A Bayesian model that exhibits overfitting (#10)

8c5782d

yousuketakada added enhancement question labels Apr 7, 2018

yousuketakada added a commit that referenced this issue Apr 7, 2018

Edit: A Bayesian model that exhibits overfitting (#10)

1fc2562

yousuketakada added a commit that referenced this issue Apr 7, 2018

Edit comments on Bayesian model and overfitting (#10)

989e313

yousuketakada added a commit that referenced this issue Apr 7, 2018

Edit: A Bayesian model that exhibits overfitting (#10)

87d3a1b

yousuketakada added a commit that referenced this issue Apr 7, 2018

Edit comments on Bayesian methods and overfitting (#10)

2635a8f

yousuketakada added a commit that referenced this issue Apr 8, 2018

Edit: A Bayesian model that exhibits overfitting (#10)

25d7a9e

yousuketakada added a commit that referenced this issue Apr 8, 2018

Edit: Generalization error vs. marginal likelihood (#10)

0bf7f58

The paragraph renamed from Bayes free energy vs. generalization loss

yousuketakada mentioned this issue Apr 10, 2018

More on generalization error vs. marginal likelihood #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dismissal of "bayesian methods don't overfit" is too quick #10

Dismissal of "bayesian methods don't overfit" is too quick #10

davidrosenberg commented Mar 27, 2018

yousuketakada commented Mar 28, 2018

davidrosenberg commented Mar 29, 2018 via email

yousuketakada commented Mar 29, 2018

davidrosenberg commented Mar 29, 2018

yousuketakada commented Mar 29, 2018

davidrosenberg commented Mar 30, 2018

yousuketakada commented Apr 1, 2018 via email

yousuketakada commented Apr 11, 2018 •

edited

Loading

davidrosenberg commented Apr 12, 2018 •

edited

Loading

yousuketakada commented Apr 12, 2018

yousuketakada commented Apr 15, 2018

Dismissal of "bayesian methods don't overfit" is too quick #10

Dismissal of "bayesian methods don't overfit" is too quick #10

Comments

davidrosenberg commented Mar 27, 2018

yousuketakada commented Mar 28, 2018

davidrosenberg commented Mar 29, 2018 via email

yousuketakada commented Mar 29, 2018

davidrosenberg commented Mar 29, 2018

yousuketakada commented Mar 29, 2018

davidrosenberg commented Mar 30, 2018

yousuketakada commented Apr 1, 2018 via email

yousuketakada commented Apr 11, 2018 • edited Loading

davidrosenberg commented Apr 12, 2018 • edited Loading

yousuketakada commented Apr 12, 2018

yousuketakada commented Apr 15, 2018

yousuketakada commented Apr 11, 2018 •

edited

Loading

davidrosenberg commented Apr 12, 2018 •

edited

Loading