diff --git a/prml_errata.tex b/prml_errata.tex index 0fd94f1..8ceeeb9 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -164,6 +164,7 @@ \subsubsection*{#1} Specifically, (1.1) should read \begin{equation} y(x, \mathbf{w}) = w_0 + w_1 x + w_2 x^2 + \dots + w_M x^M = \sum_{j=0}^{M} w_j x^j . +\label{eq:polynomial_regression_function} \end{equation} \erratum{Page~10} @@ -301,6 +302,32 @@ \subsubsection*{#1} then the conditional~$p(y|x)$ is well-defined regardless of $x$ so that $p(y|x) = p(y)$ and, again, the product rule~(1.32) holds, which, in this case, reduces to $p(x, y) = p(x) p(y)$. +\erratum{Page~33} +Paragraph~3: +Note that the two information criteria for model selection mentioned here, namely, AIC and BIC +are different criteria with different goals (see below). +However, the difference between the two criteria (i.e., AIC and BIC) or, more generally, +the difference between their goals +(i.e., generalization and to identify the ``true'' model, respectively) +seems not to be well-recognized in PRML; +we shall come back to this issue later in this report. + +\parhead{AIC vs.\ BIC} +(i)~The \emph{Akaike information criterion} (AIC) and +(ii)~Schwartz's \emph{Bayesian information criterion} (BIC; see Section~4.4.1) +are different criteria with different goals, i.e., +(i)~to make better prediction given the training data set +(an ability called \emph{generalization}) and +(ii)~to identify the ``true'' model from which the data set has been generated +(or to better explain the data set in terms of the \emph{marginal likelihood}; see Section~3.4), +respectively.\footnote{% +Since we cannot tell which goal (i.e., generalization or to identify the ``true'' model) +is more ``Bayesian'' than the other, +``Bayesian information criterion'' is a misnomer.} +Although the two criteria are often seen as competing, +one can see from the above that, since their goals are different, +there is no point in asking which criterion is optimal unconditionally. + \erratum{Page~33} The line after (1.73): The best-fit \emph{log} likelihood~$p\left(\mathcal{D}\middle|\mathbf{w}_{\text{ML}}\right)$ @@ -2005,62 +2032,172 @@ \subsubsection*{#1} \erratum{Page~147} Paragraph~\textminus2: -The argument that ``the phenomenon of [overfitting\footnote{% +The argument that ``the phenomenon of [overfitting]\footnote{% In this report, we use the term ``overfitting'' without hyphenation -(i.e., instead of ``over-fitting'' as in PRML).}] +(i.e., instead of ``over-fitting'' as in PRML).} +\dots does not arise when we marginalize over parameters in a Bayesian setting'' is simply an overstatement. Bayesian methods, like any other machine learning methods, can overfit -because the \emph{true} model from which the data set has been generated is unknown in general -so that one could possibly assume an inappropriate (too expressive) model -that would give a terribly wrong prediction very confidently; -this is true even when we take a ``fully'' Bayesian approach -(i.e., \emph{not} maximum likelihood, MAP, or whatever) as discussed shortly. -We also discuss in what follows -the difference between the two criteria for assessing model complexity, namely, -the \emph{generalization error} (see Section~3.2) and -the \emph{marginal likelihood} (or the \emph{model evidence}; see Section~3.4), -which is not well recognized in PRML. +because the ``true'' model from which the data set has been generated is unknown in general +so that one could possibly assume an inappropriate model, +say, too expressive a model that would make terribly wrong predictions very confidently; +this is true even when we take a ``fully Bayesian'' approach +(i.e., \emph{not} maximum likelihood, MAP, or whatever). + +In the following, we first show that there exists such a Bayesian model that exhibits overfitting, +after which we discuss in some detail the difference between +the two criteria for model selection (or model comparison) +concerned in Sections~3.2 and 3.4, namely, +(i)~the \emph{generalization error} and +(ii)~the \emph{marginal likelihood} (or the \emph{model evidence}), respectively. \parhead{A Bayesian model that exhibits overfitting} Let us take a Bayesian linear regression model of Section~3.3 as an example and -suppose that the precision~$\beta$ of the target~$t$ in the likelihood~(3.8) is very large -whereas the precision~$\alpha$ of the parameters~$\mathbf{w}$ in the prior~(3.52) is very small -(i.e., the conditional distribution of $t$ given $\mathbf{w}$ is narrow whereas -the prior over $\mathbf{w}$ is broad), leading to insufficient \emph{regularization} -(see Section~3.1.4). -Then, the posterior~$p(\mathbf{w}|\bm{\mathsf{t}})$ given the data set~$\bm{\mathsf{t}}$ will be -sharply peaked around the maximum likelihood estimate~$\mathbf{w}_{\text{ML}}$ and -the predictive~$p(t|\bm{\mathsf{t}})$ be also sharply peaked -(well approximated by the likelihood conditioned on $\mathbf{w}_{\text{ML}}$). -Stated differently, the assumed model reduces to the least squares method, -which is known to suffer from overfitting (see Section~1.1). - -Of course, we can extend the model by incorporating hyperpriors over $\beta$ and $\alpha$, +show that it can overfit. +Suppose that the prior~(3.52) over the parameters~$\mathbf{w}$ is broad +whereas the conditional~(3.8) over the target~$t$ given $\mathbf{w}$ is narrow, +i.e., the precision~$\alpha$ of $\mathbf{w}$ is very small +whereas the precision~$\beta$ of $t$ given $\mathbf{w}$ is very large, +leading to insufficient \emph{regularization} (see Section~3.1.4). +Then, the posterior~(3.49) over $\mathbf{w}$ given the data set~$\bm{\mathsf{t}}$ will be +sharply peaked around the maximum likelihood estimate~$\mathbf{w}_{\text{ML}}$ given by (3.15) +so that the predictive distribution~(3.58) over $t$ given $\bm{\mathsf{t}}$ will be +well approximated by the conditional~(3.8) conditioned on $\mathbf{w}_{\text{ML}}$ +and also sharply peaked around the regression function~(3.3). +Stated differently, learning thus assumed a Bayesian model reduces to the least squares method, +which is known to suffer from overfitting so that +the \emph{generalization error} can be very large if the regression function is too expressive, +e.g., the order~$M$ of the polynomial regression function~\eqref{eq:polynomial_regression_function} +is large compared to the size of the training data set as we have seen in Section~1.1. + +Of course, one can extend the model by incorporating hyperpriors over $\alpha$ and $\beta$, thus introducing more Bayesian averaging. However, if the extended model is not sensible -(e.g., the hyperpriors are sharply peaked around wrong values), +(e.g., the hyperpriors are tuned to wrong values), we shall again end up with a wrong posterior and a wrong predictive. -The point here is that, since we do not know the true model (if any), -we cannot know whether the assumed model is sensible in advance +The point here is that, since we do not know the ``true'' model (if any), +we cannot know if the assumed model is sensible in advance (i.e., without any knowledge about data to be generated). We can however assess, given a data set, whether a model is better than another -by, say, \emph{Bayesian model comparison} (see Section~3.4), -though a caveat is that we still need some (implicit) assumptions for the framework of -Bayesian model comparison to work; +in terms of, say, the \emph{marginal likelihood} as discussed in Section~3.4, +though a caveat is that we still need some (implicit) assumptions for +this Bayesian model comparison framework to work; see the discussion around (3.73). \parhead{Generalization error vs.\ marginal likelihood} -Moreover, one should also be aware of a subtlety that -(i)~the \emph{generalization error}, -which can be estimated by cross-validation (Section~3.2), and -(ii)~the \emph{marginal likelihood}, -which is used in the Bayesian model comparison framework (Section~3.4), -are closely related but different criteria for assessing model complexity, -although, in practice, a higher marginal likelihood often tends to imply -a lower generalization error and vice versa. -For more (advanced) discussions, see \citet{Watanabe:WAIC,Watanabe:WBIC}. +Moreover, we should also be aware of a subtlety here that +(i)~the \emph{generalization error} and (ii)~the \emph{marginal likelihood} +are closely related but different criteria for model selection +(although, in practice, a higher marginal likelihood often tends to imply +a lower generalization error and vice versa). + +As a general rule, one should adopt: +(i)~the generalization error +if they want to make better prediction given the data set; and +(ii)~the marginal likelihood +if they want to identify the ``true'' model from which the data set has been generated +(see below for more discussion). + +Of course, nothing prevents us from examining the behavior of \emph{both} the two criteria, +if possible, to assess the model concerned; +it is even worth the effort to do so because, since the two criteria are different, +we can gain more information from both of them than from one of them. +If, say, we have learned that both the criteria do or do not prefer a model, +then we would be more confident that the model is sensible or not, respectively. + +\parhead{More on generalization error vs.\ marginal likelihood +(or generalization loss vs.\ Bayes free energy)} +In what follows, I would like to further elaborate on +the difference between the two criteria for model selection. +In order to facilitate the discussion, we first introduce some terminology. +Throughout the discussion, +special care must be taken about the distribution with which we take expectation +because the ``true'' distribution is unknown in general +and, therefore, we must assume some model (i.e., hypothetical) distributions +so that confusion can easily arise about which distribution is concerned. +To avoid such confusion, we also introduce some notation. + +Note that the terminology and the notation to be introduced here are somewhat different from +those of PRML or other part of this report. +The terminology and the discussion here are largely due to \citet{Watanabe:BayesStatistics}, +though the notation is not because I have yet tried to follow that of PRML as closely as possible. + +First of all, we should point out that +so far we have used the term \emph{generalization error} somewhat loosely +in this report and also in PRML; +it is used primarily in the context of a frequentist inference +(such as the one in Section~3.2) +and can generally be defined as the \emph{expected loss}~(see Section~1.5.5) +evaluated for the predicted target value (i.e., a point estimate) +with some arbitrary loss function given (e.g., the squared error for regression). +An alternative definition of the generalization error from a Bayesian perspective would be +in terms of the predictive distribution. +Specifically, we can define it as the expected negative log predictive distribution; +to avoid ambiguity, we hereafter call this particular criterion for +assessing a model's predictive performance the \emph{generalization loss} +(see below for a precise definition). +We shall also define another criterion called +the \emph{Bayes free energy}, +which is nothing but the negative log marginal likelihood. +The generalization loss is better compared with the Bayes free energy as we shall see shortly. + +Let us now introduce some notation. +First, let $\mathbf{X} = \{ \mathbf{x}_1, \dots, \mathbf{x}_N \}$ be the training data set +(we use a vector~$\mathbf{x}$ instead of a scalar~$t$ for the target variable here). +We assume that the data set~$\mathbf{X}$ has been generated from +some true distribution~$p(\cdot)$ and +is i.i.d.\ so that +\begin{equation} +p(\mathbf{X}) = p(\mathbf{x}_1, \dots, \mathbf{x}_N) = \prod_{n=1}^{N} p(\mathbf{x}_n). +\end{equation} +Let $\mathcal{M}$ be the assumed model we wish to learn. +As a shorthand, the probability distribution of our assumed model~$\mathcal{M}$ is denoted by +\begin{equation} +q(\cdot) \equiv p(\cdot|\mathcal{M}) +\end{equation} +i.e., the conditioning on $\mathcal{M}$ is implicit for $q(\cdot)$.\footnote{% +We define $p(\cdot|a|b) \equiv p(\cdot|a, b)$ so that +we can write the conditional~$q(\cdot|\cdot)$.} +If there exists some true model~$\mathcal{M}^{\star}$, then +\begin{equation} +p(\cdot) \equiv p(\cdot|\mathcal{M}^{\star}). +\end{equation} +The model~$\mathcal{M}$ consists of a pair of: +(i)~a prior~$q(\mathbf{w})$ over a set of parameters~$\mathbf{w}$; and +(ii)~a conditional~$q(\mathbf{x}|\mathbf{w})$ over $\mathbf{x}$ given $\mathbf{w}$.\footnote{% +The model~$\mathcal{M}$ may include +a hyperprior~$q(\bm{\xi})$ over a set of hyperparameters~$\bm{\xi}$ and so on. +It is easy to see that the discussion here is applicable also to such a hierarchical model +because we can consider the joint prior of the form~$q(\mathbf{w}, \bm{\xi}) = +q(\mathbf{w}|\bm{\xi}) q(\bm{\xi})$ and, therefore, $\bm{\xi}$ can be absorbed into $\mathbf{w}$.} +The marginal likelihood (or the evidence) of the model~$\mathcal{M}$ is given by +\begin{equation} +q(\mathbf{X}) = \int \mathrm{d}\mathbf{w} \, q(\mathbf{w}) \, q(\mathbf{X}|\mathbf{w}) = +\int \mathrm{d}\mathbf{w} \, q(\mathbf{w}) \prod_{n=1}^{N} q(\mathbf{x}_n|\mathbf{w}). +\end{equation} +Note that, although the conditional~$q(\mathbf{X}|\mathbf{w})$ does factorize, +the marginal likelihood~$q(\mathbf{X})$ does not in general because we have some unknown +parameters~$\mathbf{w}$ in the model~$\mathcal{M}$. + +% TODO + +%[cf. \emph{cross-validation} (see Section~1.3)] + +%In \citet{Watanabe:BayesStatistics}, +%the generalization error is defined as the Kullback-Leibler divergence~$ +%\operatorname{KL}\left(p(\mathbf{x})\middle\|q(\mathbf{x}|\mathbf{X})\right)$ +%between the true distribution~$p(\mathbf{x})$ and +%the predictive distribution~$q(\mathbf{x}|\mathbf{X})$. + +% TODO: +% Section 3.5.1, generalization loss and evidence +% Section 1.5.5 + +%For WAIC and WBIC, see \citet{Watanabe:WAIC,Watanabe:WBIC}. + +% F_N is, as the subscript N suggests, a function of X \erratum{Page~156} Equation~(3.57): @@ -2199,6 +2336,18 @@ \subsubsection*{#1} Paragraph~1, Line~1: ``must related'' should be ``must be related.'' +\erratum{Page~217} +Paragraph~3: +Here, it is pointed out that information criteria such as AIC and BIC are no longer valid +if the posterior cannot be approximated by a Gaussian; +such a model is called \emph{singular} and, in fact, +many practical models are known to be singular +\citep{Watanabe:BayesStatistics,Watanabe:WAIC}. +It is also worth noting here that there have been recently proposed +new information criteria applicable to singular models, namely, +WAIC~\citep{Watanabe:WAIC,Watanabe:BayesStatistics} and WBIC~\citep{Watanabe:WBIC}, +which are generalized versions of AIC and BIC, respectively. + \erratum{Page~218} Equation~(4.144): The covariance should be the one~$\mathbf{S}_N$ evaluated at $\mathbf{w}_{\text{MAP}}$.