From 3c1477c6f79c855c3a671ca98a50d440d3f57ed4 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Wed, 11 Apr 2018 00:22:58 +0900 Subject: [PATCH 01/70] Edit: A Bayesian model that exhibits overfitting --- prml_errata.tex | 44 +++++++++++++++++++++++--------------------- 1 file changed, 23 insertions(+), 21 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index ba429a4..1f7cfa3 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -1956,37 +1956,39 @@ \subsubsection*{#1} Bayesian methods, like any other machine learning methods, can overfit because the \emph{true} model from which the data set has been generated is unknown in general so that one could possibly assume an inappropriate (too expressive) model -that would give a terribly wrong prediction very confidently; -this is true even when we take a ``fully'' Bayesian approach -(i.e., \emph{not} maximum likelihood, MAP, or whatever) as discussed shortly. -We also discuss in what follows -the difference between the two criteria for assessing model complexity, namely, -the \emph{generalization error} (see Section~3.2) and -the \emph{marginal likelihood} (or the \emph{model evidence}; see Section~3.4), +that would make terribly wrong predictions very confidently; +this is true even when we take a ``fully Bayesian'' approach +(i.e., \emph{not} maximum likelihood, MAP, or whatever). +In the following, we show such a Bayesian model that exhibits overfitting, +after which we also discuss the difference between the two criteria for model selection, namely, +(i)~the \emph{generalization error} (see Section~3.2) and +(ii)~the \emph{marginal likelihood} (or the \emph{model evidence}; see Section~3.4), which is not well recognized in PRML. \parhead{A Bayesian model that exhibits overfitting} Let us take a Bayesian linear regression model of Section~3.3 as an example and -suppose that the precision~$\beta$ of the target~$t$ in the likelihood~(3.8) is very large -whereas the precision~$\alpha$ of the parameters~$\mathbf{w}$ in the prior~(3.52) is very small -(i.e., the conditional distribution of $t$ given $\mathbf{w}$ is narrow whereas -the prior over $\mathbf{w}$ is broad), leading to insufficient \emph{regularization} -(see Section~3.1.4). -Then, the posterior~$p(\mathbf{w}|\bm{\mathsf{t}})$ given the data set~$\bm{\mathsf{t}}$ will be -sharply peaked around the maximum likelihood estimate~$\mathbf{w}_{\text{ML}}$ and -the predictive~$p(t|\bm{\mathsf{t}})$ be also sharply peaked -(well approximated by the likelihood conditioned on $\mathbf{w}_{\text{ML}}$). -Stated differently, the assumed model reduces to the least squares method, +show that it can overfit. +Suppose that the prior~(3.52) over the parameters~$\mathbf{w}$ is broad +whereas the conditional~(3.8) over the target~$t$ given $\mathbf{w}$ is narrow, +i.e., the precision~$\alpha$ of $\mathbf{w}$ is very small +whereas the precision~$\beta$ of $t$ is very large, +leading to insufficient \emph{regularization} (see Section~3.1.4). +Then, the posterior~(3.49) over $\mathbf{w}$ given the data set~$\bm{\mathsf{t}}$ will be +sharply peaked around the maximum likelihood estimate~$\mathbf{w}_{\text{ML}}$ given by (3.15); +and the predictive~(3.58) over $t$ given $\bm{\mathsf{t}}$ will be +well approximated by the likelihood~(3.8) conditioned on $\mathbf{w}_{\text{ML}}$ +and also sharply peaked around the regression function~(3.3). +Stated differently, learning the thus assumed model reduces to the least squares method, which is known to suffer from overfitting (see Section~1.1). -Of course, we can extend the model by incorporating hyperpriors over $\beta$ and $\alpha$, +Of course, one can extend the model by incorporating hyperpriors over $\alpha$ and $\beta$, thus introducing more Bayesian averaging. However, if the extended model is not sensible -(e.g., the hyperpriors are sharply peaked around wrong values), +(e.g., the hyperpriors are tuned to wrong values), we shall again end up with a wrong posterior and a wrong predictive. The point here is that, since we do not know the true model (if any), -we cannot know whether the assumed model is sensible in advance +we cannot know if the assumed model is sensible in advance (i.e., without any knowledge about data to be generated). We can however assess, given a data set, whether a model is better than another by, say, \emph{Bayesian model comparison} (see Section~3.4), @@ -2000,7 +2002,7 @@ \subsubsection*{#1} which can be estimated by cross-validation (Section~3.2), and (ii)~the \emph{marginal likelihood}, which is used in the Bayesian model comparison framework (Section~3.4), -are closely related but different criteria for assessing model complexity, +are closely related but different criteria for model selection, although, in practice, a higher marginal likelihood often tends to imply a lower generalization error and vice versa. For more (advanced) discussions, see \citet{Watanabe:WAIC,Watanabe:WBIC}. From ddfc1523e92f487b2c74e6bc43fe6234fa45a73f Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sun, 15 Apr 2018 14:34:15 +0900 Subject: [PATCH 02/70] Edit: Acknowledgements --- prml_errata.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index 1f7cfa3..cd5e2bc 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -130,7 +130,7 @@ \subsubsection*{#1} In particular, I am grateful to Christopher Sahnwaldt, Mark-Jan Nederhof, and -David Rosenberg +David S.\ Rosenberg for their invaluable comments and discussions. \Section{Corrections and Comments} From 75914b4b508c86ce941dcf759ce46333c1cb8a28 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 28 Apr 2018 00:44:23 +0900 Subject: [PATCH 03/70] s/well recognized/well-recognized/ --- prml_errata.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index b3186b4..166e39c 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -1974,7 +1974,7 @@ \subsubsection*{#1} after which we also discuss the difference between the two criteria for model selection, namely, (i)~the \emph{generalization error} (see Section~3.2) and (ii)~the \emph{marginal likelihood} (or the \emph{model evidence}; see Section~3.4), -which is not well recognized in PRML. +which is not well-recognized in PRML. \parhead{A Bayesian model that exhibits overfitting} Let us take a Bayesian linear regression model of Section~3.3 as an example and From e0e9b37d87646a4baf4a36c8232aa789ca6ca0a9 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 28 Apr 2018 00:47:19 +0900 Subject: [PATCH 04/70] Add erratum on Page 33, Paragraph 3 --- prml_errata.tex | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/prml_errata.tex b/prml_errata.tex index 166e39c..b18b6cb 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -267,6 +267,15 @@ \subsubsection*{#1} then the conditional~$p(y|x)$ is well-defined regardless of $x$ so that $p(y|x) = p(y)$ and, again, the product rule~(1.32) holds, which in this case reduces to $p(x, y) = p(x) p(y)$. +\erratum{Page~33} +Paragraph~3: +Note that +the \emph{Akaike information criterion} (AIC) and +the Schwartz's \emph{Bayesian information criterion} (BIC) +have different goals and thus are different criteria for model selection. +However, the difference is not well-recognized in PRML. +We shall come back to this issue later in this report. + \erratum{Page~33} The line after (1.73): The best-fit \emph{log} likelihood~$p\left(\mathcal{D}\middle|\mathbf{w}_{\text{ML}}\right)$ From 1f5eb61420516e7cc5fbfff010d0d339e5a06338 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 28 Apr 2018 00:49:28 +0900 Subject: [PATCH 05/70] Edit: A Bayesian model that exhibits overfitting --- prml_errata.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index b18b6cb..cb342fc 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -1994,8 +1994,8 @@ \subsubsection*{#1} whereas the precision~$\beta$ of $t$ is very large, leading to insufficient \emph{regularization} (see Section~3.1.4). Then, the posterior~(3.49) over $\mathbf{w}$ given the data set~$\bm{\mathsf{t}}$ will be -sharply peaked around the maximum likelihood estimate~$\mathbf{w}_{\text{ML}}$ given by (3.15); -and the predictive~(3.58) over $t$ given $\bm{\mathsf{t}}$ will be +sharply peaked around the maximum likelihood estimate~$\mathbf{w}_{\text{ML}}$ given by (3.15) +so that the predictive distribution~(3.58) over $t$ given $\bm{\mathsf{t}}$ will be well approximated by the likelihood~(3.8) conditioned on $\mathbf{w}_{\text{ML}}$ and also sharply peaked around the regression function~(3.3). Stated differently, learning the thus assumed model reduces to the least squares method, From 80f36faaab8ac6753344cd99a1924c60faee382f Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sun, 29 Apr 2018 23:11:51 +0900 Subject: [PATCH 06/70] Edit erratum on Page 33, Paragraph 3 --- prml_errata.tex | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index cb342fc..c07bc83 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -269,10 +269,16 @@ \subsubsection*{#1} \erratum{Page~33} Paragraph~3: -Note that -the \emph{Akaike information criterion} (AIC) and -the Schwartz's \emph{Bayesian information criterion} (BIC) -have different goals and thus are different criteria for model selection. +Note that the two information criteria for model selection mentioned here, namely, +(i)~the \emph{Akaike information criterion} (AIC) and +(ii)~Schwartz's \emph{Bayesian information criterion} (BIC; see Section~4.4.1) +are different criteria with different goals, i.e., +(i)~to make better prediction given the data (an ability called \emph{generalization}) and +(ii)~to guess the ``true'' model that has generated the data +(in terms of the \emph{marginal likelihood}; see Section~3.4), respectively.\footnote{% +Since we cannot tell which goal (generalization or to guess the ``true'' model) +is more ``Bayesian'' than the other, +the term ``Bayesian information criteria'' is a misnomer.} However, the difference is not well-recognized in PRML. We shall come back to this issue later in this report. From 65b9973bb2d2b5e5c87d919704d5eb1e6e2b0480 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 5 May 2018 17:19:51 +0900 Subject: [PATCH 07/70] Edit erratum on Page 147, Paragraph -2 --- prml_errata.tex | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 23aa13d..f0e0620 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2013,11 +2013,12 @@ \subsubsection*{#1} that would make terribly wrong predictions very confidently; this is true even when we take a ``fully Bayesian'' approach (i.e., \emph{not} maximum likelihood, MAP, or whatever). -In the following, we show such a Bayesian model that exhibits overfitting, + +In the following, we first show such a Bayesian model that exhibits overfitting, after which we also discuss the difference between the two criteria for model selection, namely, (i)~the \emph{generalization error} (see Section~3.2) and -(ii)~the \emph{marginal likelihood} (or the \emph{model evidence}; see Section~3.4), -which is not well-recognized in PRML. +(ii)~the \emph{marginal likelihood} (or the \emph{model evidence}; see Section~3.4) +because it seems that the difference is, unfortunately, not well-recognized in PRML. \parhead{A Bayesian model that exhibits overfitting} Let us take a Bayesian linear regression model of Section~3.3 as an example and From c538a029eb9f064aa8dd661a8b86e3e3f07612e0 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 5 May 2018 23:05:14 +0900 Subject: [PATCH 08/70] Edit erratum on Page 147, Paragraph -2 --- prml_errata.tex | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index f0e0620..911546f 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2015,10 +2015,12 @@ \subsubsection*{#1} (i.e., \emph{not} maximum likelihood, MAP, or whatever). In the following, we first show such a Bayesian model that exhibits overfitting, -after which we also discuss the difference between the two criteria for model selection, namely, +after which we also discuss in some detail +the difference between the two criteria for model selection, namely, (i)~the \emph{generalization error} (see Section~3.2) and (ii)~the \emph{marginal likelihood} (or the \emph{model evidence}; see Section~3.4) -because it seems that the difference is, unfortunately, not well-recognized in PRML. +because it seems that the difference between these two criteria to be discussed is, unfortunately, +not well-recognized in PRML. \parhead{A Bayesian model that exhibits overfitting} Let us take a Bayesian linear regression model of Section~3.3 as an example and From d747c7316d89194c29005f1461d142bfb6c6e0e2 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 5 May 2018 23:31:38 +0900 Subject: [PATCH 09/70] Edit erratum on Page 147, Paragraph -2 --- prml_errata.tex | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 911546f..9693f08 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2015,12 +2015,11 @@ \subsubsection*{#1} (i.e., \emph{not} maximum likelihood, MAP, or whatever). In the following, we first show such a Bayesian model that exhibits overfitting, -after which we also discuss in some detail +after which we discuss in some detail the difference between the two criteria for model selection, namely, (i)~the \emph{generalization error} (see Section~3.2) and (ii)~the \emph{marginal likelihood} (or the \emph{model evidence}; see Section~3.4) -because it seems that the difference between these two criteria to be discussed is, unfortunately, -not well-recognized in PRML. +because that difference is, unfortunately, not well-recognized in PRML. \parhead{A Bayesian model that exhibits overfitting} Let us take a Bayesian linear regression model of Section~3.3 as an example and From 49c28d6f24e476ef04ba7ce9b9f8a5c2164fd5a8 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 5 May 2018 23:35:07 +0900 Subject: [PATCH 10/70] Edit: Generalization error vs. marginal likelihood --- prml_errata.tex | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 9693f08..fd9c95c 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2058,9 +2058,12 @@ \subsubsection*{#1} which can be estimated by cross-validation (Section~3.2), and (ii)~the \emph{marginal likelihood}, which is used in the Bayesian model comparison framework (Section~3.4), -are closely related but different criteria for model selection, -although, in practice, a higher marginal likelihood often tends to imply -a lower generalization error and vice versa. +are closely related but different criteria for model selection +(although, in practice, a higher marginal likelihood often tends to imply +a lower generalization error and vice versa). +Generally speaking, one should use: +(i)~the generalization error if they want to make better prediction given the data; and +(ii)~the marginal likelihood if they want to guess the ``true'' model that has generated the data. For more (advanced) discussions, see \citet{Watanabe:WAIC,Watanabe:WBIC}. \erratum{Page~156} From a0af723836a944ab37fcf4eb1bbee6cdd8ad408d Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sun, 6 May 2018 00:04:52 +0900 Subject: [PATCH 11/70] Edit erratum on Page 147, Paragraph -2 --- prml_errata.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index fd9c95c..98aeb07 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2008,7 +2008,7 @@ \subsubsection*{#1} does not arise when we marginalize over parameters in a Bayesian setting'' is simply an overstatement. Bayesian methods, like any other machine learning methods, can overfit -because the \emph{true} model from which the data set has been generated is unknown in general +because the ``true'' model from which the data set has been generated is unknown in general so that one could possibly assume an inappropriate (too expressive) model that would make terribly wrong predictions very confidently; this is true even when we take a ``fully Bayesian'' approach @@ -2043,7 +2043,7 @@ \subsubsection*{#1} (e.g., the hyperpriors are tuned to wrong values), we shall again end up with a wrong posterior and a wrong predictive. -The point here is that, since we do not know the true model (if any), +The point here is that, since we do not know the ``true'' model (if any), we cannot know if the assumed model is sensible in advance (i.e., without any knowledge about data to be generated). We can however assess, given a data set, whether a model is better than another From 3148313168f235b54ef738325b579b5f1de9d17e Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sun, 6 May 2018 00:05:51 +0900 Subject: [PATCH 12/70] Edit: Generalization error vs. marginal likelihood --- prml_errata.tex | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 98aeb07..c4f4cab 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2053,14 +2053,15 @@ \subsubsection*{#1} see the discussion around (3.73). \parhead{Generalization error vs.\ marginal likelihood} -Moreover, one should also be aware of a subtlety that +Moreover, one should also be aware of a subtlety here that (i)~the \emph{generalization error}, -which can be estimated by cross-validation (Section~3.2), and +which can be estimated by cross-validation (see Section~3.2), and (ii)~the \emph{marginal likelihood}, -which is used in the Bayesian model comparison framework (Section~3.4), +which is employed in the Bayesian model comparison framework of Section~3.4, are closely related but different criteria for model selection (although, in practice, a higher marginal likelihood often tends to imply a lower generalization error and vice versa). + Generally speaking, one should use: (i)~the generalization error if they want to make better prediction given the data; and (ii)~the marginal likelihood if they want to guess the ``true'' model that has generated the data. From 72496129bf5a1f25504a0d35a8fed7f9d9933122 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Tue, 8 May 2018 00:02:50 +0900 Subject: [PATCH 13/70] Edit erratum on Page 33, Paragraph 3 --- prml_errata.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index 72be055..a6943a0 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -307,7 +307,7 @@ \subsubsection*{#1} Since we cannot tell which goal (generalization or to guess the ``true'' model) is more ``Bayesian'' than the other, the term ``Bayesian information criteria'' is a misnomer.} -However, the difference is not well-recognized in PRML. +However, it seems that the difference is, unfortunately, not well-recognized in PRML. We shall come back to this issue later in this report. \erratum{Page~33} From 4cbbb17eeb2e2fade7bb07a9d5fbecb706412325 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Tue, 8 May 2018 00:18:39 +0900 Subject: [PATCH 14/70] Edit: Generalization error vs. marginal likelihood --- prml_errata.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index a6943a0..cee5e11 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2062,7 +2062,7 @@ \subsubsection*{#1} (although, in practice, a higher marginal likelihood often tends to imply a lower generalization error and vice versa). -Generally speaking, one should use: +Generally speaking, one should adopt: (i)~the generalization error if they want to make better prediction given the data; and (ii)~the marginal likelihood if they want to guess the ``true'' model that has generated the data. For more (advanced) discussions, see \citet{Watanabe:WAIC,Watanabe:WBIC}. From b517c6754eb572acef948fab657c49f5cbd0acf4 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Wed, 9 May 2018 23:55:45 +0900 Subject: [PATCH 15/70] Edit erratum on Page 147, Paragraph -2 --- prml_errata.tex | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 976a07a..6ec4b43 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2016,12 +2016,11 @@ \subsubsection*{#1} this is true even when we take a ``fully Bayesian'' approach (i.e., \emph{not} maximum likelihood, MAP, or whatever). -In the following, we first show such a Bayesian model that exhibits overfitting, -after which we discuss in some detail -the difference between the two criteria for model selection, namely, -(i)~the \emph{generalization error} (see Section~3.2) and -(ii)~the \emph{marginal likelihood} (or the \emph{model evidence}; see Section~3.4) -because that difference is, unfortunately, not well-recognized in PRML. +In the following, we first show that there exists such a Bayesian model that exhibits overfitting, +after which we discuss in some detail the difference between +the two criteria for model selection concerned in Sections~3.2 and 3.4, namely, +(i)~the \emph{generalization error} and +(ii)~the \emph{marginal likelihood} (or the \emph{model evidence}), respectively. \parhead{A Bayesian model that exhibits overfitting} Let us take a Bayesian linear regression model of Section~3.3 as an example and From a1d03eca949e16269bcdbd30f918469db3372276 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Thu, 10 May 2018 00:15:11 +0900 Subject: [PATCH 16/70] Edit: A Bayesian model that exhibits overfitting --- prml_errata.tex | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 6ec4b43..dec6404 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2036,7 +2036,9 @@ \subsubsection*{#1} well approximated by the likelihood~(3.8) conditioned on $\mathbf{w}_{\text{ML}}$ and also sharply peaked around the regression function~(3.3). Stated differently, learning the thus assumed model reduces to the least squares method, -which is known to suffer from overfitting (see Section~1.1). +which is known to suffer from overfitting, i.e., +the \emph{generalization error} becomes large if the model is too expressive +(see Section~1.1). Of course, one can extend the model by incorporating hyperpriors over $\alpha$ and $\beta$, thus introducing more Bayesian averaging. @@ -2048,9 +2050,9 @@ \subsubsection*{#1} we cannot know if the assumed model is sensible in advance (i.e., without any knowledge about data to be generated). We can however assess, given a data set, whether a model is better than another -by, say, \emph{Bayesian model comparison} (see Section~3.4), -though a caveat is that we still need some (implicit) assumptions for the framework of -Bayesian model comparison to work; +in terms of, say, the \emph{marginal likelihood} (see Section~3.4), +though a caveat is that we still need some (implicit) assumptions for +the Bayesian model comparison framework to work; see the discussion around (3.73). \parhead{Generalization error vs.\ marginal likelihood} From d92749503823a375d16074f145f0dcb5c009590f Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Thu, 10 May 2018 00:27:06 +0900 Subject: [PATCH 17/70] Edit: Generalization error vs. marginal likelihood --- prml_errata.tex | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index dec6404..e9d2401 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2057,13 +2057,11 @@ \subsubsection*{#1} \parhead{Generalization error vs.\ marginal likelihood} Moreover, one should also be aware of a subtlety here that -(i)~the \emph{generalization error}, -which can be estimated by cross-validation (see Section~3.2), and -(ii)~the \emph{marginal likelihood}, -which is employed in the Bayesian model comparison framework of Section~3.4, -are closely related but different criteria for model selection -(although, in practice, a higher marginal likelihood often tends to imply -a lower generalization error and vice versa). +(i)~the \emph{generalization error} and +(ii)~the \emph{marginal likelihood} +are closely related but different criteria for model selection, +although, in practice, a higher marginal likelihood often tends to imply +a lower generalization error and vice versa. Generally speaking, one should adopt: (i)~the generalization error if they want to make better prediction given the data; and From 918dbb70ca59660557f5f6bdbd1ac9d4be48d977 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Fri, 11 May 2018 23:00:57 +0900 Subject: [PATCH 18/70] Edit erratum on Page 147, Paragraph -2 --- prml_errata.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index e9d2401..ff0708b 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2011,8 +2011,8 @@ \subsubsection*{#1} is simply an overstatement. Bayesian methods, like any other machine learning methods, can overfit because the ``true'' model from which the data set has been generated is unknown in general -so that one could possibly assume an inappropriate (too expressive) model -that would make terribly wrong predictions very confidently; +so that one could possibly assume an inappropriate model, +e.g., too expressive an model that would make terribly wrong predictions very confidently; this is true even when we take a ``fully Bayesian'' approach (i.e., \emph{not} maximum likelihood, MAP, or whatever). From fea7df980954e2295afac8e87d95a20897961383 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Fri, 11 May 2018 23:29:21 +0900 Subject: [PATCH 19/70] Edit: A Bayesian model that exhibits overfitting --- prml_errata.tex | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index ff0708b..6d5d6d2 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2028,7 +2028,7 @@ \subsubsection*{#1} Suppose that the prior~(3.52) over the parameters~$\mathbf{w}$ is broad whereas the conditional~(3.8) over the target~$t$ given $\mathbf{w}$ is narrow, i.e., the precision~$\alpha$ of $\mathbf{w}$ is very small -whereas the precision~$\beta$ of $t$ is very large, +whereas the precision~$\beta$ of $t$ given $\mathbf{w}$ is very large, leading to insufficient \emph{regularization} (see Section~3.1.4). Then, the posterior~(3.49) over $\mathbf{w}$ given the data set~$\bm{\mathsf{t}}$ will be sharply peaked around the maximum likelihood estimate~$\mathbf{w}_{\text{ML}}$ given by (3.15) @@ -2037,8 +2037,9 @@ \subsubsection*{#1} and also sharply peaked around the regression function~(3.3). Stated differently, learning the thus assumed model reduces to the least squares method, which is known to suffer from overfitting, i.e., -the \emph{generalization error} becomes large if the model is too expressive -(see Section~1.1). +the \emph{generalization error} can be very large if the regression function is too expressive, +e.g., the order~$M$ of the polynomial function~(1.1) is large compared to +the size of the training data set as we have seen in Section~1.1. Of course, one can extend the model by incorporating hyperpriors over $\alpha$ and $\beta$, thus introducing more Bayesian averaging. From ac118114c05fe73370b4cf35afb90710d29cddb2 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Fri, 11 May 2018 23:36:48 +0900 Subject: [PATCH 20/70] Edit: A Bayesian model that exhibits overfitting --- prml_errata.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index 6d5d6d2..669dcb7 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2051,7 +2051,7 @@ \subsubsection*{#1} we cannot know if the assumed model is sensible in advance (i.e., without any knowledge about data to be generated). We can however assess, given a data set, whether a model is better than another -in terms of, say, the \emph{marginal likelihood} (see Section~3.4), +in terms of, say, the \emph{marginal likelihood} as discussed in Section~3.4, though a caveat is that we still need some (implicit) assumptions for the Bayesian model comparison framework to work; see the discussion around (3.73). From b03030044aa95873274ede09994b4cb71ed4914a Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Fri, 11 May 2018 23:38:49 +0900 Subject: [PATCH 21/70] Edit: A Bayesian model that exhibits overfitting --- prml_errata.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index 669dcb7..4c342f3 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2053,7 +2053,7 @@ \subsubsection*{#1} We can however assess, given a data set, whether a model is better than another in terms of, say, the \emph{marginal likelihood} as discussed in Section~3.4, though a caveat is that we still need some (implicit) assumptions for -the Bayesian model comparison framework to work; +this Bayesian model comparison framework to work; see the discussion around (3.73). \parhead{Generalization error vs.\ marginal likelihood} From a5b6ade8a8ebe516d3253cf7043a02a72ded1503 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Fri, 11 May 2018 23:45:18 +0900 Subject: [PATCH 22/70] Edit: A Bayesian model that exhibits overfitting --- prml_errata.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index 4c342f3..e8e4a81 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2038,7 +2038,7 @@ \subsubsection*{#1} Stated differently, learning the thus assumed model reduces to the least squares method, which is known to suffer from overfitting, i.e., the \emph{generalization error} can be very large if the regression function is too expressive, -e.g., the order~$M$ of the polynomial function~(1.1) is large compared to +e.g., the order~$M$ of the polynomial regression function~(1.1) is large compared to the size of the training data set as we have seen in Section~1.1. Of course, one can extend the model by incorporating hyperpriors over $\alpha$ and $\beta$, From 008c509ad8cb625f5bbb0afed959ea5735ae71bc Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Fri, 11 May 2018 23:48:50 +0900 Subject: [PATCH 23/70] Edit: A Bayesian model that exhibits overfitting --- prml_errata.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index e8e4a81..09d1084 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2035,7 +2035,7 @@ \subsubsection*{#1} so that the predictive distribution~(3.58) over $t$ given $\bm{\mathsf{t}}$ will be well approximated by the likelihood~(3.8) conditioned on $\mathbf{w}_{\text{ML}}$ and also sharply peaked around the regression function~(3.3). -Stated differently, learning the thus assumed model reduces to the least squares method, +Stated differently, learning thus assumed a Bayesian model reduces to the least squares method, which is known to suffer from overfitting, i.e., the \emph{generalization error} can be very large if the regression function is too expressive, e.g., the order~$M$ of the polynomial regression function~(1.1) is large compared to From 462d223cec235e1505e98ae238e3e5eec9f8e7e0 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 12 May 2018 06:58:44 +0900 Subject: [PATCH 24/70] Edit erratum on Page 147, Paragraph -2 --- prml_errata.tex | 1 + 1 file changed, 1 insertion(+) diff --git a/prml_errata.tex b/prml_errata.tex index 09d1084..a16b694 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2007,6 +2007,7 @@ \subsubsection*{#1} The argument that ``the phenomenon of [overfitting\footnote{% In this report, we use the term ``overfitting'' without hyphenation (i.e., instead of ``over-fitting'' as in PRML).}] +\dots does not arise when we marginalize over parameters in a Bayesian setting'' is simply an overstatement. Bayesian methods, like any other machine learning methods, can overfit From 2a1ac3c2799de43dccda081c04c502ff4c8b6dc1 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 12 May 2018 07:03:10 +0900 Subject: [PATCH 25/70] Edit erratum on Page 147, Paragraph -2 --- prml_errata.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index a16b694..6f3bf94 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2013,7 +2013,7 @@ \subsubsection*{#1} Bayesian methods, like any other machine learning methods, can overfit because the ``true'' model from which the data set has been generated is unknown in general so that one could possibly assume an inappropriate model, -e.g., too expressive an model that would make terribly wrong predictions very confidently; +say, too expressive an model that would make terribly wrong predictions very confidently; this is true even when we take a ``fully Bayesian'' approach (i.e., \emph{not} maximum likelihood, MAP, or whatever). From d65f07bd5545b613b6b0859380d0a3f9fcd0fe85 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 12 May 2018 07:05:02 +0900 Subject: [PATCH 26/70] Edit: A Bayesian model that exhibits overfitting --- prml_errata.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index 6f3bf94..7804a21 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2037,7 +2037,7 @@ \subsubsection*{#1} well approximated by the likelihood~(3.8) conditioned on $\mathbf{w}_{\text{ML}}$ and also sharply peaked around the regression function~(3.3). Stated differently, learning thus assumed a Bayesian model reduces to the least squares method, -which is known to suffer from overfitting, i.e., +which is known to suffer from overfitting so that the \emph{generalization error} can be very large if the regression function is too expressive, e.g., the order~$M$ of the polynomial regression function~(1.1) is large compared to the size of the training data set as we have seen in Section~1.1. From 22cdbaabd2e2c0ade4f11d31d27751ae3080e9c4 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 12 May 2018 07:06:37 +0900 Subject: [PATCH 27/70] Edit erratum on Page 147, Paragraph -2 --- prml_errata.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 7804a21..b96126b 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2004,9 +2004,9 @@ \subsubsection*{#1} \erratum{Page~147} Paragraph~\textminus2: -The argument that ``the phenomenon of [overfitting\footnote{% +The argument that ``the phenomenon of [overfitting]\footnote{% In this report, we use the term ``overfitting'' without hyphenation -(i.e., instead of ``over-fitting'' as in PRML).}] +(i.e., instead of ``over-fitting'' as in PRML).} \dots does not arise when we marginalize over parameters in a Bayesian setting'' is simply an overstatement. From bcb88813854b5919b34581a1102122439f3d927a Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Mon, 14 May 2018 23:01:27 +0900 Subject: [PATCH 28/70] Edit: A Bayesian model that exhibits overfitting --- prml_errata.tex | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index b96126b..c70487c 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -162,6 +162,7 @@ \subsubsection*{#1} Specifically, (1.1) should read \begin{equation} y(x, \mathbf{w}) = w_0 + w_1 x + w_2 x^2 + \dots + w_M x^M = \sum_{j=0}^{M} w_j x^j . +\label{eq:polynomial_regression_function} \end{equation} \erratum{Page~10} @@ -2039,8 +2040,8 @@ \subsubsection*{#1} Stated differently, learning thus assumed a Bayesian model reduces to the least squares method, which is known to suffer from overfitting so that the \emph{generalization error} can be very large if the regression function is too expressive, -e.g., the order~$M$ of the polynomial regression function~(1.1) is large compared to -the size of the training data set as we have seen in Section~1.1. +e.g., the order~$M$ of the polynomial regression function~\eqref{eq:polynomial_regression_function} +is large compared to the size of the training data set as we have seen in Section~1.1. Of course, one can extend the model by incorporating hyperpriors over $\alpha$ and $\beta$, thus introducing more Bayesian averaging. From f4b2082a7361f5680553d6690a15f9a33500bbf9 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Mon, 14 May 2018 23:11:22 +0900 Subject: [PATCH 29/70] Edit erratum on Page 147, Paragraph -2 --- prml_errata.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index c70487c..69908e0 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2020,7 +2020,8 @@ \subsubsection*{#1} In the following, we first show that there exists such a Bayesian model that exhibits overfitting, after which we discuss in some detail the difference between -the two criteria for model selection concerned in Sections~3.2 and 3.4, namely, +the two criteria for model selection (or model comparison) +concerned in Sections~3.2 and 3.4, namely, (i)~the \emph{generalization error} and (ii)~the \emph{marginal likelihood} (or the \emph{model evidence}), respectively. From 535030b85eb1dd58c4602ae5eee99b6f7706f6d4 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Mon, 14 May 2018 23:19:06 +0900 Subject: [PATCH 30/70] Edit: Generalization error vs. marginal likelihood --- prml_errata.tex | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 69908e0..614e744 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2061,13 +2061,12 @@ \subsubsection*{#1} \parhead{Generalization error vs.\ marginal likelihood} Moreover, one should also be aware of a subtlety here that -(i)~the \emph{generalization error} and -(ii)~the \emph{marginal likelihood} +(i)~the generalization error and (ii)~the marginal likelihood are closely related but different criteria for model selection, although, in practice, a higher marginal likelihood often tends to imply a lower generalization error and vice versa. -Generally speaking, one should adopt: +As a general rule, one should adopt: (i)~the generalization error if they want to make better prediction given the data; and (ii)~the marginal likelihood if they want to guess the ``true'' model that has generated the data. For more (advanced) discussions, see \citet{Watanabe:WAIC,Watanabe:WBIC}. From b17b9a1d864dc57fd56412e85ae42342e0f996d7 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Tue, 15 May 2018 23:48:28 +0900 Subject: [PATCH 31/70] Edit: Generalization error vs. marginal likelihood --- prml_errata.tex | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 614e744..f2621aa 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2061,7 +2061,7 @@ \subsubsection*{#1} \parhead{Generalization error vs.\ marginal likelihood} Moreover, one should also be aware of a subtlety here that -(i)~the generalization error and (ii)~the marginal likelihood +(i)~the \emph{generalization error} and (ii)~the \emph{marginal likelihood} are closely related but different criteria for model selection, although, in practice, a higher marginal likelihood often tends to imply a lower generalization error and vice versa. @@ -2069,7 +2069,13 @@ \subsubsection*{#1} As a general rule, one should adopt: (i)~the generalization error if they want to make better prediction given the data; and (ii)~the marginal likelihood if they want to guess the ``true'' model that has generated the data. -For more (advanced) discussions, see \citet{Watanabe:WAIC,Watanabe:WBIC}. + +Of course, nothing prevents us from examining the behavior of both the two criteria, if possible, +for assessing the model concerned. +If both the criteria indicate that a model is favorable or not, +then it is probably safe to say that the model is sensible or not, respectively. +At least it is hoped that, since the two criteria are different, +we can gain more information from both of them than from one of them. \erratum{Page~156} Equation~(3.57): From 13f646005b248af07110b5798a031f30e1746d97 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Wed, 16 May 2018 06:31:04 +0900 Subject: [PATCH 32/70] Edit: Generalization error vs. marginal likelihood --- prml_errata.tex | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index f2621aa..c27084b 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2060,22 +2060,22 @@ \subsubsection*{#1} see the discussion around (3.73). \parhead{Generalization error vs.\ marginal likelihood} -Moreover, one should also be aware of a subtlety here that +Moreover, we should also be aware of a subtlety here that (i)~the \emph{generalization error} and (ii)~the \emph{marginal likelihood} -are closely related but different criteria for model selection, -although, in practice, a higher marginal likelihood often tends to imply -a lower generalization error and vice versa. +are closely related but different criteria for model selection +(although, in practice, a higher marginal likelihood often tends to imply +a lower generalization error and vice versa). As a general rule, one should adopt: (i)~the generalization error if they want to make better prediction given the data; and (ii)~the marginal likelihood if they want to guess the ``true'' model that has generated the data. -Of course, nothing prevents us from examining the behavior of both the two criteria, if possible, -for assessing the model concerned. -If both the criteria indicate that a model is favorable or not, -then it is probably safe to say that the model is sensible or not, respectively. -At least it is hoped that, since the two criteria are different, +Of course, nothing prevents us from examining the behavior of \emph{both} the two criteria, +if possible, for assessing the model concerned; +it is even worth the effort to do so because, since the two criteria are different, we can gain more information from both of them than from one of them. +For example, if both the criteria indicate that a model is favorable or not, +then it is probably safe to say that the model is sensible or not, respectively. \erratum{Page~156} Equation~(3.57): From f9646f7118ecab3a66290b27fd3298ab37e383f2 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Wed, 16 May 2018 22:43:11 +0900 Subject: [PATCH 33/70] Edit: Generalization error vs. marginal likelihood --- prml_errata.tex | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index c27084b..483a88d 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2068,14 +2068,15 @@ \subsubsection*{#1} As a general rule, one should adopt: (i)~the generalization error if they want to make better prediction given the data; and -(ii)~the marginal likelihood if they want to guess the ``true'' model that has generated the data. +(ii)~the marginal likelihood if they want to guess the ``true'' model that has generated the data +(see below). Of course, nothing prevents us from examining the behavior of \emph{both} the two criteria, -if possible, for assessing the model concerned; +if possible, to assess the model concerned; it is even worth the effort to do so because, since the two criteria are different, we can gain more information from both of them than from one of them. -For example, if both the criteria indicate that a model is favorable or not, -then it is probably safe to say that the model is sensible or not, respectively. +If we have learned that both the criteria do or do not prefer a model, +then we would be more confident that the model is sensible or not, respectively. \erratum{Page~156} Equation~(3.57): From bf8a0dbdd8bdcf49432ceb15a0dce676274bb5f7 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Wed, 16 May 2018 23:12:38 +0900 Subject: [PATCH 34/70] Edit: Generalization error vs. marginal likelihood --- prml_errata.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index 483a88d..c5c493b 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2069,7 +2069,7 @@ \subsubsection*{#1} As a general rule, one should adopt: (i)~the generalization error if they want to make better prediction given the data; and (ii)~the marginal likelihood if they want to guess the ``true'' model that has generated the data -(see below). +(see also the following discussion). Of course, nothing prevents us from examining the behavior of \emph{both} the two criteria, if possible, to assess the model concerned; From 70660dc8282c729d0b9ccebfa7b2107efd4e681c Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Fri, 18 May 2018 23:17:53 +0900 Subject: [PATCH 35/70] Add: More on generalization error vs. marginal likelihood --- prml_errata.tex | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/prml_errata.tex b/prml_errata.tex index c5c493b..02c03f6 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2078,6 +2078,21 @@ \subsubsection*{#1} If we have learned that both the criteria do or do not prefer a model, then we would be more confident that the model is sensible or not, respectively. +\parhead{More on generalization error vs.\ marginal likelihood +(or generalization loss vs.\ Bayes free energy)} +In what follows, I would like to further elaborate on +the difference between the two criteria for model selection. +In order to facilitate the discussion, +we first introduce some terminology due to \citet{Watanabe:BayesStatistics}. +Throughout the discussion, +special care must be taken about the distribution with which we take expectation +because the ``true'' distribution is unknown in general +and, therefore, we must assume some model (i.e., hypothetical) distributions +so that confusion can easily arise about which distribution is concerned. +To avoid such confusion, we also introduce some notation. +The terminology and the notation to be introduced here are somewhat different from +those of PRML or other part of this report. + \erratum{Page~156} Equation~(3.57): The new input vector~$\mathbf{x}$ is omitted in (3.57) as in, e.g., (3.74). From 10df9ed2bf67fcc5af9b4429ddf4ee834a5de7ec Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 19 May 2018 00:17:13 +0900 Subject: [PATCH 36/70] Edit: More on generalization error vs. marginal likelihood --- prml_errata.tex | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/prml_errata.tex b/prml_errata.tex index 02c03f6..e020386 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2093,6 +2093,21 @@ \subsubsection*{#1} The terminology and the notation to be introduced here are somewhat different from those of PRML or other part of this report. +First of all, we should point out that +so far we have used the term \emph{generalization error} somewhat loosely; +it is used primarily in the context of a frequentist inference +(such as the one in Section~3.2; see also Section~1.5.5) +and is appeared to be defined as +the expected loss evaluated for the predicted target values (i.e., point estimates) +with some arbitrary loss function given (e.g., the squared error for regression). +Motivated from a Bayesian point of view, we can define it differently +so as to be a better criterion for assessing a model's predictive ability, i.e., +as the expected negative log predictive distribution (see below for a precise definition), +which, to avoid ambiguity, we hereafter call the \emph{generalization loss}. +We shall also define another criterion called the \emph{Bayes free energy}, +which is nothing but the negative log \emph{marginal likelihood}. +The Bayes free energy is better compared with the generalization loss as we shall see shortly. + \erratum{Page~156} Equation~(3.57): The new input vector~$\mathbf{x}$ is omitted in (3.57) as in, e.g., (3.74). From 8fc7ae4a04ced0e51ec5aecbf48046b64c95b820 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 19 May 2018 22:23:30 +0900 Subject: [PATCH 37/70] Edit: More on generalization error vs. marginal likelihood --- prml_errata.tex | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index e020386..9574e1b 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2098,12 +2098,16 @@ \subsubsection*{#1} it is used primarily in the context of a frequentist inference (such as the one in Section~3.2; see also Section~1.5.5) and is appeared to be defined as -the expected loss evaluated for the predicted target values (i.e., point estimates) +the expected loss evaluated for the predicted target value (i.e., a point estimate) with some arbitrary loss function given (e.g., the squared error for regression). -Motivated from a Bayesian point of view, we can define it differently -so as to be a better criterion for assessing a model's predictive ability, i.e., -as the expected negative log predictive distribution (see below for a precise definition), -which, to avoid ambiguity, we hereafter call the \emph{generalization loss}. +Although this definition can be considered fairly general, +an alternative definition from a Bayesian point of view would be in terms of +the predictive distribution. +Specifically, we can define the generalization error as +the expected negative log predictive distribution (see below for a precise definition); +to avoid ambiguity, we hereafter call this particular criterion for +assessing a model's predictive ability +the \emph{generalization loss}. We shall also define another criterion called the \emph{Bayes free energy}, which is nothing but the negative log \emph{marginal likelihood}. The Bayes free energy is better compared with the generalization loss as we shall see shortly. From 9d472c740b53413dbd76c0490c41671f69cf3a27 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 19 May 2018 23:26:25 +0900 Subject: [PATCH 38/70] Edit: More on generalization error vs. marginal likelihood --- prml_errata.tex | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/prml_errata.tex b/prml_errata.tex index 9574e1b..071567c 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2112,6 +2112,38 @@ \subsubsection*{#1} which is nothing but the negative log \emph{marginal likelihood}. The Bayes free energy is better compared with the generalization loss as we shall see shortly. +Let us now introduce some notation. +First, let $\mathbf{X} = \{ \mathbf{x}_1, \dots, \mathbf{x}_N \}$ be the training data +(we use a vector~$\mathbf{x}$ instead of a scalar~$t$ for the target variable here). +They are assumed to be i.i.d.\ and +have been generated from some true distribution $p(\mathbf{x})$ so that +\begin{equation} +p(\mathbf{X}) = p(\mathbf{x}_1, \dots, \mathbf{x}_N) = \prod_{n=1}^{N} p(\mathbf{x}_n). +\end{equation} +Let $\mathcal{M}$ be the assumed model we wish to learn. +As a shorthand, the probability distribution of our assumed model~$\mathcal{M}$ is denoted by +\begin{equation} +q(\cdot) = p(\cdot|\mathcal{M}) +\end{equation} +i.e., the conditioning on $\mathcal{M}$ is implicit for $q(\cdot)$.\footnote{% +We define $p(\cdot|a|b) \equiv p(\cdot|a, b)$ so that +we can write the conditional~$q(\cdot|\cdot)$.} +If there exists a true model~$\mathcal{M}^{\star}$, +then we can write the true distribution as +\begin{equation} +p(\cdot) = p(\cdot|\mathcal{M}^{\star}). +\end{equation} +The model~$\mathcal{M}$ consists of a pair of the likelihood~$q(\mathbf{x}|\mathbf{w})$ and +the prior~$q(\mathbf{w})$ where $\mathbf{w}$ is a set of parameters +(including hyperparameters and so on). +The marginal likelihood of the model~$\mathcal{M}$ is given by +\begin{equation} +q(\mathbf{X}) = \int \mathrm{d}\mathbf{w} \, q(\mathbf{w}) \, q(\mathbf{X}|\mathbf{w}) = +\int \mathrm{d}\mathbf{w} \, q(\mathbf{w}) \prod_{n=1}^{N} q(\mathbf{x}_n|\mathbf{w}). +\end{equation} +Note that $q(\mathbf{X})$ does not factorize in general because we have some unknown +parameters~$\mathbf{w}$ in the model~$\mathcal{M}$. + \erratum{Page~156} Equation~(3.57): The new input vector~$\mathbf{x}$ is omitted in (3.57) as in, e.g., (3.74). From f36095672d7de77272a2a2402536b0de834ca082 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Mon, 21 May 2018 23:10:45 +0900 Subject: [PATCH 39/70] Edit: A Bayesian model that exhibits overfitting --- prml_errata.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index 071567c..7a406ac 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2036,7 +2036,7 @@ \subsubsection*{#1} Then, the posterior~(3.49) over $\mathbf{w}$ given the data set~$\bm{\mathsf{t}}$ will be sharply peaked around the maximum likelihood estimate~$\mathbf{w}_{\text{ML}}$ given by (3.15) so that the predictive distribution~(3.58) over $t$ given $\bm{\mathsf{t}}$ will be -well approximated by the likelihood~(3.8) conditioned on $\mathbf{w}_{\text{ML}}$ +well approximated by the conditional~(3.8) conditioned on $\mathbf{w}_{\text{ML}}$ and also sharply peaked around the regression function~(3.3). Stated differently, learning thus assumed a Bayesian model reduces to the least squares method, which is known to suffer from overfitting so that From 88adebefb583a4a84e552d226819d63b94c19f2a Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Tue, 22 May 2018 22:41:44 +0900 Subject: [PATCH 40/70] Edit erratum on Page 33, Paragraph 3 --- prml_errata.tex | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 7a406ac..3c1c7df 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -298,18 +298,20 @@ \subsubsection*{#1} \erratum{Page~33} Paragraph~3: -Note that the two information criteria for model selection mentioned here, namely, +The two information criteria for model selection mentioned here, namely, (i)~the \emph{Akaike information criterion} (AIC) and (ii)~Schwartz's \emph{Bayesian information criterion} (BIC; see Section~4.4.1) are different criteria with different goals, i.e., -(i)~to make better prediction given the data (an ability called \emph{generalization}) and -(ii)~to guess the ``true'' model that has generated the data -(in terms of the \emph{marginal likelihood}; see Section~3.4), respectively.\footnote{% -Since we cannot tell which goal (generalization or to guess the ``true'' model) +(i)~to make better prediction given the training data set +(an ability called \emph{generalization}) and +(ii)~to find the ``true'' model from which the data set has been generated +(or to better explain the data set in terms of the \emph{marginal likelihood}; see Section~3.4), +respectively.\footnote{% +Since we cannot tell which goal (generalization or to find the ``true'' model) is more ``Bayesian'' than the other, the term ``Bayesian information criteria'' is a misnomer.} -However, it seems that the difference is, unfortunately, not well-recognized in PRML. -We shall come back to this issue later in this report. +However, the difference is not well-recognized in PRML; +we shall come back to this issue later in this report. \erratum{Page~33} The line after (1.73): From 5f82c0e5eeff386095ac7342fd1bc6b2c6bd50cd Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Tue, 22 May 2018 23:05:24 +0900 Subject: [PATCH 41/70] Edit: Generalization error vs. marginal likelihood --- prml_errata.tex | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 3c1c7df..6340b69 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2069,15 +2069,16 @@ \subsubsection*{#1} a lower generalization error and vice versa). As a general rule, one should adopt: -(i)~the generalization error if they want to make better prediction given the data; and -(ii)~the marginal likelihood if they want to guess the ``true'' model that has generated the data -(see also the following discussion). +(i)~the generalization error +if they want to make better prediction given the data set; and +(ii)~the marginal likelihood +if they want to find the ``true'' model from which the data set has been generated. Of course, nothing prevents us from examining the behavior of \emph{both} the two criteria, if possible, to assess the model concerned; it is even worth the effort to do so because, since the two criteria are different, we can gain more information from both of them than from one of them. -If we have learned that both the criteria do or do not prefer a model, +If, say, we have learned that both the criteria do or do not prefer a model, then we would be more confident that the model is sensible or not, respectively. \parhead{More on generalization error vs.\ marginal likelihood From 02bbed32e64805a863821c8a137cfdf2699a4dff Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Tue, 22 May 2018 23:51:01 +0900 Subject: [PATCH 42/70] Edit: More on generalization error vs. marginal likelihood --- prml_errata.tex | 25 +++++++++++++------------ 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 6340b69..af385c9 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2085,8 +2085,7 @@ \subsubsection*{#1} (or generalization loss vs.\ Bayes free energy)} In what follows, I would like to further elaborate on the difference between the two criteria for model selection. -In order to facilitate the discussion, -we first introduce some terminology due to \citet{Watanabe:BayesStatistics}. +In order to facilitate the discussion, we first introduce some terminology. Throughout the discussion, special care must be taken about the distribution with which we take expectation because the ``true'' distribution is unknown in general @@ -2097,23 +2096,25 @@ \subsubsection*{#1} those of PRML or other part of this report. First of all, we should point out that -so far we have used the term \emph{generalization error} somewhat loosely; +so far we have used the term \emph{generalization error} somewhat loosely +(in this report and also in PRML); it is used primarily in the context of a frequentist inference -(such as the one in Section~3.2; see also Section~1.5.5) -and is appeared to be defined as -the expected loss evaluated for the predicted target value (i.e., a point estimate) +(such as the one in Section~3.2) +and can be generally defined as the \emph{expected loss}~(see Section~1.5.5) +evaluated for the predicted target value (i.e., a point estimate) with some arbitrary loss function given (e.g., the squared error for regression). -Although this definition can be considered fairly general, -an alternative definition from a Bayesian point of view would be in terms of +An alternative definition from a Bayesian point of view would be in terms of the predictive distribution. Specifically, we can define the generalization error as -the expected negative log predictive distribution (see below for a precise definition); +the expected negative log predictive distribution; to avoid ambiguity, we hereafter call this particular criterion for -assessing a model's predictive ability -the \emph{generalization loss}. -We shall also define another criterion called the \emph{Bayes free energy}, +assessing a model's predictive performance the \emph{generalization loss} +(see below for a precise definition). +We shall also define another criterion called +the \emph{Bayes free energy}, which is nothing but the negative log \emph{marginal likelihood}. The Bayes free energy is better compared with the generalization loss as we shall see shortly. +The terminology here is due to \citet{Watanabe:BayesStatistics}. Let us now introduce some notation. First, let $\mathbf{X} = \{ \mathbf{x}_1, \dots, \mathbf{x}_N \}$ be the training data From 5805c4493ba55c787ab32e7f454ce6784704a133 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Wed, 23 May 2018 00:07:18 +0900 Subject: [PATCH 43/70] Edit: More on generalization error vs. marginal likelihood --- prml_errata.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index af385c9..e657ceb 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2085,7 +2085,8 @@ \subsubsection*{#1} (or generalization loss vs.\ Bayes free energy)} In what follows, I would like to further elaborate on the difference between the two criteria for model selection. -In order to facilitate the discussion, we first introduce some terminology. +In order to facilitate the discussion, +we first introduce some terminology due to \citet{Watanabe:BayesStatistics}. Throughout the discussion, special care must be taken about the distribution with which we take expectation because the ``true'' distribution is unknown in general @@ -2114,7 +2115,6 @@ \subsubsection*{#1} the \emph{Bayes free energy}, which is nothing but the negative log \emph{marginal likelihood}. The Bayes free energy is better compared with the generalization loss as we shall see shortly. -The terminology here is due to \citet{Watanabe:BayesStatistics}. Let us now introduce some notation. First, let $\mathbf{X} = \{ \mathbf{x}_1, \dots, \mathbf{x}_N \}$ be the training data From 6663a0046b86977dc5d1be3b95f624eaeb1017d5 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Wed, 23 May 2018 22:47:02 +0900 Subject: [PATCH 44/70] Edit erratum on Page 33, Paragraph 3 --- prml_errata.tex | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index e657ceb..252893c 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -298,20 +298,25 @@ \subsubsection*{#1} \erratum{Page~33} Paragraph~3: -The two information criteria for model selection mentioned here, namely, -(i)~the \emph{Akaike information criterion} (AIC) and +Note that the two information criteria for model selection mentioned here, namely, AIC and BIC +are different criteria with different goals (see below). +However, the difference between the two criteria (i.e., AIC and BIC) or, more generally, +the difference between the two goals (i.e., generalization or model identification) +seems not to be well-recognized in PRML; +we shall come back to this issue later in this report. + +\parhead{AIC vs.\ BIC} +(i)~The \emph{Akaike information criterion} (AIC) and (ii)~Schwartz's \emph{Bayesian information criterion} (BIC; see Section~4.4.1) are different criteria with different goals, i.e., (i)~to make better prediction given the training data set (an ability called \emph{generalization}) and -(ii)~to find the ``true'' model from which the data set has been generated +(ii)~to identify the ``true'' model from which the data set has been generated (or to better explain the data set in terms of the \emph{marginal likelihood}; see Section~3.4), respectively.\footnote{% -Since we cannot tell which goal (generalization or to find the ``true'' model) +Since we cannot tell which goal (generalization or model identification) is more ``Bayesian'' than the other, the term ``Bayesian information criteria'' is a misnomer.} -However, the difference is not well-recognized in PRML; -we shall come back to this issue later in this report. \erratum{Page~33} The line after (1.73): From cc7b6199d52155adc73dd5cabb30578b686fcb9e Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Wed, 23 May 2018 23:48:47 +0900 Subject: [PATCH 45/70] Add erratum on Page 217, Paragraph 3 --- prml_errata.tex | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/prml_errata.tex b/prml_errata.tex index 4c0945f..22424dd 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2285,6 +2285,17 @@ \subsubsection*{#1} Paragraph~1, Line~1: ``must related'' should be ``must be related.'' +\erratum{Page~217} +Paragraph~3: +Here, it is pointed out that information criteria such as AIC and BIC are no longer valid +if the posterior cannot be approximated by a Gaussian +(such a model is called \emph{singular} and, in fact, +many practical models are known to be singular). +It is also worth noting here that new information criteria applicable for singular models +have been recently proposed, namely, +WAIC~\citep{Watanabe:WAIC,Watanabe:BayesStatistics} and WBIC~\citep{Watanabe:WBIC}, +which are considered generalized versions of AIC and BIC, respectively. + \erratum{Page~218} Equation~(4.144): The covariance should be the one~$\mathbf{S}_N$ evaluated at $\mathbf{w}_{\text{MAP}}$. From d29b1f8438769e9ca0dcbec88248744fadaddaab Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Wed, 23 May 2018 23:55:21 +0900 Subject: [PATCH 46/70] Edit: Generalization error vs. marginal likelihood --- prml_errata.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index 22424dd..25ee32c 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2078,7 +2078,8 @@ \subsubsection*{#1} (i)~the generalization error if they want to make better prediction given the data set; and (ii)~the marginal likelihood -if they want to find the ``true'' model from which the data set has been generated. +if they want to identify the ``true'' model from which the data set has been generated +(see below for more discussion). Of course, nothing prevents us from examining the behavior of \emph{both} the two criteria, if possible, to assess the model concerned; From 2210cf5814a0f9280a06c9cd492e1f7adaf94c72 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Thu, 24 May 2018 00:03:55 +0900 Subject: [PATCH 47/70] Edit: A Bayesian model that exhibits overfitting --- prml_errata.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index 25ee32c..0778483 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2046,7 +2046,7 @@ \subsubsection*{#1} so that the predictive distribution~(3.58) over $t$ given $\bm{\mathsf{t}}$ will be well approximated by the conditional~(3.8) conditioned on $\mathbf{w}_{\text{ML}}$ and also sharply peaked around the regression function~(3.3). -Stated differently, learning thus assumed a Bayesian model reduces to the least squares method, +Stated differently, learning thus assumed a model reduces to the least squares method, which is known to suffer from overfitting so that the \emph{generalization error} can be very large if the regression function is too expressive, e.g., the order~$M$ of the polynomial regression function~\eqref{eq:polynomial_regression_function} From 53bef82f732110916de690a228598943601df871 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Thu, 24 May 2018 00:28:43 +0900 Subject: [PATCH 48/70] Edit: More on generalization error vs. marginal likelihood --- prml_errata.tex | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 0778483..eb6d4ae 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2092,8 +2092,7 @@ \subsubsection*{#1} (or generalization loss vs.\ Bayes free energy)} In what follows, I would like to further elaborate on the difference between the two criteria for model selection. -In order to facilitate the discussion, -we first introduce some terminology due to \citet{Watanabe:BayesStatistics}. +In order to facilitate the discussion, we first introduce some terminology. Throughout the discussion, special care must be taken about the distribution with which we take expectation because the ``true'' distribution is unknown in general @@ -2111,17 +2110,19 @@ \subsubsection*{#1} and can be generally defined as the \emph{expected loss}~(see Section~1.5.5) evaluated for the predicted target value (i.e., a point estimate) with some arbitrary loss function given (e.g., the squared error for regression). -An alternative definition from a Bayesian point of view would be in terms of -the predictive distribution. -Specifically, we can define the generalization error as -the expected negative log predictive distribution; +An alternative definition of the generalization error from a Bayesian point of view would be +in terms of the predictive distribution. +Specifically, we can define it as the expected negative log predictive distribution; to avoid ambiguity, we hereafter call this particular criterion for assessing a model's predictive performance the \emph{generalization loss} (see below for a precise definition). We shall also define another criterion called the \emph{Bayes free energy}, -which is nothing but the negative log \emph{marginal likelihood}. -The Bayes free energy is better compared with the generalization loss as we shall see shortly. +which is nothing but the negative log marginal likelihood. +The Bayes free energy is better compared with the generalization loss +as we shall see shortly.\footnote{% +The terms~\emph{Bayes free energy} and \emph{generalization loss} are +due to \citet{Watanabe:BayesStatistics}.} Let us now introduce some notation. First, let $\mathbf{X} = \{ \mathbf{x}_1, \dots, \mathbf{x}_N \}$ be the training data From 63439ae77456c0ac2bcb3a9a8f0ef5398ef43943 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Thu, 24 May 2018 06:23:58 +0900 Subject: [PATCH 49/70] Edit erratum on Page 217, Paragraph 3 --- prml_errata.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index eb6d4ae..902b9a6 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2293,8 +2293,8 @@ \subsubsection*{#1} if the posterior cannot be approximated by a Gaussian (such a model is called \emph{singular} and, in fact, many practical models are known to be singular). -It is also worth noting here that new information criteria applicable for singular models -have been recently proposed, namely, +It is also worth noting here that there have been +new information criteria applicable for singular models recently proposed, namely, WAIC~\citep{Watanabe:WAIC,Watanabe:BayesStatistics} and WBIC~\citep{Watanabe:WBIC}, which are considered generalized versions of AIC and BIC, respectively. From fa563a301d10a48ab3f0b5b29a136a1ff91e183c Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Thu, 24 May 2018 07:42:09 +0900 Subject: [PATCH 50/70] Edit erratum on Page 217, Paragraph 3 --- prml_errata.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 902b9a6..372199a 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2293,10 +2293,10 @@ \subsubsection*{#1} if the posterior cannot be approximated by a Gaussian (such a model is called \emph{singular} and, in fact, many practical models are known to be singular). -It is also worth noting here that there have been -new information criteria applicable for singular models recently proposed, namely, +It is also worth noting here that there have been recently proposed +new information criteria applicable to singular models, namely, WAIC~\citep{Watanabe:WAIC,Watanabe:BayesStatistics} and WBIC~\citep{Watanabe:WBIC}, -which are considered generalized versions of AIC and BIC, respectively. +which are generalized versions of AIC and BIC, respectively. \erratum{Page~218} Equation~(4.144): From 01647af6c72a99744c24e0314d92c635d2f6de94 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Fri, 25 May 2018 00:24:04 +0900 Subject: [PATCH 51/70] Edit erratum on Page 33, Paragraph 3 --- prml_errata.tex | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 372199a..480d8ba 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -302,7 +302,8 @@ \subsubsection*{#1} Note that the two information criteria for model selection mentioned here, namely, AIC and BIC are different criteria with different goals (see below). However, the difference between the two criteria (i.e., AIC and BIC) or, more generally, -the difference between the two goals (i.e., generalization or model identification) +the difference between their goals +(i.e., generalization and to identify the ``true'' model, respectively) seems not to be well-recognized in PRML; we shall come back to this issue later in this report. @@ -315,7 +316,7 @@ \subsubsection*{#1} (ii)~to identify the ``true'' model from which the data set has been generated (or to better explain the data set in terms of the \emph{marginal likelihood}; see Section~3.4), respectively.\footnote{% -Since we cannot tell which goal (generalization or model identification) +Since we cannot tell which goal (i.e., generalization or to identify the ``true'' model) is more ``Bayesian'' than the other, the term ``Bayesian information criteria'' is a misnomer.} From d2596b4a740747cd043fc09472d87c7adad5d77e Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Fri, 25 May 2018 01:13:12 +0900 Subject: [PATCH 52/70] Edit: More on generalization error vs. marginal likelihood --- prml_errata.tex | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 480d8ba..0438d8b 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2126,10 +2126,10 @@ \subsubsection*{#1} due to \citet{Watanabe:BayesStatistics}.} Let us now introduce some notation. -First, let $\mathbf{X} = \{ \mathbf{x}_1, \dots, \mathbf{x}_N \}$ be the training data +First, let $\mathbf{X} = \{ \mathbf{x}_1, \dots, \mathbf{x}_N \}$ be the training data set (we use a vector~$\mathbf{x}$ instead of a scalar~$t$ for the target variable here). -They are assumed to be i.i.d.\ and -have been generated from some true distribution $p(\mathbf{x})$ so that +We assume that the data set~$\mathbf{X}$ is i.i.d.\ and +has been generated from some true distribution~$p(\cdot)$ so that \begin{equation} p(\mathbf{X}) = p(\mathbf{x}_1, \dots, \mathbf{x}_N) = \prod_{n=1}^{N} p(\mathbf{x}_n). \end{equation} @@ -2141,8 +2141,7 @@ \subsubsection*{#1} i.e., the conditioning on $\mathcal{M}$ is implicit for $q(\cdot)$.\footnote{% We define $p(\cdot|a|b) \equiv p(\cdot|a, b)$ so that we can write the conditional~$q(\cdot|\cdot)$.} -If there exists a true model~$\mathcal{M}^{\star}$, -then we can write the true distribution as +If there exists some true model~$\mathcal{M}^{\star}$, then we have \begin{equation} p(\cdot) = p(\cdot|\mathcal{M}^{\star}). \end{equation} From 5b623ee6925d49e5e6b740040346e3f435cbd952 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Fri, 25 May 2018 06:30:08 +0900 Subject: [PATCH 53/70] Edit: A Bayesian model that exhibits overfitting --- prml_errata.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index 0438d8b..94a26fd 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2047,7 +2047,7 @@ \subsubsection*{#1} so that the predictive distribution~(3.58) over $t$ given $\bm{\mathsf{t}}$ will be well approximated by the conditional~(3.8) conditioned on $\mathbf{w}_{\text{ML}}$ and also sharply peaked around the regression function~(3.3). -Stated differently, learning thus assumed a model reduces to the least squares method, +Stated differently, learning thus assumed a Bayesian model reduces to the least squares method, which is known to suffer from overfitting so that the \emph{generalization error} can be very large if the regression function is too expressive, e.g., the order~$M$ of the polynomial regression function~\eqref{eq:polynomial_regression_function} From db4407e76bec46341e5b78df2f11c247f3737a56 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Fri, 25 May 2018 23:20:23 +0900 Subject: [PATCH 54/70] Edit erratum on Page 147, Paragraph -2 --- prml_errata.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index 94a26fd..50759b3 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2023,7 +2023,7 @@ \subsubsection*{#1} Bayesian methods, like any other machine learning methods, can overfit because the ``true'' model from which the data set has been generated is unknown in general so that one could possibly assume an inappropriate model, -say, too expressive an model that would make terribly wrong predictions very confidently; +say, too expressive a model that would make terribly wrong predictions very confidently; this is true even when we take a ``fully Bayesian'' approach (i.e., \emph{not} maximum likelihood, MAP, or whatever). From facfee60505380796e30901a1bbb38550f745206 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 26 May 2018 00:33:31 +0900 Subject: [PATCH 55/70] Edit: More on generalization error vs. marginal likelihood --- prml_errata.tex | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 50759b3..8502baa 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2105,13 +2105,13 @@ \subsubsection*{#1} First of all, we should point out that so far we have used the term \emph{generalization error} somewhat loosely -(in this report and also in PRML); +in this report and also in PRML; it is used primarily in the context of a frequentist inference (such as the one in Section~3.2) -and can be generally defined as the \emph{expected loss}~(see Section~1.5.5) +and can generally be defined as the \emph{expected loss}~(see Section~1.5.5) evaluated for the predicted target value (i.e., a point estimate) with some arbitrary loss function given (e.g., the squared error for regression). -An alternative definition of the generalization error from a Bayesian point of view would be +An alternative definition of the generalization error from a Bayesian perspective would be in terms of the predictive distribution. Specifically, we can define it as the expected negative log predictive distribution; to avoid ambiguity, we hereafter call this particular criterion for @@ -2122,8 +2122,8 @@ \subsubsection*{#1} which is nothing but the negative log marginal likelihood. The Bayes free energy is better compared with the generalization loss as we shall see shortly.\footnote{% -The terms~\emph{Bayes free energy} and \emph{generalization loss} are -due to \citet{Watanabe:BayesStatistics}.} +The terms~\emph{generalization loss} and (Bayes) \emph{free energy} +are due to \citet{Watanabe:BayesStatistics}.} Let us now introduce some notation. First, let $\mathbf{X} = \{ \mathbf{x}_1, \dots, \mathbf{x}_N \}$ be the training data set From c2e6d500b57e41f44f48299b790f083650b14e22 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 26 May 2018 01:57:26 +0900 Subject: [PATCH 56/70] Edit: More on generalization error vs. marginal likelihood --- prml_errata.tex | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 8502baa..6aa1e1d 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2145,9 +2145,14 @@ \subsubsection*{#1} \begin{equation} p(\cdot) = p(\cdot|\mathcal{M}^{\star}). \end{equation} -The model~$\mathcal{M}$ consists of a pair of the likelihood~$q(\mathbf{x}|\mathbf{w})$ and -the prior~$q(\mathbf{w})$ where $\mathbf{w}$ is a set of parameters -(including hyperparameters and so on). +The model~$\mathcal{M}$ consists of a pair of +(i)~the prior~$q(\mathbf{w})$ over a set of parameters~$\mathbf{w}$ and +(ii)~the conditional~$q(\mathbf{x}|\mathbf{w})$ over $\mathbf{x}$ given $\mathbf{w}$.\footnote{% +The model~$\mathcal{M}$ may include +a hyperprior~$q(\bm{\xi})$ over some hyperparameters~$\bm{\xi}$ and so on. +It is easy to see that the discussion here still applies to such a model because we can consider +the joint prior of the form~$q(\mathbf{w}, \bm{\xi}) = +q(\mathbf{w}|\bm{\xi}) q(\bm{\xi})$ and, therefore, $\bm{\xi}$ can be absorbed into $\mathbf{w}$.} The marginal likelihood of the model~$\mathcal{M}$ is given by \begin{equation} q(\mathbf{X}) = \int \mathrm{d}\mathbf{w} \, q(\mathbf{w}) \, q(\mathbf{X}|\mathbf{w}) = From 05ddc4f87d08b8df6b6a68235e1750aab0de6be2 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 26 May 2018 22:42:32 +0900 Subject: [PATCH 57/70] Edit: More on generalization error vs. marginal likelihood --- prml_errata.tex | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 6aa1e1d..a07196f 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2100,8 +2100,12 @@ \subsubsection*{#1} and, therefore, we must assume some model (i.e., hypothetical) distributions so that confusion can easily arise about which distribution is concerned. To avoid such confusion, we also introduce some notation. -The terminology and the notation to be introduced here are somewhat different from + +Note that the terminology and the notation to be introduced here are somewhat different from those of PRML or other part of this report. +The terminology and the discussion are largely due to \citet{Watanabe:BayesStatistics}, +though the notation is not because, as always, +I have tried to follow that of PRML as closely as possible. First of all, we should point out that so far we have used the term \emph{generalization error} somewhat loosely @@ -2121,9 +2125,7 @@ \subsubsection*{#1} the \emph{Bayes free energy}, which is nothing but the negative log marginal likelihood. The Bayes free energy is better compared with the generalization loss -as we shall see shortly.\footnote{% -The terms~\emph{generalization loss} and (Bayes) \emph{free energy} -are due to \citet{Watanabe:BayesStatistics}.} +as we shall see shortly. Let us now introduce some notation. First, let $\mathbf{X} = \{ \mathbf{x}_1, \dots, \mathbf{x}_N \}$ be the training data set From 4c1d0eb8e1114e6421e9a61370aa68203e27ac20 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 26 May 2018 22:47:57 +0900 Subject: [PATCH 58/70] Edit: More on generalization error vs. marginal likelihood --- prml_errata.tex | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index a07196f..b6495d7 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2124,8 +2124,7 @@ \subsubsection*{#1} We shall also define another criterion called the \emph{Bayes free energy}, which is nothing but the negative log marginal likelihood. -The Bayes free energy is better compared with the generalization loss -as we shall see shortly. +The generalization loss is better compared with the Bayes free energy as we shall see shortly. Let us now introduce some notation. First, let $\mathbf{X} = \{ \mathbf{x}_1, \dots, \mathbf{x}_N \}$ be the training data set From 31e24edde3626f7760f6de5036f086d594096f48 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 26 May 2018 23:25:10 +0900 Subject: [PATCH 59/70] Edit: More on generalization error vs. marginal likelihood --- prml_errata.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index b6495d7..844d2aa 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2137,14 +2137,14 @@ \subsubsection*{#1} Let $\mathcal{M}$ be the assumed model we wish to learn. As a shorthand, the probability distribution of our assumed model~$\mathcal{M}$ is denoted by \begin{equation} -q(\cdot) = p(\cdot|\mathcal{M}) +q(\cdot) \equiv p(\cdot|\mathcal{M}) \end{equation} i.e., the conditioning on $\mathcal{M}$ is implicit for $q(\cdot)$.\footnote{% We define $p(\cdot|a|b) \equiv p(\cdot|a, b)$ so that we can write the conditional~$q(\cdot|\cdot)$.} -If there exists some true model~$\mathcal{M}^{\star}$, then we have +If there exists some true model~$\mathcal{M}^{\star}$, then \begin{equation} -p(\cdot) = p(\cdot|\mathcal{M}^{\star}). +p(\cdot) \equiv p(\cdot|\mathcal{M}^{\star}). \end{equation} The model~$\mathcal{M}$ consists of a pair of (i)~the prior~$q(\mathbf{w})$ over a set of parameters~$\mathbf{w}$ and From 1b653793c786320dbd8e5fbb30dcb843f028a0f6 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 26 May 2018 23:26:21 +0900 Subject: [PATCH 60/70] Edit: More on generalization error vs. marginal likelihood --- prml_errata.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 844d2aa..f93a0ec 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2146,9 +2146,9 @@ \subsubsection*{#1} \begin{equation} p(\cdot) \equiv p(\cdot|\mathcal{M}^{\star}). \end{equation} -The model~$\mathcal{M}$ consists of a pair of -(i)~the prior~$q(\mathbf{w})$ over a set of parameters~$\mathbf{w}$ and -(ii)~the conditional~$q(\mathbf{x}|\mathbf{w})$ over $\mathbf{x}$ given $\mathbf{w}$.\footnote{% +The model~$\mathcal{M}$ consists of a pair of: +(i)~a prior~$q(\mathbf{w})$ over a set of parameters~$\mathbf{w}$; and +(ii)~a conditional~$q(\mathbf{x}|\mathbf{w})$ over $\mathbf{x}$ given $\mathbf{w}$.\footnote{% The model~$\mathcal{M}$ may include a hyperprior~$q(\bm{\xi})$ over some hyperparameters~$\bm{\xi}$ and so on. It is easy to see that the discussion here still applies to such a model because we can consider From 62eb8e2b2d275d3db4c1c45c2c1d7bae72c4dd94 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 26 May 2018 23:52:40 +0900 Subject: [PATCH 61/70] Edit: More on generalization error vs. marginal likelihood --- prml_errata.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index f93a0ec..f0a8dd2 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2150,9 +2150,9 @@ \subsubsection*{#1} (i)~a prior~$q(\mathbf{w})$ over a set of parameters~$\mathbf{w}$; and (ii)~a conditional~$q(\mathbf{x}|\mathbf{w})$ over $\mathbf{x}$ given $\mathbf{w}$.\footnote{% The model~$\mathcal{M}$ may include -a hyperprior~$q(\bm{\xi})$ over some hyperparameters~$\bm{\xi}$ and so on. -It is easy to see that the discussion here still applies to such a model because we can consider -the joint prior of the form~$q(\mathbf{w}, \bm{\xi}) = +a hyperprior~$q(\bm{\xi})$ over a set of hyperparameters~$\bm{\xi}$ and so on. +It is easy to see that the discussion here is applicable also to such a hierarchical model +because we can consider the joint prior of the form~$q(\mathbf{w}, \bm{\xi}) = q(\mathbf{w}|\bm{\xi}) q(\bm{\xi})$ and, therefore, $\bm{\xi}$ can be absorbed into $\mathbf{w}$.} The marginal likelihood of the model~$\mathcal{M}$ is given by \begin{equation} From 99a52018d2e64a89b11779e7f8165db6dbfa0eae Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 26 May 2018 23:53:26 +0900 Subject: [PATCH 62/70] Edit: More on generalization error vs. marginal likelihood --- prml_errata.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index f0a8dd2..930c9f1 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2154,7 +2154,7 @@ \subsubsection*{#1} It is easy to see that the discussion here is applicable also to such a hierarchical model because we can consider the joint prior of the form~$q(\mathbf{w}, \bm{\xi}) = q(\mathbf{w}|\bm{\xi}) q(\bm{\xi})$ and, therefore, $\bm{\xi}$ can be absorbed into $\mathbf{w}$.} -The marginal likelihood of the model~$\mathcal{M}$ is given by +The marginal likelihood (or the evidence) of the model~$\mathcal{M}$ is given by \begin{equation} q(\mathbf{X}) = \int \mathrm{d}\mathbf{w} \, q(\mathbf{w}) \, q(\mathbf{X}|\mathbf{w}) = \int \mathrm{d}\mathbf{w} \, q(\mathbf{w}) \prod_{n=1}^{N} q(\mathbf{x}_n|\mathbf{w}). From eed64295ee6ececc7ffcee2968aa888e69c6ee72 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sat, 26 May 2018 23:55:18 +0900 Subject: [PATCH 63/70] Edit: More on generalization error vs. marginal likelihood --- prml_errata.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index 930c9f1..1a34ce6 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2159,7 +2159,8 @@ \subsubsection*{#1} q(\mathbf{X}) = \int \mathrm{d}\mathbf{w} \, q(\mathbf{w}) \, q(\mathbf{X}|\mathbf{w}) = \int \mathrm{d}\mathbf{w} \, q(\mathbf{w}) \prod_{n=1}^{N} q(\mathbf{x}_n|\mathbf{w}). \end{equation} -Note that $q(\mathbf{X})$ does not factorize in general because we have some unknown +Note that, although the conditional~$q(\mathbf{X}|\mathbf{w})$ does factorize, +the marginal likelihood~$q(\mathbf{X})$ does not in general because we have some unknown parameters~$\mathbf{w}$ in the model~$\mathcal{M}$. \erratum{Page~156} From f1875bc0949db0a4c54fa1dc7c0f158c1e3566dd Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Sun, 27 May 2018 00:13:16 +0900 Subject: [PATCH 64/70] Edit: More on generalization error vs. marginal likelihood --- prml_errata.tex | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 1a34ce6..ac33a0e 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2129,8 +2129,9 @@ \subsubsection*{#1} Let us now introduce some notation. First, let $\mathbf{X} = \{ \mathbf{x}_1, \dots, \mathbf{x}_N \}$ be the training data set (we use a vector~$\mathbf{x}$ instead of a scalar~$t$ for the target variable here). -We assume that the data set~$\mathbf{X}$ is i.i.d.\ and -has been generated from some true distribution~$p(\cdot)$ so that +We assume that the data set~$\mathbf{X}$ has been generated from +some true distribution~$p(\cdot)$ and +is i.i.d.\ (or \emph{independent and identically distributed}; see Section~1.2.4) so that \begin{equation} p(\mathbf{X}) = p(\mathbf{x}_1, \dots, \mathbf{x}_N) = \prod_{n=1}^{N} p(\mathbf{x}_n). \end{equation} From 1000810423ccbf83662eec24d51e65e558a97b2b Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Mon, 28 May 2018 23:07:39 +0900 Subject: [PATCH 65/70] Edit: AIC vs. BIC --- prml_errata.tex | 3 +++ 1 file changed, 3 insertions(+) diff --git a/prml_errata.tex b/prml_errata.tex index 0dffd7b..17a42dc 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -319,6 +319,9 @@ \subsubsection*{#1} Since we cannot tell which goal (i.e., generalization or to identify the ``true'' model) is more ``Bayesian'' than the other, the term ``Bayesian information criteria'' is a misnomer.} +Although the two criteria are often seen as competing, +one can see from the above that, since their goals are different, +there is no point in asking which criterion is optimal unconditionally. \erratum{Page~33} The line after (1.73): From dfba4db6706702d2535e03aac6eb699a892a8bb6 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Mon, 28 May 2018 23:11:18 +0900 Subject: [PATCH 66/70] i.i.d. is well-known abbreviation --- prml_errata.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index 17a42dc..6c5d6b0 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2134,7 +2134,7 @@ \subsubsection*{#1} (we use a vector~$\mathbf{x}$ instead of a scalar~$t$ for the target variable here). We assume that the data set~$\mathbf{X}$ has been generated from some true distribution~$p(\cdot)$ and -is i.i.d.\ (or \emph{independent and identically distributed}; see Section~1.2.4) so that +is i.i.d.\ so that \begin{equation} p(\mathbf{X}) = p(\mathbf{x}_1, \dots, \mathbf{x}_N) = \prod_{n=1}^{N} p(\mathbf{x}_n). \end{equation} From d3128bcf3663417c28858739587c809a22bc3cbc Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Mon, 28 May 2018 23:20:08 +0900 Subject: [PATCH 67/70] Edit: More on generalization error vs. marginal likelihood --- prml_errata.tex | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 6c5d6b0..65d8816 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2106,9 +2106,8 @@ \subsubsection*{#1} Note that the terminology and the notation to be introduced here are somewhat different from those of PRML or other part of this report. -The terminology and the discussion are largely due to \citet{Watanabe:BayesStatistics}, -though the notation is not because, as always, -I have tried to follow that of PRML as closely as possible. +The terminology and the discussion here are largely due to \citet{Watanabe:BayesStatistics}, +though the notation is not because I have yet tried to follow that of PRML as closely as possible. First of all, we should point out that so far we have used the term \emph{generalization error} somewhat loosely From f481c44984689b6383dc844c10a42ff2ffd2ecb1 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Mon, 28 May 2018 23:34:33 +0900 Subject: [PATCH 68/70] Edit: AIC vs. BIC --- prml_errata.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prml_errata.tex b/prml_errata.tex index 65d8816..61ad735 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -318,7 +318,7 @@ \subsubsection*{#1} respectively.\footnote{% Since we cannot tell which goal (i.e., generalization or to identify the ``true'' model) is more ``Bayesian'' than the other, -the term ``Bayesian information criteria'' is a misnomer.} +``Bayesian information criterion'' is a misnomer.} Although the two criteria are often seen as competing, one can see from the above that, since their goals are different, there is no point in asking which criterion is optimal unconditionally. From 9de3cfb979f50fac04f35ec008b09d7df03b8899 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Tue, 5 Jun 2018 22:47:26 +0900 Subject: [PATCH 69/70] Edit erratum on Page 217, Paragraph 3 --- prml_errata.tex | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/prml_errata.tex b/prml_errata.tex index 61ad735..8187062 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2300,9 +2300,10 @@ \subsubsection*{#1} \erratum{Page~217} Paragraph~3: Here, it is pointed out that information criteria such as AIC and BIC are no longer valid -if the posterior cannot be approximated by a Gaussian -(such a model is called \emph{singular} and, in fact, -many practical models are known to be singular). +if the posterior cannot be approximated by a Gaussian; +such a model is called \emph{singular} and, in fact, +many practical models are known to be singular +\citep{Watanabe:BayesStatistics,Watanabe:WAIC}. It is also worth noting here that there have been recently proposed new information criteria applicable to singular models, namely, WAIC~\citep{Watanabe:WAIC,Watanabe:BayesStatistics} and WBIC~\citep{Watanabe:WBIC}, From bc84481c15ca41736fb102b133a175c07e3d8ff9 Mon Sep 17 00:00:00 2001 From: Yousuke Takada Date: Wed, 2 Jan 2019 22:11:33 +0900 Subject: [PATCH 70/70] Edit prml_errata.tex --- prml_errata.tex | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/prml_errata.tex b/prml_errata.tex index 5c4b552..8ceeeb9 100644 --- a/prml_errata.tex +++ b/prml_errata.tex @@ -2181,6 +2181,24 @@ \subsubsection*{#1} the marginal likelihood~$q(\mathbf{X})$ does not in general because we have some unknown parameters~$\mathbf{w}$ in the model~$\mathcal{M}$. +% TODO + +%[cf. \emph{cross-validation} (see Section~1.3)] + +%In \citet{Watanabe:BayesStatistics}, +%the generalization error is defined as the Kullback-Leibler divergence~$ +%\operatorname{KL}\left(p(\mathbf{x})\middle\|q(\mathbf{x}|\mathbf{X})\right)$ +%between the true distribution~$p(\mathbf{x})$ and +%the predictive distribution~$q(\mathbf{x}|\mathbf{X})$. + +% TODO: +% Section 3.5.1, generalization loss and evidence +% Section 1.5.5 + +%For WAIC and WBIC, see \citet{Watanabe:WAIC,Watanabe:WBIC}. + +% F_N is, as the subscript N suggests, a function of X + \erratum{Page~156} Equation~(3.57): The new input vector~$\mathbf{x}$ is omitted in (3.57) as in, e.g., (3.74).