-
Notifications
You must be signed in to change notification settings - Fork 556
/
BDA3_notes.Rmd
2133 lines (1764 loc) · 96.1 KB
/
BDA3_notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Bayesian Data Analysis course - BDA3 notes"
date: "Page updated: `r format.Date(file.mtime('BDA3_notes.Rmd'),'%Y-%m-%d')`"
editor_options:
markdown:
wrap: 72
---
\newcommand{\E}{\operatorname{E}}
\newcommand{\N}{\operatorname{N}}
\newcommand{\Pr}{\operatorname{Pr}}
\newcommand{\Beta}{\operatorname{Beta}}
\newcommand{\Bin}{\operatorname{Bin}}
\newcommand{\BetaBinomial}{\operatorname{Beta-Binomial}}
\newcommand{\Invchisqr}{\operatorname{Inv-\chi^2}}
\newcommand{\NInvchisqr}{\operatorname{N-Inv-\chi^2}}
\newcommand{\tr}{\operatorname{tr}}
\newcommand{\trace}{\operatorname{trace}}
\newcommand{\logit}{\operatorname{logit}}
These notes help you to focus on the most important parts of each
chapter related o the Bayesian Data Analysis course. Before reading a
chapter, you can check below which sections, pages, and terms are the
most important. After reading the chapter or following the
corresponding lecture, you can check here for additional
clarifications. There also some notes for the chapters not included in
the course.
## Chapter 1 Probability and inference {#ch1}
[Chapter 1](https://users.aalto.fi/~ave/BDA3.pdf#page=13) is related to the pre-requisites and Lecture 1 *Introduction*.
### Outline
- 1.1-1.3 important terms, especially 1.3 for the notation
- 1.4 an example related to the first exercise, and another practical
example
- 1.5 foundations
- 1.6 good example related to visualization exercise
- 1.7 example which can be skipped
- 1.8 background material, good to read before doing the first
assignment
- 1.9 background material, good to read before doing the second
assignment
- 1.10 a point of view for using Bayesian inference
### The most important terms
Find all the terms and symbols listed below. Note that some of the terms
are now only briefly introduced and will be covered later in more
detail. When reading the chapter, write down questions related to things
unclear for you or things you think might be unclear for others.
- full probability model
- posterior distribution
- potentially observable quantity
- quantities that are not directly observable
- exchangeability
- independently and identically distributed
- $\theta, y, \tilde{y}, x, X, p(\cdot|\cdot), p(\cdot), \Pr(\cdot), \sim, H$
- sd, E, var
- Bayes rule
- prior distribution
- sampling distribution, data distribution
- joint probability distribution
- posterior density
- probability
- density
- distribution
- $p(y|\theta)$ as a function of $y$ or $\theta$
- likelihood
- posterior predictive distribution
- probability as measure of uncertainty
- subjectivity and objectivity
- transformation of variables
- simulation
- inverse cumulative distribution function
### Recommended exercises
Optional but recommended end of the chapter exercises in BDA3 to get a better understanding of the chapter topic:
- [1.1-1.4, 1.6-1.8](https://users.aalto.fi/~ave/BDA3.pdf#page=37) ([model solutions available for 1.1-1.6](http://www.stat.columbia.edu/~gelman/book/solutions3.pdf))
### Distributed as, $\sim$
It is common to write statistical models using a following notation:
$$
\begin{aligned}
y & \sim \mathrm{normal}(\mu, \sigma) \\
\mu & \sim \mathrm{normal}(0, 10) \\
\sigma & \sim \mathrm{normal}^+(0, 1),
\end{aligned}
$$
where the symbol $\sim$ is called *tilde* (`\sim` in LateX).
In general, we can read $\sim$ as *"is distributed as,"* and overall
this notation is used as a shorthand for defining distributions, so
that the above example can be written (with also as
$$
\begin{aligned}
p(y| \mu, \sigma) & = \mathrm{normal}(y | \mu, \sigma)\\
p(\mu) & = \mathrm{normal}(\mu | 0, 10)\\
p(\sigma) & = \mathrm{normal}^+(\sigma | 0, 1).
\end{aligned}
$$
A collection of distribution statements define a joint
distribution as the product of component distributions
$$
p(y,\mu,\sigma) = p(y| \mu, \sigma )p(\mu) p(\sigma).
$$
### Proportional to, $\propto$
The symbol $\propto$ means *proportional to*, which means left hand side
is equal to right hand size given a constant multiplier. For instance if
$y=2x$, then $y \propto x$. It's `\ propto` in LaTeX. See [Proportionality in Wikipedia](https://en.wikipedia.org/wiki/Proportionality_(mathematics)).
The Bayes rule is
$$
p(\theta | y) = \frac{p(y | \theta)p(\theta)}{p(y)},
$$
where dividing by $p(y)$ makes $\int p(\theta | y) d\theta = 1$. $p(y)$ is often infeasible to compute, but luckily we also often don't need to know it, and then we write the Bayes rule as
$$
p(\theta | y) \propto p(y | \theta)p(\theta).
$$
### Model and likelihood
Term $p(y|\theta,M)$ has two different names depending on the situation.
Due to the short notation used, there is possibility of confusion.
- Term $p(y|\theta,M)$ is called a *model* (sometimes more
specifically *observation model* or *statistical model*) when it is
used to describe uncertainty about $y$ given $\theta$ and $M$.
Longer notation $p_y(y|\theta,M)$ shows explicitly that it is a
function of $y$.
- In Bayes rule, the term $p(y|\theta,M)$ is called *likelihood
function*. Posterior distribution describes the probability (or
probability density) for different values of $\theta$ given a fixed
$y$, and thus when the posterior is computed the terms on the right
hand side (in Bayes rule) are also evaluated as a function of
$\theta$ given fixed $y$. Longer notation $p_\theta(y|\theta,M)$
shows explicitly that it is a function of $\theta$. Term has it's
own name (likelihood) to make the difference to the model. The
likelihood function is unnormalized probability distribution
describing uncertainty related to $\theta$ (and that's why Bayes
rule has the normalization term to get the posterior distribution).
### Two types of uncertainty
Epistemic and aleatory uncertainty are reviewed nicely in the article:
[Tony O'Hagan, ``Dicing with the unknown'' Significance 1(3):132-133,
2004.](http://onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2004.00050.x/abstract)
In that paper, there is one typo using the word *aleatory* instead of
*epistemic* (if you notice this, it's then quite obvious).
### Transformation of variables
- See [BDA3 p. 21](https://users.aalto.fi/~ave/BDA3.pdf#page=31)
### Ambiguous notation in statistics
- In $p(y|\theta)$
- $y$ can be variable or value
- we could clarify by using $p(Y|\theta)$ or $p(y|\theta)$
- $\theta$ can be variable or value
- we could clarify by using $p(y|\Theta)$ or $p(y|\theta)$
- $p$ can be a discrete or continuous function of $y$ or $\theta$
- we could clarify by using $P_Y$, $P_\Theta$, $p_Y$ or
$p_\Theta$
- $P_Y(Y|\Theta=\theta)$ is a probability mass function, sampling
distribution, observation model
- $P(Y=y|\Theta=\theta)$ is a probability
- $P_\Theta(Y=y|\Theta)$ is a likelihood function (can be discrete
or continuous)
- $p_Y(Y|\Theta=\theta)$ is a probability density function,
sampling distribution, observation model
- $p(Y=y|\Theta=\theta)$ is a density
- $p_\Theta(Y=y|\Theta)$ is a likelihood function (can be discrete
or continuous)
- $y$ and $\theta$ can also be mix of continuous and discrete
- Due to the sloppiness sometimes likelihood is used to refer
$P_{Y,\theta}(Y|\Theta)$, $p_{Y,\theta}(Y|\Theta)$
### Exchangeability
You don't need to understand or use the term exchangeability before
Chapter 5 and Lecture 7. At this point and until Chapter 5 and Lecture
7, it is sufficient that you know that 1) independence is stronger
condition than exchangeability, 2) independence implies exchangeability,
3) exchangeability does not imply independence, 4) exchangeability is
related to what information is available instead of the properties of
unknown underlying data generating mechanism. If you want to know more
about exchangeability right now, then read BDA Section 5.2 and
[notes for Chapter 5](#ch5).
## Chapter 2 Single-parameter models {#ch2}
[Chapter 2](https://users.aalto.fi/~ave/BDA3.pdf#page=39) is related to the prerequisites and Lecture 2 *Basics of Bayesian inference*.
### Outline
- 2.1 Binomial model (e.g. biased coin flipping)
- 2.2 Posterior as compromise between data and prior information
- 2.3 Posterior summaries
- 2.4 Informative prior distributions (skip exponential families and
sufficient statistics)
- 2.5 Gaussian model with known variance
- 2.6 Other single parameter models
- in this course the normal distribution with known mean but
unknown variance is the most important
- glance through Poisson and exponential
- 2.7 glance through this example, which illustrates benefits of prior
information, no need to read all the details (it's quite long
example)
- 2.8 Noninformative priors
- 2.9 Weakly informative priors
Laplace's approach for approximating integrals is discussed in more
detail in Chapter 4.
### The most important terms
Find all the terms and symbols listed below. When reading the chapter,
write down questions related to things unclear for you or things you
think might be unclear for others.
- binomial model
- Bernoulli trial
- $\mathop{\mathrm{Bin}}$, $\binom{n}{y}$
- Laplace's law of succession
- think which expectations in eqs. 2.7-2.8
- summarizing posterior inference
- mode, mean, median, standard deviation, variance, quantile
- central posterior interval
- highest posterior density interval / region
- uninformative / informative prior distribution
- principle of insufficient reason
- hyperparameter
- conjugacy, conjugate family, conjugate prior distribution, natural
conjugate prior
- nonconjugate prior
- normal distribution
- conjugate prior for mean of normal distribution with known variance
- posterior for mean of normal distribution with known variance
- precision
- posterior predictive distribution
- normal model with known mean but unknown variance
- proper and improper prior
- unnormalized density
- difficulties with noninformative priors
- weakly informative priors
### R and Python demos
- 2.1: Binomial model and Beta posterior. [R](https://avehtari.github.io/BDA_R_demos/demos_ch2/demo2_1.html). [Python](https://github.com/avehtari/BDA_py_demos/blob/master/demos_ch2/demo2_1.ipynb).
- 2.2: Comparison of posterior distributions with different
parameter values for the Beta prior distribution. [R](https://avehtari.github.io/BDA_R_demos/demos_ch2/demo2_2.html). [Python](https://github.com/avehtari/BDA_py_demos/blob/master/demos_ch2/demo2_2.ipynb).
- 2.3: Use samples to plot histogram with quantiles, and the same
for a transformed variable. [R](https://avehtari.github.io/BDA_R_demos/demos_ch2/demo2_3.html). [Python](https://github.com/avehtari/BDA_py_demos/blob/master/demos_ch2/demo2_3.ipynb).
- 2.4: Grid sampling using inverse-cdf method. [R](https://avehtari.github.io/BDA_R_demos/demos_ch2/demo2_4.html). [Python](https://github.com/avehtari/BDA_py_demos/blob/master/demos_ch2/demo2_4.ipynb).
### Recommended exercises
Optional but recommended end of the chapter exercises in BDA3 to get a better understanding of the chapter topic:
- [2.1-2.5, 2.8, 2.9, 2.14, 2.17, 2.22](https://users.aalto.fi/~ave/BDA3.pdf#page=67) ([model solutions available for
2.1-2.5, 2.7-2.13, 2.16, 2.17,
2.20](http://www.stat.columbia.edu/~gelman/book/solutions3.pdf),
and 2.14 is in course slides)
### Posterior, credible, and confidence intervals
- *Confidence interval* is used in frequentist statistics and not used
in Bayesian statistics, but we mention it here to make that fact
explicit. Given a confidence level \gamma$ (95% and 99% are typical
values), a *confidence interval* is a random interval which contains
the parameter being estimated $\gamma$\% of the time ([Wikipedia](https://en.wikipedia.org/wiki/Confidence_interval)).
- *Credible interval* is used in Bayesian statistics (and by choice
the acronym is CI as for the confidence interval). *Credible
interval* is defined such that an unobserved parameter value has a
particular probability $\alpha$ to fall within it
([Wikipedia](https://en.wikipedia.org/wiki/Credible_interval)).
*Credible interval* can be used to characterize any probability
distribution.
- *Posterior interval* is a credible interval specifically
characterizing posterior distribution.
### Integration over Beta distribution
Chapter 2 has an example of analysing the ratio of girls born in Paris
1745--1770. Laplace used binomial model and uniform prior which produces
Beta distribution as posterior distribution. Laplace wanted to calculate
$p(\theta \geq 0.5)$, which is obtained as
$$
\begin{aligned}
p(\theta \geq 0.5) &=& \int_{0.5}^1 p(\mathbf{\theta}|y,n,M) d\theta \\
&=& \frac{493473!}{241945!251527!} \int_{0.5}^1 \theta^y(1-\theta)^{n-y} d\theta\end{aligned}
$$
Note that $\Gamma(n)=(n-1)!$. Integral has a form which is called
*incomplete Beta function*. Bayes and Laplace had difficulties in
computing this, but nowadays there are several series and continued
fraction expressions. Furthermore usually the normalization term is
computed by computing $\log(\Gamma(\cdot))$ directly without explicitly
computing $\Gamma(\cdot)$. Bayes was able to solve integral given small
$n$ and $y$. In case of large $n$ and $y$, Laplace used Gaussian
approximation of the posterior (more in Chapter 4). In this specific
case, R `pbeta` gives the same results as Laplace's result with at least 3
digit accuracy.
### Numerical accuracy
Laplace calculated
$$
p(\theta \geq 0.5 | y, n, M) \approx 1.15 \times 10^{-42}.
$$
Correspondingly Laplace could have calculated
$$
p(\theta \geq 0.5 | y, n, M) = 1 - p(\theta \leq 0.5 | y, n, M),
$$
which in theory could be computed in R with
`1-pbeta(0.5,y+1,n-y+1)`.
In practice this fails, due to the limitation in the floating point
representation used by the computers. In R the largest floating point
number which is smaller than 1 is about 1-eps/4, where eps is about
$2.22 \times 10^{-16}$ (the smallest floating point number larger than 1
is 1+eps). Thus the result from `pbeta(0.5,y+1,n-y+1)` will be rounded
to 1 and $1-1=0\neq 1.15
\times 10^{-42}$. We can compute $p(\theta \geq 0.5 | y, n, M)$ in R
with `pbeta(0.5, y+1, n-y+1, lower.tail=FALSE)`.
### Highest Posterior Density interval
HPD interval is not invariant to reparametrization. Here's an
illustrative example (using R and package `HDInterval`):
```
> r <- exp(rnorm(1000))
> quantile(log(r),c(.05, .95))
5% 95%
-1.532931 1.655137
> log(quantile(r,c(.05, .95)))
5% 95%
-1.532925 1.655139
> hdi(log(r), credMass = 0.9)
lower upper
-1.449125 1.739169
attr(,"credMass")
[1] 0.9
> log(hdi(r, credMass = 0.9))
lower upper
-2.607574 1.318569
attr(,"credMass")
[1] 0.9
```
### Gaussian distribution in more complex models and methods
Gaussian distribution is commonly used in mixture models, hierarchical
models, hierarchical prior structures, scale mixture distributions,
Gaussian latent variable models, Gaussian processes, Gaussian random
Markov fields, Kalman filters, proposal distribution in Monte Carlo
methods, etc.
### Predictive distribution
Often the predictive distribution is more interesting than the posterior
distribution. The posterior distribution describes the uncertainty in
the parameters (like the proportion of red chips in the bag), but the
predictive distribution describes also the uncertainty about the future
event (like which color is picked next). This difference is important,
for example, if we want to what could happen if some treatment is given
to a patient.
In case of Gaussian distribution with known variance $\sigma^2$ the
model is
$$
\begin{aligned}
y\sim \N(\theta,\sigma^2),
\end{aligned}
$$
where
$\sigma^2$ describes aleatoric uncertainty. Using uniform prior the
posterior is
$$
\begin{aligned}
p(\theta|y) \sim \mathop{\mathrm{N}}(\theta|\bar{y},\sigma^2/n),
\end{aligned}
$$
where $\sigma^2/n$ described epistemic uncertainty related to $\theta$.
Using uniform prior the posterior predictive distribution for new
$\tilde{y}$ is
$$
\begin{aligned}
p(\tilde{y}|y) \sim \N(\tilde{y}|\bar{y},\sigma^2+\sigma^2/n),
\end{aligned}
$$
where the uncertainty is sum of epistemic ($\sigma^2/n$) and aleatoric
uncertainty ($\sigma^2$).
### Non-informative and weakly informative priors
Our thinking has advanced since sections 2.8 and 2.9 were written. We're
even more strongly in favor weakly informative priors, and in favor of
more information in the priors. Non-informative priors are likely to
produce more unstable estimates (higher variance), and the lectures
include also examples of how seemingly non-informative priors can
actually be informative on some aspect. See further discussion and
example in the [Prior Choice Recommendations Wiki](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations).
Thus Prior Choice Recommendations Wiki will see also some further updates (we're doing
research and learning more all the time).
### Should we worry about rigged priors?
[Andrew Gelman's blog post answering worries that data analyst would
choose a too optimistic prior](http://andrewgelman.com/2017/10/04/worry-rigged-priors/).
### Prior knowledge elicitation
Prior elicitation transforms domain knowledge of various kinds into well-defined prior distributions. There are challenges in how to gather domain knowledge and hopw to transform that to mathematical form. We come back to the topic later in the course. A recent review [*"Prior Knowledge Elicitation: The Past, Present, and Future"* by Mikkola et al. (2023)](https://doi.org/10.1214/23-BA1381) provides more information for those interested to go beoyond the material in thos course.
### Exchangeability
You don't need to understand or use the term exchangeability before
Chapter 5 and Lecture 7. At this point and until Chapter 5 and Lecture
7, it is sufficient that you know that 1) independence is stronger
condition than exchangeability, 2) independence implies exchangeability,
3) exchangeability does not imply independence, 4) exchangeability is
related to what information is available instead of the properties of
unknown underlying data generating mechanism. If you want to know more
about exchangeability right now, then read BDA3 Section 5.2 and
[notes for Chapter 5](#ch5).
### The number of left-handed students in the class
- What we know and don't know
- $N=L+R$ is the total number of students in the lecture hall,
$N$ is known in the beginning
- $L$ and $R$ are the number of left and right handed students, not known before we start asking
- $n=l+r$ is the total number of students we have asked
- $l$ and $r$ are the numbers of left and right handed students from the students we asked
- we also know that $l \leq L \leq (N-r)$ and $r \leq R \leq (N-l)$
- After observing $n$ students with $l$ left handed, what we know about $L$?
- We define $L=l+\tilde{l}$, where $\tilde{l}$ is the unobserved number of left handed students among those who we did not yet ask
- Posterior distribution for $\theta$ is $\Beta(\alpha+l, \beta+r)$
- Posterior predictive distribution for $\tilde{l}$ is
$\BetaBinomial(\tilde{l} | N-n, \alpha+l, \beta+r)=\int_0^1\Bin(\tilde{l} | N-n, \theta)\Beta(\theta | \alpha+l, \beta+r)d\theta$
- Eventually as we have asked everyone, $n=N$, and there is no uncertainty on the number of left-handed students present, and $l=L$ and $\tilde{l}=0$. There is still uncertainty about $\theta$, but that is relevant only if we would like to make predictions beyond the students in the lecture hall.
## Chapter 3 Introduction to multiparameter models {#ch3}
[Chapter 3](https://users.aalto.fi/~ave/BDA3.pdf#page=73) is related to the Lecture 3 *Multidimensional posterior*.
### Outline
- 3.1 Marginalisation
- 3.2 Normal distribution with a noninformative prior (very important)
- 3.3 Normal distribution with a conjugate prior (very important)
- 3.4 Multinomial model (can be skipped)
- 3.5 Multivariate normal with known variance (needed later)
- 3.6 Multivariate normal with unknown variance (glance through)
- 3.7 Bioassay example (very important, related to one of the
exercises)
- 3.8 Summary (summary)
Normal model is used a lot as a building block of the models in the
later chapters, so it is important to learn it now. Bioassay example is
good example used to illustrate many important concepts and it is used
in several exercises over the course.
### The most important terms
Find all the terms and symbols listed below. When reading the chapter,
write down questions related to things unclear for you or things you
think might be unclear for others.
- marginal distribution/density
- conditional distribution/density
- joint distribution/density
- nuisance parameter
- mixture
- normal distribution with a noninformative prior
- normal distribution with a conjugate prior
- sample variance
- sufficient statistics
- $\mu$, $\sigma^2$, $\bar{y}$, $s^2$
- a simple normal integral
- $\Invchisqr$
- factored density
- $t_{n-1}$
- degrees of freedom
- posterior predictive distribution
- $\NInvchisqr$
- variance matrix $\Sigma$
- nonconjugate model
- generalized linear model
- binomial model
- logistic transformation
- density at a grid
### R and Python demos
- 3.1: Visualize joint density and marginal densities of posterior
of normal distribution with unknown mean and variance. [R](https://avehtari.github.io/BDA_R_demos/demos_ch3/demo3_1.html). [Python](https://github.com/avehtari/BDA_py_demos/blob/master/demos_ch3/demo3_1.ipynb).
- 3.2: Visualize factored sampling and corresponding marginal and
conditional density. [R](https://avehtari.github.io/BDA_R_demos/demos_ch3/demo3_2.html). [Python](https://github.com/avehtari/BDA_py_demos/blob/master/demos_ch3/demo3_2.ipynb).
- 3.3: Visualize marginal distribution of $\mu$ as a mixture of
normals. [R](https://avehtari.github.io/BDA_R_demos/demos_ch3/demo3_3.html). [Python](https://github.com/avehtari/BDA_py_demos/blob/master/demos_ch3/demo3_3.ipynb).
- 3.4: Visualize sampling from the posterior predictive
distribution. [R](https://avehtari.github.io/BDA_R_demos/demos_ch3/demo3_4.html). [Python](https://github.com/avehtari/BDA_py_demos/blob/master/demos_ch3/demo3_4.ipynb).
- 3.5: Visualize Newcomb's data. [R](https://avehtari.github.io/BDA_R_demos/demos_ch3/demo3_5.html). [Python](https://github.com/avehtari/BDA_py_demos/blob/master/demos_ch3/demo3_5.ipynb).
- 3.6: Visualize posterior in bioassay example. [R](https://avehtari.github.io/BDA_R_demos/demos_ch3/demo3_6.html). [Python](https://github.com/avehtari/BDA_py_demos/blob/master/demos_ch3/demo3_6.ipynb).
### Recommended exercises
Optional but recommended end of the chapter exercises in BDA3 to get a better understanding of the chapter topic:
- [3.2, 3.3, 3.9](https://users.aalto.fi/~ave/BDA3.pdf#page=89) (model solutions available for 3.1-3.3, 3.5, 3.9, 3.10)
### Conjugate prior for normal distribution
[BDA3 p. 67](https://users.aalto.fi/~ave/BDA3.pdf#page=77)
mentions that the conjugate prior for normal
distribution has to have a product form $p(\sigma^2)p(\mu|\sigma^2)$.
The book refers to (3.2) and the following discussion. As additional
hint is useful to think the relation of terms $(n-1)s^2$ and
$n(\bar{y}-\mu)^2$ in 3.2 to equations 3.3 and 3.4.
### Trace of square matrix
Trace of square matrix, $\trace$,
$\tr A$, $\trace(A)$,
$\tr(A)$, is the sum of diagonal elements. To derive
equation 3.11 the following property has been used
$\tr(ABC) = \tr(CAB) = \tr(BCA)$.
### History and naming of distributions
See [Earliest Known Uses of Some of the Words of Mathematics](http://jeff560.tripod.com/mathword.html).
### Using Monte Carlo to obtain draws from the posterior of generated quantities
Chapter 3 discusses closed form posteriors for binomial and normal
models given conjugate priors. These are also used as part of the
assignment. The assignment also requires forming a posterior for derived
quantities, and these posterior don't have closed form (so no need to
try derive them). As we know how to sample from the posterior of
binomial and normal models, we can use these posterior draws to get
draws from the posterior of derived quantity.
For example, given posteriors $p(\theta_1|y_1)$ and $p(\theta_2|y_2)$ we
want to find the posterior for the difference
$p(\theta_1-\theta_2|y_1,y_2)$.
1. Sample $\theta_1^s$ from $p(\theta_1|y_1)$ and $\theta_2^s$ from
$p(\theta_2|y_2)$, we can compute posterior draws for the derived
quantity as $\delta^s=\theta_1^s-\theta_2^s$ ($s=1,\ldots,S$).
2. $\delta^s$ are then draws from $p(\delta^s|y_1,y_2)$, and they can
be used to illustrate the posterior $p(\delta^s|y_1,y_2)$ with
histogram, and compute posterior mean, sd, and quantiles.
This is one reason why Monte Carlo approaches are so commonly used.
### The number of required Monte Carlo draws
This will discussed in Lecture 4 and Chapter 10. Meanwhile, e.g., 1000 draws is
sufficient.
### Bioassay
Bioassay example is is an example of very common statistical inference
task typical, for example, medicine, pharmacology, health care,
cognitive science, genomics, industrial processes etc.
The example is from Racine et al (1986) (see ref in the end of the
BDA3). Swiss company makes classification of chemicals to different
toxicity categories defined by authorities (like EU). Toxicity
classification is based on lethal dose 50% (LD50) which tells what
amount of chemical kills 50% of the subjects. Smaller the LD50 more
lethal the chemical is. The original paper mentions \"1983 Swiss poison
Regulation\" which defines following categories for chemicals orally
given to rats (mg/ml)\
Class LD50
------- -----------
1 \<5
2 5-50
3 50-500
4 500-2000
5 2000-5000
------- -----------
To reduce the number of rats needed in the experiments, the company
started to use Bayesian methods. The paper mentions that in those days
use of just 20 rats to define the classification was very little. Book
gives LD50 in log(g/ml). When the result from 3.6 is transformed to
scale mg/ml, we see that the mean LD50 is about 900 and
$p(500<\text{LD50}<2000)\approx 0.99$. Thus, the tested chemical can be
classified as category 4 toxic.
Note that the chemical testing is moving away from using rats and other
animals to using, for example, human cells grown in chips, tissue models
and human blood cells. The human-cell based approaches are also more
accurate to predict the effect for humans.
$\logit$ transformation can be justified information
theoretically when binomial likelihood is used.
Code in demo 3.6 ([R](https://avehtari.github.io/BDA_R_demos/demos_ch3/demo3_6.html), [Python](https://github.com/avehtari/BDA_py_demos/blob/master/demos_ch3/demo3_6.ipynb)) can be helpful in exercises related to Bioassay
example.
### Bayesian vs. frequentist statements in two group comparisons
When asking to compare groups, some students get confused as the
frequentist testing is quite different. The frequentist testing is often
focusing on a) differently named tests for different models and b) null
hypothesis testing. In Bayesian inference a) the same Bayes rule and
investigation of posterior is used for all models, b) null hypothesis
testing is less common. We come later to decision making given posterior
and utility/ cost function (Lecture 10.1) and more about null hypothesis
testing (Lecture 12.1). Now it is assumed you will report the posterior
(e.g. histogram), possible summaries, and report what you can infer from
that. Specifically as in this assignment the group comparisons are based
on continuous model parameter, the probability of 0 difference is 0
(later lecture 12.1 covers null hypothesis testing). Instead of forcing
dichotomous answer (yes/no) about whether there is difference, report
the whole posterior that tells also how big that difference might be.
What big means depends on the application, which brings us back to the
fact of importance of domain expertise. You are not experts on the
application examples used in the assignment, but you can think how would
you report what you have learned to a domain expert.
[Frank Harrell's recommendations how to state results in two group
comparisons](https://hbiostat.org/blog/post/bayes-freq-stmts/) are excellent.
### Unimodal and multimodal
From [Wikipedia Unimodality](https://en.wikipedia.org/wiki/Unimodality):
- In statistics, a unimodal probability distribution or unimodal distribution is a probability distribution which has a single peak. The term "mode" in this context refers to any peak of the distribution.
- If it has more modes it is "bimodal" (2), "trimodal" (3), etc., or in general, "multimodal".
BDA3 Section 2.3 discusses posterior summaries and illustrates bimodal distribution in Figure 2.2.
## Chapter 4 Asymptotics and connections to non-Bayesian approaches {#ch4}
[Chapter 4](https://users.aalto.fi/~ave/BDA3.pdf#page=93) is related to the Lecture 11 *Normal approximation,
frequency properties*.
### Outline
- 4.1 Normal approximation (Laplace's method)
- 4.2 Large-sample theory
- 4.3 Counter examples
- 4.4 Frequency evaluation (not part of the course, but interesting)
- 4.5 Other statistical methods (not part of the course, but interesting)
Normal approximation is used often used as part of posterior
computation (more about this in [Chapter 13](#ch13), which is not a
part of the course).
### The most important terms
Find all the terms and symbols listed below. When reading the chapter,
write down questions related to things unclear for you or things you
think might be unclear for others.
- sample size
- asymptotic theory
- normal approximation
- quadratic function
- Taylor series expansion
- observed information
- positive definite
- why $\log \sigma$?
- Jacobian of the transformation
- point estimates and standard errors- large-sample theory
- asymptotic normality
- consistency
- underidentified
- nonidentified
- number of parameters increasing with sample size
- aliasing
- unbounded likelihood
- improper posterior
- edge of parameter space
- tails of distribution
### R and Python demos
- 4.1: Bioassay example [R](https://avehtari.github.io/BDA_R_demos/demos_ch4/demo4_1.html). [Python](https://github.com/avehtari/BDA_py_demos/blob/master/demos_ch4/demo4_1.ipynb).
### Normal approximation
Other normal posterior approximations are discussed in [Chapter
13](ch13). For example, variational and expectation propagation methods
improve the approximation by global fitting instead of just the
curvature at the mode. The normal approximation at the mode is often
also called the Laplace method, as Laplace used it first.
Several researchers have provided partial proofs that posterior
converges towards normal distribution. In the mid 20th century Le
Cam was first to provide a strict proof.
### Observed information
When $n\rightarrow\infty$, the posterior distribution approaches
normal distribution. As the log density of the normal is a
quadratic function, the higher derivatives of the log posterior
approach zero. The curvature at the mode describes the information
only in the case if asymptotic normality. In the case of the normal
distribution, the curvature describes also the width of the
normal. Information matrix is a *precision matrix*, which is
inverse of a covariance matrix.
### Aliasing
In Finnish: valetoisto.
Aliasing is a special case of under-identifiability, where likelihood
repeats in separate points of the parameter space. That is, likelihood
will get exactly same values and has same shape although possibly
mirrored or otherwise projected. For example, the following mixture model
$$
p(y_i|\mu_1,\mu_2,\sigma_1^2,\sigma_2^2,\lambda)=\lambda\N(\mu_1,\sigma_1^2)+(1-\lambda)\N(\mu_2,\sigma_2^2),
$$
has two normals with own means and variances. With a probability
$\lambda$ the observation comes from $\N(\mu_1,\sigma_1^2)$ and a
probability $1-\lambda$ from $\N(\mu_2,\sigma_2^2)$. This kind of
model could be used, for example, for the Newcomb's data, so that the
another normal component would model faulty measurements. Model
does not state which of the components 1 or 2, would model good
measurements and which would model the faulty measurements. Thus it is
possible to interchange values of $(\mu_1,\mu_2)$ and
$(\sigma_1^2,\sigma_2^2)$ and replace $\lambda$ with $(1-\lambda)$ to
get the equivalent model. Posterior distribution then has two modes
which are mirror images of each other. When $n\rightarrow\infty$ modes
will get narrower, but the posterior does not converge to a single
point.
If we can integrate over the whole posterior, the aliasing is not a
problem. However aliasing makes the approximative inference more
difficult.
### Frequency property vs. frequentist
Bayesians can evaluate frequency properties of Bayesian estimates
without being frequentist. For Bayesians the starting point is the
Bayes rule and decision theory. Bayesians care more about efficiency
than unbiasedness. For frequentists the starting point is to find an
estimator with desired frequency properties and quite often
unbiasedness is chosen as the first restriction.
### Transformation of variables
See [BDA3 p. 21](https://users.aalto.fi/~ave/BDA3.pdf#page=31)
for the explanation how to derive densities for transformed
variables. This explains, for example, why uniform prior
$p(\log(\sigma^2))\propto 1$ for $\log(\sigma^2)$ corresponds to prior
$p(\sigma^2)=\sigma^{-2}$ for $\sigma^{2}$.
### On derivation
Here's a reminder how to integrate with respect to $g(\theta)$. For example
$$
\frac{d}{d\log\sigma}\sigma^{-2}=-2 \sigma^{-2}
$$
is easily solved by setting $z=\log\sigma$ to get
$$
\frac{d}{dz}\exp(z)^{-2}=-2\exp(z)^{-3}\exp(z)=-2\exp(z)^{-2}=-2\sigma^{-2}.
$$
## Chapter 5 Hierarchical models {#ch5}
[Chapter 5](https://users.aalto.fi/~ave/BDA3.pdf#page=111) is related to the Lecture 7 *Hierarchical models and exchangeability*.
### Outline
- 5.1 Lead-in to hierarchical models
- 5.2 Exchangeability (a useful theoretical concept)
- 5.3 Bayesian analysis of hierarchical models (discusses factorized computation which can be skipped)
- 5.4 Hierarchical normal model (discusses factorized computation which can be skipped)
- 5.5 Example: parallel experiments in eight schools (useful dicussion, skip the details of computation)
- 5.6 Meta-analysis (can be skipped in this course)
- 5.7 Weakly informative priors for hierarchical variance parameters
(more recent discussion can be found in [Prior Choice Recommendation Wiki](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations))
The hierarchical models in the chapter are simple to keep computation
simple. More advanced computational tools are presented in Chapters
[10](ch10), [11](ch11) and [12](ch12) (part of the course), and [13](ch13) (not part of the course).
### The most important terms
Find all the terms and symbols listed below. When reading the chapter,
write down questions related to things unclear for you or things you
think might be unclear for others.
- population distribution
- hyperparameter
- hierarchical model
- exchangeability
- invariant to permutations
- independent and identically distributed
- ignorance
<!-- - de Finetti's theorem -->
- partially exchangeable
- conditionally exchangeable
- conditional independence
- hyperprior
- different posterior predictive distributions
- the conditional probability formula
### R and Python demos
- 5.1: Rats example. [R](https://avehtari.github.io/BDA_R_demos/demos_ch5/demo5_1.html). [Python](https://github.com/avehtari/BDA_py_demos/blob/master/demos_ch5/demo5_1.ipynb).
- 5.2: SAT example. [R](https://avehtari.github.io/BDA_R_demos/demos_ch5/demo5_2.html). [Python](https://github.com/avehtari/BDA_py_demos/blob/master/demos_ch5/demo5_2.ipynb).
### Recommended exercises
Optional but recommended end of the chapter exercises in BDA3 to get a better understanding of the chapter topic:
- [5.1 and 5.2](https://users.aalto.fi/~ave/BDA3.pdf#page=144) ([model solution available for 5.3-5.5,
5.7-5.12](http://www.stat.columbia.edu/~gelman/book/solutions3.pdf))
### Computation
Examples in BDA3 Sections 5.3 and 5.4 continue computation with factorization
and grid, but there is no need to go deep in to computational details as
in the assignments you will use MCMC and Stan instead.
### Exchangeability vs. independence
Exchangeability and independence are two separate concepts. Neither
necessarily implies the other. Independent identically distributed
variables/parameters are exchangeable. Exchangeability is less strict
condition than independence. Often we may assume that observations or
unobserved quantities are in fact dependent, but if we can't get
information about these dependencies we may assume those observations or
unobserved quantities as exchangeable. ``Ignorance implies
exchangeability.''
In case of exchangeable observations, we may sometimes act *as if*
observations were independent if the additional potential information
gained from the dependencies is very small. This is related to de
Finetti's theorem
([BDA3 p. 105](https://users.aalto.fi/~ave/BDA3.pdf#page=115)),
which applies formally only when
$J\rightarrow\infty$, but in practice difference may be small if $J$ is
finite but relatively large (see examples below).
- If no other information than data $y$ is available to distinguish
$\theta_j$ from each other and parameters can not be ordered or
grouped, we may assume symmetry between parameters in their prior
distribution
- This symmetry can be represented with exchangeability
- Parameters $\theta_1,\ldots,\theta_J$ are exchangeable in their
joint distribution if $p(\theta_1,\ldots,\theta_J)$ is invariant to
permutation of indexes $(1,\ldots,J)$
Here are some examples you may consider.
Ex 5.1. Exchangeability with known model parameters: For each of
following three examples, answer: (i) Are observations $y_1$ and $y_2$
exchangeable? (ii) Are observations $y_1$ and $y_2$ independent? (iii)
Can we act *as if* the two observations are independent?
1. A box has one black ball and one white ball. We pick a ball $y_1$ at
random, put it back, and pick another ball $y_2$ at random.
2. A box has one black ball and one white ball. We pick a ball $y_1$ at
random, we do not put it back, then we pick ball $y_2$.
3. A box has a million black balls and a million white balls. We pick a
ball $y_1$ at random, we do not put it back, then we pick ball $y_2$
at random.
Ex 5.2. Exchangeability with unknown model parameters: For each of
following three examples, answer: (i) Are observations $y_1$ and $y_2$
exchangeable? (ii) Are observations $y_1$ and $y_2$ independent? (iii)
Can we act *as if* the two observations are independent?
1. A box has $n$ black and white balls but we do not know how many of
each color. We pick a ball $y_1$ at random, put it back, and pick
another ball $y_2$ at random.
2. A box has $n$ black and white balls but we do not know how many of
each color. We pick a ball $y_1$ at random, we do not put it back,
then we pick ball $y_2$ at random.
3. Same as (b) but we know that there are many balls of each color in
the box.
Note that for example in opinion polls, balls i.e. humans are not put
back and there is a large but finite number of humans.
Following complements the divorce example in the book by discussing the
effect of the additional observations
- Example: divorce rate per 1000 population in 8 states of the USA in
1981
- without any other knowledge $y_1,\ldots,y_8$ are exchangeable
- it is reasonable to assume a prior independence given population
density $p(y_i|\theta)$
- Divorce rate in first seven are $5.6, 6.6, 7.8, 5.6,
7.0, 7.2, 5.4$
- now we have some additional information, but still changing the
indexing does not affect the joint distribution. For example, if
we were told that divorce rate were not for the first seven but
last seven states, it does not change the joint distribution,
and thus $y_1,\ldots,y_8$ are exchangeable
- sensible assumption is a prior independence given population
density $p(y_i|\theta)$
- if \"true\" $\theta_0$ were known, $y_1,\ldots,y_8$ were
independent given \"true\" $\theta_0$
- since $\theta$ is estimated using observations, $y_i$ are a
posterior dependent, which is obvious, e.g., from the predictive
density $p(y_8|y_1,\ldots,y_7,M)$, i.e. e.g. if $y_1,\ldots,y_7$
are large then probably $y_8$ is large
- if we were told that given rates were for the last seven states,
then $p(y_1|y_2,\ldots,y_8,M)$ would be exactly same as
$p(y_8|y_1,\ldots,y_7,M)$ above, i.e. changing the indexing does
not have effect since $y_1,\ldots,y_8$ are exchangeable
- Additionally we know that $y_8$ is Nevada and rates of other states
are $5.6, 6.6, 7.8, 5.6, 7.0, 7.2, 5.4$
- based on what we were told about Nevada, predictive density s
$p(y_8|y_1,\ldots,y_7,M)$ should take into account that
probability $p(y_8>\max(y_1,\ldots,y_7)|y_1,\ldots,y_7)$ should
be large
- if we were told that, Nevada is $y_3$ (not $y_8$ as above), then
new predictive density $p(y_8|y_1,\ldots,y_7,M)$ would be
different, because $y_1,\ldots,y_8$ are not anymore exchangeable
### What if observations are not exchangeable
Often observations are not fully exchangeable, but are partially or
conditionally exchangeable. Two basic cases
- If observations can be grouped, we may make hierarchical model, were
each group has own subpart, but the group properties are unknown. If
we assume that group properties are exchangeable we may use common
prior for the group properties.
- If $y_i$ has additional information $x_i$, then $y_i$ are not
exchangeable, but $(y_i,x_i)$ still are exchangeable, then we can be
make joint model for $(y_i,x_i)$ or conditional model $(y_i|x_i)$.
Here are additional examples (Bernardo & Smith, Bayesian Theory, 1994),
which illustrate the above basic cases. Think of old fashioned thumb
pin. This kind of pin can stay flat on it's base or slanting so that the
pin head and the edge of the base touch the table. This kind of pin
represents realistic version of \"unfair\" coin.
- Let's throw pin $n$ times and mark $x_i=1$ when pin stands on it's
base. Let's assume, that throwing conditions stay same all the time.
Most would accept throws as exchangeable.
- Same experiment, but odd numbered throws will be made with full
metal pin and even numbered throws with plastic coated pin. Most
would accept exchangeability for all odd and all even throws
separately, but not necessarily for both series combined. Thus we
have partial exchangeability.
- Laboratory experiments $x_1,...,x_n$, are real valued measurements
about the chemical property of some substance. If all experiments
are from the same sample, in the same laboratory with same
procedure, most would accept exchangeability. If experiments were
made, for example, in different laboratories we could assume partial
exchangeability.
- $x_1,...,x_n$ are real valued measurements about the physiological
reactions to certain medicine. Different test persons get different
amount of medicine. Test persons are males and females of different
ages. If the attributes of the test persons were known, most would
not accept results as exchangeable. In a group with certain dose,
sex and age, the measurements could be assumed exchangeable. We
could use grouping or if the doses and attributes are continuous we
could use regression, that is, assume conditional independence.
### Weakly informative priors for hierarchical variance parameters
Our thinking has advanced since section 5.7 was written. Section 5.7
([BDA3 p. 128--](https://users.aalto.fi/~ave/BDA3.pdf#page=138))
recommends use of half-Cauchy as weakly informative
prior for hierarchical variance parameters. More recent recommendation
is half-normal if you have substantial information on the high end
values, or or half-$t_4$ if you there might be possibility of
surprise. Often we don't have so much prior information that we would
be able to well define the exact tail shape of the prior, but
half-normal produces usually more sensible prior predictive
distributions and is thus better justified. Half-normal leads also
usually to easier inference.
See the [Prior Choice Wiki](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations)
for more recent general discussion and model specific recommendations.
## Chapter 6 Model checking {#ch6}
[Chapter 6](https://users.aalto.fi/~ave/BDA3.pdf#page=151) is related to the Lecture 8 *Model checking & cross-validation*.
### Outline
- 6.1 The place of model checking in applied Bayesian statistics
- 6.2 Do the inferences from the model make sense?
- 6.3 Posterior predictive checking ($p$-values can be skipped)