-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathaGHQ.qmd
1428 lines (1188 loc) · 62.3 KB
/
aGHQ.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
fig-width: 4
fig-height: 3
fig-dpi: 150
fig-format: png
engine: julia
julia:
exeflags: ["--project"]
---
# GLMM log-likelihood {#sec-GLMMdeviance}
\newcommand\bbb{{\mathbf{b}}}
\newcommand\bbg{{\mathbf{g}}}
\newcommand\bbI{{\mathbf{I}}}
\newcommand\bbL{{\mathbf{L}}}
\newcommand\bbo{{\mathbf{o}}}
\newcommand\bbr{{\mathbf{r}}}
\newcommand\bbu{{\mathbf{u}}}
\newcommand\bbW{{\mathbf{W}}}
\newcommand\bbX{{\mathbf{X}}}
\newcommand\bby{{\mathbf{y}}}
\newcommand\bbZ{{\mathbf{Z}}}
\newcommand\bbzero{{\mathbf{0}}}
\newcommand\bbbeta{{\boldsymbol{\beta}}}
\newcommand\bbeta{{\boldsymbol{\eta}}}
\newcommand\bbLambda{{\boldsymbol{\Lambda}}}
\newcommand\bbmu{{\boldsymbol{\mu}}}
\newcommand\bbtheta{{\boldsymbol{\theta}}}
\newcommand\mcB{{\mathcal{B}}}
\newcommand\mcD{{\mathcal{D}}}
\newcommand\mcI{{\mathcal{I}}}
\newcommand\mcL{{\mathcal{L}}}
\newcommand\mcN{{\mathcal{N}}}
\newcommand\mcU{{\mathcal{U}}}
\newcommand\mcX{{\mathcal{X}}}
\newcommand\mcY{{\mathcal{Y}}}
The log-likelihood for a linear mixed model (LMM) was derived in @sec-lmmtheory, where some computational methods for fitting such models with [MixedModels.jl](https://github.com/JuliaStats/MixedModels.jl), by optimizing a *profiled log-likelihood*, were illustrated.
In this appendix we outline the evaluation of the log-likelihood for a generalized linear mixed model (GLMM) with a binary response, which is modelled using the [Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution).
## The Bernoulli GLMM
The Bernoulli GLMM model defines the conditional distribution $({\mcY}|{\mcB}={\bbb})$ as independent Bernoulli random variables with expected values ${\bbmu}={\bbg}^{-1}({\bbeta})$, where ${\bbeta}={\bbX}{\bbbeta}+{\bbZ}{\bbb}$ is the *linear predictor* and $\bbg^{-1}$ is an *inverse link function*.
We will use the *logit link function*, $\bbeta=\bbg(\bbmu)$, defined component-wise from the scalar logit link, $g$, as
$$
\eta_i=g(\mu_i)=\mathrm{logit}(\mu_i)=\log\left(\frac{\mu_i}{1-\mu_i}\right)\quad i=1,\dots,n .
$$
The inverse link, $\bbmu=\bbg^{-1}(\bbeta)$, is similarly defined component-wise from the inverse of the scalar logit, which is the scalar *logistic* function,
$$
\mu_i=g^{-1}(\eta_i)=\mathrm{logistic}(\eta_i)=\frac{1}{1+e^{-\eta_i}}\quad i=1,\dots,n .
$$ {#eq-scalarlogistic}
The logit is the *canonical link function* (@sec-logitlink) for the Bernoulli distribution.
As in the linear mixed model discussed in @sec-lmmtheory, the $q$-dimensional random effects, $\mcB$, are expressed as $\mcB=\bbLambda_{\bbtheta}\,\mcU$ where $\mcU$ has a standard, multivariate Gaussian distribution (@sec-multivariateGaussian)
$$
\mcU\sim\mcN(\bbzero,\bbI_q) ,
$$ {#eq-MVNUdistGLMM}
with probability density function
$$
f_{\mcU}(\bbu)=\frac{1}{\sqrt{2\pi}^q}e^{-\|\bbu\|^2/2} .
$$ {#eq-uunconddense}
For a linear mixed model the distribution of these *spherical random effects* was given as $\mcU\sim(\bbzero,\sigma^2\bbI_q)$ (@eq-sphericalre).
A dispersion parameter like $\sigma^2$ is not present in @eq-MVNUdistGLMM because the Bernoulli distribution does not have a separate dispersion parameter --- it is entirely determined by its mean.
As is the case for the linear mixed model, the *covariance factor*, $\bbLambda_{\bbtheta}$, is sparse and patterned.
It is not uncommon in practical examples, such as the one in @sec-PIRLS, for
$\bbtheta$ to be one-dimensional and $\bbLambda_{\bbtheta}=\theta\,\bbI_q$, to be a scalar multiple of the $q\times q$ identity matrix.
### Log-likelihood for a Bernoulli GLMM
The likelihood for the parameters, $\bbtheta$ and $\bbbeta$, given the observed data, $\bby$, is the value of the marginal probability mass function for the response, $\mcY$, evaluated at $\bby$, the observed vector of {0,1} responses.
We obtain this value by integrating the product of the probability mass function for the conditional distribution, $(\mcY|\mcU=\bbu)$, and unconditional density of $\mcU$ (@eq-uunconddense), with respect to $\bbu$.
Recall that the probability mass for a single Bernoulli response can be written as $(1-\mu)^{1-y}\mu^y$, which is the specialization to $n=1$ of the probability mass function for the [binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution)
$$
\binom{n}{y}(1-\mu)^{n-y}\mu^y ,\quad 0\le\mu\le 1, \quad y\in\{0,\dots,n\} .
$$
Because the components of the vector-valued conditional distribution, $(\mcY|\mcU=\bbu)$, are assumed to be independent, its probability mass function can be written as the product of the probability masses for each component
$$
f_{\mcY|\mcU=\bbu}(\bby|\bbu)=\prod_{i=1}^n \left[(1-\mu_i)^{1-y_i}\mu_i^{y_i}\right]
\quad\mathrm{where}\quad
\bbmu=\bbg^{-1}(\bbX\bbbeta+\bbZ\bbLambda_{\bbtheta}\bbu) ,
$$
providing the likelihood as
$$
\begin{aligned}
L(\bbeta,\bbtheta|\bby)&=
\int_{\bbu}f_{\mcY,\mcU=\bbu}(\bby,\bbu)f_{\mcU}(\bbu)\,d\bbu\\
&=\int_{\bbu}\frac{1}{\sqrt{2\pi}^q}e^{\sum_{i=1}^n(1-y_i)\log(1-\mu_i)+y_i\,\log(\mu_i)}
e^{-\left\|\bbu\right\|^2/2}\,d\bbu\\
&=\int_{\bbu}\frac{1}{\sqrt{2\pi}^q}\exp\left(\frac{\left\|\bbu\right\|^2+\sum_{i=1}^n d(y_i,\mu_i)}{-2}\right)\,d\bbu
\end{aligned}
$$ {#eq-GLMMlikelihood}
where the *unit deviances*, $d(y_i,\mu_i)$, are
$$
d(y_i,\mu_i)=-2\left[(1-y_i)\log(1-\mu_i)+y_i\log(\mu_i)\right]\quad i=1,\dots,n .
$$ {#eq-unitdeviances}
By converting from the logarithm of the probability mass function to the [deviance](https://en.wikipedia.org/wiki/Deviance_(statistics)) scale, which is negative twice the log-probability, we get a quantity, $\sum_{i=1}^n d(y_i,\mu_i)$, which is on the same scale as the squared length, $\|\bbu\|^2$, of a standard multivariate Gaussian.
The sum of the unit deviances is analogous to the sum of squared residuals, $\|\bby-\bbX\bbbeta\|^2$, in a linear model.
In @sec-lmmtheory we showed that the integral defining the likelihood for a linear mixed model, @eq-likelihood-integral, has an analytic solution.
In general, the integral in @eq-GLMMlikelihood does not.
We will approximate the value of this integral using a quadratic approximation to the argument of the exponential function in @eq-GLMMlikelihood at the value of $\bbu$ that maximizes the integrand, which is the density of the conditional distribution, $(\mcU|\mcY=\bby)$, up to a scale factor.
Because the scale factor does not affect the location of the maximum, the value of $\bbu$ that maximizes the integrand,
$$
\begin{aligned}
\tilde{\bbu}(\bby|\bbtheta,\bbbeta)
&=\arg\max_{\bbu}\exp\left(\frac{\left\|\bbu\right\|^2 + \sum_{i=1}^n d(y_i,\mu_i)}{-2}\right)\\
&=\arg\min_{\bbu}\left(\left\|\bbu\right\|^2 + \sum_{i=1}^n d(y_i,\mu_i)\right)
\end{aligned} ,
$$ {#eq-condmode}
is also the *conditional mode* --- the value of $\bbu$ that maximizes the conditional density.
The expression being minimized in @eq-condmode, $\left\|\bbu\right\|^2 + \sum_{i=1}^n d(y_i,\mu_i)$, is called the *penalized deviance*.
Using a quadratic approximation to the penalized deviance at this conditional mode (i.e. the mode of the conditional distribution of $\mcU$ given $\mcY=\bby$) is equivalent to using a multivariate Gaussian approximation to this conditional distribution.
Approximating an integral like @eq-GLMMlikelihood by approximating the integrand as a scaled multivariate Gaussian distribution at its mode is called [Laplace's approximation](https://en.wikipedia.org/wiki/Laplace%27s_approximation) (@TierneyKadane1986).
The *penalized iteratively re-weighted least squares* (PIRLS) algorithm (@sec-PIRLS) provides a fast and stable method of determining the conditional mode, $\tilde{\bbu}(\bby|\bbtheta,\bbbeta)$ (@eq-condmode), thereby making it feasible to use Laplace's approximation at scale.
Before discussing PIRLS, however, we will describe generalized linear models (GLMs) without random effects (@sec-BernoulliGLM), for which the *deviance* is defined as the sum of the unit deviances and the maximum likelihood estimate of the coefficient vector, $\widehat{\bbbeta}$, is the value that minimizes the deviance.
In @sec-IRLS we describe the *iteratively re-weighted least squares* (IRLS) algorithm, which is a stable, fast algorithm to minimize the deviance.
We will illustrate the IRLS algorithm with the `contra` data discussed in @sec-glmmbinomial and a model like `com05`, which was fit in that chapter, but without the random effects. Later we will use the full `com05` model to illustrate some of the computations for GLMMs.
Although 0/1 responses and the Bernoulli distribution are easy to describe, the theory of the generalized linear mixed model (GLMM) and the details of the implementation are not. Readers who wish to focus on practical applications more than on the theory should feel free to skim this appendix.
Load the packages to be used
```{julia}
#| code-fold: true
#| output: false
#| label: packagesA03
using AlgebraOfGraphics
using BenchmarkTools
using CairoMakie
using EmbraceUncertainty: dataset
using FreqTables
using LinearAlgebra
using MixedModels
using MixedModelsMakie
using NLopt
using PooledArrays
using StatsAPI
```
and define some constants
```{julia}
#| code-fold: true
#| output: false
#| label: constantsA03
@isdefined(contrasts) || const contrasts = Dict{Symbol,Any}()
@isdefined(progress) || const progress = false
```
## Generalized linear models for binary data {#sec-BernoulliGLM}
To introduce some terms and workflows we first consider the generalized linear model (GLM) for the Bernoulli distribution and the logit link. The linear predictor for a GLM - a model without random effects - is simply
$$
{\bbeta}= {\bbX}{\bbbeta} ,
$$
and the mean response vector, ${\bbmu}=\bbg^{-1}(\bbeta)$, is obtained by component-wise application of the scalar logistic function (@eq-scalarlogistic).
The probability mass function for the Bernoulli distribution is
$$
f_{\mcY}(y|\mu) = \mu^y\,(1-\mu)^{(1-y)}\quad\mathrm{for}\quad y\in\{0,1\} .
$$
Because the elements of ${\mcY}|{\bbmu}$ are assumed to be independent, the log-likelihood is simply the sum of contributions from each element, which, on the deviance scale, can be written in terms of the *unit deviances*
$$
\begin{aligned}
-2\,\ell({\bbmu}|{\mathbf{y}})&= -2\,\log(L({\bbmu}|{\mathbf{y}}))\\
&=-2\,\sum_{i=1}^n y_i\log(\mu_i)+(1-y_i)\log(1-\mu_i) .
\end{aligned}
$$ {#eq-Bernoulliloglik}
As described above, it is customary when working with GLMs to convert the log-likelihood to a [deviance](https://en.wikipedia.org/wiki/Deviance_(statistics)), which, for the Bernoulli distribution, is negative twice the log-likelihood. (For other distributions, the deviance may incorporate an additional term that depends only on ${\mathbf{y}}$.)
One reason for preferring the deviance scale is that the change in deviance for nested models has approximately a $\chi^2$ distribution with degrees of freedom determined by the number of independent constraints on the parameters in the simpler model.
Especially for GLMs, the deviance plays a role similar to the sum of squared residuals in linear models.
For greater numerical precision we avoid calculating $1-\mu$ directly when evaluating expressions like @eq-Bernoulliloglik and instead use
$$
1 - \mu = 1 - \frac{1}{1+e^{-\eta}}=\frac{e^{-\eta}}{1+e^{-\eta}} .
$$
Evaluation of the last expression provides greater precision for large negative values of $\eta$ (corresponding to small values of $\mu$) than does first evaluating $\mu$ followed by $1 - \mu$.
After some algebra, we write the *unit deviance*, $d(y_i,\eta_i)$, which is the contribution to the deviance from the $i$th observation, as
$$
\begin{aligned}
d(y_i, \eta_i)&=-2\left[y_i\log(\mu_i)+(1-y_i)\log(1-\mu_i)\right]\\
&=2\left[(1-y_i)\eta_i-\log(1+e^{-\eta_i})\right]
\end{aligned}
\quad i=1,\dots,n
$$
A Julia function to evaluate both the mean and the unit deviance can be written as
```{julia}
#| output: false
function meanunitdev(y::T, η::T) where {T<:AbstractFloat}
expmη = exp(-η)
return (; μ=inv(1 + expmη), dev=2 * ((1 - y) * η + log1p(expmη)))
end
```
::: {.callout-note collapse="true"}
### log1p
Mathematically `log1p`, read *log of 1 plus*, is defined as $\mathrm{log1p}(x)=\log(1+x)$ but it is implemented in such a way as to provide greater accuracy when $x$ is small. For example,
```{julia}
let small = eps() / 10
@show small
@show 1 + small
@show log(1 + small)
@show log1p(small)
end;
```
`1 + small` evaluates to `1.0` in floating point arithmetic because of round-off, producing 0 for the expression `log(1 + small)`, whereas `log1p(small) ≈ small`, as it should be.
:::
This function returns a `NamedTuple` of values from scalar arguments.
For example,
```{julia}
meanunitdev(0.0, 0.21)
```
A `Vector` of such `NamedTuple`s is a *row-table* (@sec-Tablesjl), which can be updated in place by [dot-vectorization](https://docs.julialang.org/en/v1/manual/functions/#man-vectorized) of the scalar `meanunitdev` function, as shown below.
### An example: fixed-effects only from com05
We illustrate some of these computations using only the fixed-effects specification for `com05`, a GLMM fit to the `contra` data set in @sec-glmmbinomial. Because we will use the full GLMM later we reproduce `com05` by loading the data, creating the binary `ch` variable indicating children/no-children, defining the contrasts and formula to be used, and fitting the model as in @sec-glmmbinomial.
```{julia}
#| code-fold: show
#| output: false
#| label: com05
contra = let tbl = dataset(:contra)
Table(tbl; ch=tbl.livch .≠ "0")
end
contrasts[:urban] = HelmertCoding()
contrasts[:ch] = HelmertCoding()
com05 =
let d = contra,
ds = Bernoulli(),
f = @formula(
use ~ 1 + urban + ch * age + age & age + (1 | dist & urban)
)
fit(MixedModel, f, d, ds; contrasts, nAGQ=9, progress)
end
```
Extract the fixed-effects model matrix, $\bbX$, and initialize the coefficient vector, $\bbbeta$, to a copy (in case we modify it) of the estimated fixed-effects.
```{julia}
βm05 = copy(com05.β)
```
As stated above, the `meanunitdev` function can be applied to the vectors, ${\mathbf{y}}$ and ${\bbeta}$, via dot-vectorization to produce a `Vector{NamedTuple}`, which is the typical form of a row-table.
```{julia}
rowtbl = meanunitdev.(com05.y, com05.X * βm05)
typeof(rowtbl)
```
For display we convert the row-table to a column-table and prepend another column-table consisting of $\bby$ and $\bbeta$.
```{julia}
Table((; y=com05.y, η=com05.X * βm05), rowtbl) # display as a Table
```
The deviance for this value of ${\bbbeta}$ in this model is the sum of the unit deviances, which we write as `sum` applied to a [generator expression](https://docs.julialang.org/en/v1/manual/arrays/#Generator-Expressions). (In general we extract columns of a row-table with generator expressions that produce [iterators](https://docs.julialang.org/en/v1/base/collections/#lib-collections-iteration).)
```{julia}
sum(r.dev for r in rowtbl)
```
### Encapsulating the model in a struct
When minimizing the deviance it is convenient to have the different components of the model encapsulated in a user-created `struct` type so we can update the parameter values and evaluate the deviance without needing to keep track of all the pieces of the model.
```{julia}
struct BernoulliGLM{T<:AbstractFloat}
X::Matrix{T}
β::Vector{T}
ytbl::NamedTuple{(:y, :η),NTuple{2,Vector{T}}}
rtbl::Vector{NamedTuple{(:μ, :dev),NTuple{2,T}}}
end
```
We also create an *external constructor*, which is a function defined outside the struct and of the same name as the struct, that constructs and returns an object of that type.
In this case the external constructor creates a `BernoulliGLM` from the model matrix and the response vector, after some consistency checks on the arguments passed to it.
```{julia}
#| output: false
function BernoulliGLM(
X::Matrix{T},
y::Vector{T},
) where {T<:AbstractFloat}
# check consistency of arguments
n = size(X, 1) # number of rows in X
if length(y) ≠ n || any(!in([0, 1]), y)
throw(ArgumentError("y is not an $n-vector of 0's and 1's"))
end
# initial β from linear regression of y in {-1,1} coding
β = X \ replace(y, 0 => -1)
η = X * β
return BernoulliGLM(X, β, (; y, η), meanunitdev.(y, η))
end
```
To optimize the deviance we define an *extractor* method that returns the deviance
```{julia}
StatsAPI.deviance(m::BernoulliGLM) = sum(r.dev for r in m.rtbl)
```
::: {.callout-note collapse="true"}
### Why StatsAPI.deviance and not just deviance?
This extractor is written as a method for the generic `deviance` function defined in the `StatsAPI` package.
Doing so allows us to use the `deviance` name for the extractor without interfering with `deviance` methods defined for other model types.
:::
We also define a mutating function, `setβ!`, that installs a new value of `β` then updates `η` and `rtbl` in place.
```{julia}
#| output: false
function setβ!(m::BernoulliGLM, newβ)
(; y, η) = m.ytbl # destructure ytbl
mul!(η, m.X, copyto!(m.β, newβ)) # η = X * newβ in place
m.rtbl .= meanunitdev.(y, η) # update rtbl in place
return m
end
```
Create such a struct from `X` and `y` for model `com05`.
```{julia}
com05fe = BernoulliGLM(com05.X, com05.y)
β₀ = copy(com05fe.β) # keep a copy of the initial values
```
These initial values of $\bbbeta$ are from a least squares fit of $\bby$, converted from `{0,1}` coding to `{-1,1}` coding, on the model matrix, $\bbX$.
As a simple test of the `setβ!` and `deviance` methods we can check that `com05fe` produces the same deviance value for `βm05` as was evaluated above.
```{julia}
deviance(setβ!(com05fe, βm05))
```
For fairness in later comparisons we restore the initial values `β₀` to the model.
These are rough starting estimates with a deviance that is considerably greater than that at `βm05`.
```{julia}
deviance(setβ!(com05fe, β₀))
```
### Fit the GLM using a general optimizer
We can use a general optimizer like those available in [NLopt.jl](https://github.com/JuliaOpt/NLopt.jl) to minimize the deviance.
Following the instructions given at that package's repository, we create an `Opt` object specifying the algorithm to be used, BOBYQA (@powell2009bobyqa), and the dimension of the problem, then define and assign the objective function in the required form, and call `optimize`
```{julia}
function StatsAPI.fit!(m::BernoulliGLM{T}) where {T}
opt = Opt(:LN_BOBYQA, length(m.β))
function objective(x::Vector{T}, g::Vector{T}) where {T}
isempty(g) || throw(
ArgumentError("Gradient not available, g must be empty"),
)
return deviance(setβ!(m, x))
end
opt.min_objective = objective
minf, minx, ret = optimize(opt, copy(m.β))
@info (; code=ret, nevals=opt.numevals, minf)
return m
end
```
```{julia}
fit!(com05fe);
```
The optimizer has determined a coefficient vector that reduces the deviance to 2409.38, at which point convergence was declared because changes in the objective are limited by round-off.
This required about 500 evaluations of the deviance at candidate values of $\bbbeta$.
Each evaluation of the deviance is fast, requiring only a fraction of a millisecond on a laptop computer,
```{julia}
βopt = copy(com05fe.β)
@benchmark deviance(setβ!(m, β)) seconds = 1 setup =
(m = com05fe; β = βopt)
```
but the already large number of evaluations for these six coefficients would not scale well as this dimension increases.
Fortunately there is an algorithm, called *iteratively reweighted least squares* (IRLS), that uses the special structure of the GLM to provide fast and stable convergence to estimates of the coefficients, even for models with a large number of coefficients.
This will be important to us in fitting GLMMs where we must optimize with respect to the random effects, whose dimension can be large.
## The IRLS algorithm {#sec-IRLS}
As we have seen, in a GLM we are modeling the responses and the predicted values on two scales --- the *linear predictor scale*, for $\bbeta$, and the *response scale*, for $\bby$ and $\bbmu$.
The scalar link function, $\eta=g(\mu)$, and the inverse link, $\mu=g^{-1}(\eta)$, map vectors component-wise between these two scales.
For operations like determining a new candidate value of $\bbbeta$, the linear predictor scale is preferred, because, on that scale, $\bbeta=\bbX\bbbeta$ is a linear function of $\bbbeta$.
Thus it would be convenient if we could transform the response, ${\mathbf{y}}$, to the linear predictor scale where we could define a residual and use some form of minimizing a sum of squared residuals to evaluate a new coefficient vector (or, alternatively, evaluate an increment that will be added to the current coefficient vector).
Unfortunately, a naive approach of transforming $\bby$ to the linear predictor scale won't work because the elements of ${\mathbf{y}}$ are all $0$ or $1$ and the logit link function maps these values to $-\infty$ and $\infty$, respectively.
For an iterative algorithm, however, we can use a local linear approximation to the link function to define a *working residual*, from which to evaluate an increment to the coefficient vector, or a *working response*, from which we evaluate the new coefficient vector directly.
Because the link and inverse link functions are defined component-wise we will define the approximation for scalars $y_i$, $\mu_i$, and $\eta_i$ and for the scalar link function, $g$, with the understanding that these definitions apply component-wise to the vectors.
The *working residual* is evaluated by mapping the residual on the response scale, $y_i-\mu_i$, through the linear approximation to the link, $g(\mu)$, at $\mu_i$.
That is,
$$
\tilde{r_i}=(y_i-\mu_i)g'(\mu_i)\quad i=1,\dots,n .
$$
Because the derivative, $g'(\mu_i)$, for the logit link function is $1/[\mu_i(1-\mu_i)]$, these working residuals are
$$
\tilde{r}_i = (y_i-\mu_i)g'(\mu_i) = \frac{y_i - \mu_i}{\mu_i(1-\mu_i)}\quad i=1,\dots,n .
$$
Similarly, the *working response* on the linear predictor scale, is defined by adding the working residual to the current linear predictor value,
$$
\tilde{y_i}=\eta_i + \tilde{r_i}=\eta_i +(y_i-\mu_i)g'(\mu_i)=
\eta_i + \frac{y_i - \mu_i}{\mu_i(1-\mu_i)}\quad i=1,\dots,n .
$$
On the linear predictor scale we can fit a linear model to the working response to obtain a new parameter vector, but we must take into account that the variances of the *noise terms* in this linear model, which are the working residuals, are not constant.
We use *weighted least squares* where the weights are inversely proportional to the variance of the working residual.
The variance of the random variable $\mcY_i$ is $\mu_i(1-\mu_i)$, hence the variance of the working residual is
$$
\mathrm{Var}(\tilde{r_i})=g'(\mu_i)^2 \mathrm{Var}(\mcY_i)=\frac{\mu_i(1-\mu_i)}{\left[\mu_i(1-\mu_i)\right]^2}=\frac{1}{\mu_i(1-\mu_i)}
\quad i=1,\dots,n .
$$
Thus the working weights are
$$
\begin{aligned}
w_i&=\mu_i(1-\mu_i)\\
&=\frac{1}{1+e^{-\eta_i}}\frac{e^{-\eta_i}}{1+e^{-\eta_i}}
\end{aligned}
,\quad i=1,\dots,n.
$$
In practice we will use the square roots of the working weights, evaluated as
$$
\sqrt{w_i}=\frac{\sqrt{e^{-\eta_i}}}{1+e^{-\eta_i}}=\frac{e^{-\eta_i/2}}{1+e^{-\eta_i}}\quad i=1,\dots,n .
$$
Note that $\mathrm{Var}(\mcY_i)$ happens to be the inverse of $g'(\mu_i)$ for a Bernoulli response and the logit link function.
This will always be true for distributions in the [exponential family](https://en.wikipedia.org/wiki/Exponential_family) and their canonial links.
At the $k$th iteration the IRLS algorithm updates the coefficient vector to $\bbbeta^{(k)}$, which is a weighted least squares solution that could be written as
$$
\bbbeta^{(k)}= \left(\bbX'\bbW\bbX\right)^{-1}\left(\bbX'\bbW\tilde{\bby}\right) ,
$$
where $\bbW$ is an $n\times n$ diagonal matrix of the working weights and $\tilde{\bby}$ is the working response, both evaluated at $\bbbeta^{(k-1)}$, the coefficient vector from the previous iteration.
In practice we use the square roots of the working weights, which we write as a diagonal matrix, $\bbW^{1/2}$, and a QR decomposition (@sec-matrixdecomp) of a weighted model matrix, $\bbW^{1/2}\bbX$, to solve for the updated coefficient vector from the weighted working response, $\bbW^{1/2}\tilde{\bby}$, with elements
$$
\begin{aligned}
\sqrt{w_i}(\eta_i+\tilde{r}_i)&=\sqrt{\mu_i(1-\mu_i)}(\eta_i+\tilde{r}_i)\\
&=\sqrt{w_i}\eta_i +\frac{(y_i-\mu_i)\sqrt{\mu_i(1-\mu_i)}}{\mu_i(1-\mu_i)}\\
&=\sqrt{w_i}\eta_i +\frac{y_i-\mu_i}{\sqrt{w_i}}
\end{aligned},\quad i=1,\dots,n
$$
It is possible to write the IRLS algorithm using a weighted least squares fit of the working residuals on the model matrix to determine a parameter increment.
However, in the PIRLS algorithm it is necessary to use the working response, not the working residual, so we define the IRLS algorithm in those terms too.
Furthermore, in the PIRLS algorithm we will need to allow for an *offset* when calculating the working response.
In the presence of an offset, $\bbo$, a constant vector of length $n$, the linear predictor is defined as
$$
\bbeta = \bbo + \bbX\bbbeta .
$$
The mean, $\bbmu$, the working weights and the working residuals are defined as before but the working response becomes
$$
\tilde{\bby}=\tilde{\bbr} + \bbeta - \bbo .
$$
For a linear model there is rarely a reason for using an offset.
Instead we can simply subtract the constant vector, $\bbo$, from the response, $\bby$, because the response and the linear predictor are on the same scale.
However, this is not the case for a GLM where we must deal with the effects of the constant offset on the linear predictor scale, not on the response scale.
### Implementation of IRLS for Bernoulli-Logit
We define a `BernoulliIRLS` struct with three additional elements in the rowtable: the square roots of the working weights, `rtwwt`, the weighted working residuals, `wwres`, and the weighted working response, `wwresp`.
In the discussion above, `rtwwt` is the diagonal of $\bbW^{1/2}$, `wwres` is $\bbW^{1/2}\tilde{\bbr}$ and `wwresp` is $\bbW^{1/2}\tilde{\bby}$.
We also add fields `Xqr`, in which the weighted model matrix, $\bbW^{1/2}\bbX$, is formed followed by its QR decomposition, and `βcp`, which holds a copy of the previous coefficient vector.
```{julia}
#| output: false
struct BernoulliIRLS{T<:AbstractFloat}
X::Matrix{T}
Xqr::Matrix{T} # copy of X used in the QR decomp
β::Vector{T}
βcp::Vector{T} # copy of previous β
Whalf::Diagonal{T,Vector{T}} # rtwwt as a Diagonal matrix
ytbl::NamedTuple{(:y, :η),NTuple{2,Vector{T}}}
rtbl::Vector{
NamedTuple{(:μ, :dev, :rtwwt, :wwres, :wwresp),NTuple{5,T}},
}
end
```
with constructor
```{julia}
#| output: false
function BernoulliIRLS(
X::Matrix{T},
y::Vector{T},
) where {T<:AbstractFloat}
n = size(X, 1) # number of rows of X
if length(y) ≠ n || !all(v -> (iszero(v) || isone(v)), y)
throw(ArgumentError("y is not an $n-vector of 0's and 1's"))
end
# initial β from linear least squares fit of y in {-1,1} coding
Xqr = copy(X)
β = qr!(Xqr) \ replace(y, 0 => -1)
βcp = copy(β)
η = X * β
rtbl = tblrow.(y, η)
Whalf = Diagonal([r.rtwwt for r in rtbl])
return BernoulliIRLS(X, Xqr, β, βcp, Whalf, (; y, η), rtbl)
end
```
The `tblrow` function evaluates the mean, unit deviance, square root of the weight, and the weighted, working residual and weighted, working response for scalar $y$ and $\eta$.
The `offset` argument, which defaults to zero, is not used in calls for `BernoulliIRLS` models, but will be used in @sec-PIRLS when we discuss the PIRLS algorithm.
```{julia}
#| output: false
function tblrow(
y::T,
η::T,
offset::T=zero(T),
) where {T<:AbstractFloat}
rtexpmη = exp(-η / 2) # square root of exp(-η)
expmη = abs2(rtexpmη) # exp(-η)
denom = 1 + expmη
μ = inv(denom)
dev = 2 * ((1 - y) * η + log1p(expmη))
rtwwt = rtexpmη / denom # sqrt of working wt
wwres = (y - μ) / rtwwt # weighted working resid
wwresp = wwres + rtwwt * (η - offset)
return (; μ, dev, rtwwt, wwres, wwresp)
end
```
```{julia}
#| output: false
StatsAPI.deviance(m::BernoulliIRLS) = sum(r.dev for r in m.rtbl)
```
Next we define a mutating function, `updateβ!`, that evaluates $\bbbeta^{(k)}$, the updated coefficient vector at iteration $k$, in place by weighted least squares then updates the response table.
```{julia}
#| output: false
function updateβ!(m::BernoulliIRLS)
(; X, Xqr, β, βcp, Whalf, ytbl, rtbl) = m # destructure m & ytbl
(; y, η) = ytbl
copyto!(βcp, β) # keep a copy of β
copyto!(Whalf.diag, r.rtwwt for r in rtbl) # rtwwt -> Whalf
mul!(Xqr, Whalf, X) # weighted model matrix
copyto!(η, r.wwresp for r in rtbl) # use η as temp storage
ldiv!(β, qr!(Xqr), η) # weighted least squares
rtbl .= tblrow.(y, mul!(η, X, β)) # update η and rtbl
return m
end
```
For our example, we start at the same coefficient vector as we did with the general optimizer.
```{julia}
com05fe = BernoulliIRLS(com05.X, com05.y)
deviance(com05fe)
```
The first IRLS iteration
```{julia}
deviance(updateβ!(com05fe))
```
reduces the deviance substantially.
We create a `fit!` method to iterate to convergence.
```{julia}
#| output: false
function StatsAPI.fit!(m::BernoulliIRLS, β₀=m.β; verbose::Bool=true)
(; X, β, βcp, ytbl, rtbl) = m
(; y, η) = ytbl
rtbl .= tblrow.(y, mul!(η, X, copyto!(β, β₀)))
olddev = deviance(m)
verbose && @info 0, olddev # record the deviance at initial β
for i in 1:100 # perform at most 100 iterations
newdev = deviance(updateβ!(m))
verbose && @info i, newdev # iteration number and deviance
if newdev > olddev
@warn "failure to decrease deviance"
copyto!(β, βcp) # roll back changes to β, η, and rtbl
rtbl = tblrow.(y, mul!(η, X, β))
break
elseif (olddev - newdev) < (1.0e-10 * olddev)
break # exit loop if deviance is stable
else
olddev = newdev
end
end
return m
end
```
```{julia}
fit!(com05fe, β₀);
```
The IRLS algorithm has converged in 4 iterations to essentially the same deviance as the general optimizer achieved after around 500 function evaluations.
Each iteration of the IRLS algorithm takes more time than a deviance evaluation, but still only a fraction of a millisecond on a laptop computer.
```{julia}
@benchmark deviance(updateβ!(m)) seconds = 1 setup = (m = com05fe)
```
## GLMMs and the PIRLS algorithm {#sec-PIRLS}
In @sec-lmmtheory we showed that, given a value of $\bbtheta$, which determines the relative covariance factor, $\bbLambda_{\bbtheta}$, of the random effects, $\mcB$, the *conditional mode*, $\tilde{\bbb}$, of the random effects can be evaluated as the solution to a *penalized least squares* (PLS) problem.
It is convenient to write the PLS problem in terms of the *spherical random effects*, $\mcU\sim\mcN(\bbzero,\sigma^2\bbI)$, with the defining relationship $\mcB=\bbLambda_{\bbtheta}\mcU$, as in @eq-penalized-rss
$$
\tilde{\bbu}=\arg\min_{\bbu}\left(
\left\|\bby-\bbX\bbbeta-\bbZ\bbLambda_{\bbtheta}\bbu\right\|^2 +
\left\|\bbu\right\|^2
\right) .
$$
We wrote @eq-penalized-rss for the LMM case as minimizing the penalized sum of squared residuals with respect to both $\bbbeta$ and $\bbu$.
Here, and in @eq-condmodeu below, we minimize with respect to $\bbu$ only while holding $\bbbeta$ fixed.
The solution of this PLS problem, $\tilde\bbu$, is the *conditional mode* of $\mcU$, in that it maximizes the density of the conditional distribution, $(\mcU|\mcY=\bby)$, at the observed $\bby$.
(In the case of a LMM, where the conditional distributions, $(\mcB|\mcY=\bby)$ and $(\mcU|\mcY=\bby)$, are multivariate Gaussian, the solution of the PLS problem is also the *mean* of the conditional distribution, but this property doesn't carry over to GLMMs.)
In a Bernoulli generalized linear mixed model (GLMM) the mode of the conditional distribution, $(\mcU|\mcY=\bby)$, minimizes the *penalized GLM deviance*,
$$
\tilde{\bbu}=\arg\min_{\bbu}\left(
\left\|\bbu\right\|^2+\sum_{i-1}^n d(y_i,\eta_i(\bbu))
\right) ,
$$ {#eq-condmodeu}
where $d(y_i,\eta_i),\,i=1,\dots,n$ are the unit deviances defined in @sec-BernoulliGLM.
We modify the IRLS algorithm as *penalized iteratively re-weighted least squares* (PIRLS) to determine these values.
As with IRLS, each iteration of the PIRLS algorithm involves using the current linear predictor, $\bbeta(\bbu)=\bbX\bbbeta+\bbZ\bbLambda_{\bbtheta}\bbu$, ($\bbbeta$ and $\bbtheta$ are assumed known and fixed, and $\bbX\bbbeta$ is an offset) to evaluate the mean, $\bbmu=\bbg^{-1}(\bbeta)$, of the conditional distribution, $(\mcY|\mcU=\bbu)$, as well as the unit deviances, $d(y_i,\eta_i)$, the square roots of the working weights, which are on the diagonal of $\bbW^{1/2}$, and the weighted, working response, $\bbW^{1/2}\tilde{\bby}$.
The updated spherical random effects vector, $\bbu$, is the solution to
$$
(\bbLambda'\bbZ'\bbW\bbZ\bbLambda+\bbI)\bbu=\bbLambda'\bbZ'\bbW\tilde{\bby}
$$
and is evaluated using the Cholesky factor, $\bbL$, of $\bbLambda'\bbZ'\bbW\bbZ\bbLambda+\bbI$.
As in the solution of the PLS problem in @sec-lmmtheory, the fact that $\bbZ$ is sparse and that the sparsity is also present in $\bbL$, makes it feasible to solve for $\bbu$ even when its dimension is large.
### PIRLS for com05
To illustrate the calculations we again use the `com05` model, which has a single, scalar random-effects term, `(1 | dist & urban)`, in its formula.
The matrix $\bbZ$ is displayed as
```{julia}
com05re = only(com05.reterms)
Int.(collect(com05re)) # Int values for compact printing
```
but internally it is stored much more compactly because it is an
*indicator matrix* (also called *one-hot* encoding), which means that all $Z_{i,j}\in\{0,1\}$ and in each row there is exactly one value that is one (and all the others are zero).
The column in which the non-zero element of each row occurs is given as an integer vector in the `refs` property of `com05re`.
```{julia}
com05re.refs' # transpose for compact printing
```
We define a struct
```{julia}
#| output: false
struct BernoulliPIRLS{T<:AbstractFloat,S<:Integer}
X::Matrix{T}
θβ::Vector{T}
ytbl::NamedTuple{ # column-table
(:refs, :y, :η, :offset),
Tuple{Vector{S},Vector{T},Vector{T},Vector{T}},
}
utbl::NamedTuple{ # column-table
(:u, :u0, :Ldiag, :pdev, :pdev0, :aGHQ),
NTuple{6,Vector{T}},
}
rtbl::Vector{ # row-table
NamedTuple{(:μ, :dev, :rtwwt, :wwres, :wwresp),NTuple{5,T}},
}
end
```
with an external constructor
```{julia}
#| output: false
function BernoulliPIRLS(
X::Matrix{T},
y::Vector{T},
refs::Vector{S},
) where {T<:AbstractFloat,S<:Integer}
# use IRLS to check X and y, obtain initial β, and establish rtbl
irls = fit!(BernoulliIRLS(X, y); verbose=false)
β = irls.β
θβ = append!(ones(T, 1), β) # initial θ = 1
η = X * β
# refs should contain all values from 1 to maximum(refs)
refvals = sort!(unique(refs))
q = length(refvals)
if refvals ≠ 1:q
throw(ArgumentError("sort!(unique(refs)) must be 1:$q"))
end
length(refs) == length(y) ||
throw(ArgumentError("lengths of y and refs aren't equal"))
ytbl = (; refs, y, η, offset=copy(η))
utbl = NamedTuple(
nm => zeros(T, q) for
nm in (:u, :u0, :Ldiag, :pdev, :pdev0, :aGHQ)
)
return updatetbl!(BernoulliPIRLS(X, θβ, ytbl, utbl, irls.rtbl))
end
```
::: {.callout-note collapse="true"}
### Why are θ and β stored in a single vector?
The reason for storing both $\bbtheta$ and $\bbbeta$ in a single vector is to provide for their simultaneous optimization with an optimizer such as those in [NLopt.jl](https://github.com/JuliaOpt/NLopt.jl).
:::
The `utbl` field in a `BernoulliPIRLS` struct contains vectors named `u0`, `pdev`, `pdev0`, and `aGHQ`, in addition to `u` and `Ldiag`.
These are not used in the PIRLS algorithm (other than for keeping a copy of the previous $\bbu$) or for optimizing Laplace's approximation to the objective.
However, they will be used in adaptive Gauss-Hermite quadrature evaluation of the objective (@sec-aGHQ), so we keep them in the struct throughout.
The `updatetbl!` method for this type first evaluates $\bbeta$ via a "virtual" multiplication that forms $\bbZ\bbLambda_{\bbtheta}\bbu$ plus the stored `offset`, which is $\bbX\bbbeta$, then updates the rowtable from $\bby$, $\bbeta$, and the offset.
For this model $\bbtheta$ is one-dimensional and $\bbLambda_{\bbtheta}$ is a scalar multiple of $\bbI_q$, the identity matrix of size $q$, and thus the matrix multiplication by $\bbLambda_{\bbtheta}$ can be expressed as scalar products.
```{julia}
#| output: false
function updatetbl!(m::BernoulliPIRLS)
(; refs, y, η, offset) = m.ytbl
u = m.utbl.u
θ = first(m.θβ)
# evaluate η = offset + ZΛu where Λ is θ * I and Z is one-hot
fill!(η, 0)
@inbounds for i in eachindex(η, refs, offset)
η[i] += offset[i] + u[refs[i]] * θ
end
m.rtbl .= tblrow.(y, η, offset)
return m
end
```
The `pdeviance` method returns the deviance for the GLM model plus the penalty on the squared length of `u`.
```{julia}
#| output: false
function pdeviance(m::BernoulliPIRLS)
return sum(r.dev for r in m.rtbl) + sum(abs2, m.utbl.u)
end
```
The `updateu!` method is similar to `updateβ!` for the `BernoulliIRLS` type except that it is based on the diagonal matrix $\bbLambda'\bbZ'\bbW\bbZ\bbLambda + \bbI$.
Only the diagonal elements of this matrix are constructed and used to solve for the updated $\bbu$ vector.
At convergence of the PIRLS algorithm the elements of `Ldiag` are replaced by their square roots.
```{julia}
#| output: false
function updateu!(m::BernoulliPIRLS)
(; u, u0, Ldiag) = m.utbl
copyto!(u0, u) # keep a copy of u
θ = first(m.θβ) # extract the scalar θ
fill!(u, 0)
if iszero(θ) # skip the update if θ == 0
fill!(Ldiag, 1) # L is the identity if θ == 0
return updatetbl!(m)
end
fill!(Ldiag, 0)
@inbounds for (ri, ti) in zip(m.ytbl.refs, m.rtbl)
rtWΛ = θ * ti.rtwwt # non-zero in i'th row of √WZΛ
Ldiag[ri] += abs2(rtWΛ) # accumulate Λ'Z'WZΛ
u[ri] += rtWΛ * ti.wwresp # accumulate Λ'Z'Wỹ
end
Ldiag .+= 1 # form diagonal of Λ'Z'WZΛ + I = LL'
u ./= Ldiag # solve for u with diagonal LL'
return updatetbl!(m) # and update η and rtbl
end
```
Create a `BernoulliPIRLS` struct for the `com05` model and check the penalized deviance at the initial values
```{julia}
m = BernoulliPIRLS(com05.X, com05.y, only(com05.reterms).refs)
pdeviance(m)
```
As with IRLS, the first iteration of PIRLS reduces the objective, which is the penalized deviance in this case, substantially.
```{julia}
pdeviance(updateu!(m))
```
Create a `pirls!` method for this struct.
```{julia}
#| output: false
function pirls!(m::BernoulliPIRLS; verbose::Bool=false)
(; u, u0, Ldiag) = m.utbl
fill!(u, 0) # start from u == 0
copyto!(u0, u) # keep a copy of u
oldpdev = pdeviance(updatetbl!(m))
verbose && @info 0, oldpdev
for i in 1:10 # maximum of 10 PIRLS iterations
newpdev = pdeviance(updateu!(m))
verbose && @info i, newpdev
if newpdev > oldpdev # PIRLS iteration failed
@warn "PIRLS iteration did not reduce penalized deviance"
copyto!(u, u0) # restore previous u
updatetbl!(m) # restore η and rtbl
break
elseif (oldpdev - newpdev) < (1.0e-8 * oldpdev)
copyto!(u0, u) # keep a copy of u
break
else
copyto!(u0, u) # keep a copy of u
oldpdev = newpdev
end
end
map!(sqrt, Ldiag, Ldiag) # replace diag(LL') by diag(L)
return m
end
```
The PIRLS iterations always start from $\bbu=\mathbf{0}$ so that the converged value of the penalized deviance is reproducible for given values of $\theta$ and $\bbbeta$.
If we allowed the algorithm to start at whatever values are currently stored in $\bbu$ then there could be slight differences in the value of the penalized deviance at convergence of PIRLS, which can cause problems when trying to optimize with respect to $\theta$ and $\bbbeta$.
```{julia}
pirls!(m; verbose=true);
```
As with IRLS, PIRLS is a fast and stable algorithm for determining the mode of the conditional distribution $(\mcU|\mcY=\bby)$ with $\bbtheta$ and $\bbbeta$ held fixed.
```{julia}
@benchmark pirls!(mm) seconds = 1 setup = (mm = m)
```
The time taken for the four iterations to determine the conditional mode of $\bbu$ is comparable to the time taken for a single call to `updateβ!`.
Most of the time in `updateβ!` is spent in the QR factorization to solve the weighted least squares problem, whereas in `updateu!`and thus in `pirls!`, we take advantage of the fact that the solution of the penalized, weighted least squares problem is based on a diagonal matrix.
## Laplace's approximation to the log-likelihood {#sec-GLMMLaplace}
The PIRLS algorithm determines the value of $\bbu$ that minimizes the penalized deviance
$$
\tilde{\bbu}=\arg\min_{\bbu}\left(\left\|\bbu\right\|^2+\sum_{i=1}^n d(y_i,\eta_i)\right) ,
$$
where $\eta_i, i=1,\dots,n$ is the $i$th component of $\bbeta=\bbX\bbbeta+\bbZ\bbLambda\bbu$.
A quadratic approximation to the penalized deviance at $\tilde{\bbu}$ is
$$
\begin{aligned}
\tilde{d}(\bby,\bbu)&=\|\bbu\|^2+\sum_{i=1}^n d(y_i,\eta_i)\\
&\approx\|\tilde{\bbu}\|^2+\sum_{i=1}^n d(y_i,\tilde{\eta}_i)+
(\bbu-\tilde{\bbu})'(\bbLambda'\bbZ'\bbW\bbZ\bbLambda+\bbI)(\bbu-\tilde{\bbu})\\
&=\|\tilde{\bbu}\|^2+\sum_{i=1}^n d(y_i,\tilde{\eta}_i)+
(\bbu-\tilde{\bbu})'\bbL\bbL'(\bbu-\tilde{\bbu})\\
&=\tilde{d}(\bby,\tilde{\bbu})+
(\bbu-\tilde{\bbu})'\bbL\bbL'(\bbu-\tilde{\bbu})
\end{aligned}
$$ {#eq-pdevquad}
where $\bbL$ is the lower Cholesky factor of $\bbLambda'\bbZ'\bbW\bbZ\bbLambda+\bbI$.
(In @eq-pdevquad the linear term in $(\bbu-\tilde{\bbu})$ that would normally occur in such an expression is omitted because the gradient of $\tilde{d}(\bby,\bbu)$ is zero at $\tilde{\bbu}$.)
[Laplace's approximation](https://en.wikipedia.org/wiki/Laplace%27s_approximation) to the log-likelihood uses this quadratic approximation to the penalized deviance, which is negative one-half the logarithm of the integrand, to approximate the value of the integral.
On the deviance scale, which is negative twice the log-likelihood, the approximation is
$$
\begin{aligned}
-2\,\ell(\bbu|\bby,\bbtheta,\bbbeta)&=-2\,\log\left(L(\bbu|\bby,\bbtheta,\bbbeta)\right)\\
&=-2\,\log\left(\int_{\bbu}\frac{1}{\sqrt{2\pi}^q}\exp\left(\frac{\left\|\bbu\right\|^2+\sum_{i=1}^n d(y_i,\mu_i)}{-2}\right)\,d\bbu\right)\\
&\approx\tilde{d}(\bby,\tilde{\bbu})-2\,\log\left(
\int_\bbu\frac{1}{\sqrt{2\pi}^q}\exp\left(\frac{[\bbu-\tilde{\bbu}]'\bbL\bbL'[\bbu-\tilde{\bbu}]}{-2}\right)\,d\bbu
\right)\\
&=\tilde{d}(\bby,\tilde{\bbu})-2\,\log\left(
\int_\bbu\frac{1}{\sqrt{2\pi}^q}\exp\left(\frac{\left\|\bbL'[\bbu-\tilde{\bbu}\right\|^2}{-2}\right)\,d\bbu
\right)\\
&=\tilde{d}(\bby,\tilde{\bbu})-2\,\log\left(|\bbL|^{-1}\right)\\
&=\tilde{d}(\bby,\tilde{\bbu})+\log\left(|\bbL|^2\right)
\end{aligned}
$$
```{julia}
#| output: false
function laplaceapprox(m::BernoulliPIRLS)
return pdeviance(m) + 2 * sum(log, m.utbl.Ldiag)
end
```
```{julia}
laplaceapprox(pirls!(m))
```
The remaining step is to optimize Laplace's approximation to the GLMM deviance with respect to $\theta$ and $\bbbeta$, which we do using the BOBYQA optimizer from [NLopt.jl](https://github.com/JuliaOpt/NLopt.jl)
```{julia}
#| output: false
function StatsAPI.fit!(m::BernoulliPIRLS)
θβ = m.θβ
pp1 = length(θβ) # length(β) = p and length(θ) = 1
opt = Opt(:LN_BOBYQA, pp1)
mβ = view(θβ, 2:pp1)
function obj(x, g)
if !isempty(g)
throw(ArgumentError("gradient not provided, g must be empty"))
end
copyto!(θβ, x)
mul!(m.ytbl.offset, m.X, mβ)
return laplaceapprox(pirls!(m))
end
opt.min_objective = obj
lb = fill!(similar(θβ), -Inf) # vector of lower bounds
lb[1] = 0 # scalar θ must be non-negative
NLopt.lower_bounds!(opt, lb)
minf, minx, ret = optimize(opt, copy(θβ))
@info (; ret, fevals=opt.numevals, minf)
return m
end
```
```{julia}
fit!(m);
```
```{julia}
#| code-fold: true
print(
"Converged to θ = ",
first(m.θβ),
" and β =",
view(m.θβ, 2:lastindex(m.θβ)),
)
```
These estimates differ somewhat from those for model `com05`.
```{julia}
#| code-fold: true
print(
"Estimates for com05: θ = ",
only(com05.θ),
", fmin = ",
deviance(com05),
", and β =",
com05.β,
)
```
The discrepancy in the results is because the `com05` results are based on a more accurate approximation to the integral called *adaptive Gauss-Hermite Quadrature*, which is discussed in @sec-aGHQ.
### Generalizations to more complex structure
There is an implicit assumption in the `BernoulliPIRLS` structure that random effects in the model are simple, scalar random effects associated with a single grouping factor, which is represented by `m.ytbl.refs`.
For such models the random effects model matrix, $\bbZ$, is an $n\times q$ indicator matrix, the covariance parameter, $\bbtheta$, is one-dimensional and the covariance factor, $\bbLambda_{\bbtheta}=\theta\,\bbI_q$ is a scalar multiple of the $q\times q$ identity matrix, $\bbI_q$.
Furthermore, $\bbLambda_{\bbtheta}'\bbZ'\bbW\bbZ\bbLambda_{\bbtheta}+\bbI_q$ is also diagonal, as is its Cholesky factor, $\bbL$.
We have taken advantage of the special structure of these matrices both in representations --- storing $\bbL$ by storing only the diagonal values in the vector `m.utbl.Ldiag` --- and in some algorithms for the PIRLS iterative step.
The PIRLS algorithm to determine the conditional mode, $\tilde{\bbu}$, of the random effects, $\mcU$, and Laplace's approximation to the log-likelihood for GLMMs can be generalized to models with vector-valued random effects, or with random effects associated with more than one grouping factor, or with both.
The more general representation of $\bbZ$ and $\bbL$ used with linear mixed models can be adapted for GLMMs as well.
We chose to specialize the representation of GLMMs in this appendix to this specific type of random effects to be able to demonstrate adaptive Gauss-Hermite quadrature in @sec-aGHQ, which, at present, is restricted to models with a single, simple, scalar, random effects term.
## Adaptive Gauss-Hermite quadrature {#sec-aGHQ}
Recall from @eq-pdevquad that Laplace's approximation to the likelihood is based on a quadratic approximation to the penalized (GLM) deviance, $\tilde{d}(\bby,\bbu)$, at the conditional mode $\tilde{\bbu}$.
In the case of a model with a single, scalar, random effects term, like the model `com05`, each linear predictor value, $\eta_i,\,i=1,\dots,n$ depends on only one element of $\bbu$.
Writing the set of indices $i$ for which `refs[i] == j` as $\mcI(j)$, we can express the penalized deviance, and its quadratic approximation, as sums of scalar contributions,