intro.qmd

---
fig-width: 4
fig-height: 3
fig-dpi: 150
fig-format: png
engine: julia
julia:
  exeflags: ["--project"]
---

# A simple, linear, mixed-effects model {#sec-examlmm}

\newcommand\bbA{{\mathbf{A}}}
\newcommand\bbb{{\mathbf{b}}}
\newcommand\bbI{{\mathbf{I}}}
\newcommand\bbR{{\mathbf{R}}}
\newcommand\bbX{{\mathbf{X}}}
\newcommand\bbx{{\mathbf{x}}}
\newcommand\bby{{\mathbf{y}}}
\newcommand\bbZ{{\mathbf{Z}}}
\newcommand\bbbeta{{\boldsymbol{\beta}}}
\newcommand\bbeta{{\boldsymbol{\eta}}}
\newcommand\bbLambda{{\boldsymbol{\Lambda}}}
\newcommand\bbOmega{{\boldsymbol{\Omega}}}
\newcommand\bbmu{{\boldsymbol{\mu}}}
\newcommand\bbSigma{{\boldsymbol{\Sigma}}}
\newcommand\bbtheta{{\boldsymbol{\theta}}}
\newcommand\mcN{{\mathcal{N}}}
\newcommand\mcB{{\mathcal{B}}}
\newcommand\mcY{{\mathcal{Y}}}

In this book we describe the theory behind a type of statistical model called *mixed-effects* models and the practice of fitting and analyzing such models using the [MixedModels](https://github.com/JuliaStats/MixedModels.jl) package for [Julia](https://julialang.org).
These models are used in many different disciplines.
Because the descriptions of the models can vary markedly between disciplines, we begin by describing what mixed-effects models are and by exploring a very simple example of one type of mixed model, the *linear mixed model*.

This simple example allows us to illustrate the use of the `LinearMixedModel` type in the `MixedModels` package for fitting such models and for analyzing the fitted model.
We also describe methods of assessing the precision of the parameter estimates and of visualizing the conditional distribution of the random effects, given the observed data.

## Mixed-effects models {#sec-memod}

Mixed-effects models, like many other types of statistical models, describe a relationship between a *response* variable and some of the *covariates* that have been measured or observed along with the response.
In mixed-effects models at least one of the covariates is a *categorical* covariate representing experimental or observational "units" in the data set.
In the example from the chemical industry that is given in this chapter, the observational unit is the batch of an intermediate product used in production of a dye.
In medical and social sciences the observational units are often the human or animal subjects in the study.
In agriculture the experimental units may be the plots of land or the specific plants being studied.

In all of these cases the categorical covariate or covariates are observed at a set of discrete *levels*.
We may use numbers, such as subject identifiers, to designate the particular levels that we observed but these numbers are simply labels.
The important characteristic of a categorical covariate is that, at each observed value of the response, the covariate takes on the value of one of a set of distinct levels.

Parameters associated with the particular levels of a covariate are sometimes called the "effects" of the levels.
If the set of possible levels of the covariate is fixed and reproducible we model the covariate using *fixed-effects* parameters.
If the levels that we observed represent a random sample from the set of all possible levels we incorporate *random effects* in the model.

There are two things to notice about this distinction between fixed-effects parameters and random effects.
First, the names are misleading because the distinction between fixed and random is more a property of the levels of the categorical covariate than a property of the effects associated with them.
Second, we distinguish between "fixed-effects parameters", which are indeed parameters in the statistical model, and "random effects", which, strictly speaking, are not parameters.
As we will see shortly, random effects are unobserved random variables.

To make the distinction more concrete, suppose that we wish to model the annual reading test scores for students in a school district and that the covariates recorded with the score include a student identifier and the student's gender.
Both of these are categorical covariates.
The levels of the gender covariate, male and female, are fixed.
If we consider data from another school district or we incorporate scores from earlier tests, we will not change those levels.
On the other hand, the students whose scores we observed would generally be regarded as a sample from the set of all possible students whom we could have observed.
Adding more data, either from more school districts or from results on previous or subsequent tests, will increase the number of distinct levels of the student identifier.

*Mixed-effects models* or, more simply, *mixed models* are statistical models that incorporate both fixed-effects parameters and random effects.
Because of the way that we will define random effects, a model with random effects always includes at least one fixed-effects parameter.
Thus, any model with random effects is a mixed model.

We characterize the statistical model in terms of two random variables: a $q$-dimensional vector of random effects represented by the random variable $\mcB$ and an $n$-dimensional response vector represented by the random variable $\mcY$.
(We use upper-case "script" characters to denote random variables.
The corresponding lower-case upright letter denotes a particular value of the random variable, with vector-valued random variable values in boldface.)
We observe the value, $\bby$, of $\mcY$.
We do not observe the value, $\bbb$, of $\mcB$.

When formulating the model we describe the unconditional distribution of $\mcB$ and the conditional distribution, $(\mcY|\mcB=\bbb)$.
The descriptions of the distributions involve the form of the distribution and the values of certain parameters.
We use the observed values of the response and the covariates to estimate these
parameters and to make inferences about them.

That's the big picture.
Now let's make this more concrete by describing a particular, versatile class of mixed models called linear mixed models and by studying a simple example of such a model.
First we describe the data in the example.

## The dyestuff and dyestuff2 data {#sec-DyestuffData}

Models with random effects have been in use for a long time.
The first edition of the classic book, *Statistical Methods in Research and Production*, edited by O.L. Davies, was published in 1947 and contained examples of the use of random effects to characterize batch-to-batch variability in chemical processes.
The data from one of these examples are available as the `dyestuff` data in the `MixedModels` package.
In this section we describe and plot these data and introduce a second example, the `dyestuff2` data, described in @box73:_bayes_infer_statis_analy.

### The dyestuff data {#sec-dyestuff}

The data are described in @davies72:_statis_method_in_resear_and_produc [, Table 6.3, p. 131], the fourth edition of the book mentioned above, as coming from

> an investigation to find out how much the variation from batch to
> batch in the quality of an intermediate product (H-acid) contributes
> to the variation in the yield of the dyestuff (Naphthalene Black 12B)
> made from it. In the experiment six samples of the intermediate,
> representing different batches of works manufacture, were obtained,
> and five preparations of the dyestuff were made in the laboratory from
> each sample. The equivalent yield of each preparation as grams of
> standard colour was determined by dye-trial.

To access these data within `Julia` we must first attach this package, and others that we will use, to our session.

```{julia}
#| code-fold: show
#| output: false
#| label: packages01
using AlgebraOfGraphics # high-level graphics
using CairoMakie        # graphics back-end
using DataFrameMacros   # elegant DataFrame manipulation
using DataFrames        # DataFrame implementation
using MixedModels       # fit and examine mixed-effects models
using MixedModelsMakie  # graphics for mixed-effects models
using Random            # random number generation
using StatsBase         # basic statistical summaries

using EmbraceUncertainty: dataset  # `dataset` means this one
```

A package must be attached before any of the data sets or functions in the package can be used.
If entering this line results in an error report stating that there is no package by one of these names then you must first install the package(s).
If you are using Julia version 1.8.0 or later the error report will include instructions on how to do this.

In what follows, we will assume that these packages have been installed and attached to the session before any of the code shown has been run.

The `EmbraceUncertainty` package provides access to the datasets used in this book.
The directive

```
using EmbraceUncertainty: dataset
```

provides access to the `dataset` function in that package without having to "qualify" the name by writing `EmbraceUncertainty.dataset`.
The definition of this particular `dataset` function also provides access to the datasets used
in examples and in tests of the `MixedModels` package.

:::{.callout-note collapse="true"}

### Add metadata to Arrow files for datasets

:::

```{julia}
#| label: dyestuff_data
dyestuff = dataset(:dyestuff)
```

The output indicates that this dataset is a `Table` read from a file in the [Arrow](https://arrow.apache.org) data format.

A `Table` is a general, but "bare bones", tabular form defined in the [Tables](https://github.com/JuliaData/Tables.jl) package.
This particular table, read from an `Arrow` file, will be read-only.
Often it is convenient to convert the read-only table form to a `DataFrame` to be able to use the full power of the [DataFrames](https://github.com/JuliaData/DataFrames.jl) package.

```{julia}
#| label: dyestuff_dataframe
dyestuff = DataFrame(dyestuff)
```

The `describe` method for a `DataFrame` provides a concise description of the structure of the data,

```{julia}
#| label: describe_dyestuff
describe(dyestuff)
```

Combining this information with the initial description of the `Table` we see that the data frame consists of 30 observations of the `yield`, the response variable, and of the covariate, `batch`, which is a categorical variable whose levels are character strings.

```{julia}
typeof(dyestuff.batch)
```

(`DictEncoded` is the `Arrow` term for a categorical structure where the levels form a dictionary or lookup table and the values are stored as indices into this lookup table.)

```{julia}
show(levels(dyestuff.batch))
```

If the labels for the factor levels are arbitrary, as they are here, we will use letters instead of numbers for the labels.
That is, we label the batches as `"A"` through `"F"` rather than `1` through `6`.
When the labels are letters it is clear that the variable is categorical.
When the labels are numbers a categorical covariate can be mistaken for a numeric covariate, with unintended consequences.

It is a good practice to apply `describe` to any data frame the first time you work with it and to check carefully that any categorical variables are indeed represented as factors.

The data in a data frame are viewed as a table with columns corresponding to variables and rows to observations.
The functions `first` and `last` select the first or last few rows of the table.

```{julia}
first(dyestuff, 7)
```

```{julia}
last(dyestuff, 7)
```

or we could tabulate the data using `groupby` and `combine` from [DataFrames](https://github.com/JuliaData/DataFrames.jl).

```{julia}
#| label: dyestuff_summary
combine(groupby(dyestuff, :batch), :yield => mean, nrow => :n)
```

Although this table does show us an important property of the data, namely that there are exactly 5 observations on each batch --- a property that we will describe by saying that the data are *balanced* with respect to `batch` --- we usually learn much more about the structure of such data from plots like @fig-dyestuffdata

```{julia}
#| fig-cap: Yield of dyestuff by batch.  The line joins the mean yields.
#| label: fig-dyestuffdata
#| code-fold: true
#| warning: false
let
  sumry = sort!(
    combine(groupby(dyestuff, :batch), :yield => mean => :yield),
    :yield,
  )
  mp = mapping(
    :yield => "Yield of dyestuff [g]",
    :batch =>
      sorter(sumry.batch) => "Batch of intermediate product",
  )
  draw(
    (
      data(dyestuff) *
      mp *
      visual(Scatter; marker='○', markersize=12)
    ) + (data(sumry) * mp * visual(Lines));
    figure=(; size=(600, 300)),
  )
end
```

than we do from numerical summaries.

In @fig-dyestuffdata we can see that there is considerable variability in yield, even for preparations from the same batch, but there is also noticeable batch-to-batch variability.
For example, four of the five preparations from batch F produced lower yields than did any of the preparations from batches B, C, and E.

This plot, and essentially all the other plots in this book, were created using the [Makie](https://makie.juliaplots.io) package [@DanischKrumbiegel2021].

In `yet-to-be-written-appendix` we review some of the principles of data graphics, such as reordering the levels of the factor by increasing mean response, that enhance the informativeness of the plot.
For example, in this plot the levels of `batch` are sorted by increasing mean yield, to make visual comparisons between batches easier, and the vertical positions are *jittered* to avoid overplotting of points.
(Note that the two lowest yields of samples from batch `A` are identical.)

::: {.callout-note}

### Jittering has not yet been added.

:::

At this point we will concentrate on the information conveyed by the plot and not on how the plot is created.

In @sec-fittinglmms we will use mixed models to quantify the variability in yield between batches.
For the time being let us just note that the particular batches used in this experiment are a selection or sample from the set of all batches that we wish to consider.
Furthermore, the extent to which one particular batch tends to increase or decrease the mean yield of the process --- in other words, the "effect" of that particular batch on the yield --- is not as interesting to us as is the extent of the variability between batches.
For the purposes of designing, monitoring and controlling a process we want to predict the yield from future batches, taking into account the batch-to-batch variability and the within-batch variability.
Being able to estimate the extent to which a particular batch in the past increased or decreased the yield is not usually an important goal for us.
We will model the effects of the batches as random effects rather than as fixed-effects parameters.

### The dyestuff2 data {#sec-dyestuff2}

The data are simulated data presented in @box73:_bayes_infer_statis_analy [, Table 5.1.4, p. 247] where the authors state

> These data had to be constructed for although examples of this sort
> undoubtedly occur in practice they seem to be rarely published.

The structure and summary

```{julia}
#| label: dyestuff2
dyestuff2 = dataset(:dyestuff2)
```

```{julia}
#| label: dyestuff2_dataframe
dyestuff2 = DataFrame(dyestuff2)
describe(dyestuff2)
```

are intentionally similar to those of the `dyestuff` data.

A data plot (@fig-dyestuff2data)

```{julia}
#| fig-cap: Artificial data of yield by batch.  The line joins the mean yields.
#| label: fig-dyestuff2data
#| code-fold: true
#| warning: false
let
  sumry = sort!(
    combine(groupby(dyestuff2, :batch), :yield => mean => :yield),
    :yield,
  )
  mp = mapping(
    :yield => "Simulated yield",
    :batch => sorter(sumry.batch) => "Batch",
  )
  draw(
    (
      data(dyestuff2) *
      mp *
      visual(Scatter; marker='○', markersize=12)
    ) + (data(sumry) * mp * visual(Lines));
    figure=(; size=(600, 300)),
  )
end
```

shows that the batch-to-batch variability in these data is small compared to the within-batch variability.

In some approaches to mixed models it can be difficult to fit models to such data.
Paradoxically, small "variance components" can be more difficult to estimate than large variance components.

The methods we will present are not compromised when estimating small variance components.

## Fitting linear mixed models {#sec-fittinglmms}

Before we formally define a linear mixed model, let's go ahead and fit models to these data sets using `MixedModels`.
The simplest way to do this is to use the generic `fit` function with arguments describing the type of model to be fit (i.e. `MixedModel`), a *formula* specifying the model and the *data* on which to evaluate the formula.

We will explain the structure of the formula after we have considered an example.

### A model for the dyestuff data {#sec-dyestuffLMM}

We fit a model to the data allowing for an overall level of the `yield` and for an additive random effect for each level of `batch`.

```{julia}
#| label: dsm01
dsm01 = let f = @formula(yield ~ 1 + (1 | batch))
  fit(MixedModel, f, dyestuff; progress=false)
end
```

By default, this compact display of mixed-effects model fits is used in [Jupyter](https://jupyter.org) notebooks, which are the evaluation engine for [quarto books](https://quarto.org/docs/books/) with Julia code chunks.
To obtain more information we can `print` the model, as in

```{julia}
print(dsm01)
```

The call to `fit` constructs a `LinearMixedModel` object, evaluates the [maximum likelihood](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) parameter estimates, assigns the results to the name `dsm01`, and displays a summary of the fitted model.

:::{.callout-note collapse="true"}

### The named argument, progress=false

The estimates are determined via an iterative optimization algorithm.
In an interactive session a progress bar showing the number of iterations, the current value of the objective and the average time per iteration is displayed during this optimization.
For a static display of the results in a book we suppress this display with the optional, named argument `progress=false`.

Note that the punctuation used to separate the positional arguments (the first three arguments) from the named argument is a semi-colon, not a comma.
In this function call the semi-colon is optional -- a comma could be used instead.
In @sec-stylistic we describe a form of argument passing for which the semi-colon is required.

:::

### Details of the printed display {#sec-printeddetails}

The display of the fitted model has four major sections:

 1. a description of the model that was fit
 2. some statistics characterizing the model fit
 3. a summary of properties of the random effects and
 4. a summary of the fixed-effects parameter estimates.

We consider each of these sections in turn.

The description section states that this is a linear mixed model in which the parameters have been estimate by maximum likelihood (ML).
The model formula is displayed for later reference.

The display of a model fit by maximum likelihood provides several model-fit statistics such as Akaike's Information Criterion [@saka:ishi:kita:1986], Schwarz's Bayesian Information Criterion [@schw:1978], the log-likelihood at the parameter estimates, and negative twice the log-likelihood, which is the estimation criterion transformed to the scale of the *deviance*.
For linear mixed models we refer to `-2 loglik` as the value of the *objective* because this is the value that is minimized during the optimization phase of fitting the model.
To evaluate the *deviance* we should subtract the value of this criterion at a *saturated* or baseline model but it is not clear how to define such a baseline model in these cases.

However, it is still possible to perform [likelihood ratio tests](https://en.wikipedia.org/wiki/Likelihood-ratio_test) of different models fit to the same data using the difference in the minimized objectives, because this difference is the same as the difference in the deviances.
(Recall that a ratio of likelihoods corresponds to a difference in log-likelihoods.)

The third section is the table of estimates of parameters associated with the random effects.
There are two sources of variability in the model we have fit, a batch-to-batch variability in the level of the response and the residual or per-observation variability --- also called the within-batch variability.
The name "residual" is used in statistical modeling to denote the part of the variability that cannot be explained or modeled with the other terms.
It is the variation in the observed data that is "left over" after we have determined the estimates of the parameters in the other parts of the model.

Some of the variability in the response is associated with the fixed-effects terms.
In this model there is only one such term, labeled as the `(Intercept)`.
The name "intercept", which is better suited to models based on straight lines written in a slope/intercept form, should be understood to represent an overall "typical" or mean level of the response in this case.
(In case you are wondering about the parentheses around the name `(Intercept)`, they are included so that you can't accidentally create a variable with a name that conflicts with this name.)
The line labeled `batch` in the random effects table shows that the random effects added to the term, one for each level of the factor `batch`, are modeled as random variables whose unconditional variance is estimated as 1388.33 g$^2$ in the ML fit.
The corresponding standard deviation is 37.26 g.

Note that the last column in the random effects summary table is the estimate of the variability expressed as a standard deviation rather than as a variance.
These values are provided because usually it is easier to visualize standard deviations, which are on the scale of the response, than it is to visualize the magnitude of a variance.
The values in this column are a simple re-expression (the square root) of the estimated variances.
Do not confuse them with standard errors of the variance estimators, which are not given here.
In `add-section-reference-here` we explain why we do not provide standard errors of variance estimates.

The line labeled `Residual` in this table gives the estimate of the variance of the residuals (also in g$^2$) and its corresponding standard deviation.
The estimated standard deviation of the residuals is 49.5 g.

The last line in the random effects table states the number of observations to which the model was fit and the number of levels of any "grouping factors" for the random effects.
In this case we have a single random effects term, `(1|batch)`, in the model formula and the grouping factor for that term is `batch`.
There will be a total of six random effects, one for each level of `batch`.

The final part of the printed display gives the estimates and standard errors of any fixed-effects parameters in the model.
The only fixed-effects term in the model formula is the `1`, denoting a constant which, as explained above, is labeled as `(Intercept)`.
The estimate of this parameter is 1527.5 g, which happens to be the mean yield across all the data - a consequence of the experiment being *balanced*, in the sense that there was the same number of observations for each batch.

This *coefficient table* also includes the *standard error* of the estimate, which is the estimated standard deviation of the estimator, a `z` ratio, which is the ratio of the estimate to its standard error, and the probability the absolute value of a standard normal distribution exceeding the absolute value of the `z` ratio.

If the hypothesis that the coefficient is zero is of interest, which is not the case here, then the value in the fourth column is an approximate p-value for the hypothesis test.
An alternative measure of precision of the estimate - a coverage interval based on a *parametric bootstrap* - is presented in @sec-furtherassess

### A model for the dyestuff2 data {#sec-dyestuff2lmm}

Fitting a similar model to the `dyestuff2` data produces an estimate $\widehat{\sigma}_1=0$.

```{julia}
#| label: dsm02
dsm02 = let f = @formula(yield ~ 1 + (1 | batch))
  fit(MixedModel, f, dyestuff2; progress=false)
end
print(dsm02)
```

An estimate of $0$ for $\sigma_1$ does not mean that there is no variation between the groups.
Indeed @fig-dyestuff2data shows that there is some small amount of variability between the groups.
The estimate, $\widehat{\sigma}_1=0$, simply indicates that the level of "between-group" variability is not sufficient to warrant incorporating random effects in the model.

This point is worth reiterating.
An estimate of zero for a variance component does not imply that there is no variability between the groups.
There will always be variability between groups that is induced by the per-observation variability.
If we arbitrarily divided 30 observations from a homogeneous distribution into six groups of five observations, which is likely the way that this sample was simulated, the sample averages would inherit a level of variability from the original sample.
What is being estimated in the variance component for `batch` is the *excess* variability between groups beyond that induced by the residual variability.

In other words, an estimate $\widehat{\sigma}_1=0$ is not inconsistent with the model.
The important point to take away from this example is that we must allow for the estimates of variance components to be zero.

We describe such a model as being *degenerate*, in the sense that the estimated distribution of the random effects is a [degenerate distribution](https://en.wikipedia.org/wiki/Degenerate_distribution), representing a point mass at zero.
It corresponds to a linear model in which we have removed the random effects associated with `batch`.

Degenerate models can and do occur in practice.
Even when the final fitted model is not degenerate, we must allow for such models when determining the parameter estimates through numerical optimization.

To reiterate, this model can be reduced to a linear model because the random effects are inert, in the sense that they have a variance of zero.

## Model formulation {#sec-modelformulation}

A common way of writing the statistical model being fit in `dsm01` and `dsm02` is as an expression for the i'th observation in the j'th batch
$$
y_{i,j}=\mu+b_j+\epsilon_{i,j},\quad i = 1,\dots,5;\;j=1,\dots,6
$$
with the further specification that the per-observation noise terms, $\epsilon_{i,j}$, are independently and identically distributed as $\mcN(0, \sigma^2)$ and the random effects, $b_j$, are independently and identically distributed as $\mcN(0, \sigma_1^2)$.
The three parameters in the model are the overall mean, $\mu$, the standard deviation of the random effects, $\sigma_1$, and the standard deviation of the per-observation noise, $\sigma$.

Generalizing this formulation to unbalanced data sets and more complex models can become unwieldy.
For the range of models shown in this book it is more convenient to use a matrix-vector representation in which the random effects are described as a $q$-dimensional multivariate Gaussian random variable
$$
\mcB\sim\mcN(\mathbf{0},\sigma_1^2\bbI) .
$$ {#eq-mvbdist}
The multivariate Gaussian distribution is described in more detail in @sec-mvgaussian.
At this point the important thing to know is that @eq-mvbdist describes a vector of independent Gaussian random variables, each of which has mean $0$ and variance $\sigma_1^2$.

The random effects are associated with the observations through a *model matrix*, $\bbZ$.
For a model with $n$ observations and a total of $q$ random effects, $\bbZ$ will be of size $n\times q$.
Furthermore, it is sparse - meaning that most of the elements of the matrix are zeros.
In this case, $\bbZ$ is the indicator matrix for the `batch` covariate.

```{julia}
Zdsm01 = Int.(Matrix(only(dsm01.reterms)))
```

The expression to produce that output first extracted the random-effects terms information from `dsm01`, extracted the term for the first grouping factor, checking that there is only one, converted it to a matrix and converted the matrix to integers to save on space when printing.

In cases like this we will often print out the transpose of the model matrix to save more space

```{julia}
Zdsm01'
```

or, even more compact, as a sparse matrix pattern

```{julia}
sparse(Zdsm01')
```

The fixed-effects parameters, of which there is only one here, are gathered into a vector, $\bbbeta$, of length $p$, and multiplied by another model matrix, $\bbX$, of size $n\times p$.

```{julia}
dsm01.β
```

```{julia}
Int.(dsm01.X')
```

A linear model without random effects can be described as $\mcY\sim\mcN(\bbX\bbbeta, \sigma^2\bbI)$.
That is, the expected value of the response vector is the *linear predictor*, $\bbX\bbbeta$, and the covariance matrix of the multivariate Gaussian distribution for the response is $\sigma^2\bbI$, corresponding to independent, constant variance per-observation noise terms.

For the linear mixed model, we describe the conditional distribution of the response, given a particular value, $\bbb$, of the random effects as multivariate Gaussian with mean determined by a linear predictor expression involving both the assumed random effects value, $\bbb$, and the fixed-effects parameter vector, $\bbbeta$,
$$
(\mcY|\mcB=\bbb)\sim\mcN(\bbX\bbbeta+\bbZ\bbb,\sigma^2\bbI) .
$$

The complete model can be written as
$$
\begin{aligned}
(\mcY|\mcB=\bbb)&\sim\mcN(\bbX\bbbeta+\bbZ\bbb,\sigma^2\bbI)\\
\mcB&\sim\mcN(\mathbf{0},\bbSigma_\theta) .
\end{aligned}
$$ {#eq-lmmdist}

From these two distributions, the joint distribution of $\mcY$ and $\mcB$, which is also multivariate Gaussian, can be determined and from that the likelihood for the parameters, given the observed value $\bby$ of $\mcY$.

Details on the derivation and evaluation of the log-likelihood are given in @sec-lmmtheory.
At this point the important results are that the *profiled* log-likelihood for models `dsm01` and `dsm02` can be evaluated directly (i.e. without an iterative optimization) from a single parameter, $\theta=\sigma_1/\sigma$.

Furthermore, this profiled log-likelihood can be evaluated from a matrix of $\bbX'\bbX$-like products,
$$
\bbA=
\begin{bmatrix}
\bbZ'\bbZ & \bbZ'\bbX & \bbZ'\bby \\
\bbX'\bbZ & \bbX'\bbX & \bbX'\bby \\
\bby'\bbZ & \bby'\bbX & \bby'\bby
\end{bmatrix}
$$
which is stored in three pieces

```{julia}
first(dsm01.A)
```

```{julia}
dsm01.A[2]
```

```{julia}
last(dsm01.A)
```

The first, diagonal matrix is $\bbZ'\bbZ$.
A simple, scalar random effects term like `(1|batch)` in the formula for models `dsm01` and `dsm02` results in $\bbZ$ being the indicator matrix for the levels of the grouping factor and, hence, in a diagonal $\bbZ'\bbZ$, whose diagonal elements are the frequencies of the levels.
Thus this block shows that there are exactly 5 observations on each of the 6 batches.

The second, rectangular matrix is $\begin{bmatrix}
\bbZ'\bbX \\
\bbZ'\bby
\end{bmatrix}$.
Again, the fact that $\bbZ$ is the indicator matrix for the levels of the grouping factor and that $\bbX$ is a single column of 1's results in the frequencies of the levels of the factor in the first row of this block.
The second row of this block consists of the sums of the yields for each batch.

The third, square symmetric matrix is
$$
\begin{bmatrix}
\bbX'\bbX & \bbX'\bby \\
\bby'\bbX & \bby'\bby
\end{bmatrix}
$$

The parameters enter into the computation via a *relative covariance factor*, $\bbLambda_{\boldsymbol\theta}$, which is the $6\times 6$ matrix, $\theta\,\bbI_6$, in this case.

The method hinges on being able to evaluate efficiently the *Cholesky factor* (see @sec-cholesky for details) of
$$
\begin{bmatrix}
\bbLambda_\theta'\mathbf{Z'Z}\bbLambda_\theta+\bbI &
\bbLambda_\theta'\mathbf{Z'X} & \bbLambda_\theta'\mathbf{Z'y} \\
\mathbf{X'Z}\bbLambda_\theta & \mathbf{X'X} & \mathbf{X'y} \\
\mathbf{y'Z}\bbLambda_\theta & \mathbf{y'X} & \mathbf{y'y}
\end{bmatrix}
$$

The optimal value of $\theta$ for model `dsm01` is

```{julia}
only(dsm01.θ)
```

producing a lower Cholesky factor of

```{julia}
sparseL(dsm01; full=true)
```

## Variability of parameter estimates {#sec-furtherassess}

The parameter estimates in a statistical model represent our "best guess" at the unknown values of the model parameters and, as such, are important results in statistical modeling.
However, they are not the whole story.
Statistical models characterize the variability in the data and we should assess the effect of this variability on the precision of the parameter estimates and of predictions made from the model.

One method of assessing the variability in the parameter estimates is through a [parametric bootstrap](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)#Parametric_bootstrap), a process where a large number of data sets are simulated from the assumed model using the estimated parameter values as the "true" parameter values.
The distribution of the parameter estimators is inferred from the distribution of the parameter estimates from these generated data sets.

This method is well-suited to Julia code because refitting an existing model to a simulated data set can be very fast.

For methods that involve simulation, it is best to initialize a random number generator to a known state so that the "random" sample can be reproduced if desired.
Beginning with Julia v1.7.0 the default random number generator is the `Xoshiro` generator, which we initialize to an arbitrary, but reproducible, value.

The object returned by a call to `parametricbootstrap` has a somewhat complex internal structure to allow for the ability to simulate from complex models while still maintaining a comparatively small storage footprint.
To examine the distribution of the parameter estimates we extract a table of all the estimated parameters and convert it to a `DataFrame`.

```{julia}
#| label: dsm01samp
#| warning: false
Random.seed!(4321234)      # random number generator
dsm01samp = parametricbootstrap(10_000, dsm01; progress=false)
dsm01pars = DataFrame(dsm01samp.allpars)
first(dsm01pars, 7)
```

:::{.callout-note}

### Switch to Table(dsm01samp.tbl) formulation

:::

```{julia}
last(dsm01pars, 7)
```

Plots of the bootstrap estimates for individual parameters are obtained by extracting subsets of the rows of this dataframe using `subset` methods from the `DataFrames` package or the `@subset` macro from the `DataFramesMacros` package.
For example,

```{julia}
βdf = @subset(dsm01pars, :type == "β")
```

We begin by examining density plots constructed using the [AlgebraOfGraphics](https://github.com/JuliaGraphics/AlgebraOfGraphics.jl) package.
(In general we "fold" the code used to generate plots as it tends to have considerable detail that may distract from the plot itself.
You can check the details by clicking on the "Code" button in the HTML version of the plot or in the quarto files on the [github repository](https://github.com/JuliaMixedModels/EmbraceUncertainty) for this book.)

```{julia}
#| code-fold: true
#| fig-cap: Kernel density plot of bootstrap fixed-effects parameter estimates from dsm01
#| label: fig-dsm01_bs_beta_density
#| warning: false
draw(
  data(@subset(dsm01pars, :type == "β")) *
  mapping(
    :value => "Bootstrap samples of β";
    color=(:names => "Names"),
  ) *
  AlgebraOfGraphics.density();
  figure=(; size=(600, 340)),
)
```

The distribution of the estimates of `β₁` is more-or-less a Gaussian (or "normal") shape, with a mean value of `{julia} repr(mean(βdf.value), context=:compact=>true)`  which is close to the estimated `β₁` of `{julia} repr(only(dsm01.β), context=:compact=>true)`.

Similarly the standard deviation of the simulated β values, `{julia} repr(std(βdf.value), context=:compact=>true)` is close to the standard error of the parameter, `{julia} repr(only(stderror(dsm01)), context=:compact=>true)`.

In other words, the estimator of the fixed-effects parameter in this case behaves as we would expect.
The estimates are approximately normally distributed centered about the "true" parameter value with a standard deviation given by the standard error of the parameter.

The situation is different for the estimates of the standard deviation parameters, $\sigma$ and $\sigma_1$, as shown in @fig-dsm01_bs_sigma_density.

```{julia}
#| code-fold: true
#| fig-cap: Kernel density plot of bootstrap variance-component parameter estimates from model dsm01
#| label: fig-dsm01_bs_sigma_density
#| warning: false
draw(
  data(@subset(dsm01pars, :type == "σ")) *
  mapping(
    :value => "Bootstrap samples of σ";
    color=(:group => "Group"),
  ) *
  AlgebraOfGraphics.density();
  figure=(; size=(600, 340)),
)
```

The estimator for the residual standard deviation, $\sigma$, is approximately normally distributed but the estimator for $\sigma_1$, the standard deviation of the `batch` random effects is bimodal (i.e. has two "modes" or local maxima).
There is one peak around the "true" value for the simulation, `{julia} repr(only(dsm01.σs.batch), context=:compact=>true), and another peak at zero.

The apparent distribution of the estimates of $\sigma_1$ in @fig-dsm01_bs_sigma_density is being distorted by the method of approximating the density.
A [kernel density estimate](https://en.wikipedia.org/wiki/Kernel_density_estimation) approximates a probability density from a finite sample by blurring or smearing the positions of the sample values according to a *kernel* such as a narrow Gaussian distribution (see the linked article for details).
In this case the distribution of the estimates is a combination of a continuous distribution and a spike or point mass at zero as shown in a histogram, @fig-dsm01_bs_sigma_hist.

:::{.callout-note collapse="true"}

### Adjust the alpha in multiple histograms

Use a lower alpha in the colors for multiple histograms so the bars behind another color are more visible
:::

```{julia}
#| code-fold: true
#| fig-cap: Histogram of bootstrap variance-components as standard deviations from model dsm01
#| label: fig-dsm01_bs_sigma_hist
#| warning: false
draw(
  data(@subset(dsm01pars, :type == "σ")) *
  mapping(
    :value => "Bootstrap parameter estimates of σ";
    color=(:group => "Group"),
  ) *
  AlgebraOfGraphics.histogram(; bins=80);
  figure=(; size=(600, 340)),
)
```

Nearly 1000 of the 10,000 bootstrap fits are "singular" in that the estimated unconditional distribution of the random effects is degenerate.
(A model with multiple grouping factors is singular if one or more of the unconditional distributions of the random effects is degenerate.
For this model the only way singularity can occur is for $\sigma_1$ to be zero.)

```{julia}
#| label: dsm01samp_singular
count(issingular(dsm01samp))
```

The distribution of the estimator of $\sigma_1$ with this point mass at zero is not at all like a Gaussian or normal distribution.
This is why characterizing the uncertainty of these parameter estimates with a standard error is misleading.
Quoting an estimate and a standard error for the estimate is only meaningful if these values adequately characterize the distribution of the values of the estimator.

In many cases standard errors are quoted for estimates of the variance components, $\sigma_1^2$ and $\sigma^2$, whose distribution (@fig-dsm01_bs_sigmasqr_hist) in the bootstrap sample is even less like a normal or Gaussian distribution than is the distribution of estimates of $\sigma_1$.

```{julia}
#| code-fold: true
#| fig-cap: Histogram of bootstrap variance-components from model dsm01
#| label: fig-dsm01_bs_sigmasqr_hist
#| warning: false
draw(
  data(@transform(@subset(dsm01pars, :type == "σ"), abs2(:value))) *
  mapping(
    :value_abs2 => "Bootstrap sample of estimates of σ²",
    color=:group,
  ) *
  AlgebraOfGraphics.histogram(; bins=200);
  figure=(; size=(600, 340)),
)
```

The approach of creating coverage intervals from a bootstrap sample of parameter estimates, described in the next section, does accommodate non-normal distributions.

### Confidence intervals on the parameters {#sec-confints}

A bootstrap sample of parameter estimates also provides us with a method of creating bootstrap coverage intervals, which can be regarded as confidence intervals on the parameters.

For, say, a 95% coverage interval based on 10,000 parameter estimates we determine an interval that contains 95% (9,500, in this case) of the estimated parameter values.

There are many such intervals.
For example, from the sorted parameter values we could return the interval from the smallest up to the 9,500th largest, or from the second smallest up to the 9,501 largest, and so on.

The `shortestcovint` method returns the shortest of all these potential intervals which will correspond to the interval with the highest empirical density.

```{julia}
DataFrame(shortestcovint(dsm01samp))
```

For the `(Intercept)` fixed-effects parameter the interval is similar to an "asymptotic" or Wald interval constructed as the estimate plus or minus two standard errors.

```{julia}
show(only(dsm01.β) .+ [-2, 2] * only(dsm01.stderror))
```

The interval on the residual standard deviation, $\sigma$, is reasonable, given that there are only 30 observations, but the interval of the standard deviation of the random effects, $\sigma_1$, extends all the way down to zero.

### Tracking the progress of the iterations {#sec-dsm01trace}

The optional argument, `thin=1`, in a call to `fit` causes all the values of $\boldsymbol\theta$ and the corresponding value of objective from the iterative optimization to be stored in the `optsum` property.

```{julia}
#| label: dsm01trace
dsm01trace = fit(
  MixedModel,
  @formula(yield ~ 1 + (1 | batch)),
  dyestuff;
  thin=1,
  progress=false,
)
DataFrame(dsm01trace.optsum.fitlog)
```

Here the algorithm converges after 18 function evaluations to a profiled deviance of `{julia} repr(dsm01trace.objective, context=:compact=>true)` at θ = `{julia} repr(only(dsm01trace.theta), context=:compact=>true)`.

## Assessing the random effects {#sec-assessRE}

What are sometimes called the BLUPs (or best linear unbiased predictors) of the random effects, $\mcB$, are the mode (location of the maximum probability density) of the conditional distribution, $(\mcB|\mcY=\bby)$, evaluated at the parameter estimates and the observed response vector, $\bby$.
Although BLUP is an appealing acronym, we don't find the term particularly instructive (what is a "linear unbiased predictor" and in what sense are these the "best"?) and prefer the term "conditional mode" to describe this value.

These values are often considered as some sort of "estimates" of the random effects.
It can be helpful to think of them this way but it can also be misleading.
As we have stated, the random effects are not, strictly speaking, parameters --- they are unobserved random variables.
We don't estimate the random effects in the same sense that we estimate parameters.
Instead, we consider the conditional distribution of $\mcB$ given the observed data, $(\mcB|\mcY=\bby)$.

Because the unconditional distribution, $\mcB\sim\mcN(\mathbf{0},\Sigma_\theta)$ is continuous, the conditional distribution, $(\mcB|\mcY=\bby)$ will also be continuous.
In general, the mode of a probability density is the point of maximum density, so the phrase "conditional mode" refers to the point at which this conditional density is maximized.
Because this definition relates to the probability model, the values of the parameters are assumed to be known.
In practice, of course, we don't know the values of the parameters (if we did there would be no purpose in forming the parameter estimates), so we use the estimated values of the parameters to evaluate the conditional modes.

Those who are familiar with the multivariate Gaussian distribution may recognize that, because both $\mcB$ and $(\mcY|\mcB=\bbb)$ are multivariate Gaussian, $(\mcB|\mcY=\bby)$ will also be multivariate Gaussian and the conditional mode will also be the conditional mean of $\mcB$, given $\mcY=\bby$.
This is the case for a linear mixed model but it does not carry over to other forms of mixed models.
In the general case all we can say about $\tilde{\bbb}$ is that they maximize a conditional density, which is why we use the term "conditional mode" to describe these values.
We will only use the term "conditional mean" and the symbol, $\bbmu$, in reference to $\mathrm{E}(\mcY|\mcB=\bbb)$, which is the conditional mean of $\mcY$ given $\mcB$, and an important part of the formulation of all types of mixed-effects models.

The conditional modes are available as a vector of matrices

```{julia}
only(dsm01.b)
```

In this case the vector consists of a single matrix because there is only one random-effects term, `(1|batch)`, in the model and, hence, only one grouping factor, `batch`, for the random effects.
There is only one row in the matrix because the random-effects term, `(1|batch)`, is a simple, scalar term.

To make this more explicit, random-effects terms in the model formula are those that contain the vertical bar (`|`) character.
The variable or expression on the right of the `|` is the grouping factor for the random effects generated by this term.
If the expression on the left of the vertical bar is `1`, as it is here, we describe the term as a *simple, scalar, random-effects term*.
The designation "scalar" means there will be exactly one random effect generated for each level of the grouping factor.
A simple, scalar term generates a block of indicator columns --- the indicators for the grouping factor --- in $\bbZ$.
Because there is only one random-effects term in this model and because that term is a simple, scalar term, the model matrix $\bbZ$ for this model is the indicator matrix for the levels of `batch`, as shown in @sec-modelformulation.

In the next chapter we fit models with multiple simple, scalar terms and, in subsequent chapters, we extend random-effects terms beyond simple, scalar terms.
When we have only simple, scalar terms in the model, each term has a unique grouping factor and the elements of the vector returned as the `b` property can be considered as associated with terms or with grouping factors.
In more complex models a particular grouping factor may occur in more than one term.
In such cases the terms associated with the same grouping factor are internally amalgamated into a single term.
Thus internally the random effects are associated with grouping factors, not the terms in the model formula.

Given the data, $\bby$, and the parameter estimates, we can evaluate a measure of the dispersion of $(\mcB|\mcY=\bby)$.
In the case of a linear mixed model, this is the conditional standard deviation, from which we can obtain a prediction interval.
A plot of these prediction intervals is sometimes called a *caterpillar plot* because it can look like a fuzzy caterpillar (e.g. @fig-iem01qqcaterpillard) when there are many levels of the grouping factor.

```{julia}
#| code-fold: true
#| fig-cap: Caterpillar plot of prediction intervals for dsm01 random effects
#| label: fig-caterpillar_dsm01
#| warning: false
caterpillar!(Figure(size=(600, 200)), ranefinfo(dsm01, :batch))
```

The `caterpillar` function returns a plot with linear spacing of the
levels on the y axis.
An alternative, the `qqcaterpillar` function

```{julia}
#| code-fold: true
#| fig-cap: Quantile caterpillar plot of prediction intervals for dsm01 random effects
#| label: fig-qqcaterpillar_dsm01
#| warning: false
qqcaterpillar!(Figure(size=(600, 200)), ranefinfo(dsm01, :batch))
```

returns a plot like @fig-qqcaterpillar_dsm01 where the intervals are plotted with vertical spacing corresponding to the quantiles of the standard normal distribution.

The `caterpillar` plot is preferred when there are only a few levels of the grouping factor, as in this case.
When there are hundreds or thousands of random effects the `qqcaterpillar` form is preferred because it focuses attention on the "important few" at the extremes and de-emphasizes the "trivial many" that are close to zero.

## Some stylistic conventions {#sec-stylistic}

To make it easier to recognize the form of the many models that will be fit in this book, we will adopt certain conventions regarding the argument specifications.
In particular we will establish the convention of specifying `contrasts` for categorical covariates not used as grouping factors, and the possibility of using certain transformations, such as centering and scaling, of numeric covariates.

The name `contrasts` is used in a general sense here to specify certain transformations that are to take place during the process of converting a formula and the structure, or [schema](https://en.wikipedia.org/wiki/Database_schema), of the data into model matrices.

The [StatsModels.jl package](https://github.com/JuliaStats/StatsModels.jl) allows for contrasts to be specified as a key-value dictionary where the keys are symbols and the values are of a type that specializes `StatsModels.AbstractContrasts`.
The [StandardizedPredictors.jl package](https://github.com/beacon-biosignals/StandardizedPredictors.jl) allows transformations to be coded as contrasts.

For models `dsm01` and `dsm02` the only covariate in the model formula is `batch`, which is a grouping factor for the random effects.
In the past it was helpful to specify a `Grouping()` contrast for grouping factors as

```{julia}
#| code-fold: false
#| eval: false
contrasts = Dict(:batch => Grouping())
```

but versions 4.21 of the `MixedModels` package or later do this automatically.

(Symbols in Julia can be written as a colon followed by the name of the symbol.
The colon creates an expression but, if the expression consists of a single name, then the expression is the symbol.)

There is an advantage in assigning the contrasts dictionary to the name `contrasts` because a call to `fit` with the optional, named argument `contrasts`

```{julia}
#| code-fold: false
#| eval: false
dsm01 = fit(
  MixedModel,
  @formula(yield ~ 1 + (1 | batch)),
  dyestuff;
  contrasts=contrasts,
  progress=false,
)
```

can be condensed to

```{julia}
#| code-fold: false
#| eval: false
dsm01 = fit(
  MixedModel,
  @formula(yield ~ 1 + (1 | batch)),
  dyestuff;
  contrasts,
  progress=false,
)
```

That is, if the name of the object to be passed as a named argument is the same as the name of the argument, like `contrasts=contrasts`, then the name does not need to be repeated.
Note that the comma after the positional arguments must be changed to a semicolon in this convention.
This is necessary because it indicates that arguments following the semicolon are named arguments, not positional.

Similarly, we could (and will, in later chapters), bind the name `progress` to `false` in the Main environment and use `progress` as a named argument instead of `progress=false` to suppress output of the progress of the iterations.

Another convention we will use is assigning the formula separately from the call to `fit` as part of a `let` block.
Often the formula can become rather long, sometimes needing multiple lines in the call, and it becomes difficult to keep track of the other arguments.
Assigning the formula separately helps in keeping track of the arguments to `fit`.

A `let` block is a way of making temporary assignments that do not affect the global state.
An assignment to a variable name inside a `let` block is local to the block.

Thus `dsm01` can be assigned as

```{julia}
#| code-fold: false
#| eval: false
dsm01 = let f = @formula(yield ~ 1 + (1 | batch))
  fit(MixedModel, f, dyestuff; contrasts, progress=false)
end
```

## Chapter summary {#sec-ChIntroSummary}

A considerable amount of material has been presented in this chapter, especially considering the word "simple" in its title (it's the model that is simple, not the material).
A summary may be in order.

A mixed-effects model incorporates fixed-effects parameters and random effects, which are unobserved random variables, $\mcB$.
In a linear mixed model, both the unconditional distribution of $\mcB$ and the conditional distribution, $(\mcY|\mcB=\bbb)$, are multivariate Gaussian distributions.
Furthermore, this conditional distribution is a spherical Gaussian with mean, $\bbmu$, determined by the linear predictor, $\bbZ\bbb+\bbX\bbbeta$.
That is,
$$
(\mcY|\mcB=\bbb)\sim
\mcN(\bbZ\bbb+\bbX\bbbeta, \sigma^2\bbI_n) .
$$
The unconditional distribution of $\mcB$ has mean $\mathbf{0}$ and a parameterized $q\times q$ variance-covariance matrix, $\Sigma_\theta$.

In the models we considered in this chapter, $\Sigma_\theta$, is a scalar multiple of the identity matrix, $\bbI_6$.
This matrix is always a multiple of the identity in models with just one random-effects term that is a simple, scalar term.
The reason for introducing all the machinery that we did is to allow for more general model specifications.

The maximum likelihood estimates of the parameters are obtained by minimizing the deviance.
For linear mixed models we can minimize the profiled deviance, which is a function of $\bbtheta$ only, thereby considerably simplifying the optimization problem.

To assess the precision of the parameter estimates we use a parametric bootstrap sample, when feasible.
Re-fitting a simple model, such as those shown in this chapter, to a randomly generated response vector is quite fast and it is reasonable to work with bootstrap samples of tens of thousands of replicates.
With larger data sets and more complex models, large bootstrap samples of parameter estimates could take much longer.

Prediction intervals from the conditional distribution of the random effects, given the observed data, allow us to assess the precision of the random effects.

*This page was rendered from git revision {{< git-rev short=true >}}.*