1705 significance and assumptions

cambiotraining · May 17, 2024 · be3e935 · be3e935
1 parent 6d93e36
commit be3e935
Show file tree

Hide file tree

Showing 23 changed files with 1,154 additions and 207 deletions.
diff --git a/.quarto/idx/index.qmd.json b/.quarto/idx/index.qmd.json
diff --git a/_freeze/materials/checking-assumptions/execute-results/html.json b/_freeze/materials/checking-assumptions/execute-results/html.json
@@ -0,0 +1,17 @@
+{
+  "hash": "e60a019a08a121704741c223e9cc90c0",
+  "result": {
+    "engine": "knitr",
+    "markdown": "---\ntitle: \"Checking assumptions\"\noutput: html_document\n---\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\nAlthough generalised linear models do allow us to relax certain assumptions compared to standard linear models (linearity, equality of variance of residuals, and normality of residuals).\n\nHowever, we cannot relax all of them. This section of the materials will talk through the important assumptions for GLMs, and how to assess them.\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(ggResidpanel)\n```\n:::\n\n## Python\n\n::: {.cell}\n\n```{.python .cell-code}\nfrom scipy.stats import *\n```\n:::\n\n:::\n\n:::\n\n## Assumption 1: Distribution of response variable\n\nAlthough we don't expect our response variable $y$ to be continuous and normally distributed (as we did in linear modelling), we do still expect its distribution to come from the \"exponential family\" of distributions.\n\nThe exponential family contains the following distributions, among others:\n\n- normal\n- exponential\n- Poisson \n- Bernoulli\n- binomial (for fixed number of trials)\n- chi-squared\n\nYou can use a histogram to visualise the distribution of your response variable, but it is typically most useful just to think about the nature of your response variable. For instance, binary variables will follow a Bernoulli distribution, proportional variables follow a binomial distribution, and most count variables will follow a Poisson distribution.\n\nIf you have a very unusual variable that doesn't follow one of these exponential family distributions, however, then a GLM will not be an appropriate choice. In other words, a GLM is not necessarily a magic fix!\n\n## Assumption 2: Correct link function\n\nA closely-related assumption to assumption 1 above, is that we have chosen the correct link function for our model.\n\nIf we have done so, then there should be a linear relationship between our *transformed* model and our response variable; in other words, if we have chosen the right link function, then we have correctly \"linearised\" our model.\n\n## Assumption 3: Independence\n\nWe expect that the each observation or datapoint in our sample is independent of all the others. Specifically, we expect that our set of $y$ response variables are independent of one another.\n\nFor this to be true, we have to make sure:\n\n- that we aren't treating technical replicates as true/biological replicates;\n- that we don't have observations/datapoints in our sample that are artificially similar to each other (compared to other datapoints);\n- that we don't have any nuisance/confounding variables that create \"clusters\" or hierarchy in our dataset;\n- that we haven't got repeated measures, i.e., multiple measurements/rows per individual in our sample\n\nThere is no diagnostic plot for assessing this assumption. To determine whether your data are independent, you need to understand your experimental design.\n\nYou might find [this page](https://cambiotraining.github.io/experimental-design/materials/04-replication.html#criteria-for-true-independent-replication) useful if you're looking for more information on what counts as truly independent data.\n\n## Good science: No influential observations\n\nAs with linear models, though this isn't always considered a \"formal\" assumption, we do want to ensure that there aren't any datapoints that are overly influencing our model.\n\nA datapoint is overly influential, i.e., has high leverage, if removing that point from the dataset would cause large changes in the model coefficients. Datapoints with high leverage are typically those that don't follow the same general \"trend\" as the rest of the data.\n\nThe easiest way to check for overly influential points is to construct a Cook's distance plot.\n\nLet's try that out, using the `diabetes` example dataset.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndiabetes <- read_csv(\"data/diabetes.csv\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 728 Columns: 3\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (3): glucose, diastolic, test_result\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n\n```{.r .cell-code}\nglm_dia <- glm(test_result ~ glucose * diastolic,\n                  family = \"binomial\",\n                  data = diabetes)\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndiabetes_py = pd.read_csv(\"data/diabetes.csv\")\n\nmodel = smf.glm(formula = \"test_result ~ glucose * diastolic\", \n                family = sm.families.Binomial(), \n                data = diabetes_py)\n                \nglm_dia_py = model.fit()\n```\n:::\n\n:::\n\nOnce our model is fitted, we can fit a Cook's distance plot:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n::: {.cell}\n\n```{.r .cell-code}\nresid_panel(glm_dia, plots = \"cookd\")\n```\n\n::: {.cell-output-display}\n![](checking-assumptions_files/figure-html/unnamed-chunk-7-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n::: {.cell}\n\n:::\n\n:::\n\nGood news - there don't appear to be any overly influential points!\n\n## Dispersion\n\nAnother thing that we want to check, primarily in Poisson regression, is whether our dispersion parameter is correct.\n\n::: {.callout-note collapse=\"true}\n\n#### First, let's unpack what dispersion is!\n\nDispersion, in statistics, is a general term to describe the variability, scatter, or spread of a distribution. Variance is a common measure of dispersion that hopefully you are familiar with.\n\nIn a normal distribution, the mean (average) and the variance (dispersion) are independent of each other; we need both numbers, or parameters, to understand the shape of the distribution.\n\nOther distributions, however, require different parameters to describe them in full. For a Poisson distribution, we need just one parameter $lambda$, which captures the expected rate of occurrences/expected count. The mean and variance of a Poisson distribution are actually expected to be the same.\n\nIn the context of a model, you can think about the dispersion as the degree to which the data are spread out around the model curve. A dispersion parameter of 1 means the data are spread out exactly as we expect; <1 is called underdispersion; and >1 is called overdispersion.\n:::\n\n### A \"hidden assumption\"\n\nWhen we fit a linear model, because we're assuming a normal distribution, we take the time to estimate the dispersion - by measuring the variance.  \n\nWhen performing Poisson regression, however, we make an extra \"hidden\" assumption, in setting the dispersion parameter to 1. In other words, we expect the errors to have a certain spread to them that matches our theoretical distribution/model. This means we don't have to waste time and statistical power in estimating the dispersion.\n\nHowever, if our data are underdispersed or overdispersed, then we might be violating this assumption we've made. \n\nUnderdispersion is quite rare. It's far more likely that you'll encounter overdispersion; in Poisson regression, this is usually caused by the presence of lots of zeroes in your response variable (known as zero-inflation).\n\nIn these situations, you may wish to fit a different GLM to the data. Negative binomial regression, for instance, is a common alternative for zero-inflated count data.\n\n### Checking the dispersion parameter\n\nThe easiest way to check dispersion in a model is to calculate the ratio of the residual deviance to the residual degrees of freedom.\n\nLet's practice doing this using a Poisson regression fitted to the `islands` dataset that you saw earlier in the course.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n::: {.cell}\n\n```{.r .cell-code}\nislands <- read_csv(\"data/islands.csv\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 35 Columns: 2\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (2): species, area\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n\n```{.r .cell-code}\nglm_isl <- glm(species ~ area,\n               data = islands, family = \"poisson\")\n```\n:::\n\n\n## Python\n\n::: {.cell}\n\n```{.python .cell-code}\nislands_py = pd.read_csv(\"data/islands.csv\")\n\nmodel = smf.glm(formula = \"species ~ area\",\n                family = sm.families.Poisson(),\n                data = islands_py)\n\nglm_isl_py = model.fit()\n```\n:::\n\n:::\n\nIf we take a look at the model output, we can see the two quantities we care about - residual deviance and residual degrees of freedom:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_isl)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = species ~ area, family = \"poisson\", data = islands)\n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)    \n(Intercept) 4.241129   0.041322  102.64   <2e-16 ***\narea        0.035613   0.001247   28.55   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 856.899  on 34  degrees of freedom\nResidual deviance:  30.437  on 33  degrees of freedom\nAIC: 282.66\n\nNumber of Fisher Scoring iterations: 3\n```\n\n\n:::\n:::\n\n\n## Python\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_isl_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:                species   No. Observations:                   35\nModel:                            GLM   Df Residuals:                       33\nModel Family:                 Poisson   Df Model:                            1\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -139.33\nDate:                Fri, 17 May 2024   Deviance:                       30.437\nTime:                        12:40:44   Pearson chi2:                     30.3\nNo. Iterations:                     4   Pseudo R-squ. (CS):              1.000\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      4.2411      0.041    102.636      0.000       4.160       4.322\narea           0.0356      0.001     28.551      0.000       0.033       0.038\n==============================================================================\n```\n\n\n:::\n:::\n\n:::\n\nThe residual deviance is 30.437, on 33 residual degrees of freedom. All we need to do is divide one by the other to get our dispersion parameter.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_isl$deviance/glm_isl$df.residual\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.922334\n```\n\n\n:::\n:::\n\n\n## Python\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_isl_py.deviance/glm_isl_py.df_resid)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.9223340414458532\n```\n\n\n:::\n:::\n\n:::\n\nThe dispersion parameter here is 0.922. That's pretty good - not far off 1 at all.\n\nBut how can we check whether it is *significantly* different from 1? \n\nWell, you've actually already got the knowledge you need to do this, from the previous course section on significance testing. Specifically, the chi-squared goodness-of-fit test can be used to check whether the dispersion is within sensible limits.\n\nYou may have noticed that the two values we're using for the dispersion parameter are the same two numbers that we used in those chi-squared tests. For this Poisson regression fitted to the `islands` dataset, that goodness-of-fit test would look like this:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(glm_isl$deviance, glm_isl$df.residual)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.595347\n```\n\n\n:::\n:::\n\n\n## Python\n\n::: {.cell}\n\n```{.python .cell-code}\npvalue = chi2.sf(glm_isl_py.deviance, glm_isl_py.df_resid)\n\nprint(pvalue)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.5953470127463187\n```\n\n\n:::\n:::\n\n:::\n\nIf our chi-squared goodness-of-fit test returns a large (insignificant) p-value, as it does here, that tells us that we don't need to worry about the dispersion.\n\nIf our chi-squared goodness-of-fit test returned a small, significant p-value, this would tell us our model doesn't fit the data well. And, since dispersion is all about the spread of points around the model, it makes sense that these two things are so closely related!\n\n## Summary\n\nWhile generalised linear models make fewer assumptions than standard linear models, we do still expect certain things to be true about the model and our variables for GLMs to be valid. Checking most of these assumptions requires understanding your dataset, and diagnostic plots play a less heavy role.\n\n::: {.callout-tip}\n#### Key points\n\n- For a generalised linear model, we assume that we have chosen the correct link function, that our response variable follows a distribution from the exponential family, and that our data are independent\n- To assess these assumptions, we need to understand our dataset and variables\n- We can also use visualisation to determine whether we have overly influential (high leverage) datapoints\n- For Poisson regression, we should also investigate the dispersion parameter of our model, which we expect to be close to 1\n:::\n",
+    "supporting": [
+      "checking-assumptions_files"
+    ],
+    "filters": [
+      "rmarkdown/pagebreak.lua"
+    ],
+    "includes": {},
+    "engineDependencies": {},
+    "preserve": {},
+    "postProcess": true
+  }
+}
diff --git a/_freeze/materials/checking-assumptions/figure-html/unnamed-chunk-7-1.png b/_freeze/materials/checking-assumptions/figure-html/unnamed-chunk-7-1.png
diff --git a/_freeze/materials/glm-practical-poisson/execute-results/html.json b/_freeze/materials/glm-practical-poisson/execute-results/html.json
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-16-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-16-1.png
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-19-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-19-1.png
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-30-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-30-1.png
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-31-3.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-31-3.png
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-43-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-43-1.png
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-9-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-9-1.png
diff --git a/_freeze/materials/significance-testing/execute-results/html.json b/_freeze/materials/significance-testing/execute-results/html.json
diff --git a/_quarto.yml b/_quarto.yml
@@ -33,6 +33,7 @@ metadata-files:
   # - "materials/_sidebar.yml"
 
 book:
+  bread-crumbs: false
   search:
     location: sidebar
   favicon: "_extensions/cambiotraining/courseformat/img/university-of-cambridge-favicon.ico"