diff --git a/.quarto/idx/index.qmd.json b/.quarto/idx/index.qmd.json
index 215fdbf..0a143e7 100644
--- a/.quarto/idx/index.qmd.json
+++ b/.quarto/idx/index.qmd.json
@@ -1 +1 @@
-{"title":"Course overview","markdown":{"yaml":{"title":"Course overview","number-sections":false},"headingText":"Core aims","containsRefs":false,"markdown":"\n\nWelcome to the wonderful world of generalised linear models!\n\nThese sessions are intended to enable you to construct and use generalised linear models confidently.\n\nAs with all of our statistics courses our focus is not on mathematical derivations, but on developing an intuitive understanding of the underlying statistical concepts.\n\nAt the same time this is also *not* a \"how to mindlessly use a stats program\" course! We hope that at the end of this course you feel like you have a better grasp on what it is we're trying to do, and gained sufficient confidence in your coding skills to implement these statistical concepts in your own research!\n\n\nTo introduce sufficient understanding and coding experience for analysing data with non-continuous response variables.\n\n::: callout-note\n## Course aims\n\nTo know what to do when presented with an arbitrary data set e.g.\n\n1.  Construct\n    a.  a logistic model for binary response variables\n    b.  a logistic model for proportion response variables\n    c.  a Poisson model for count response variables\n    d.  ~~a Negative Binomial model for count response variables~~ (to be added later)\n2.  Plot the data and the fitted curve in each case for both continuous and categorical predictors\n3.  Assess the significance of fit\n4.  Assess assumption of the model\n:::\n","srcMarkdownNoYaml":"\n\nWelcome to the wonderful world of generalised linear models!\n\nThese sessions are intended to enable you to construct and use generalised linear models confidently.\n\nAs with all of our statistics courses our focus is not on mathematical derivations, but on developing an intuitive understanding of the underlying statistical concepts.\n\nAt the same time this is also *not* a \"how to mindlessly use a stats program\" course! We hope that at the end of this course you feel like you have a better grasp on what it is we're trying to do, and gained sufficient confidence in your coding skills to implement these statistical concepts in your own research!\n\n## Core aims\n\nTo introduce sufficient understanding and coding experience for analysing data with non-continuous response variables.\n\n::: callout-note\n## Course aims\n\nTo know what to do when presented with an arbitrary data set e.g.\n\n1.  Construct\n    a.  a logistic model for binary response variables\n    b.  a logistic model for proportion response variables\n    c.  a Poisson model for count response variables\n    d.  ~~a Negative Binomial model for count response variables~~ (to be added later)\n2.  Plot the data and the fitted curve in each case for both continuous and categorical predictors\n3.  Assess the significance of fit\n4.  Assess assumption of the model\n:::\n"},"formats":{"courseformat-html":{"identifier":{"display-name":"HTML","target-format":"courseformat-html","base-format":"html","extension-name":"courseformat"},"execute":{"fig-width":7,"fig-height":5,"fig-format":"retina","fig-dpi":96,"df-print":"default","error":false,"eval":true,"cache":null,"freeze":"auto","echo":true,"output":true,"warning":true,"include":true,"keep-md":false,"keep-ipynb":false,"ipynb":null,"enabled":null,"daemon":null,"daemon-restart":false,"debug":false,"ipynb-filters":[],"ipynb-shell-interactivity":null,"plotly-connected":true,"engine":"markdown"},"render":{"keep-tex":false,"keep-typ":false,"keep-source":false,"keep-hidden":false,"prefer-html":false,"output-divs":true,"output-ext":"html","fig-align":"default","fig-pos":null,"fig-env":null,"code-fold":"none","code-overflow":"scroll","code-link":true,"code-line-numbers":false,"code-tools":false,"tbl-colwidths":"auto","merge-includes":true,"inline-includes":false,"preserve-yaml":false,"latex-auto-mk":true,"latex-auto-install":true,"latex-clean":true,"latex-min-runs":1,"latex-max-runs":10,"latex-makeindex":"makeindex","latex-makeindex-opts":[],"latex-tlmgr-opts":[],"latex-input-paths":[],"latex-output-dir":null,"link-external-icon":false,"link-external-newwindow":false,"self-contained-math":false,"format-resources":[],"notebook-links":true,"shortcodes":[]},"pandoc":{"standalone":true,"wrap":"none","default-image-extension":"png","to":"html","toc":true,"number-sections":false,"filters":["courseformat"],"output-file":"index.html"},"language":{"toc-title-document":"Table of contents","toc-title-website":"On this page","related-formats-title":"Other Formats","related-notebooks-title":"Notebooks","source-notebooks-prefix":"Source","other-links-title":"Other Links","code-links-title":"Code Links","launch-dev-container-title":"Launch Dev Container","launch-binder-title":"Launch Binder","article-notebook-label":"Article Notebook","notebook-preview-download":"Download Notebook","notebook-preview-download-src":"Download Source","notebook-preview-back":"Back to Article","manuscript-meca-bundle":"MECA Bundle","section-title-abstract":"Abstract","section-title-appendices":"Appendices","section-title-footnotes":"Footnotes","section-title-references":"References","section-title-reuse":"Reuse","section-title-copyright":"Copyright","section-title-citation":"Citation","appendix-attribution-cite-as":"For attribution, please cite this work as:","appendix-attribution-bibtex":"BibTeX citation:","title-block-author-single":"Author","title-block-author-plural":"Authors","title-block-affiliation-single":"Affiliation","title-block-affiliation-plural":"Affiliations","title-block-published":"Published","title-block-modified":"Modified","title-block-keywords":"Keywords","callout-tip-title":"Tip","callout-note-title":"Note","callout-warning-title":"Warning","callout-important-title":"Important","callout-caution-title":"Caution","code-summary":"Code","code-tools-menu-caption":"Code","code-tools-show-all-code":"Show All Code","code-tools-hide-all-code":"Hide All Code","code-tools-view-source":"View Source","code-tools-source-code":"Source Code","tools-share":"Share","tools-download":"Download","code-line":"Line","code-lines":"Lines","copy-button-tooltip":"Copy to Clipboard","copy-button-tooltip-success":"Copied!","repo-action-links-edit":"Edit this page","repo-action-links-source":"View source","repo-action-links-issue":"Report an issue","back-to-top":"Back to top","search-no-results-text":"No results","search-matching-documents-text":"matching documents","search-copy-link-title":"Copy link to search","search-hide-matches-text":"Hide additional matches","search-more-match-text":"more match in this document","search-more-matches-text":"more matches in this document","search-clear-button-title":"Clear","search-text-placeholder":"","search-detached-cancel-button-title":"Cancel","search-submit-button-title":"Submit","search-label":"Search","toggle-section":"Toggle section","toggle-sidebar":"Toggle sidebar navigation","toggle-dark-mode":"Toggle dark mode","toggle-reader-mode":"Toggle reader mode","toggle-navigation":"Toggle navigation","crossref-fig-title":"Figure","crossref-tbl-title":"Table","crossref-lst-title":"Listing","crossref-thm-title":"Theorem","crossref-lem-title":"Lemma","crossref-cor-title":"Corollary","crossref-prp-title":"Proposition","crossref-cnj-title":"Conjecture","crossref-def-title":"Definition","crossref-exm-title":"Example","crossref-exr-title":"Exercise","crossref-ch-prefix":"Chapter","crossref-apx-prefix":"Appendix","crossref-sec-prefix":"Section","crossref-eq-prefix":"Equation","crossref-lof-title":"List of Figures","crossref-lot-title":"List of Tables","crossref-lol-title":"List of Listings","environment-proof-title":"Proof","environment-remark-title":"Remark","environment-solution-title":"Solution","listing-page-order-by":"Order By","listing-page-order-by-default":"Default","listing-page-order-by-date-asc":"Oldest","listing-page-order-by-date-desc":"Newest","listing-page-order-by-number-desc":"High to Low","listing-page-order-by-number-asc":"Low to High","listing-page-field-date":"Date","listing-page-field-title":"Title","listing-page-field-description":"Description","listing-page-field-author":"Author","listing-page-field-filename":"File Name","listing-page-field-filemodified":"Modified","listing-page-field-subtitle":"Subtitle","listing-page-field-readingtime":"Reading Time","listing-page-field-wordcount":"Word Count","listing-page-field-categories":"Categories","listing-page-minutes-compact":"{0} min","listing-page-category-all":"All","listing-page-no-matches":"No matching items","listing-page-words":"{0} words"},"metadata":{"lang":"en","fig-responsive":true,"quarto-version":"1.4.546","theme":["default","_extensions/cambiotraining/courseformat/theme.scss"],"number-depth":3,"code-copy":true,"revealjs-plugins":[],"bibliography":["references.bib"],"knitr":{"opts_knit":{"cache.path":".knitr_cache"}},"title":"Course overview"},"extensions":{"book":{"multiFile":true}}}},"projectFormats":["courseformat-html"]}
\ No newline at end of file
+{"title":"Course overview","markdown":{"yaml":{"title":"Course overview","author":["Vicki Hodgson, Matt Castle, Martin van Rongen"],"number-sections":false},"headingText":"Core aims","containsRefs":false,"markdown":"\n\nWelcome to the wonderful world of generalised linear models!\n\nThese sessions are intended to enable you to construct and use generalised linear models confidently.\n\nAs with all of our statistics courses our focus is not on mathematical derivations, but on developing an intuitive understanding of the underlying statistical concepts.\n\nAt the same time this is also *not* a \"how to mindlessly use a stats program\" course! We hope that at the end of this course you feel like you have a better grasp on what it is we're trying to do, and gained sufficient confidence in your coding skills to implement these statistical concepts in your own research!\n\n\nTo introduce sufficient understanding and coding experience for analysing data with non-continuous response variables.\n\n::: callout-note\n## Course aims\n\nTo know what to do when presented with an arbitrary data set e.g.\n\n1.  Construct\n    a.  a logistic model for binary response variables\n    b.  a logistic model for proportion response variables\n    c.  a Poisson model for count response variables\n    d.  ~~a Negative Binomial model for count response variables~~ (to be added later)\n2.  Plot the data and the fitted curve in each case for both continuous and categorical predictors\n3.  Assess the significance of fit\n4.  Assess assumption of the model\n:::\n\n## Authors\n\nAbout the author(s):\n\n- **Vicki Hodgson**\n  <a href=\"https://orcid.org/0000-0001-5619-2118\" target=\"_blank\"><i class=\"fa-brands fa-orcid\" style=\"color:#a6ce39\"></i></a> \n  <a href=\"https://github.com/Vicki-H\" target=\"_blank\"><i class=\"fa-brands fa-github\" style=\"color:#4078c0\"></i></a>  \n  _Affiliation_: Bioinformatics Training Facility, University of Cambridge  \n  _Roles_: writing - review & editing; conceptualisation; coding\n- **Martin van Rongen**\n  <a href=\"https://orcid.org/0000-0002-1441-367X\" target=\"_blank\"><i class=\"fa-brands fa-orcid\" style=\"color:#a6ce39\"></i></a> \n  <a href=\"https://github.com/mvanrongen\" target=\"_blank\"><i class=\"fa-brands fa-github\" style=\"color:#4078c0\"></i></a>  \n  _Affiliation_: Bioinformatics Training Facility, University of Cambridge  \n  _Roles_: writing - review & editing; conceptualisation; coding\n- **Matt Castle**\n  <a href=\"https://orcid.org/0000-0002-9439-552X\" target=\"_blank\"><i class=\"fa-brands fa-orcid\" style=\"color:#a6ce39\"></i></a> \n  _Affiliation_: Bioinformatics Training Facility, University of Cambridge  \n  _Roles_: conceptualisation; writing\n\n","srcMarkdownNoYaml":"\n\nWelcome to the wonderful world of generalised linear models!\n\nThese sessions are intended to enable you to construct and use generalised linear models confidently.\n\nAs with all of our statistics courses our focus is not on mathematical derivations, but on developing an intuitive understanding of the underlying statistical concepts.\n\nAt the same time this is also *not* a \"how to mindlessly use a stats program\" course! We hope that at the end of this course you feel like you have a better grasp on what it is we're trying to do, and gained sufficient confidence in your coding skills to implement these statistical concepts in your own research!\n\n## Core aims\n\nTo introduce sufficient understanding and coding experience for analysing data with non-continuous response variables.\n\n::: callout-note\n## Course aims\n\nTo know what to do when presented with an arbitrary data set e.g.\n\n1.  Construct\n    a.  a logistic model for binary response variables\n    b.  a logistic model for proportion response variables\n    c.  a Poisson model for count response variables\n    d.  ~~a Negative Binomial model for count response variables~~ (to be added later)\n2.  Plot the data and the fitted curve in each case for both continuous and categorical predictors\n3.  Assess the significance of fit\n4.  Assess assumption of the model\n:::\n\n## Authors\n\nAbout the author(s):\n\n- **Vicki Hodgson**\n  <a href=\"https://orcid.org/0000-0001-5619-2118\" target=\"_blank\"><i class=\"fa-brands fa-orcid\" style=\"color:#a6ce39\"></i></a> \n  <a href=\"https://github.com/Vicki-H\" target=\"_blank\"><i class=\"fa-brands fa-github\" style=\"color:#4078c0\"></i></a>  \n  _Affiliation_: Bioinformatics Training Facility, University of Cambridge  \n  _Roles_: writing - review & editing; conceptualisation; coding\n- **Martin van Rongen**\n  <a href=\"https://orcid.org/0000-0002-1441-367X\" target=\"_blank\"><i class=\"fa-brands fa-orcid\" style=\"color:#a6ce39\"></i></a> \n  <a href=\"https://github.com/mvanrongen\" target=\"_blank\"><i class=\"fa-brands fa-github\" style=\"color:#4078c0\"></i></a>  \n  _Affiliation_: Bioinformatics Training Facility, University of Cambridge  \n  _Roles_: writing - review & editing; conceptualisation; coding\n- **Matt Castle**\n  <a href=\"https://orcid.org/0000-0002-9439-552X\" target=\"_blank\"><i class=\"fa-brands fa-orcid\" style=\"color:#a6ce39\"></i></a> \n  _Affiliation_: Bioinformatics Training Facility, University of Cambridge  \n  _Roles_: conceptualisation; writing\n\n"},"formats":{"courseformat-html":{"identifier":{"display-name":"HTML","target-format":"courseformat-html","base-format":"html","extension-name":"courseformat"},"execute":{"fig-width":7,"fig-height":5,"fig-format":"retina","fig-dpi":96,"df-print":"default","error":false,"eval":true,"cache":null,"freeze":"auto","echo":true,"output":true,"warning":true,"include":true,"keep-md":false,"keep-ipynb":false,"ipynb":null,"enabled":null,"daemon":null,"daemon-restart":false,"debug":false,"ipynb-filters":[],"ipynb-shell-interactivity":null,"plotly-connected":true,"engine":"markdown"},"render":{"keep-tex":false,"keep-typ":false,"keep-source":false,"keep-hidden":false,"prefer-html":false,"output-divs":true,"output-ext":"html","fig-align":"default","fig-pos":null,"fig-env":null,"code-fold":"none","code-overflow":"scroll","code-link":true,"code-line-numbers":false,"code-tools":false,"tbl-colwidths":"auto","merge-includes":true,"inline-includes":false,"preserve-yaml":false,"latex-auto-mk":true,"latex-auto-install":true,"latex-clean":true,"latex-min-runs":1,"latex-max-runs":10,"latex-makeindex":"makeindex","latex-makeindex-opts":[],"latex-tlmgr-opts":[],"latex-input-paths":[],"latex-output-dir":null,"link-external-icon":false,"link-external-newwindow":false,"self-contained-math":false,"format-resources":[],"notebook-links":true,"shortcodes":[]},"pandoc":{"standalone":true,"wrap":"none","default-image-extension":"png","to":"html","toc":true,"number-sections":false,"filters":["courseformat"],"output-file":"index.html"},"language":{"toc-title-document":"Table of contents","toc-title-website":"On this page","related-formats-title":"Other Formats","related-notebooks-title":"Notebooks","source-notebooks-prefix":"Source","other-links-title":"Other Links","code-links-title":"Code Links","launch-dev-container-title":"Launch Dev Container","launch-binder-title":"Launch Binder","article-notebook-label":"Article Notebook","notebook-preview-download":"Download Notebook","notebook-preview-download-src":"Download Source","notebook-preview-back":"Back to Article","manuscript-meca-bundle":"MECA Bundle","section-title-abstract":"Abstract","section-title-appendices":"Appendices","section-title-footnotes":"Footnotes","section-title-references":"References","section-title-reuse":"Reuse","section-title-copyright":"Copyright","section-title-citation":"Citation","appendix-attribution-cite-as":"For attribution, please cite this work as:","appendix-attribution-bibtex":"BibTeX citation:","title-block-author-single":"Author","title-block-author-plural":"Authors","title-block-affiliation-single":"Affiliation","title-block-affiliation-plural":"Affiliations","title-block-published":"Published","title-block-modified":"Modified","title-block-keywords":"Keywords","callout-tip-title":"Tip","callout-note-title":"Note","callout-warning-title":"Warning","callout-important-title":"Important","callout-caution-title":"Caution","code-summary":"Code","code-tools-menu-caption":"Code","code-tools-show-all-code":"Show All Code","code-tools-hide-all-code":"Hide All Code","code-tools-view-source":"View Source","code-tools-source-code":"Source Code","tools-share":"Share","tools-download":"Download","code-line":"Line","code-lines":"Lines","copy-button-tooltip":"Copy to Clipboard","copy-button-tooltip-success":"Copied!","repo-action-links-edit":"Edit this page","repo-action-links-source":"View source","repo-action-links-issue":"Report an issue","back-to-top":"Back to top","search-no-results-text":"No results","search-matching-documents-text":"matching documents","search-copy-link-title":"Copy link to search","search-hide-matches-text":"Hide additional matches","search-more-match-text":"more match in this document","search-more-matches-text":"more matches in this document","search-clear-button-title":"Clear","search-text-placeholder":"","search-detached-cancel-button-title":"Cancel","search-submit-button-title":"Submit","search-label":"Search","toggle-section":"Toggle section","toggle-sidebar":"Toggle sidebar navigation","toggle-dark-mode":"Toggle dark mode","toggle-reader-mode":"Toggle reader mode","toggle-navigation":"Toggle navigation","crossref-fig-title":"Figure","crossref-tbl-title":"Table","crossref-lst-title":"Listing","crossref-thm-title":"Theorem","crossref-lem-title":"Lemma","crossref-cor-title":"Corollary","crossref-prp-title":"Proposition","crossref-cnj-title":"Conjecture","crossref-def-title":"Definition","crossref-exm-title":"Example","crossref-exr-title":"Exercise","crossref-ch-prefix":"Chapter","crossref-apx-prefix":"Appendix","crossref-sec-prefix":"Section","crossref-eq-prefix":"Equation","crossref-lof-title":"List of Figures","crossref-lot-title":"List of Tables","crossref-lol-title":"List of Listings","environment-proof-title":"Proof","environment-remark-title":"Remark","environment-solution-title":"Solution","listing-page-order-by":"Order By","listing-page-order-by-default":"Default","listing-page-order-by-date-asc":"Oldest","listing-page-order-by-date-desc":"Newest","listing-page-order-by-number-desc":"High to Low","listing-page-order-by-number-asc":"Low to High","listing-page-field-date":"Date","listing-page-field-title":"Title","listing-page-field-description":"Description","listing-page-field-author":"Author","listing-page-field-filename":"File Name","listing-page-field-filemodified":"Modified","listing-page-field-subtitle":"Subtitle","listing-page-field-readingtime":"Reading Time","listing-page-field-wordcount":"Word Count","listing-page-field-categories":"Categories","listing-page-minutes-compact":"{0} min","listing-page-category-all":"All","listing-page-no-matches":"No matching items","listing-page-words":"{0} words"},"metadata":{"lang":"en","fig-responsive":true,"quarto-version":"1.4.546","theme":["default","_extensions/cambiotraining/courseformat/theme.scss"],"number-depth":3,"code-copy":true,"revealjs-plugins":[],"bibliography":["references.bib"],"knitr":{"opts_knit":{"cache.path":".knitr_cache"}},"title":"Course overview","author":["Vicki Hodgson, Matt Castle, Martin van Rongen"]},"extensions":{"book":{"multiFile":true}}}},"projectFormats":["courseformat-html"]}
\ No newline at end of file
diff --git a/_freeze/materials/checking-assumptions/execute-results/html.json b/_freeze/materials/checking-assumptions/execute-results/html.json
new file mode 100644
index 0000000..c947048
--- /dev/null
+++ b/_freeze/materials/checking-assumptions/execute-results/html.json
@@ -0,0 +1,17 @@
+{
+  "hash": "e60a019a08a121704741c223e9cc90c0",
+  "result": {
+    "engine": "knitr",
+    "markdown": "---\ntitle: \"Checking assumptions\"\noutput: html_document\n---\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\nAlthough generalised linear models do allow us to relax certain assumptions compared to standard linear models (linearity, equality of variance of residuals, and normality of residuals).\n\nHowever, we cannot relax all of them. This section of the materials will talk through the important assumptions for GLMs, and how to assess them.\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(ggResidpanel)\n```\n:::\n\n## Python\n\n::: {.cell}\n\n```{.python .cell-code}\nfrom scipy.stats import *\n```\n:::\n\n:::\n\n:::\n\n## Assumption 1: Distribution of response variable\n\nAlthough we don't expect our response variable $y$ to be continuous and normally distributed (as we did in linear modelling), we do still expect its distribution to come from the \"exponential family\" of distributions.\n\nThe exponential family contains the following distributions, among others:\n\n- normal\n- exponential\n- Poisson \n- Bernoulli\n- binomial (for fixed number of trials)\n- chi-squared\n\nYou can use a histogram to visualise the distribution of your response variable, but it is typically most useful just to think about the nature of your response variable. For instance, binary variables will follow a Bernoulli distribution, proportional variables follow a binomial distribution, and most count variables will follow a Poisson distribution.\n\nIf you have a very unusual variable that doesn't follow one of these exponential family distributions, however, then a GLM will not be an appropriate choice. In other words, a GLM is not necessarily a magic fix!\n\n## Assumption 2: Correct link function\n\nA closely-related assumption to assumption 1 above, is that we have chosen the correct link function for our model.\n\nIf we have done so, then there should be a linear relationship between our *transformed* model and our response variable; in other words, if we have chosen the right link function, then we have correctly \"linearised\" our model.\n\n## Assumption 3: Independence\n\nWe expect that the each observation or datapoint in our sample is independent of all the others. Specifically, we expect that our set of $y$ response variables are independent of one another.\n\nFor this to be true, we have to make sure:\n\n- that we aren't treating technical replicates as true/biological replicates;\n- that we don't have observations/datapoints in our sample that are artificially similar to each other (compared to other datapoints);\n- that we don't have any nuisance/confounding variables that create \"clusters\" or hierarchy in our dataset;\n- that we haven't got repeated measures, i.e., multiple measurements/rows per individual in our sample\n\nThere is no diagnostic plot for assessing this assumption. To determine whether your data are independent, you need to understand your experimental design.\n\nYou might find [this page](https://cambiotraining.github.io/experimental-design/materials/04-replication.html#criteria-for-true-independent-replication) useful if you're looking for more information on what counts as truly independent data.\n\n## Good science: No influential observations\n\nAs with linear models, though this isn't always considered a \"formal\" assumption, we do want to ensure that there aren't any datapoints that are overly influencing our model.\n\nA datapoint is overly influential, i.e., has high leverage, if removing that point from the dataset would cause large changes in the model coefficients. Datapoints with high leverage are typically those that don't follow the same general \"trend\" as the rest of the data.\n\nThe easiest way to check for overly influential points is to construct a Cook's distance plot.\n\nLet's try that out, using the `diabetes` example dataset.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndiabetes <- read_csv(\"data/diabetes.csv\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 728 Columns: 3\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (3): glucose, diastolic, test_result\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n\n```{.r .cell-code}\nglm_dia <- glm(test_result ~ glucose * diastolic,\n                  family = \"binomial\",\n                  data = diabetes)\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndiabetes_py = pd.read_csv(\"data/diabetes.csv\")\n\nmodel = smf.glm(formula = \"test_result ~ glucose * diastolic\", \n                family = sm.families.Binomial(), \n                data = diabetes_py)\n                \nglm_dia_py = model.fit()\n```\n:::\n\n:::\n\nOnce our model is fitted, we can fit a Cook's distance plot:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n::: {.cell}\n\n```{.r .cell-code}\nresid_panel(glm_dia, plots = \"cookd\")\n```\n\n::: {.cell-output-display}\n![](checking-assumptions_files/figure-html/unnamed-chunk-7-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n::: {.cell}\n\n:::\n\n:::\n\nGood news - there don't appear to be any overly influential points!\n\n## Dispersion\n\nAnother thing that we want to check, primarily in Poisson regression, is whether our dispersion parameter is correct.\n\n::: {.callout-note collapse=\"true}\n\n#### First, let's unpack what dispersion is!\n\nDispersion, in statistics, is a general term to describe the variability, scatter, or spread of a distribution. Variance is a common measure of dispersion that hopefully you are familiar with.\n\nIn a normal distribution, the mean (average) and the variance (dispersion) are independent of each other; we need both numbers, or parameters, to understand the shape of the distribution.\n\nOther distributions, however, require different parameters to describe them in full. For a Poisson distribution, we need just one parameter $lambda$, which captures the expected rate of occurrences/expected count. The mean and variance of a Poisson distribution are actually expected to be the same.\n\nIn the context of a model, you can think about the dispersion as the degree to which the data are spread out around the model curve. A dispersion parameter of 1 means the data are spread out exactly as we expect; <1 is called underdispersion; and >1 is called overdispersion.\n:::\n\n### A \"hidden assumption\"\n\nWhen we fit a linear model, because we're assuming a normal distribution, we take the time to estimate the dispersion - by measuring the variance.  \n\nWhen performing Poisson regression, however, we make an extra \"hidden\" assumption, in setting the dispersion parameter to 1. In other words, we expect the errors to have a certain spread to them that matches our theoretical distribution/model. This means we don't have to waste time and statistical power in estimating the dispersion.\n\nHowever, if our data are underdispersed or overdispersed, then we might be violating this assumption we've made. \n\nUnderdispersion is quite rare. It's far more likely that you'll encounter overdispersion; in Poisson regression, this is usually caused by the presence of lots of zeroes in your response variable (known as zero-inflation).\n\nIn these situations, you may wish to fit a different GLM to the data. Negative binomial regression, for instance, is a common alternative for zero-inflated count data.\n\n### Checking the dispersion parameter\n\nThe easiest way to check dispersion in a model is to calculate the ratio of the residual deviance to the residual degrees of freedom.\n\nLet's practice doing this using a Poisson regression fitted to the `islands` dataset that you saw earlier in the course.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n::: {.cell}\n\n```{.r .cell-code}\nislands <- read_csv(\"data/islands.csv\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 35 Columns: 2\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (2): species, area\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n\n```{.r .cell-code}\nglm_isl <- glm(species ~ area,\n               data = islands, family = \"poisson\")\n```\n:::\n\n\n## Python\n\n::: {.cell}\n\n```{.python .cell-code}\nislands_py = pd.read_csv(\"data/islands.csv\")\n\nmodel = smf.glm(formula = \"species ~ area\",\n                family = sm.families.Poisson(),\n                data = islands_py)\n\nglm_isl_py = model.fit()\n```\n:::\n\n:::\n\nIf we take a look at the model output, we can see the two quantities we care about - residual deviance and residual degrees of freedom:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_isl)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = species ~ area, family = \"poisson\", data = islands)\n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)    \n(Intercept) 4.241129   0.041322  102.64   <2e-16 ***\narea        0.035613   0.001247   28.55   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 856.899  on 34  degrees of freedom\nResidual deviance:  30.437  on 33  degrees of freedom\nAIC: 282.66\n\nNumber of Fisher Scoring iterations: 3\n```\n\n\n:::\n:::\n\n\n## Python\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_isl_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:                species   No. Observations:                   35\nModel:                            GLM   Df Residuals:                       33\nModel Family:                 Poisson   Df Model:                            1\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -139.33\nDate:                Fri, 17 May 2024   Deviance:                       30.437\nTime:                        12:40:44   Pearson chi2:                     30.3\nNo. Iterations:                     4   Pseudo R-squ. (CS):              1.000\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      4.2411      0.041    102.636      0.000       4.160       4.322\narea           0.0356      0.001     28.551      0.000       0.033       0.038\n==============================================================================\n```\n\n\n:::\n:::\n\n:::\n\nThe residual deviance is 30.437, on 33 residual degrees of freedom. All we need to do is divide one by the other to get our dispersion parameter.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_isl$deviance/glm_isl$df.residual\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.922334\n```\n\n\n:::\n:::\n\n\n## Python\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_isl_py.deviance/glm_isl_py.df_resid)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.9223340414458532\n```\n\n\n:::\n:::\n\n:::\n\nThe dispersion parameter here is 0.922. That's pretty good - not far off 1 at all.\n\nBut how can we check whether it is *significantly* different from 1? \n\nWell, you've actually already got the knowledge you need to do this, from the previous course section on significance testing. Specifically, the chi-squared goodness-of-fit test can be used to check whether the dispersion is within sensible limits.\n\nYou may have noticed that the two values we're using for the dispersion parameter are the same two numbers that we used in those chi-squared tests. For this Poisson regression fitted to the `islands` dataset, that goodness-of-fit test would look like this:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(glm_isl$deviance, glm_isl$df.residual)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.595347\n```\n\n\n:::\n:::\n\n\n## Python\n\n::: {.cell}\n\n```{.python .cell-code}\npvalue = chi2.sf(glm_isl_py.deviance, glm_isl_py.df_resid)\n\nprint(pvalue)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.5953470127463187\n```\n\n\n:::\n:::\n\n:::\n\nIf our chi-squared goodness-of-fit test returns a large (insignificant) p-value, as it does here, that tells us that we don't need to worry about the dispersion.\n\nIf our chi-squared goodness-of-fit test returned a small, significant p-value, this would tell us our model doesn't fit the data well. And, since dispersion is all about the spread of points around the model, it makes sense that these two things are so closely related!\n\n## Summary\n\nWhile generalised linear models make fewer assumptions than standard linear models, we do still expect certain things to be true about the model and our variables for GLMs to be valid. Checking most of these assumptions requires understanding your dataset, and diagnostic plots play a less heavy role.\n\n::: {.callout-tip}\n#### Key points\n\n- For a generalised linear model, we assume that we have chosen the correct link function, that our response variable follows a distribution from the exponential family, and that our data are independent\n- To assess these assumptions, we need to understand our dataset and variables\n- We can also use visualisation to determine whether we have overly influential (high leverage) datapoints\n- For Poisson regression, we should also investigate the dispersion parameter of our model, which we expect to be close to 1\n:::\n",
+    "supporting": [
+      "checking-assumptions_files"
+    ],
+    "filters": [
+      "rmarkdown/pagebreak.lua"
+    ],
+    "includes": {},
+    "engineDependencies": {},
+    "preserve": {},
+    "postProcess": true
+  }
+}
\ No newline at end of file
diff --git a/_freeze/materials/checking-assumptions/figure-html/unnamed-chunk-7-1.png b/_freeze/materials/checking-assumptions/figure-html/unnamed-chunk-7-1.png
new file mode 100644
index 0000000..b1a0a52
Binary files /dev/null and b/_freeze/materials/checking-assumptions/figure-html/unnamed-chunk-7-1.png differ
diff --git a/_freeze/materials/glm-practical-poisson/execute-results/html.json b/_freeze/materials/glm-practical-poisson/execute-results/html.json
index 5a4ad40..d6865f9 100644
--- a/_freeze/materials/glm-practical-poisson/execute-results/html.json
+++ b/_freeze/materials/glm-practical-poisson/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "a7944edaa2f705428f83c869d555dfd3",
+  "hash": "bb8efe99ebfffabd6145784d8da531a1",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: \"Count data\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n**Questions**\n\n-   How do we analyse count data?\n\n**Objectives**\n\n-   Be able to perform a poisson regression on count data\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n### Functions\n\n## Python\n\n### Libraries\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n\n# Needed for additional probability functionality\nfrom scipy.stats import *\n```\n:::\n\n\n### Functions\n\n:::\n:::\n\nThe examples in this section use the following data sets:\n\n`data/islands.csv`\n\nThis is a data set comprising 35 observations of two variables (one dependent and one predictor). This records the number of species recorded on different small islands along with the area (km<sup>2</sup>) of the islands. The variables are `species` and `area`.\n\nThe second data set is on seat belts.\n\nThe `seatbelts` data set is a multiple time-series data set that was commissioned by the Department of Transport in 1984 to measure differences in deaths before and after front seat belt legislation was introduced on 31st January 1983. It provides monthly total numerical data on a number of incidents including those related to death and injury in Road Traffic Accidents (RTA's). The data set starts in January 1969 and observations run until December 1984.\n\nYou can find the file in `data/seatbelts.csv`\n\n## Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nislands <- read_csv(\"data/islands.csv\")\n```\n:::\n\n\nLet's have a glimpse at the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nislands\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 35 × 2\n   species  area\n     <dbl> <dbl>\n 1     114  12.1\n 2     130  13.4\n 3     113  13.7\n 4     109  14.5\n 5     118  16.8\n 6     136  19.0\n 7     149  19.6\n 8     162  20.6\n 9     145  20.9\n10     148  21.0\n# ℹ 25 more rows\n```\n\n\n:::\n:::\n\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nislands_py = pd.read_csv(\"data/islands.csv\")\n```\n:::\n\n\nLet's have a glimpse at the data:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nislands_py.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   species       area\n0      114  12.076133\n1      130  13.405439\n2      113  13.723525\n3      109  14.540359\n4      118  16.792122\n```\n\n\n:::\n:::\n\n\n:::\n\nLooking at the data, we can see that there are two columns: `species`, which contains the number of species recorded on each island and `area`, which contains the surface area of the island in square kilometers.\n\nWe can plot the data:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(islands, aes(x = area, y = species)) +\n  geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(islands_py, aes(x = \"area\", y = \"species\")) +\n  geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-9-1.png){width=614}\n:::\n:::\n\n:::\n\nIt looks as though `area` may have an effect on the number of species that we observe on each island. We note that the response variable is count data and so we try to construct a Poisson regression.\n\n## Constructing a model\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_isl <- glm(species ~ area,\n               data = islands, family = \"poisson\")\n```\n:::\n\n\nand we look at the model summary:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_isl)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = species ~ area, family = \"poisson\", data = islands)\n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)    \n(Intercept) 4.241129   0.041322  102.64   <2e-16 ***\narea        0.035613   0.001247   28.55   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 856.899  on 34  degrees of freedom\nResidual deviance:  30.437  on 33  degrees of freedom\nAIC: 282.66\n\nNumber of Fisher Scoring iterations: 3\n```\n\n\n:::\n:::\n\n\nThe output is strikingly similar to the logistic regression models (who’d have guessed, eh?) and the main numbers to extract from the output are the two numbers underneath `Estimate.Std` in the `Coefficients` table:\n\n```\n(Intercept)    4.241129\narea           0.035613\n```\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a generalised linear model\nmodel = smf.glm(formula = \"species ~ area\",\n                family = sm.families.Poisson(),\n                data = islands_py)\n# and get the fitted parameters of the model\nglm_isl_py = model.fit()\n```\n:::\n\n\nLet's look at the model output:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_isl_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:                species   No. Observations:                   35\nModel:                            GLM   Df Residuals:                       33\nModel Family:                 Poisson   Df Model:                            1\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -139.33\nDate:                Tue, 06 Feb 2024   Deviance:                       30.437\nTime:                        16:16:33   Pearson chi2:                     30.3\nNo. Iterations:                     4   Pseudo R-squ. (CS):              1.000\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      4.2411      0.041    102.636      0.000       4.160       4.322\narea           0.0356      0.001     28.551      0.000       0.033       0.038\n==============================================================================\n```\n\n\n:::\n:::\n\n\n:::\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\n$$ E(species) = \\exp(4.24 + 0.036 \\times area) $$\n\nInterpreting this requires a bit of thought (not much, but a bit).\nThe intercept coefficient, `4.24`, is related to the number of species we would expect on an island of zero area (this is statistics, not real life. You’d do well to remember that before you worry too much about what that even means). But in order to turn this number into something meaningful we have to exponentiate it. Since `exp(4.24) ≈ 70`, we can say that the baseline number of species the model expects on any island is 70. This isn’t actually the interesting bit though.\n\nThe coefficient of `area` is the fun bit. For starters we can see that it is a positive number which does mean that increasing `area` leads to increasing numbers of `species`. Good so far.\n\nBut what does the value `0.036` actually mean? Well, if we exponentiate it as well, we get `exp(0.036) ≈ 1.04`. This means that for every increase in `area` of 1 km^2 (the original units of the area variable), the number of species on the island is multiplied by `1.04`. So, an island of area 1 km^2 will have `1.04 x 70 ≈ 72` species.\n\nSo, in order to interpret Poisson coefficients, you have to exponentiate them.\n\n## Plotting the Poisson regression\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(islands, aes(area, species)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = poisson)) +\n  xlim(10,50)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n`geom_smooth()` using formula = 'y ~ x'\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel = pd.DataFrame({'area': list(range(10, 50))})\n\nmodel[\"pred\"] = glm_isl_py.predict(model)\n\nmodel.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   area        pred\n0    10   99.212463\n1    11  102.809432\n2    12  106.536811\n3    13  110.399326\n4    14  114.401877\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(islands_py,\n         aes(x = \"area\",\n             y = \"species\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"area\", y = \"pred\"), colour = \"blue\", size = 1))\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-16-1.png){width=614}\n:::\n:::\n\n:::\n\n## Assumptions\n\nAs we mentioned earlier, Poisson regressions require that the variance of the data at any point is the same as the mean of the data at that point. We checked that earlier by looking at the residual deviance values.\n\nWe can look for influential points using the Cook’s distance plot:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot(glm_isl , which=4)\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-17-3.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# extract the Cook's distances\nglm_isl_py_resid = pd.DataFrame(glm_isl_py.\n                                get_influence().\n                                summary_frame()[\"cooks_d\"])\n\n# add row index \nglm_isl_py_resid['obs'] = glm_isl_py_resid.reset_index().index\n```\n:::\n\n\nWe can use these to create the plot:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(glm_isl_py_resid,\n         aes(x = \"obs\",\n             y = \"cooks_d\")) +\n     geom_segment(aes(x = \"obs\", y = \"cooks_d\", xend = \"obs\", yend = 0)) +\n     geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-19-1.png){width=614}\n:::\n:::\n\n\n:::\n\nNone of our points have particularly large Cook’s distances and so life is rosy.\n\n## Assessing significance\n\nWe can ask the same three questions we asked before.\n\n1. Is the model well-specified?\n2. Is the overall model better than the null model?\n3. Are any of the individual predictors significant?\n\nAgain, in this case, questions 2 and 3 are effectively asking the same thing because we still only have a single predictor variable.\n\nTo assess if the model is any good we’ll again use the residual deviance and the residual degrees of freedom.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(30.437, 33)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.5953482\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(30.437, 33)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.5953481872979622\n```\n\n\n:::\n:::\n\n\n:::\n\nThis gives a probability of `0.60`. This suggests that this model is actually a reasonably decent one and that the data are pretty well supported by the model. For Poisson models this has an extra interpretation. This can be used to assess whether we have significant over-dispersion in our data.\n\nFor a Poisson model to be appropriate we need that the variance of the data to be exactly the same as the mean of the data. Visually, this would correspond to the data spreading out more for higher predicted values of `species.` However, we don’t want the data to spread out too much. If that happens then a Poisson model wouldn’t be appropriate.\n\nThe easy way to check this is to look at the ratio of the residual deviance to the residual degrees of freedom (in this case `0.922`). For a Poisson model to be valid, this ratio should be about 1. If the ratio is significantly bigger than 1 then we say that we have over-dispersion in the model and we wouldn’t be able to trust any of the significance testing that we are about to do using a Poisson regression.\n\nThankfully the probability we have just created (`0.60`) is exactly the right one we need to look at to assess whether we have significant over-dispersion in our model.\n\nSecondly, to assess whether the overall model, with all of the terms, is better than the null model we’ll look at the difference in deviances and the difference in degrees of freedom:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(856.899 - 30.437, 34 - 33)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(856.899 - 30.437, 34 - 33)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n9.524927068555617e-182\n```\n\n\n:::\n:::\n\n:::\n\nThis gives a reported p-value of pretty much zero, which is pretty damn small. So, yes, this model is better than nothing at all and species does appear to change with some of our predictors\n\nFinally, we’ll construct an analysis of deviance table to look at the individual terms:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(glm_isl , test = \"Chisq\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nAnalysis of Deviance Table\n\nModel: poisson, link: log\n\nResponse: species\n\nTerms added sequentially (first to last)\n\n     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    \nNULL                    34     856.90              \narea  1   826.46        33      30.44 < 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nThe p-value in this table is just as small as we’d expect given our previous result (`<2.2e-16` is pretty close to 0), and we have the nice consistent result that `area` definitely has an effect on `species`.\n\n## Python\n\nAs mentioned before, this is not quite possible in Python.\n:::\n\n## Exercises\n\n### Seat belts {#sec-exr_seatbelts}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nFor this exercise we'll be using the data from `data/seatbelts.csv`.\n\nI'd like you to do the following:\n\n1.  Load the data\n2.  Visualise the data and create a poisson regression model\n3.  Plot the regression model on top of the data\n4.  Assess if the model is a decent predictor for the number of fatalities\n\n::: {.callout-answer collapse=\"true\"}\n\n#### Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseatbelts <- read_csv(\"data/seatbelts.csv\")\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nseatbelts_py = pd.read_csv(\"data/seatbelts.csv\")\n```\n:::\n\n\nLet's have a glimpse at the data:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nseatbelts_py.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   casualties  drivers  front  rear  ...  van_killed  law  year  month\n0         107     1687    867   269  ...          12    0  1969    Jan\n1          97     1508    825   265  ...           6    0  1969    Feb\n2         102     1507    806   319  ...          12    0  1969    Mar\n3          87     1385    814   407  ...           8    0  1969    Apr\n4         119     1632    991   454  ...          10    0  1969    May\n\n[5 rows x 10 columns]\n```\n\n\n:::\n:::\n\n:::\n\nThe data tracks the number of drivers killed in road traffic accidents, before and after the seat belt law was introduced. The information on whether the law was in place is encoded in the `law` column as `0` (law not in place) or `1` (law in place).\n\nThere are many more observations when the law was *not* in place, so we need to keep this in mind when we're interpreting the data.\n\nFirst we have a look at the data comparing no law vs law:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe have to convert the `law` column to a factor, otherwise R will see it as numerical.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseatbelts %>% \n  ggplot(aes(as_factor(law), casualties)) +\n   geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-28-1.png){width=672}\n:::\n:::\n\n\nThe data are recorded by month and year, so we can also display the number of drivers killed by year:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseatbelts %>% \n  ggplot(aes(year, casualties)) +\n  geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-29-1.png){width=672}\n:::\n:::\n\n\n## Python\n\nWe have to convert the `law` column to a factor, otherwise R will see it as numerical.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(seatbelts_py,\n         aes(x = seatbelts_py.law.astype(object),\n             y = \"casualties\")) +\n     geom_boxplot())\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-30-1.png){width=614}\n:::\n:::\n\n\nThe data are recorded by month and year, so we can also display the number of casualties by year:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(seatbelts_py,\n         aes(x = \"year\",\n             y = \"casualties\")) +\n     geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-31-3.png){width=614}\n:::\n:::\n\n\n:::\n\nThe data look a bit weird. There is quite some variation within years (keeping in mind that the data are aggregated monthly). The data also seems to wave around a bit... with some vague peaks (e.g. 1972 - 1973) and some troughs (e.g. around 1976).\n\nSo my initial thought is that these data are going to be a bit tricky to interpret. But that's OK.\n\n#### Constructing a model\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_stb <- glm(casualties ~ year,\n               data = seatbelts, family = \"poisson\")\n```\n:::\n\n\nand we look at the model summary:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_stb)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = casualties ~ year, family = \"poisson\", data = seatbelts)\n\nCoefficients:\n             Estimate Std. Error z value Pr(>|z|)    \n(Intercept) 37.168958   2.796636   13.29   <2e-16 ***\nyear        -0.016373   0.001415  -11.57   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 984.50  on 191  degrees of freedom\nResidual deviance: 850.41  on 190  degrees of freedom\nAIC: 2127.2\n\nNumber of Fisher Scoring iterations: 4\n```\n\n\n:::\n:::\n\n\n```\n(Intercept)    37.168958\nyear           -0.016373\n```\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.glm(formula = \"casualties ~ year\",\n                family = sm.families.Poisson(),\n                data = seatbelts_py)\n# and get the fitted parameters of the model\nglm_stb_py = model.fit()\n```\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_stb_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:             casualties   No. Observations:                  192\nModel:                            GLM   Df Residuals:                      190\nModel Family:                 Poisson   Df Model:                            1\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -1061.6\nDate:                Tue, 06 Feb 2024   Deviance:                       850.41\nTime:                        16:16:38   Pearson chi2:                     862.\nNo. Iterations:                     4   Pseudo R-squ. (CS):             0.5026\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     37.1690      2.797     13.291      0.000      31.688      42.650\nyear          -0.0164      0.001    -11.569      0.000      -0.019      -0.014\n==============================================================================\n```\n\n\n:::\n:::\n\n\n```\n======================\n                 coef  \n----------------------\nIntercept     37.1690 \nyear          -0.0164 \n======================\n```\n:::\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\n$$ E(casualties) = \\exp(37.17 - 0.164 \\times year) $$\n\n#### Assessing significance\n\nIs the model well-specified?\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(850.41, 190)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(850.41, 190)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n3.1319689119997022e-84\n```\n\n\n:::\n:::\n\n:::\n\nThis value indicates that the model is actually pretty good. Remember, it is between $[0, 1]$ and the closer to zero, the better the model.\n\nHow about the overall fit?\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(984.50 - 850.41, 191 - 190)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\n## Python\n\nFirst we need to define the null model:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.glm(formula = \"casualties ~ 1\",\n                family = sm.families.Poisson(),\n                data = seatbelts_py)\n# and get the fitted parameters of the model\nglm_stb_null_py = model.fit()\n\nprint(glm_stb_null_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:             casualties   No. Observations:                  192\nModel:                            GLM   Df Residuals:                      191\nModel Family:                 Poisson   Df Model:                            0\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -1128.6\nDate:                Tue, 06 Feb 2024   Deviance:                       984.50\nTime:                        16:16:38   Pearson chi2:                 1.00e+03\nNo. Iterations:                     4   Pseudo R-squ. (CS):          1.942e-13\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      4.8106      0.007    738.670      0.000       4.798       4.823\n==============================================================================\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(984.50 - 850.41, 191 - 190)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n5.2214097202831414e-31\n```\n\n\n:::\n:::\n\n:::\n\nAgain, this indicates that the model is markedly better than the null model.\n\n#### Plotting the regression\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(seatbelts, aes(year, casualties)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = poisson)) +\n  xlim(1970,1985)\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-41-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel = pd.DataFrame({'year': list(range(1968, 1985))})\n\nmodel[\"pred\"] = glm_stb_py.predict(model)\n\nmodel.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   year        pred\n0  1968  140.737690\n1  1969  138.452153\n2  1970  136.203733\n3  1971  133.991827\n4  1972  131.815842\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(seatbelts_py,\n         aes(x = \"year\",\n             y = \"casualties\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"year\", y = \"pred\"), colour = \"blue\", size = 1))\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-43-1.png){width=614}\n:::\n:::\n\n:::\n\n\n#### Conclusions\n\nThe model we constructed appears to be a decent predictor for the number of fatalities.\n\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n-   Poisson regression is useful when dealing with count data\n:::\n",
+    "markdown": "---\ntitle: \"Count data\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n**Questions**\n\n-   How do we analyse count data?\n\n**Objectives**\n\n-   Be able to perform a poisson regression on count data\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n### Functions\n\n## Python\n\n### Libraries\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n\n# Needed for additional probability functionality\nfrom scipy.stats import *\n```\n:::\n\n\n### Functions\n\n:::\n:::\n\nThe examples in this section use the following data sets:\n\n`data/islands.csv`\n\nThis is a data set comprising 35 observations of two variables (one dependent and one predictor). This records the number of species recorded on different small islands along with the area (km<sup>2</sup>) of the islands. The variables are `species` and `area`.\n\nThe second data set is on seat belts.\n\nThe `seatbelts` data set is a multiple time-series data set that was commissioned by the Department of Transport in 1984 to measure differences in deaths before and after front seat belt legislation was introduced on 31st January 1983. It provides monthly total numerical data on a number of incidents including those related to death and injury in Road Traffic Accidents (RTA's). The data set starts in January 1969 and observations run until December 1984.\n\nYou can find the file in `data/seatbelts.csv`\n\n## Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nislands <- read_csv(\"data/islands.csv\")\n```\n:::\n\n\nLet's have a glimpse at the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nislands\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 35 × 2\n   species  area\n     <dbl> <dbl>\n 1     114  12.1\n 2     130  13.4\n 3     113  13.7\n 4     109  14.5\n 5     118  16.8\n 6     136  19.0\n 7     149  19.6\n 8     162  20.6\n 9     145  20.9\n10     148  21.0\n# ℹ 25 more rows\n```\n\n\n:::\n:::\n\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nislands_py = pd.read_csv(\"data/islands.csv\")\n```\n:::\n\n\nLet's have a glimpse at the data:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nislands_py.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   species       area\n0      114  12.076133\n1      130  13.405439\n2      113  13.723525\n3      109  14.540359\n4      118  16.792122\n```\n\n\n:::\n:::\n\n\n:::\n\nLooking at the data, we can see that there are two columns: `species`, which contains the number of species recorded on each island and `area`, which contains the surface area of the island in square kilometers.\n\nWe can plot the data:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(islands, aes(x = area, y = species)) +\n  geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(islands_py, aes(x = \"area\", y = \"species\")) +\n  geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-9-1.png){width=614}\n:::\n:::\n\n:::\n\nIt looks as though `area` may have an effect on the number of species that we observe on each island. We note that the response variable is count data and so we try to construct a Poisson regression.\n\n## Constructing a model\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_isl <- glm(species ~ area,\n               data = islands, family = \"poisson\")\n```\n:::\n\n\nand we look at the model summary:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_isl)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = species ~ area, family = \"poisson\", data = islands)\n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)    \n(Intercept) 4.241129   0.041322  102.64   <2e-16 ***\narea        0.035613   0.001247   28.55   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 856.899  on 34  degrees of freedom\nResidual deviance:  30.437  on 33  degrees of freedom\nAIC: 282.66\n\nNumber of Fisher Scoring iterations: 3\n```\n\n\n:::\n:::\n\n\nThe output is strikingly similar to the logistic regression models (who’d have guessed, eh?) and the main numbers to extract from the output are the two numbers underneath `Estimate.Std` in the `Coefficients` table:\n\n```\n(Intercept)    4.241129\narea           0.035613\n```\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a generalised linear model\nmodel = smf.glm(formula = \"species ~ area\",\n                family = sm.families.Poisson(),\n                data = islands_py)\n# and get the fitted parameters of the model\nglm_isl_py = model.fit()\n```\n:::\n\n\nLet's look at the model output:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_isl_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:                species   No. Observations:                   35\nModel:                            GLM   Df Residuals:                       33\nModel Family:                 Poisson   Df Model:                            1\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -139.33\nDate:                Fri, 17 May 2024   Deviance:                       30.437\nTime:                        12:30:01   Pearson chi2:                     30.3\nNo. Iterations:                     4   Pseudo R-squ. (CS):              1.000\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      4.2411      0.041    102.636      0.000       4.160       4.322\narea           0.0356      0.001     28.551      0.000       0.033       0.038\n==============================================================================\n```\n\n\n:::\n:::\n\n\n:::\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\n$$ E(species) = \\exp(4.24 + 0.036 \\times area) $$\n\nInterpreting this requires a bit of thought (not much, but a bit).\nThe intercept coefficient, `4.24`, is related to the number of species we would expect on an island of zero area (this is statistics, not real life. You’d do well to remember that before you worry too much about what that even means). But in order to turn this number into something meaningful we have to exponentiate it. Since `exp(4.24) ≈ 70`, we can say that the baseline number of species the model expects on any island is 70. This isn’t actually the interesting bit though.\n\nThe coefficient of `area` is the fun bit. For starters we can see that it is a positive number which does mean that increasing `area` leads to increasing numbers of `species`. Good so far.\n\nBut what does the value `0.036` actually mean? Well, if we exponentiate it as well, we get `exp(0.036) ≈ 1.04`. This means that for every increase in `area` of 1 km^2 (the original units of the area variable), the number of species on the island is multiplied by `1.04`. So, an island of area 1 km^2 will have `1.04 x 70 ≈ 72` species.\n\nSo, in order to interpret Poisson coefficients, you have to exponentiate them.\n\n## Plotting the Poisson regression\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(islands, aes(area, species)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = poisson)) +\n  xlim(10,50)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n`geom_smooth()` using formula = 'y ~ x'\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel = pd.DataFrame({'area': list(range(10, 50))})\n\nmodel[\"pred\"] = glm_isl_py.predict(model)\n\nmodel.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   area        pred\n0    10   99.212463\n1    11  102.809432\n2    12  106.536811\n3    13  110.399326\n4    14  114.401877\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(islands_py,\n         aes(x = \"area\",\n             y = \"species\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"area\", y = \"pred\"), colour = \"blue\", size = 1))\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-16-1.png){width=614}\n:::\n:::\n\n:::\n\n## Assumptions\n\nAs we mentioned earlier, Poisson regressions require that the variance of the data at any point is the same as the mean of the data at that point. We checked that earlier by looking at the residual deviance values.\n\nWe can look for influential points using the Cook’s distance plot:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot(glm_isl , which = 4)\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-17-3.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# extract the Cook's distances\nglm_isl_py_resid = pd.DataFrame(glm_isl_py.\n                                get_influence().\n                                summary_frame()[\"cooks_d\"])\n\n# add row index \nglm_isl_py_resid['obs'] = glm_isl_py_resid.reset_index().index\n```\n:::\n\n\nWe can use these to create the plot:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(glm_isl_py_resid,\n         aes(x = \"obs\",\n             y = \"cooks_d\")) +\n     geom_segment(aes(x = \"obs\", y = \"cooks_d\", xend = \"obs\", yend = 0)) +\n     geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-19-1.png){width=614}\n:::\n:::\n\n\n:::\n\nNone of our points have particularly large Cook’s distances and so life is rosy.\n\n## Assessing significance\n\nWe can ask the same three questions we asked before.\n\n1. Is the model well-specified?\n2. Is the overall model better than the null model?\n3. Are any of the individual predictors significant?\n\nAgain, in this case, questions 2 and 3 are effectively asking the same thing because we still only have a single predictor variable.\n\nTo assess if the model is any good we’ll again use the residual deviance and the residual degrees of freedom.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(30.437, 33)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.5953482\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(30.437, 33)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.5953481872979622\n```\n\n\n:::\n:::\n\n\n:::\n\nThis gives a probability of `0.60`. This suggests that this model is actually a reasonably decent one and that the data are pretty well supported by the model. For Poisson models this has an extra interpretation. This can be used to assess whether we have significant over-dispersion in our data.\n\nFor a Poisson model to be appropriate we need that the variance of the data to be exactly the same as the mean of the data. Visually, this would correspond to the data spreading out more for higher predicted values of `species.` However, we don’t want the data to spread out too much. If that happens then a Poisson model wouldn’t be appropriate.\n\nThe easy way to check this is to look at the ratio of the residual deviance to the residual degrees of freedom (in this case `0.922`). For a Poisson model to be valid, this ratio should be about 1. If the ratio is significantly bigger than 1 then we say that we have over-dispersion in the model and we wouldn’t be able to trust any of the significance testing that we are about to do using a Poisson regression.\n\nThankfully the probability we have just created (`0.60`) is exactly the right one we need to look at to assess whether we have significant over-dispersion in our model.\n\nSecondly, to assess whether the overall model, with all of the terms, is better than the null model we’ll look at the difference in deviances and the difference in degrees of freedom:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(856.899 - 30.437, 34 - 33)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(856.899 - 30.437, 34 - 33)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n9.524927068555617e-182\n```\n\n\n:::\n:::\n\n:::\n\nThis gives a reported p-value of pretty much zero, which is pretty damn small. So, yes, this model is better than nothing at all and species does appear to change with some of our predictors\n\nFinally, we’ll construct an analysis of deviance table to look at the individual terms:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(glm_isl , test = \"Chisq\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nAnalysis of Deviance Table\n\nModel: poisson, link: log\n\nResponse: species\n\nTerms added sequentially (first to last)\n\n     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    \nNULL                    34     856.90              \narea  1   826.46        33      30.44 < 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nThe p-value in this table is just as small as we’d expect given our previous result (`<2.2e-16` is pretty close to 0), and we have the nice consistent result that `area` definitely has an effect on `species`.\n\n## Python\n\nAs mentioned before, this is not quite possible in Python.\n:::\n\n## Exercises\n\n### Seat belts {#sec-exr_seatbelts}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nFor this exercise we'll be using the data from `data/seatbelts.csv`.\n\nI'd like you to do the following:\n\n1.  Load the data\n2.  Visualise the data and create a poisson regression model\n3.  Plot the regression model on top of the data\n4.  Assess if the model is a decent predictor for the number of fatalities\n\n::: {.callout-answer collapse=\"true\"}\n\n#### Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseatbelts <- read_csv(\"data/seatbelts.csv\")\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nseatbelts_py = pd.read_csv(\"data/seatbelts.csv\")\n```\n:::\n\n\nLet's have a glimpse at the data:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nseatbelts_py.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   casualties  drivers  front  rear  ...  van_killed  law  year  month\n0         107     1687    867   269  ...          12    0  1969    Jan\n1          97     1508    825   265  ...           6    0  1969    Feb\n2         102     1507    806   319  ...          12    0  1969    Mar\n3          87     1385    814   407  ...           8    0  1969    Apr\n4         119     1632    991   454  ...          10    0  1969    May\n\n[5 rows x 10 columns]\n```\n\n\n:::\n:::\n\n:::\n\nThe data tracks the number of drivers killed in road traffic accidents, before and after the seat belt law was introduced. The information on whether the law was in place is encoded in the `law` column as `0` (law not in place) or `1` (law in place).\n\nThere are many more observations when the law was *not* in place, so we need to keep this in mind when we're interpreting the data.\n\nFirst we have a look at the data comparing no law vs law:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe have to convert the `law` column to a factor, otherwise R will see it as numerical.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseatbelts %>% \n  ggplot(aes(as_factor(law), casualties)) +\n   geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-28-1.png){width=672}\n:::\n:::\n\n\nThe data are recorded by month and year, so we can also display the number of drivers killed by year:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseatbelts %>% \n  ggplot(aes(year, casualties)) +\n  geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-29-1.png){width=672}\n:::\n:::\n\n\n## Python\n\nWe have to convert the `law` column to a factor, otherwise R will see it as numerical.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(seatbelts_py,\n         aes(x = seatbelts_py.law.astype(object),\n             y = \"casualties\")) +\n     geom_boxplot())\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-30-1.png){width=614}\n:::\n:::\n\n\nThe data are recorded by month and year, so we can also display the number of casualties by year:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(seatbelts_py,\n         aes(x = \"year\",\n             y = \"casualties\")) +\n     geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-31-3.png){width=614}\n:::\n:::\n\n\n:::\n\nThe data look a bit weird. There is quite some variation within years (keeping in mind that the data are aggregated monthly). The data also seems to wave around a bit... with some vague peaks (e.g. 1972 - 1973) and some troughs (e.g. around 1976).\n\nSo my initial thought is that these data are going to be a bit tricky to interpret. But that's OK.\n\n#### Constructing a model\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_stb <- glm(casualties ~ year,\n               data = seatbelts, family = \"poisson\")\n```\n:::\n\n\nand we look at the model summary:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_stb)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = casualties ~ year, family = \"poisson\", data = seatbelts)\n\nCoefficients:\n             Estimate Std. Error z value Pr(>|z|)    \n(Intercept) 37.168958   2.796636   13.29   <2e-16 ***\nyear        -0.016373   0.001415  -11.57   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 984.50  on 191  degrees of freedom\nResidual deviance: 850.41  on 190  degrees of freedom\nAIC: 2127.2\n\nNumber of Fisher Scoring iterations: 4\n```\n\n\n:::\n:::\n\n\n```\n(Intercept)    37.168958\nyear           -0.016373\n```\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.glm(formula = \"casualties ~ year\",\n                family = sm.families.Poisson(),\n                data = seatbelts_py)\n# and get the fitted parameters of the model\nglm_stb_py = model.fit()\n```\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_stb_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:             casualties   No. Observations:                  192\nModel:                            GLM   Df Residuals:                      190\nModel Family:                 Poisson   Df Model:                            1\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -1061.6\nDate:                Fri, 17 May 2024   Deviance:                       850.41\nTime:                        12:30:06   Pearson chi2:                     862.\nNo. Iterations:                     4   Pseudo R-squ. (CS):             0.5026\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     37.1690      2.797     13.291      0.000      31.688      42.650\nyear          -0.0164      0.001    -11.569      0.000      -0.019      -0.014\n==============================================================================\n```\n\n\n:::\n:::\n\n\n```\n======================\n                 coef  \n----------------------\nIntercept     37.1690 \nyear          -0.0164 \n======================\n```\n:::\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\n$$ E(casualties) = \\exp(37.17 - 0.164 \\times year) $$\n\n#### Assessing significance\n\nIs the model well-specified?\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(850.41, 190)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(850.41, 190)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n3.1319689119997022e-84\n```\n\n\n:::\n:::\n\n:::\n\nThis value indicates that the model is actually pretty good. Remember, it is between $[0, 1]$ and the closer to zero, the better the model.\n\nHow about the overall fit?\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(984.50 - 850.41, 191 - 190)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\n## Python\n\nFirst we need to define the null model:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.glm(formula = \"casualties ~ 1\",\n                family = sm.families.Poisson(),\n                data = seatbelts_py)\n# and get the fitted parameters of the model\nglm_stb_null_py = model.fit()\n\nprint(glm_stb_null_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:             casualties   No. Observations:                  192\nModel:                            GLM   Df Residuals:                      191\nModel Family:                 Poisson   Df Model:                            0\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -1128.6\nDate:                Fri, 17 May 2024   Deviance:                       984.50\nTime:                        12:30:07   Pearson chi2:                 1.00e+03\nNo. Iterations:                     4   Pseudo R-squ. (CS):          1.942e-13\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      4.8106      0.007    738.670      0.000       4.798       4.823\n==============================================================================\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(984.50 - 850.41, 191 - 190)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n5.2214097202831414e-31\n```\n\n\n:::\n:::\n\n:::\n\nAgain, this indicates that the model is markedly better than the null model.\n\n#### Plotting the regression\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(seatbelts, aes(year, casualties)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = poisson)) +\n  xlim(1970, 1985)\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-41-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel = pd.DataFrame({'year': list(range(1968, 1985))})\n\nmodel[\"pred\"] = glm_stb_py.predict(model)\n\nmodel.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   year        pred\n0  1968  140.737690\n1  1969  138.452153\n2  1970  136.203733\n3  1971  133.991827\n4  1972  131.815842\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(seatbelts_py,\n         aes(x = \"year\",\n             y = \"casualties\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"year\", y = \"pred\"), colour = \"blue\", size = 1))\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-43-1.png){width=614}\n:::\n:::\n\n:::\n\n\n#### Conclusions\n\nThe model we constructed appears to be a decent predictor for the number of fatalities.\n\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n-   Poisson regression is useful when dealing with count data\n:::\n",
     "supporting": [
       "glm-practical-poisson_files"
     ],
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-16-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-16-1.png
index 1c0b16a..416b5d3 100644
Binary files a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-16-1.png and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-16-1.png differ
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-19-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-19-1.png
index 829445c..f466d73 100644
Binary files a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-19-1.png and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-19-1.png differ
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-30-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-30-1.png
index 6ea8e6f..437228b 100644
Binary files a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-30-1.png and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-30-1.png differ
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-31-3.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-31-3.png
index b7ce4a8..d55a9ec 100644
Binary files a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-31-3.png and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-31-3.png differ
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-43-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-43-1.png
index a1b6b49..52baf93 100644
Binary files a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-43-1.png and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-43-1.png differ
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-9-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-9-1.png
index 72005bc..c9bcee1 100644
Binary files a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-9-1.png and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-9-1.png differ
diff --git a/_freeze/materials/significance-testing/execute-results/html.json b/_freeze/materials/significance-testing/execute-results/html.json
new file mode 100644
index 0000000..d77988d
--- /dev/null
+++ b/_freeze/materials/significance-testing/execute-results/html.json
@@ -0,0 +1,17 @@
+{
+  "hash": "9b1997b4e655bbd2efb84094264a9b0c",
+  "result": {
+    "engine": "knitr",
+    "markdown": "---\ntitle: \"Significance & goodness-of-fit\"\noutput: html_document\n---\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\nGeneralised linear models are fitted a little differently to standard linear models - namely, using maximum likelihood estimation instead of ordinary least squares for estimating the model coefficients.\n\nAs a result, we can no longer use F-tests for significance, or interpret $R^2$ values in quite the same way. This section will discuss new techniques for significance and goodness-of-fit testing, specifically for use with GLMs.\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"lmtest\")\nlibrary(lmtest)\n```\n:::\n\n## Python\n\n::: {.cell}\n\n```{.python .cell-code}\nfrom scipy.stats import *\n```\n:::\n\n:::\n\n:::\n\n## Deviance\n\nSeveral of the tests and metrics we'll discuss below are based heavily on deviance. So, what is deviance, and where does it come from?\n\nFitting a model using maximum likelihood estimation - the method that we use for GLMs, among other models - is all about finding the parameters that maximise the **likelihood**, or joint probability, of the sample. In other words, how likely is it that you would sample a set of data points like these, if they were being drawn from an underlying population where your model is true? Each model that you fit has its own likelihood.\n\nNow, for each dataset, there is a \"saturated\", or perfect, model. This model has the same number of parameters in it as there are data points, meaning the data are fitted exactly - as if connecting the dots between them. The **saturated model** has the largest possible likelihood of any model fitted to the dataset.\n\nOf course, we don't actually use the saturated model for drawing real conclusions, but we can use it as a baseline for comparison. We compare each model that we fit to this saturated model, to calculate the **deviance**. Deviance is defined as the difference between the log-likelihood of your fitted model and the log-likelihood of the saturated model (multiplied by 2). \n\nBecause deviance is all about capturing the discrepancy between fitted and actual values, it's performing a similar function to the residual sum of squares (RSS) in a standard linear model. In fact, the RSS is really just a specific type of deviance.\n\n![Different models and their deviances](images/deviance.png){width=70%}\n\n## Significance testing\n\nThere are a few different potential sources of p-values for a generalised linear model. \n\nHere, we'll briefly discuss the p-values that are reported \"as standard\" in a typical GLM model output.\n\nThen, we'll spend most of our time focusing on likelihood ratio tests, perhaps the most effective way to assess significance in a GLM.\n\n### Revisiting the diabetes dataset\n\nAs a worked example, we'll use a logistic regression fitted to the `diabetes` dataset that we saw in a previous section.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndiabetes <- read_csv(\"data/diabetes.csv\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 728 Columns: 3\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (3): glucose, diastolic, test_result\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndiabetes_py = pd.read_csv(\"data/diabetes.csv\")\n\ndiabetes_py.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   glucose  diastolic  test_result\n0      148         72            1\n1       85         66            0\n2      183         64            1\n3       89         66            0\n4      137         40            1\n```\n\n\n:::\n:::\n\n:::\n\nAs a reminder, this dataset contains three variables:\n\n- `test_result`, binary results of a diabetes test result (1 for positive, 0 for negative)\n- `glucose`, the results of a glucose tolerance test\n- `diastolic` blood pressure\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_dia <- glm(test_result ~ glucose * diastolic,\n                  family = \"binomial\",\n                  data = diabetes)\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel = smf.glm(formula = \"test_result ~ glucose * diastolic\", \n                family = sm.families.Binomial(), \n                data = diabetes_py)\n                \nglm_dia_py = model.fit()\n```\n:::\n\n:::\n\n### Wald tests\n\nLet's use the `summary` function to see the model we've just fitted.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_dia)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = test_result ~ glucose * diastolic, family = \"binomial\", \n    data = diabetes)\n\nCoefficients:\n                    Estimate Std. Error z value Pr(>|z|)   \n(Intercept)       -8.5710565  2.7032318  -3.171  0.00152 **\nglucose            0.0547050  0.0209256   2.614  0.00894 **\ndiastolic          0.0423651  0.0363681   1.165  0.24406   \nglucose:diastolic -0.0002221  0.0002790  -0.796  0.42590   \n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 936.60  on 727  degrees of freedom\nResidual deviance: 748.01  on 724  degrees of freedom\nAIC: 756.01\n\nNumber of Fisher Scoring iterations: 4\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_dia_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:            test_result   No. Observations:                  728\nModel:                            GLM   Df Residuals:                      724\nModel Family:                Binomial   Df Model:                            3\nLink Function:                  Logit   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -374.00\nDate:                Fri, 17 May 2024   Deviance:                       748.01\nTime:                        12:43:46   Pearson chi2:                     720.\nNo. Iterations:                     5   Pseudo R-squ. (CS):             0.2282\nCovariance Type:            nonrobust                                         \n=====================================================================================\n                        coef    std err          z      P>|z|      [0.025      0.975]\n-------------------------------------------------------------------------------------\nIntercept            -8.5711      2.703     -3.171      0.002     -13.869      -3.273\nglucose               0.0547      0.021      2.614      0.009       0.014       0.096\ndiastolic             0.0424      0.036      1.165      0.244      -0.029       0.114\nglucose:diastolic    -0.0002      0.000     -0.796      0.426      -0.001       0.000\n=====================================================================================\n```\n\n\n:::\n:::\n\n:::\n\nWhichever language you're using, you may have spotted some p-values being reported directly here in the model summaries. Specifically, each individual parameter, or coefficient, has its own z-value and associated p-value.\n\nA hypothesis test has automatically been performed for each of the parameters in your model, including the intercept and interaction. In each case, something called a **Wald test** has been performed.\n\nThe null hypothesis for these Wald tests is that the value of the coefficient = 0. The idea is that if a coefficient isn't significantly different from 0, then that parameter isn't useful and could be dropped from the model. These tests are the equivalent of the t-tests that are calculated as part of the `summary` output for standard linear models.\n\nImportantly, these Wald tests *don't* tell you about the significance of the overall model. For that, we're going to need something else: a likelihood ratio test.\n\n### Likelihood ratio tests (LRTs)\n\nWhen we were assessing the significance of standard linear models, we were able to use the F-statistic to determine:\n\n- the significance of the model versus a null model, and\n- the significance of individual predictors.\n\nWe can't use these F-tests for GLMs, but we can use LRTs in a really similar way, to calculate p-values for both the model as a whole, and for individual variables.\n\nThese tests are all built on the idea of deviance, or the likelihood ratio, as discussed above on this page. We can compare any two models fitted to the same dataset by looking at the difference in their deviances, also known as the difference in their log-likelihoods, or more simply as a likelihood ratio.\n\nHelpfully, this likelihood ratio approximately follows a chi-square distribution, which we can capitalise on that to calculate a p-value. All we need is the number of degrees of freedom, which is equal to the difference in the number of parameters of the two models you're comparing.\n\n::: {.callout-warning}\nImportantly, we are only able to use this sort of test when one of the two models that we are comparing is a \"simpler\" version of the other, i.e., one model has a subset of the parameters of the other model. \n\nSo while we could perform an LRT just fine between these two models: `Y ~ A + B + C` and `Y ~ A + B + C + D`, or between any model and the null (`Y ~ 1`), we would not be able to use this test to compare `Y ~ A + B + C` and `Y ~ A + B + D`.\n:::\n\n#### Testing the model versus the null\n\nSince LRTs involve making a comparison between two models, we must first decide which models we're comparing, and check that one model is a \"subset\" of the other.\n\nLet's use an example from a previous section of the course, where we fitted a logistic regression to the `diabetes` dataset. \n\n::: {.panel-tabset group=\"language\"}\n## R\n\nThe first step is to create the two models that we want to compare: our original model, and the null model (with and without predictors, respectively).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_dia <- glm(test_result ~ glucose * diastolic,\n                  family = \"binomial\",\n                  data = diabetes)\n\nglm_null <- glm(test_result ~ 1, \n                family = binomial, \n                data = diabetes)\n```\n:::\n\n\nThen, we use the `lrtest` function from the `lmtest` package to perform the test itself; we include both the models that we want to compare, listing them one after another.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlrtest(glm_dia, glm_null)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nLikelihood ratio test\n\nModel 1: test_result ~ glucose * diastolic\nModel 2: test_result ~ 1\n  #Df LogLik Df  Chisq Pr(>Chisq)    \n1   4 -374.0                         \n2   1 -468.3 -3 188.59  < 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nWe can see from the output that our chi-square statistic is significant, with a really small p-value. This tells us that, for the difference in degrees of freedom (here, that's 3), the change in deviance is actually quite big. (In this case, you can use `summary(glm_dia)` to see those deviances - 936 versus 748!)\n\nIn other words, our model is better than the null.\n\n## Python\n\nThe first step is to create the two models that we want to compare: our original model, and the null model (with and without our predictor, respectively).\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel = smf.glm(formula = \"test_result ~ glucose * diastolic\", \n                family = sm.families.Binomial(), \n                data = diabetes_py)\n                \nglm_dia_py = model.fit()\n\nmodel = smf.glm(formula = \"test_result ~ 1\",\n                family = sm.families.Binomial(),\n                data = diabetes_py)\n\nglm_null_py = model.fit()\n```\n:::\n\n\nUnlike in R, there isn't a nice neat function for extracting the $\\chi^2$ value, so we have to do a little bit of work by hand.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# calculate the likelihood ratio (i.e. the chi-square value)\nlrstat = -2*(glm_null_py.llf - glm_dia_py.llf)\n\n# calculate the associated p-value\npvalue = chi2.sf(lrstat, glm_dia_py.df_model - glm_null_py.df_model)\n\nprint(lrstat, pvalue)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n188.59314837444526 1.2288700360045209e-40\n```\n\n\n:::\n:::\n\n\nThis gives us the likelihood ratio, based on the log-likelihoods that we've extracted directly from the models, which approximates a chi-square distribution. \n\nWe've also calculated the associated p-value, by providing the difference in degrees of freedom between the two models (in this case, that's simply 1, but for more complicated models it's easier to extract the degrees of freedom directly from the model as we've done here).\n\nHere, we have a large chi-square statistic and a small p-value. This tells us that, for the difference in degrees of freedom (here, that's 1), the change in deviance is actually quite big. (In this case, you can use `summary(glm_dia)` to see those deviances - 936 versus 748!)\n\nIn other words, our model is better than the null.\n:::\n\n### Testing individual predictors\n\nAs well as testing the overall model versus the null, we might want to test particular predictors to determine whether they are individually significant.\n\nThe way to achieve this is essentially to perform a series of \"targeted\" likelihood ratio tests. In each LRT, we'll compare two models that are almost identical - one with, and one without, our variable of interest in each case.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nThe first step is to construct a new model that doesn't contain our predictor of interest. Let's test the `glucose:diastolic` interaction.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_dia_add <- glm(test_result ~ glucose + diastolic,\n                  family = \"binomial\",\n                  data = diabetes)\n```\n:::\n\n\nNow, we use the `lrtest` function to compare the models with and without the interaction:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlrtest(glm_dia, glm_dia_add)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nLikelihood ratio test\n\nModel 1: test_result ~ glucose * diastolic\nModel 2: test_result ~ glucose + diastolic\n  #Df  LogLik Df  Chisq Pr(>Chisq)\n1   4 -374.00                     \n2   3 -374.32 -1 0.6288     0.4278\n```\n\n\n:::\n:::\n\n\nThis tells us that our interaction `glucose:diastolic` isn't significant - our more complex model doesn't have a meaningful reduction in deviance.\n\nThis might, however, seem like a slightly clunky way to test each individual predictor. Luckily, we can also use our trusty `anova` function with an extra argument to tell us about individual predictors. \n\nBy specifying that we want to use a chi-squared test, we are able to construct an analysis of deviance table (as opposed to an analysis of variance table) that will perform the likelihood ratio tests for us for each predictor:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(glm_dia, test=\"Chisq\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: test_result\n\nTerms added sequentially (first to last)\n\n                  Df Deviance Resid. Df Resid. Dev Pr(>Chi)    \nNULL                                727     936.60             \nglucose            1  184.401       726     752.20  < 2e-16 ***\ndiastolic          1    3.564       725     748.64  0.05905 .  \nglucose:diastolic  1    0.629       724     748.01  0.42779    \n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nYou'll spot that the p-values we get from the analysis of deviance table match the p-values you could calculate yourself using `lrtest`; this is just more efficient when you have a complex model!\n\n## Python\n\nThe first step is to construct a new model that doesn't contain our predictor of interest. Let's test the `glucose:diastolic` interaction.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel = smf.glm(formula = \"test_result ~ glucose + diastolic\", \n                family = sm.families.Binomial(), \n                data = diabetes_py)\n                \nglm_dia_add_py = model.fit()\n```\n:::\n\n\nWe'll then use the same code we used above, to compare the models with and without the interaction:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nlrstat = -2*(glm_dia_add_py.llf - glm_dia_py.llf)\n\npvalue = chi2.sf(lrstat, glm_dia_py.df_model - glm_dia_add_py.df_model)\n\nprint(lrstat, pvalue)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.6288201373599804 0.42778842576800746\n```\n\n\n:::\n:::\n\n\nThis tells us that our interaction `glucose:diastolic` isn't significant - our more complex model doesn't have a meaningful reduction in deviance.\n:::\n\n## Goodness-of-fit\n\nGoodness-of-fit is all about how well a model fits the data, and typically involves summarising the discrepancy between the actual data points, and the fitted/predicted values that the model produces.\n\nThough closely linked, it's important to realise that goodness-of-fit and significance don't come hand-in-hand automatically: we might find a model that is significantly better than the null, but is still overall pretty rubbish at matching the data. So, to understand the quality of our model better, we should ideally perform both types of test. \n\n### Chi-square tests\n\nOnce again, we can make use of deviance and chi-square tests, this time to assess goodness-of-fit.\n\nAbove, we used likelihood ratio tests to assess the null hypothesis that our candidate fitted model and the null model had the same deviance.\n\nNow, however, we will test the null hypothesis that the fitted model and the saturated (perfect) model have the same deviance, i.e., that they both fit the data equally well. In most hypothesis tests, we want to reject the null hypothesis, but in this case, we'd like it to be true.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nRunning a goodness-of-fit chi-square test in R can be done using the `pchisq` function. We need to include two arguments: 1) the residual deviance, and 2) the residual degrees of freedom. Both of these can be found in the `summary` output, but you can use the `$` syntax to call these properties directly like so:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(glm_dia$deviance, glm_dia$df.residual)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.2605931\n```\n\n\n:::\n:::\n\n\n## Python\n\nThe syntax is very similar to the LRT we ran above, but now instead of including information about both our candidate model and the null, we instead just need 1) the residual deviance, and 2) the residual degrees of freedom:\n\n\n::: {.cell}\n\n```{.python .cell-code}\npvalue = chi2.sf(glm_dia_py.deviance, glm_dia_py.df_resid)\n\nprint(pvalue)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.26059314630406843\n```\n\n\n:::\n:::\n\n:::\n\nYou can think about this p-value, roughly, as \"the probability that this model is good\". We're not below our significance threshold, which means that we're not rejecting our null hypothesis (which is a good thing) - but it's also not a huge probability. This suggests that there's probably other variables we could measure and include in a future experiment, to give a better overall model.\n\n### AIC values\n\nYou might remember AIC values from standard linear modelling. AIC values are useful, because they tell us about overall model quality, factoring in both goodness-of-fit and model complexity.\n\nOne of the best things about the Akaike information criterion (AIC) is that it isn't specific to linear models - it works for models fitted with maximum likelihood estimation.\n\nIn fact, if you look at the formula for AIC, you'll see why:\n\n$$\nAIC = 2k - 2ln(\\hat{L})\n$$\n\nwhere $k$ represents the number of parameters in the model, and $\\hat{L}$ is the maximised likelihood function. In other words, the two parts of the equation represent the complexity of the model, versus the log-likelihood.\n\nThis means that AIC can be used for model comparison for GLMs in precisely the same way as it's used for linear models: lower AIC indicates a better-quality model.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nThe AIC value is given as standard, near the bottom of the `summary` output (just below the deviance values). You can also print it directly using the `$` syntax:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_dia)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = test_result ~ glucose * diastolic, family = \"binomial\", \n    data = diabetes)\n\nCoefficients:\n                    Estimate Std. Error z value Pr(>|z|)   \n(Intercept)       -8.5710565  2.7032318  -3.171  0.00152 **\nglucose            0.0547050  0.0209256   2.614  0.00894 **\ndiastolic          0.0423651  0.0363681   1.165  0.24406   \nglucose:diastolic -0.0002221  0.0002790  -0.796  0.42590   \n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 936.60  on 727  degrees of freedom\nResidual deviance: 748.01  on 724  degrees of freedom\nAIC: 756.01\n\nNumber of Fisher Scoring iterations: 4\n```\n\n\n:::\n\n```{.r .cell-code}\nglm_dia$aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 756.0069\n```\n\n\n:::\n:::\n\n\nIn even better news for R users, the `step` function works for GLMs just as it does for linear models, so long as you include the `test = LRT` argument.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstep(glm_dia, test = \"LRT\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nStart:  AIC=756.01\ntest_result ~ glucose * diastolic\n\n                    Df Deviance    AIC     LRT Pr(>Chi)\n- glucose:diastolic  1   748.64 754.64 0.62882   0.4278\n<none>                   748.01 756.01                 \n\nStep:  AIC=754.64\ntest_result ~ glucose + diastolic\n\n            Df Deviance    AIC     LRT Pr(>Chi)    \n<none>           748.64 754.64                     \n- diastolic  1   752.20 756.20   3.564  0.05905 .  \n- glucose    1   915.52 919.52 166.884  < 2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:  glm(formula = test_result ~ glucose + diastolic, family = \"binomial\", \n    data = diabetes)\n\nCoefficients:\n(Intercept)      glucose    diastolic  \n   -6.49941      0.03836      0.01407  \n\nDegrees of Freedom: 727 Total (i.e. Null);  725 Residual\nNull Deviance:\t    936.6 \nResidual Deviance: 748.6 \tAIC: 754.6\n```\n\n\n:::\n:::\n\n\n## Python\n\nThe AIC value isn't printed as standard with the model summary, but you can access it easily like so:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_dia_py.aic)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n756.0068586069744\n```\n\n\n:::\n:::\n\n:::\n\n### Pseudo r-squared\n\nWe can't use $R^2$ values to represent the amount of variance explained in a GLM. This is primarily because, while linear models are fitted by minimising the squared residuals, GLMs are fitted by maximising the likelihood - an entirely different procedure.\n\nHowever, because $R^2$ values are so useful in linear modelling, statisticians have developed something called a \"pseudo $R^2$\" for GLMs.\n\n::: {.callout-note}\n#### Debate about pseudo $R^2$ values\n\nThere are two main areas of debate:\n\n1. Which version of pseudo $R^2$ to use? \n\nThere are many. Some of the most popular are McFadden's, Nagelkerke's, Cox & Snell's, and Tjur's. They all have slightly different formulae and in some cases can give quite different results. [This post](https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/) does a nice job of discussing some of them and providing some comparisons.\n\n2. Should pseudo $R^2$ values be calculated at all? \n\nWell, it depends what you want them for. Most statisticians tend to advise that pseudo $R^2$ values are only really useful for model comparisons (i.e., comparing different GLMs fitted to the same dataset). This is in contrast to the way that we use $R^2$ values in linear models, as a measure of effect size that is generalisable across studies.\n\nSo, if you choose to use pseudo $R^2$ values, try to be thoughtful about it; and avoid the temptation to over-interpret! \n:::\n\n## Summary\n\nLikelihood and deviance are very important in generalised linear models - not just for fitting the model via maximum likelihood estimation, but for assessing significance and goodness-of-fit. To determine the quality of a model and draw conclusions from it, it's important to assess both of these things.\n\n::: {.callout-tip}\n#### Key points\n\n- Deviance is the difference between predicted and actual values, and is calculated by comparing a model's log-likelihood to that of the perfect \"saturated\" model \n- Using deviance, likelihood ratio tests can be used in lieu of F-tests for generalised linear models\n- Similarly, a chi-square goodness-of-fit test can also be performed using likelihood/deviance\n- The Akaike information criterion is also based on likelihood, and can be used to compare the quality of GLMs fitted to the same dataset\n- Other metrics that may be of use are Wald test p-values and pseudo $R^2$ values\n:::\n",
+    "supporting": [
+      "significance-testing_files"
+    ],
+    "filters": [
+      "rmarkdown/pagebreak.lua"
+    ],
+    "includes": {},
+    "engineDependencies": {},
+    "preserve": {},
+    "postProcess": true
+  }
+}
\ No newline at end of file
diff --git a/_quarto.yml b/_quarto.yml
index 0c1ef0f..2cc13c2 100644
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -33,6 +33,7 @@ metadata-files:
   # - "materials/_sidebar.yml"
 
 book:
+  bread-crumbs: false
   search:
     location: sidebar
   favicon: "_extensions/cambiotraining/courseformat/img/university-of-cambridge-favicon.ico"
diff --git a/_site/index.html b/_site/index.html
index ff094d9..c2ea896 100644
--- a/_site/index.html
+++ b/_site/index.html
@@ -6,6 +6,7 @@
 
 <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
 
+<meta name="author" content="Vicki Hodgson, Matt Castle, Martin van Rongen">
 
 <title>Course overview</title>
 <style>
@@ -110,9 +111,9 @@
       <button type="button" class="quarto-btn-toggle btn" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }">
         <i class="bi bi-layout-text-sidebar-reverse"></i>
       </button>
-        <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="./index.html">Welcome</a></li><li class="breadcrumb-item"><a href="./index.html"><span class="chapter-number">1</span>&nbsp; <span class="chapter-title">Course overview</span></a></li></ol></nav>
-        <a class="flex-grow-1" role="button" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }">      
-        </a>
+        <a class="flex-grow-1 no-decor" role="button" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }">      
+          <h1 class="quarto-secondary-nav-title">Course overview</h1>
+        </a>     
       <button type="button" class="btn quarto-search-button" aria-label="" onclick="window.quartoOpenSearch();">
         <i class="bi bi-search"></i>
       </button>
@@ -185,13 +186,25 @@
   <a href="./materials/glm-intro-glm.html" class="sidebar-item-text sidebar-link">
  <span class="menu-text"><span class="chapter-number">6</span>&nbsp; <span class="chapter-title">Generalise your model</span></span></a>
   </div>
+</li>
+          <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="./materials/significance-testing.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">7</span>&nbsp; <span class="chapter-title">Significance &amp; goodness-of-fit</span></span></a>
+  </div>
+</li>
+          <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="./materials/checking-assumptions.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">8</span>&nbsp; <span class="chapter-title">Checking assumptions</span></span></a>
+  </div>
 </li>
       </ul>
   </li>
         <li class="sidebar-item sidebar-item-section">
       <div class="sidebar-item-container"> 
             <a class="sidebar-item-text sidebar-link text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-3" aria-expanded="true">
- <span class="menu-text">Binary and proportional data</span></a>
+ <span class="menu-text">Logistic regression</span></a>
           <a class="sidebar-item-toggle text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-3" aria-expanded="true" aria-label="Toggle section">
             <i class="bi bi-chevron-right ms-2"></i>
           </a> 
@@ -200,13 +213,13 @@
           <li class="sidebar-item">
   <div class="sidebar-item-container"> 
   <a href="./materials/glm-practical-logistic-binary.html" class="sidebar-item-text sidebar-link">
- <span class="menu-text"><span class="chapter-number">7</span>&nbsp; <span class="chapter-title">Binary response</span></span></a>
+ <span class="menu-text"><span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Binary response</span></span></a>
   </div>
 </li>
           <li class="sidebar-item">
   <div class="sidebar-item-container"> 
   <a href="./materials/glm-practical-logistic-proportion.html" class="sidebar-item-text sidebar-link">
- <span class="menu-text"><span class="chapter-number">8</span>&nbsp; <span class="chapter-title">Proportional response</span></span></a>
+ <span class="menu-text"><span class="chapter-number">10</span>&nbsp; <span class="chapter-title">Proportional response</span></span></a>
   </div>
 </li>
       </ul>
@@ -214,7 +227,7 @@
         <li class="sidebar-item sidebar-item-section">
       <div class="sidebar-item-container"> 
             <a class="sidebar-item-text sidebar-link text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-4" aria-expanded="true">
- <span class="menu-text">Count data</span></a>
+ <span class="menu-text">Poisson regression</span></a>
           <a class="sidebar-item-toggle text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-4" aria-expanded="true" aria-label="Toggle section">
             <i class="bi bi-chevron-right ms-2"></i>
           </a> 
@@ -223,7 +236,7 @@
           <li class="sidebar-item">
   <div class="sidebar-item-container"> 
   <a href="./materials/glm-practical-poisson.html" class="sidebar-item-text sidebar-link">
- <span class="menu-text"><span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Count data</span></span></a>
+ <span class="menu-text"><span class="chapter-number">11</span>&nbsp; <span class="chapter-title">Count data</span></span></a>
   </div>
 </li>
       </ul>
@@ -239,21 +252,28 @@ <h2 id="toc-title">Table of contents</h2>
    
   <ul>
   <li><a href="#core-aims" id="toc-core-aims" class="nav-link active" data-scroll-target="#core-aims">Core aims</a></li>
+  <li><a href="#authors" id="toc-authors" class="nav-link" data-scroll-target="#authors">Authors</a></li>
   </ul>
 </nav>
     </div>
 <!-- main -->
 <main class="content" id="quarto-document-content">
 
-<header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="./index.html">Welcome</a></li><li class="breadcrumb-item"><a href="./index.html"><span class="chapter-number">1</span>&nbsp; <span class="chapter-title">Course overview</span></a></li></ol></nav>
+<header id="title-block-header" class="quarto-title-block default">
 <div class="quarto-title">
-<h1 class="title">Course overview</h1>
+<h1 class="title d-none d-lg-block">Course overview</h1>
 </div>
 
 
 
 <div class="quarto-title-meta">
 
+    <div>
+    <div class="quarto-title-meta-heading">Author</div>
+    <div class="quarto-title-meta-contents">
+             <p>Vicki Hodgson, Matt Castle, Martin van Rongen </p>
+          </div>
+  </div>
     
   
     
@@ -296,6 +316,20 @@ <h2 class="anchored" data-anchor-id="core-aims">Core aims</h2>
 </ol>
 </div>
 </div>
+</section>
+<section id="authors" class="level2">
+<h2 class="anchored" data-anchor-id="authors">Authors</h2>
+<p>About the author(s):</p>
+<ul>
+<li><strong>Vicki Hodgson</strong> <a href="https://orcid.org/0000-0001-5619-2118" target="_blank"><i class="fa-brands fa-orcid" style="color:#a6ce39"></i></a> <a href="https://github.com/Vicki-H" target="_blank"><i class="fa-brands fa-github" style="color:#4078c0"></i></a><br>
+<em>Affiliation</em>: Bioinformatics Training Facility, University of Cambridge<br>
+<em>Roles</em>: writing - review &amp; editing; conceptualisation; coding</li>
+<li><strong>Martin van Rongen</strong> <a href="https://orcid.org/0000-0002-1441-367X" target="_blank"><i class="fa-brands fa-orcid" style="color:#a6ce39"></i></a> <a href="https://github.com/mvanrongen" target="_blank"><i class="fa-brands fa-github" style="color:#4078c0"></i></a><br>
+<em>Affiliation</em>: Bioinformatics Training Facility, University of Cambridge<br>
+<em>Roles</em>: writing - review &amp; editing; conceptualisation; coding</li>
+<li><strong>Matt Castle</strong> <a href="https://orcid.org/0000-0002-9439-552X" target="_blank"><i class="fa-brands fa-orcid" style="color:#a6ce39"></i></a> <em>Affiliation</em>: Bioinformatics Training Facility, University of Cambridge<br>
+<em>Roles</em>: conceptualisation; writing</li>
+</ul>
 
 
 </section>
diff --git a/_site/materials/glm-practical-logistic-binary.html b/_site/materials/glm-practical-logistic-binary.html
index 0940269..9560650 100644
--- a/_site/materials/glm-practical-logistic-binary.html
+++ b/_site/materials/glm-practical-logistic-binary.html
@@ -4,7 +4,7 @@
 <meta charset="utf-8">
 <meta name="generator" content="quarto-1.4.546">
 <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
-<title>7&nbsp; Binary response</title>
+<title>9&nbsp; Binary response</title>
 <style>
 code{white-space: pre-wrap;}
 span.smallcaps{font-variant: small-caps;}
@@ -80,7 +80,7 @@
 <script src="../site_libs/quarto-search/quarto-search.js"></script>
 <meta name="quarto:offset" content="../">
 <link href="../materials/glm-practical-logistic-proportion.html" rel="next">
-<link href="../materials/glm-intro-glm.html" rel="prev">
+<link href="../materials/checking-assumptions.html" rel="prev">
 <link href="../site_libs/quarto-contrib/quarto-project/cambiotraining/courseformat/img/university-of-cambridge-favicon.ico" rel="icon">
 <script src="../site_libs/quarto-html/quarto.js"></script>
 <script src="../site_libs/quarto-html/popper.min.js"></script>
@@ -177,9 +177,11 @@
       <button type="button" class="quarto-btn-toggle btn" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }">
         <i class="bi bi-layout-text-sidebar-reverse"></i>
       </button>
-        <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../materials/glm-practical-logistic-binary.html">Binary and proportional data</a></li><li class="breadcrumb-item"><a href="../materials/glm-practical-logistic-binary.html"><span class="chapter-number">7</span>&nbsp; <span class="chapter-title">Binary response</span></a></li></ol></nav>
-        <a class="flex-grow-1" role="button" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }">      
-        </a>
+        <a class="flex-grow-1 no-decor" role="button" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }">      
+          <h1 class="quarto-secondary-nav-title">
+<span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Binary response</span>
+</h1>
+        </a>     
       <button type="button" class="btn quarto-search-button" aria-label="" onclick="window.quartoOpenSearch();">
         <i class="bi bi-search"></i>
       </button>
@@ -248,13 +250,25 @@
   <a href="../materials/glm-intro-glm.html" class="sidebar-item-text sidebar-link">
  <span class="menu-text"><span class="chapter-number">6</span>&nbsp; <span class="chapter-title">Generalise your model</span></span></a>
   </div>
+</li>
+          <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../materials/significance-testing.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">7</span>&nbsp; <span class="chapter-title">Significance &amp; goodness-of-fit</span></span></a>
+  </div>
+</li>
+          <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../materials/checking-assumptions.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">8</span>&nbsp; <span class="chapter-title">Checking assumptions</span></span></a>
+  </div>
 </li>
       </ul>
 </li>
         <li class="sidebar-item sidebar-item-section">
       <div class="sidebar-item-container"> 
             <a class="sidebar-item-text sidebar-link text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-3" aria-expanded="true">
- <span class="menu-text">Binary and proportional data</span></a>
+ <span class="menu-text">Logistic regression</span></a>
           <a class="sidebar-item-toggle text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-3" aria-expanded="true" aria-label="Toggle section">
             <i class="bi bi-chevron-right ms-2"></i>
           </a> 
@@ -263,13 +277,13 @@
 <li class="sidebar-item">
   <div class="sidebar-item-container"> 
   <a href="../materials/glm-practical-logistic-binary.html" class="sidebar-item-text sidebar-link active">
- <span class="menu-text"><span class="chapter-number">7</span>&nbsp; <span class="chapter-title">Binary response</span></span></a>
+ <span class="menu-text"><span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Binary response</span></span></a>
   </div>
 </li>
           <li class="sidebar-item">
   <div class="sidebar-item-container"> 
   <a href="../materials/glm-practical-logistic-proportion.html" class="sidebar-item-text sidebar-link">
- <span class="menu-text"><span class="chapter-number">8</span>&nbsp; <span class="chapter-title">Proportional response</span></span></a>
+ <span class="menu-text"><span class="chapter-number">10</span>&nbsp; <span class="chapter-title">Proportional response</span></span></a>
   </div>
 </li>
       </ul>
@@ -277,7 +291,7 @@
         <li class="sidebar-item sidebar-item-section">
       <div class="sidebar-item-container"> 
             <a class="sidebar-item-text sidebar-link text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-4" aria-expanded="true">
- <span class="menu-text">Count data</span></a>
+ <span class="menu-text">Poisson regression</span></a>
           <a class="sidebar-item-toggle text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-4" aria-expanded="true" aria-label="Toggle section">
             <i class="bi bi-chevron-right ms-2"></i>
           </a> 
@@ -286,7 +300,7 @@
 <li class="sidebar-item">
   <div class="sidebar-item-container"> 
   <a href="../materials/glm-practical-poisson.html" class="sidebar-item-text sidebar-link">
- <span class="menu-text"><span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Count data</span></span></a>
+ <span class="menu-text"><span class="chapter-number">11</span>&nbsp; <span class="chapter-title">Count data</span></span></a>
   </div>
 </li>
       </ul>
@@ -299,26 +313,26 @@
         <nav id="TOC" role="doc-toc" class="toc-active"><h2 id="toc-title">Table of contents</h2>
    
   <ul>
-<li><a href="#libraries-and-functions" id="toc-libraries-and-functions" class="nav-link active" data-scroll-target="#libraries-and-functions"><span class="header-section-number">7.1</span> Libraries and functions</a></li>
-  <li><a href="#load-and-visualise-the-data" id="toc-load-and-visualise-the-data" class="nav-link" data-scroll-target="#load-and-visualise-the-data"><span class="header-section-number">7.2</span> Load and visualise the data</a></li>
-  <li><a href="#creating-a-suitable-model" id="toc-creating-a-suitable-model" class="nav-link" data-scroll-target="#creating-a-suitable-model"><span class="header-section-number">7.3</span> Creating a suitable model</a></li>
-  <li><a href="#model-output" id="toc-model-output" class="nav-link" data-scroll-target="#model-output"><span class="header-section-number">7.4</span> Model output</a></li>
-  <li><a href="#parameter-interpretation" id="toc-parameter-interpretation" class="nav-link" data-scroll-target="#parameter-interpretation"><span class="header-section-number">7.5</span> Parameter interpretation</a></li>
-  <li><a href="#assumptions" id="toc-assumptions" class="nav-link" data-scroll-target="#assumptions"><span class="header-section-number">7.6</span> Assumptions</a></li>
-  <li><a href="#assessing-significance" id="toc-assessing-significance" class="nav-link" data-scroll-target="#assessing-significance"><span class="header-section-number">7.7</span> Assessing significance</a></li>
+<li><a href="#libraries-and-functions" id="toc-libraries-and-functions" class="nav-link active" data-scroll-target="#libraries-and-functions"><span class="header-section-number">9.1</span> Libraries and functions</a></li>
+  <li><a href="#load-and-visualise-the-data" id="toc-load-and-visualise-the-data" class="nav-link" data-scroll-target="#load-and-visualise-the-data"><span class="header-section-number">9.2</span> Load and visualise the data</a></li>
+  <li><a href="#creating-a-suitable-model" id="toc-creating-a-suitable-model" class="nav-link" data-scroll-target="#creating-a-suitable-model"><span class="header-section-number">9.3</span> Creating a suitable model</a></li>
+  <li><a href="#model-output" id="toc-model-output" class="nav-link" data-scroll-target="#model-output"><span class="header-section-number">9.4</span> Model output</a></li>
+  <li><a href="#parameter-interpretation" id="toc-parameter-interpretation" class="nav-link" data-scroll-target="#parameter-interpretation"><span class="header-section-number">9.5</span> Parameter interpretation</a></li>
+  <li><a href="#assumptions" id="toc-assumptions" class="nav-link" data-scroll-target="#assumptions"><span class="header-section-number">9.6</span> Assumptions</a></li>
+  <li><a href="#assessing-significance" id="toc-assessing-significance" class="nav-link" data-scroll-target="#assessing-significance"><span class="header-section-number">9.7</span> Assessing significance</a></li>
   <li>
-<a href="#exercises" id="toc-exercises" class="nav-link" data-scroll-target="#exercises"><span class="header-section-number">7.8</span> Exercises</a>
+<a href="#exercises" id="toc-exercises" class="nav-link" data-scroll-target="#exercises"><span class="header-section-number">9.8</span> Exercises</a>
   <ul class="collapse">
-<li><a href="#sec-exr_diabetes" id="toc-sec-exr_diabetes" class="nav-link" data-scroll-target="#sec-exr_diabetes"><span class="header-section-number">7.8.1</span> Diabetes</a></li>
+<li><a href="#sec-exr_diabetes" id="toc-sec-exr_diabetes" class="nav-link" data-scroll-target="#sec-exr_diabetes"><span class="header-section-number">9.8.1</span> Diabetes</a></li>
   </ul>
 </li>
-  <li><a href="#summary" id="toc-summary" class="nav-link" data-scroll-target="#summary"><span class="header-section-number">7.9</span> Summary</a></li>
+  <li><a href="#summary" id="toc-summary" class="nav-link" data-scroll-target="#summary"><span class="header-section-number">9.9</span> Summary</a></li>
   </ul></nav>
     </div>
 <!-- main -->
-<main class="content" id="quarto-document-content"><header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../materials/glm-practical-logistic-binary.html">Binary and proportional data</a></li><li class="breadcrumb-item"><a href="../materials/glm-practical-logistic-binary.html"><span class="chapter-number">7</span>&nbsp; <span class="chapter-title">Binary response</span></a></li></ol></nav><div class="quarto-title">
-<h1 class="title">
-<span class="chapter-number">7</span>&nbsp; <span class="chapter-title">Binary response</span>
+<main class="content" id="quarto-document-content"><header id="title-block-header" class="quarto-title-block default"><div class="quarto-title">
+<h1 class="title d-none d-lg-block">
+<span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Binary response</span>
 </h1>
 </div>
 
@@ -358,8 +372,8 @@ <h1 class="title">
 </ul>
 </div>
 </div>
-<section id="libraries-and-functions" class="level2" data-number="7.1"><h2 data-number="7.1" class="anchored" data-anchor-id="libraries-and-functions">
-<span class="header-section-number">7.1</span> Libraries and functions</h2>
+<section id="libraries-and-functions" class="level2" data-number="9.1"><h2 data-number="9.1" class="anchored" data-anchor-id="libraries-and-functions">
+<span class="header-section-number">9.1</span> Libraries and functions</h2>
 <div class="callout callout-style-default callout-note callout-titled">
 <div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-2-contents" aria-controls="callout-2" aria-expanded="false" aria-label="Toggle callout">
 <div class="callout-icon-container">
@@ -379,15 +393,15 @@ <h1 class="title">
 </ul>
 <div class="tab-content" data-group="language">
 <div id="tabset-1-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-1-1-tab">
-<section id="libraries" class="level3" data-number="7.1.1"><h3 data-number="7.1.1" class="anchored" data-anchor-id="libraries">
-<span class="header-section-number">7.1.1</span> Libraries</h3>
-</section><section id="functions" class="level3" data-number="7.1.2"><h3 data-number="7.1.2" class="anchored" data-anchor-id="functions">
-<span class="header-section-number">7.1.2</span> Functions</h3>
+<section id="libraries" class="level3" data-number="9.1.1"><h3 data-number="9.1.1" class="anchored" data-anchor-id="libraries">
+<span class="header-section-number">9.1.1</span> Libraries</h3>
+</section><section id="functions" class="level3" data-number="9.1.2"><h3 data-number="9.1.2" class="anchored" data-anchor-id="functions">
+<span class="header-section-number">9.1.2</span> Functions</h3>
 </section>
 </div>
 <div id="tabset-1-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-1-2-tab">
-<section id="libraries-1" class="level3" data-number="7.1.3"><h3 data-number="7.1.3" class="anchored" data-anchor-id="libraries-1">
-<span class="header-section-number">7.1.3</span> Libraries</h3>
+<section id="libraries-1" class="level3" data-number="9.1.3"><h3 data-number="9.1.3" class="anchored" data-anchor-id="libraries-1">
+<span class="header-section-number">9.1.3</span> Libraries</h3>
 <div class="cell">
 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="co"># A maths library</span></span>
 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> math</span>
@@ -406,8 +420,8 @@ <h1 class="title">
 <span id="cb1-15"><a href="#cb1-15" aria-hidden="true" tabindex="-1"></a><span class="co"># Needed for additional probability functionality</span></span>
 <span id="cb1-16"><a href="#cb1-16" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> scipy.stats <span class="im">import</span> <span class="op">*</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
-</section><section id="functions-1" class="level3" data-number="7.1.4"><h3 data-number="7.1.4" class="anchored" data-anchor-id="functions-1">
-<span class="header-section-number">7.1.4</span> Functions</h3>
+</section><section id="functions-1" class="level3" data-number="9.1.4"><h3 data-number="9.1.4" class="anchored" data-anchor-id="functions-1">
+<span class="header-section-number">9.1.4</span> Functions</h3>
 </section>
 </div>
 </div>
@@ -420,8 +434,8 @@ <h1 class="title">
 <p>These data come from an analysis of gene flow across two finch species <span class="citation" data-cites="lamichhaney2020">(<a href="../references.html#ref-lamichhaney2020" role="doc-biblioref">Lamichhaney et al. 2020</a>)</span>. They are slightly adapted here for illustrative purposes.</p>
 <p>The data focus on two species, <em>Geospiza fortis</em> and <em>G. scandens</em>. The original measurements are split by a uniquely timed event: a particularly strong El Niño event in 1983. This event changed the vegetation and food supply of the finches, allowing F1 hybrids of the two species to survive, whereas before 1983 they could not. The measurements are classed as <code>early</code> (pre-1983) and <code>late</code> (1983 onwards).</p>
 <p>Here we are looking only at the <code>early</code> data. We are specifically focussing on the beak shape classification, which we saw earlier in <a href="glm-intro-glm.html#fig-beak_shape_glm" class="quarto-xref">Figure&nbsp;<span>6.5</span></a>.</p>
-</section><section id="load-and-visualise-the-data" class="level2" data-number="7.2"><h2 data-number="7.2" class="anchored" data-anchor-id="load-and-visualise-the-data">
-<span class="header-section-number">7.2</span> Load and visualise the data</h2>
+</section><section id="load-and-visualise-the-data" class="level2" data-number="9.2"><h2 data-number="9.2" class="anchored" data-anchor-id="load-and-visualise-the-data">
+<span class="header-section-number">9.2</span> Load and visualise the data</h2>
 <p>First we load the data, then we visualise it.</p>
 <div class="tabset-margin-container"></div><div class="panel-tabset" data-group="language">
 <ul class="nav nav-tabs" role="tablist">
@@ -612,8 +626,8 @@ <h1 class="title">
 <li>The residuals do not appear to be distributed normally (Q-Q Plot)</li>
 <li>The variance is not homogeneous across the predicted values (Location-Scale Plot)</li>
 <li>But - there is always a silver lining - we don’t have influential data points.</li>
-</ul></section><section id="creating-a-suitable-model" class="level2" data-number="7.3"><h2 data-number="7.3" class="anchored" data-anchor-id="creating-a-suitable-model">
-<span class="header-section-number">7.3</span> Creating a suitable model</h2>
+</ul></section><section id="creating-a-suitable-model" class="level2" data-number="9.3"><h2 data-number="9.3" class="anchored" data-anchor-id="creating-a-suitable-model">
+<span class="header-section-number">9.3</span> Creating a suitable model</h2>
 <p>So far we’ve established that using a simple linear model to describe a potential relationship between beak length and the probability of having a pointed beak is not a good idea. So, what <em>can</em> we do?</p>
 <p>One of the ways we can deal with binary outcome data is by performing a logistic regression. Instead of fitting a straight line to our data, and performing a regression on that, we fit a line that has an S shape. This avoids the model making predictions outside the <span class="math inline">\([0, 1]\)</span> range.</p>
 <p>We described our standard linear relationship as follows:</p>
@@ -698,8 +712,8 @@ <h1 class="title">
 </div>
 </div>
 </div>
-</section><section id="model-output" class="level2" data-number="7.4"><h2 data-number="7.4" class="anchored" data-anchor-id="model-output">
-<span class="header-section-number">7.4</span> Model output</h2>
+</section><section id="model-output" class="level2" data-number="9.4"><h2 data-number="9.4" class="anchored" data-anchor-id="model-output">
+<span class="header-section-number">9.4</span> Model output</h2>
 <p>That’s the easy part done! The trickier part is interpreting the output. First of all, we’ll get some summary information.</p>
 <div class="tabset-margin-container"></div><div class="panel-tabset" data-group="language">
 <ul class="nav nav-tabs" role="tablist">
@@ -759,8 +773,8 @@ <h1 class="title">
 </div>
 </div>
 <p>There’s a lot to unpack here, but let’s start with what we’re familiar with: coefficients!</p>
-</section><section id="parameter-interpretation" class="level2" data-number="7.5"><h2 data-number="7.5" class="anchored" data-anchor-id="parameter-interpretation">
-<span class="header-section-number">7.5</span> Parameter interpretation</h2>
+</section><section id="parameter-interpretation" class="level2" data-number="9.5"><h2 data-number="9.5" class="anchored" data-anchor-id="parameter-interpretation">
+<span class="header-section-number">9.5</span> Parameter interpretation</h2>
 <div class="tabset-margin-container"></div><div class="panel-tabset" data-group="language">
 <ul class="nav nav-tabs" role="tablist">
 <li class="nav-item" role="presentation"><a class="nav-link active" id="tabset-9-1-tab" data-bs-toggle="tab" data-bs-target="#tabset-9-1" role="tab" aria-controls="tabset-9-1" aria-selected="true">R</a></li>
@@ -893,7 +907,7 @@ <h1 class="title">
 <img src="glm-practical-logistic-binary_files/figure-html/fig-beak_class_glm_probs-2.png" class="img-fluid figure-img" width="672">
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-beak_class_glm_probs-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-Figure&nbsp;7.1: Predicted probabilities for beak classification
+Figure&nbsp;9.1: Predicted probabilities for beak classification
 </figcaption></figure>
 </div>
 </div>
@@ -907,8 +921,8 @@ <h1 class="title">
 <!-- * Deviance -->
 <!-- ## Assumptions -->
 <!-- * GAMLSS -->
-</section><section id="assumptions" class="level2" data-number="7.6"><h2 data-number="7.6" class="anchored" data-anchor-id="assumptions">
-<span class="header-section-number">7.6</span> Assumptions</h2>
+</section><section id="assumptions" class="level2" data-number="9.6"><h2 data-number="9.6" class="anchored" data-anchor-id="assumptions">
+<span class="header-section-number">9.6</span> Assumptions</h2>
 <p>As explained in the background chapter, we can’t really use the standard diagnostic plots to assess assumptions. We’re not going to go into a lot of detail for now, but there is one thing that we can do: look for influential points using the Cook’s distance plot.</p>
 <div class="tabset-margin-container"></div><div class="panel-tabset" data-group="language">
 <ul class="nav nav-tabs" role="tablist">
@@ -1008,8 +1022,8 @@ <h1 class="title">
 </div>
 <p>This shows that there are no very obvious influential points. You could regard point <code>34</code> as potentially influential (it’s got a Cook’s distance of around <code>0.8</code>), but I’m not overly worried.</p>
 <p>If we were worried, we’d remove the troublesome data point, re-run the analysis and see if that changes the statistical outcome. If so, then our entire (statistical) conclusion hinges on one data point, which is not a very robust bit of research. If it <em>doesn’t</em> change our significance, then all is well, even though that data point is influential.</p>
-</section><section id="assessing-significance" class="level2" data-number="7.7"><h2 data-number="7.7" class="anchored" data-anchor-id="assessing-significance">
-<span class="header-section-number">7.7</span> Assessing significance</h2>
+</section><section id="assessing-significance" class="level2" data-number="9.7"><h2 data-number="9.7" class="anchored" data-anchor-id="assessing-significance">
+<span class="header-section-number">9.7</span> Assessing significance</h2>
 <p>We can ask several questions.</p>
 <p><strong>Is the model well-specified?</strong></p>
 <p>Roughly speaking this asks “can our model predict our data reasonably well?”</p>
@@ -1136,10 +1150,10 @@ <h1 class="title">
 </div>
 </div>
 </div>
-</section><section id="exercises" class="level2" data-number="7.8"><h2 data-number="7.8" class="anchored" data-anchor-id="exercises">
-<span class="header-section-number">7.8</span> Exercises</h2>
-<section id="sec-exr_diabetes" class="level3" data-number="7.8.1"><h3 data-number="7.8.1" class="anchored" data-anchor-id="sec-exr_diabetes">
-<span class="header-section-number">7.8.1</span> Diabetes</h3>
+</section><section id="exercises" class="level2" data-number="9.8"><h2 data-number="9.8" class="anchored" data-anchor-id="exercises">
+<span class="header-section-number">9.8</span> Exercises</h2>
+<section id="sec-exr_diabetes" class="level3" data-number="9.8.1"><h3 data-number="9.8.1" class="anchored" data-anchor-id="sec-exr_diabetes">
+<span class="header-section-number">9.8.1</span> Diabetes</h3>
 <div class="callout callout-style-default callout-exercise no-icon callout-titled">
 <div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-8-contents" aria-controls="callout-8" aria-expanded="true" aria-label="Toggle callout">
 <div class="callout-icon-container">
@@ -1404,8 +1418,8 @@ <h1 class="title">
 </div>
 </div>
 </div>
-</section></section><section id="summary" class="level2" data-number="7.9"><h2 data-number="7.9" class="anchored" data-anchor-id="summary">
-<span class="header-section-number">7.9</span> Summary</h2>
+</section></section><section id="summary" class="level2" data-number="9.9"><h2 data-number="9.9" class="anchored" data-anchor-id="summary">
+<span class="header-section-number">9.9</span> Summary</h2>
 <div class="callout callout-style-default callout-tip callout-titled">
 <div class="callout-header d-flex align-content-center">
 <div class="callout-icon-container">
@@ -1816,13 +1830,13 @@ <h1 class="title">
   }
 });
 </script><nav class="page-navigation"><div class="nav-page nav-page-previous">
-      <a href="../materials/glm-intro-glm.html" class="pagination-link  aria-label=" your="" model="">
-        <i class="bi bi-arrow-left-short"></i> <span class="nav-page-text"><span class="chapter-number">6</span>&nbsp; <span class="chapter-title">Generalise your model</span></span>
+      <a href="../materials/checking-assumptions.html" class="pagination-link  aria-label=" assumptions="">
+        <i class="bi bi-arrow-left-short"></i> <span class="nav-page-text"><span class="chapter-number">8</span>&nbsp; <span class="chapter-title">Checking assumptions</span></span>
       </a>          
   </div>
   <div class="nav-page nav-page-next">
-      <a href="../materials/glm-practical-logistic-proportion.html" class="pagination-link" aria-label="<span class='chapter-number'>8</span>&nbsp; <span class='chapter-title'>Proportional response</span>">
-        <span class="nav-page-text"><span class="chapter-number">8</span>&nbsp; <span class="chapter-title">Proportional response</span></span> <i class="bi bi-arrow-right-short"></i>
+      <a href="../materials/glm-practical-logistic-proportion.html" class="pagination-link" aria-label="<span class='chapter-number'>10</span>&nbsp; <span class='chapter-title'>Proportional response</span>">
+        <span class="nav-page-text"><span class="chapter-number">10</span>&nbsp; <span class="chapter-title">Proportional response</span></span> <i class="bi bi-arrow-right-short"></i>
       </a>
   </div>
 </nav>
diff --git a/_site/search.json b/_site/search.json
index a949747..8d5d34c 100644
--- a/_site/search.json
+++ b/_site/search.json
@@ -21,6 +21,17 @@
       "<span class='chapter-number'>1</span>  <span class='chapter-title'>Course overview</span>"
     ]
   },
+  {
+    "objectID": "index.html#authors",
+    "href": "index.html#authors",
+    "title": "Course overview",
+    "section": "Authors",
+    "text": "Authors\nAbout the author(s):\n\nVicki Hodgson  \nAffiliation: Bioinformatics Training Facility, University of Cambridge\nRoles: writing - review & editing; conceptualisation; coding\nMartin van Rongen  \nAffiliation: Bioinformatics Training Facility, University of Cambridge\nRoles: writing - review & editing; conceptualisation; coding\nMatt Castle  Affiliation: Bioinformatics Training Facility, University of Cambridge\nRoles: conceptualisation; writing",
+    "crumbs": [
+      "Welcome",
+      "<span class='chapter-number'>1</span>  <span class='chapter-title'>Course overview</span>"
+    ]
+  },
   {
     "objectID": "setup.html",
     "href": "setup.html",
@@ -164,290 +175,444 @@
       "<span class='chapter-number'>6</span>  <span class='chapter-title'>Generalise your model</span>"
     ]
   },
+  {
+    "objectID": "materials/significance-testing.html",
+    "href": "materials/significance-testing.html",
+    "title": "\n7  Significance & goodness-of-fit\n",
+    "section": "",
+    "text": "7.1 Libraries and functions",
+    "crumbs": [
+      "Background",
+      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Significance & goodness-of-fit</span>"
+    ]
+  },
+  {
+    "objectID": "materials/significance-testing.html#libraries-and-functions",
+    "href": "materials/significance-testing.html#libraries-and-functions",
+    "title": "\n7  Significance & goodness-of-fit\n",
+    "section": "",
+    "text": "Click to expand\n\n\n\n\n\n\n\nR\nPython\n\n\n\n\ninstall.packages(\"lmtest\")\nlibrary(lmtest)\n\n\n\n\nfrom scipy.stats import *",
+    "crumbs": [
+      "Background",
+      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Significance & goodness-of-fit</span>"
+    ]
+  },
+  {
+    "objectID": "materials/significance-testing.html#deviance",
+    "href": "materials/significance-testing.html#deviance",
+    "title": "\n7  Significance & goodness-of-fit\n",
+    "section": "\n7.2 Deviance",
+    "text": "7.2 Deviance\nSeveral of the tests and metrics we’ll discuss below are based heavily on deviance. So, what is deviance, and where does it come from?\nFitting a model using maximum likelihood estimation - the method that we use for GLMs, among other models - is all about finding the parameters that maximise the likelihood, or joint probability, of the sample. In other words, how likely is it that you would sample a set of data points like these, if they were being drawn from an underlying population where your model is true? Each model that you fit has its own likelihood.\nNow, for each dataset, there is a “saturated”, or perfect, model. This model has the same number of parameters in it as there are data points, meaning the data are fitted exactly - as if connecting the dots between them. The saturated model has the largest possible likelihood of any model fitted to the dataset.\nOf course, we don’t actually use the saturated model for drawing real conclusions, but we can use it as a baseline for comparison. We compare each model that we fit to this saturated model, to calculate the deviance. Deviance is defined as the difference between the log-likelihood of your fitted model and the log-likelihood of the saturated model (multiplied by 2).\nBecause deviance is all about capturing the discrepancy between fitted and actual values, it’s performing a similar function to the residual sum of squares (RSS) in a standard linear model. In fact, the RSS is really just a specific type of deviance.\n\n\nDifferent models and their deviances",
+    "crumbs": [
+      "Background",
+      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Significance & goodness-of-fit</span>"
+    ]
+  },
+  {
+    "objectID": "materials/significance-testing.html#significance-testing",
+    "href": "materials/significance-testing.html#significance-testing",
+    "title": "\n7  Significance & goodness-of-fit\n",
+    "section": "\n7.3 Significance testing",
+    "text": "7.3 Significance testing\nThere are a few different potential sources of p-values for a generalised linear model.\nHere, we’ll briefly discuss the p-values that are reported “as standard” in a typical GLM model output.\nThen, we’ll spend most of our time focusing on likelihood ratio tests, perhaps the most effective way to assess significance in a GLM.\n\n7.3.1 Revisiting the diabetes dataset\nAs a worked example, we’ll use a logistic regression fitted to the diabetes dataset that we saw in a previous section.\n\n\nR\nPython\n\n\n\n\ndiabetes &lt;- read_csv(\"data/diabetes.csv\")\n\nRows: 728 Columns: 3\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (3): glucose, diastolic, test_result\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n\n\n\n\n\ndiabetes_py = pd.read_csv(\"data/diabetes.csv\")\n\ndiabetes_py.head()\n\n   glucose  diastolic  test_result\n0      148         72            1\n1       85         66            0\n2      183         64            1\n3       89         66            0\n4      137         40            1\n\n\n\n\n\nAs a reminder, this dataset contains three variables:\n\n\ntest_result, binary results of a diabetes test result (1 for positive, 0 for negative)\n\nglucose, the results of a glucose tolerance test\n\ndiastolic blood pressure\n\n\n\nR\nPython\n\n\n\n\nglm_dia &lt;- glm(test_result ~ glucose * diastolic,\n                  family = \"binomial\",\n                  data = diabetes)\n\n\n\n\nmodel = smf.glm(formula = \"test_result ~ glucose * diastolic\", \n                family = sm.families.Binomial(), \n                data = diabetes_py)\n                \nglm_dia_py = model.fit()\n\n\n\n\n\n7.3.2 Wald tests\nLet’s use the summary function to see the model we’ve just fitted.\n\n\nR\nPython\n\n\n\n\nsummary(glm_dia)\n\n\nCall:\nglm(formula = test_result ~ glucose * diastolic, family = \"binomial\", \n    data = diabetes)\n\nCoefficients:\n                    Estimate Std. Error z value Pr(&gt;|z|)   \n(Intercept)       -8.5710565  2.7032318  -3.171  0.00152 **\nglucose            0.0547050  0.0209256   2.614  0.00894 **\ndiastolic          0.0423651  0.0363681   1.165  0.24406   \nglucose:diastolic -0.0002221  0.0002790  -0.796  0.42590   \n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 936.60  on 727  degrees of freedom\nResidual deviance: 748.01  on 724  degrees of freedom\nAIC: 756.01\n\nNumber of Fisher Scoring iterations: 4\n\n\n\n\n\nprint(glm_dia_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:            test_result   No. Observations:                  728\nModel:                            GLM   Df Residuals:                      724\nModel Family:                Binomial   Df Model:                            3\nLink Function:                  Logit   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -374.00\nDate:                Fri, 17 May 2024   Deviance:                       748.01\nTime:                        12:43:46   Pearson chi2:                     720.\nNo. Iterations:                     5   Pseudo R-squ. (CS):             0.2282\nCovariance Type:            nonrobust                                         \n=====================================================================================\n                        coef    std err          z      P&gt;|z|      [0.025      0.975]\n-------------------------------------------------------------------------------------\nIntercept            -8.5711      2.703     -3.171      0.002     -13.869      -3.273\nglucose               0.0547      0.021      2.614      0.009       0.014       0.096\ndiastolic             0.0424      0.036      1.165      0.244      -0.029       0.114\nglucose:diastolic    -0.0002      0.000     -0.796      0.426      -0.001       0.000\n=====================================================================================\n\n\n\n\n\nWhichever language you’re using, you may have spotted some p-values being reported directly here in the model summaries. Specifically, each individual parameter, or coefficient, has its own z-value and associated p-value.\nA hypothesis test has automatically been performed for each of the parameters in your model, including the intercept and interaction. In each case, something called a Wald test has been performed.\nThe null hypothesis for these Wald tests is that the value of the coefficient = 0. The idea is that if a coefficient isn’t significantly different from 0, then that parameter isn’t useful and could be dropped from the model. These tests are the equivalent of the t-tests that are calculated as part of the summary output for standard linear models.\nImportantly, these Wald tests don’t tell you about the significance of the overall model. For that, we’re going to need something else: a likelihood ratio test.\n\n7.3.3 Likelihood ratio tests (LRTs)\nWhen we were assessing the significance of standard linear models, we were able to use the F-statistic to determine:\n\nthe significance of the model versus a null model, and\nthe significance of individual predictors.\n\nWe can’t use these F-tests for GLMs, but we can use LRTs in a really similar way, to calculate p-values for both the model as a whole, and for individual variables.\nThese tests are all built on the idea of deviance, or the likelihood ratio, as discussed above on this page. We can compare any two models fitted to the same dataset by looking at the difference in their deviances, also known as the difference in their log-likelihoods, or more simply as a likelihood ratio.\nHelpfully, this likelihood ratio approximately follows a chi-square distribution, which we can capitalise on that to calculate a p-value. All we need is the number of degrees of freedom, which is equal to the difference in the number of parameters of the two models you’re comparing.\n\n\n\n\n\n\nWarning\n\n\n\nImportantly, we are only able to use this sort of test when one of the two models that we are comparing is a “simpler” version of the other, i.e., one model has a subset of the parameters of the other model.\nSo while we could perform an LRT just fine between these two models: Y ~ A + B + C and Y ~ A + B + C + D, or between any model and the null (Y ~ 1), we would not be able to use this test to compare Y ~ A + B + C and Y ~ A + B + D.\n\n\nTesting the model versus the null\nSince LRTs involve making a comparison between two models, we must first decide which models we’re comparing, and check that one model is a “subset” of the other.\nLet’s use an example from a previous section of the course, where we fitted a logistic regression to the diabetes dataset.\n\n\nR\nPython\n\n\n\nThe first step is to create the two models that we want to compare: our original model, and the null model (with and without predictors, respectively).\n\nglm_dia &lt;- glm(test_result ~ glucose * diastolic,\n                  family = \"binomial\",\n                  data = diabetes)\n\nglm_null &lt;- glm(test_result ~ 1, \n                family = binomial, \n                data = diabetes)\n\nThen, we use the lrtest function from the lmtest package to perform the test itself; we include both the models that we want to compare, listing them one after another.\n\nlrtest(glm_dia, glm_null)\n\nLikelihood ratio test\n\nModel 1: test_result ~ glucose * diastolic\nModel 2: test_result ~ 1\n  #Df LogLik Df  Chisq Pr(&gt;Chisq)    \n1   4 -374.0                         \n2   1 -468.3 -3 188.59  &lt; 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nWe can see from the output that our chi-square statistic is significant, with a really small p-value. This tells us that, for the difference in degrees of freedom (here, that’s 3), the change in deviance is actually quite big. (In this case, you can use summary(glm_dia) to see those deviances - 936 versus 748!)\nIn other words, our model is better than the null.\n\n\nThe first step is to create the two models that we want to compare: our original model, and the null model (with and without our predictor, respectively).\n\nmodel = smf.glm(formula = \"test_result ~ glucose * diastolic\", \n                family = sm.families.Binomial(), \n                data = diabetes_py)\n                \nglm_dia_py = model.fit()\n\nmodel = smf.glm(formula = \"test_result ~ 1\",\n                family = sm.families.Binomial(),\n                data = diabetes_py)\n\nglm_null_py = model.fit()\n\nUnlike in R, there isn’t a nice neat function for extracting the \\(\\chi^2\\) value, so we have to do a little bit of work by hand.\n\n# calculate the likelihood ratio (i.e. the chi-square value)\nlrstat = -2*(glm_null_py.llf - glm_dia_py.llf)\n\n# calculate the associated p-value\npvalue = chi2.sf(lrstat, glm_dia_py.df_model - glm_null_py.df_model)\n\nprint(lrstat, pvalue)\n\n188.59314837444526 1.2288700360045209e-40\n\n\nThis gives us the likelihood ratio, based on the log-likelihoods that we’ve extracted directly from the models, which approximates a chi-square distribution.\nWe’ve also calculated the associated p-value, by providing the difference in degrees of freedom between the two models (in this case, that’s simply 1, but for more complicated models it’s easier to extract the degrees of freedom directly from the model as we’ve done here).\nHere, we have a large chi-square statistic and a small p-value. This tells us that, for the difference in degrees of freedom (here, that’s 1), the change in deviance is actually quite big. (In this case, you can use summary(glm_dia) to see those deviances - 936 versus 748!)\nIn other words, our model is better than the null.\n\n\n\n\n7.3.4 Testing individual predictors\nAs well as testing the overall model versus the null, we might want to test particular predictors to determine whether they are individually significant.\nThe way to achieve this is essentially to perform a series of “targeted” likelihood ratio tests. In each LRT, we’ll compare two models that are almost identical - one with, and one without, our variable of interest in each case.\n\n\nR\nPython\n\n\n\nThe first step is to construct a new model that doesn’t contain our predictor of interest. Let’s test the glucose:diastolic interaction.\n\nglm_dia_add &lt;- glm(test_result ~ glucose + diastolic,\n                  family = \"binomial\",\n                  data = diabetes)\n\nNow, we use the lrtest function to compare the models with and without the interaction:\n\nlrtest(glm_dia, glm_dia_add)\n\nLikelihood ratio test\n\nModel 1: test_result ~ glucose * diastolic\nModel 2: test_result ~ glucose + diastolic\n  #Df  LogLik Df  Chisq Pr(&gt;Chisq)\n1   4 -374.00                     \n2   3 -374.32 -1 0.6288     0.4278\n\n\nThis tells us that our interaction glucose:diastolic isn’t significant - our more complex model doesn’t have a meaningful reduction in deviance.\nThis might, however, seem like a slightly clunky way to test each individual predictor. Luckily, we can also use our trusty anova function with an extra argument to tell us about individual predictors.\nBy specifying that we want to use a chi-squared test, we are able to construct an analysis of deviance table (as opposed to an analysis of variance table) that will perform the likelihood ratio tests for us for each predictor:\n\nanova(glm_dia, test=\"Chisq\")\n\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: test_result\n\nTerms added sequentially (first to last)\n\n                  Df Deviance Resid. Df Resid. Dev Pr(&gt;Chi)    \nNULL                                727     936.60             \nglucose            1  184.401       726     752.20  &lt; 2e-16 ***\ndiastolic          1    3.564       725     748.64  0.05905 .  \nglucose:diastolic  1    0.629       724     748.01  0.42779    \n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nYou’ll spot that the p-values we get from the analysis of deviance table match the p-values you could calculate yourself using lrtest; this is just more efficient when you have a complex model!\n\n\nThe first step is to construct a new model that doesn’t contain our predictor of interest. Let’s test the glucose:diastolic interaction.\n\nmodel = smf.glm(formula = \"test_result ~ glucose + diastolic\", \n                family = sm.families.Binomial(), \n                data = diabetes_py)\n                \nglm_dia_add_py = model.fit()\n\nWe’ll then use the same code we used above, to compare the models with and without the interaction:\n\nlrstat = -2*(glm_dia_add_py.llf - glm_dia_py.llf)\n\npvalue = chi2.sf(lrstat, glm_dia_py.df_model - glm_dia_add_py.df_model)\n\nprint(lrstat, pvalue)\n\n0.6288201373599804 0.42778842576800746\n\n\nThis tells us that our interaction glucose:diastolic isn’t significant - our more complex model doesn’t have a meaningful reduction in deviance.",
+    "crumbs": [
+      "Background",
+      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Significance & goodness-of-fit</span>"
+    ]
+  },
+  {
+    "objectID": "materials/significance-testing.html#goodness-of-fit",
+    "href": "materials/significance-testing.html#goodness-of-fit",
+    "title": "\n7  Significance & goodness-of-fit\n",
+    "section": "\n7.4 Goodness-of-fit",
+    "text": "7.4 Goodness-of-fit\nGoodness-of-fit is all about how well a model fits the data, and typically involves summarising the discrepancy between the actual data points, and the fitted/predicted values that the model produces.\nThough closely linked, it’s important to realise that goodness-of-fit and significance don’t come hand-in-hand automatically: we might find a model that is significantly better than the null, but is still overall pretty rubbish at matching the data. So, to understand the quality of our model better, we should ideally perform both types of test.\n\n7.4.1 Chi-square tests\nOnce again, we can make use of deviance and chi-square tests, this time to assess goodness-of-fit.\nAbove, we used likelihood ratio tests to assess the null hypothesis that our candidate fitted model and the null model had the same deviance.\nNow, however, we will test the null hypothesis that the fitted model and the saturated (perfect) model have the same deviance, i.e., that they both fit the data equally well. In most hypothesis tests, we want to reject the null hypothesis, but in this case, we’d like it to be true.\n\n\nR\nPython\n\n\n\nRunning a goodness-of-fit chi-square test in R can be done using the pchisq function. We need to include two arguments: 1) the residual deviance, and 2) the residual degrees of freedom. Both of these can be found in the summary output, but you can use the $ syntax to call these properties directly like so:\n\n1 - pchisq(glm_dia$deviance, glm_dia$df.residual)\n\n[1] 0.2605931\n\n\n\n\nThe syntax is very similar to the LRT we ran above, but now instead of including information about both our candidate model and the null, we instead just need 1) the residual deviance, and 2) the residual degrees of freedom:\n\npvalue = chi2.sf(glm_dia_py.deviance, glm_dia_py.df_resid)\n\nprint(pvalue)\n\n0.26059314630406843\n\n\n\n\n\nYou can think about this p-value, roughly, as “the probability that this model is good”. We’re not below our significance threshold, which means that we’re not rejecting our null hypothesis (which is a good thing) - but it’s also not a huge probability. This suggests that there’s probably other variables we could measure and include in a future experiment, to give a better overall model.\n\n7.4.2 AIC values\nYou might remember AIC values from standard linear modelling. AIC values are useful, because they tell us about overall model quality, factoring in both goodness-of-fit and model complexity.\nOne of the best things about the Akaike information criterion (AIC) is that it isn’t specific to linear models - it works for models fitted with maximum likelihood estimation.\nIn fact, if you look at the formula for AIC, you’ll see why:\n\\[\nAIC = 2k - 2ln(\\hat{L})\n\\]\nwhere \\(k\\) represents the number of parameters in the model, and \\(\\hat{L}\\) is the maximised likelihood function. In other words, the two parts of the equation represent the complexity of the model, versus the log-likelihood.\nThis means that AIC can be used for model comparison for GLMs in precisely the same way as it’s used for linear models: lower AIC indicates a better-quality model.\n\n\nR\nPython\n\n\n\nThe AIC value is given as standard, near the bottom of the summary output (just below the deviance values). You can also print it directly using the $ syntax:\n\nsummary(glm_dia)\n\n\nCall:\nglm(formula = test_result ~ glucose * diastolic, family = \"binomial\", \n    data = diabetes)\n\nCoefficients:\n                    Estimate Std. Error z value Pr(&gt;|z|)   \n(Intercept)       -8.5710565  2.7032318  -3.171  0.00152 **\nglucose            0.0547050  0.0209256   2.614  0.00894 **\ndiastolic          0.0423651  0.0363681   1.165  0.24406   \nglucose:diastolic -0.0002221  0.0002790  -0.796  0.42590   \n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 936.60  on 727  degrees of freedom\nResidual deviance: 748.01  on 724  degrees of freedom\nAIC: 756.01\n\nNumber of Fisher Scoring iterations: 4\n\nglm_dia$aic\n\n[1] 756.0069\n\n\nIn even better news for R users, the step function works for GLMs just as it does for linear models, so long as you include the test = LRT argument.\n\nstep(glm_dia, test = \"LRT\")\n\nStart:  AIC=756.01\ntest_result ~ glucose * diastolic\n\n                    Df Deviance    AIC     LRT Pr(&gt;Chi)\n- glucose:diastolic  1   748.64 754.64 0.62882   0.4278\n&lt;none&gt;                   748.01 756.01                 \n\nStep:  AIC=754.64\ntest_result ~ glucose + diastolic\n\n            Df Deviance    AIC     LRT Pr(&gt;Chi)    \n&lt;none&gt;           748.64 754.64                     \n- diastolic  1   752.20 756.20   3.564  0.05905 .  \n- glucose    1   915.52 919.52 166.884  &lt; 2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\n\nCall:  glm(formula = test_result ~ glucose + diastolic, family = \"binomial\", \n    data = diabetes)\n\nCoefficients:\n(Intercept)      glucose    diastolic  \n   -6.49941      0.03836      0.01407  \n\nDegrees of Freedom: 727 Total (i.e. Null);  725 Residual\nNull Deviance:      936.6 \nResidual Deviance: 748.6    AIC: 754.6\n\n\n\n\nThe AIC value isn’t printed as standard with the model summary, but you can access it easily like so:\n\nprint(glm_dia_py.aic)\n\n756.0068586069744\n\n\n\n\n\n\n7.4.3 Pseudo r-squared\nWe can’t use \\(R^2\\) values to represent the amount of variance explained in a GLM. This is primarily because, while linear models are fitted by minimising the squared residuals, GLMs are fitted by maximising the likelihood - an entirely different procedure.\nHowever, because \\(R^2\\) values are so useful in linear modelling, statisticians have developed something called a “pseudo \\(R^2\\)” for GLMs.\n\n\n\n\n\n\nDebate about pseudo \\(R^2\\) values\n\n\n\nThere are two main areas of debate:\n\nWhich version of pseudo \\(R^2\\) to use?\n\nThere are many. Some of the most popular are McFadden’s, Nagelkerke’s, Cox & Snell’s, and Tjur’s. They all have slightly different formulae and in some cases can give quite different results. This post does a nice job of discussing some of them and providing some comparisons.\n\nShould pseudo \\(R^2\\) values be calculated at all?\n\nWell, it depends what you want them for. Most statisticians tend to advise that pseudo \\(R^2\\) values are only really useful for model comparisons (i.e., comparing different GLMs fitted to the same dataset). This is in contrast to the way that we use \\(R^2\\) values in linear models, as a measure of effect size that is generalisable across studies.\nSo, if you choose to use pseudo \\(R^2\\) values, try to be thoughtful about it; and avoid the temptation to over-interpret!",
+    "crumbs": [
+      "Background",
+      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Significance & goodness-of-fit</span>"
+    ]
+  },
+  {
+    "objectID": "materials/significance-testing.html#summary",
+    "href": "materials/significance-testing.html#summary",
+    "title": "\n7  Significance & goodness-of-fit\n",
+    "section": "\n7.5 Summary",
+    "text": "7.5 Summary\nLikelihood and deviance are very important in generalised linear models - not just for fitting the model via maximum likelihood estimation, but for assessing significance and goodness-of-fit. To determine the quality of a model and draw conclusions from it, it’s important to assess both of these things.\n\n\n\n\n\n\nKey points\n\n\n\n\nDeviance is the difference between predicted and actual values, and is calculated by comparing a model’s log-likelihood to that of the perfect “saturated” model\nUsing deviance, likelihood ratio tests can be used in lieu of F-tests for generalised linear models\nSimilarly, a chi-square goodness-of-fit test can also be performed using likelihood/deviance\nThe Akaike information criterion is also based on likelihood, and can be used to compare the quality of GLMs fitted to the same dataset\nOther metrics that may be of use are Wald test p-values and pseudo \\(R^2\\) values",
+    "crumbs": [
+      "Background",
+      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Significance & goodness-of-fit</span>"
+    ]
+  },
+  {
+    "objectID": "materials/checking-assumptions.html",
+    "href": "materials/checking-assumptions.html",
+    "title": "\n8  Checking assumptions\n",
+    "section": "",
+    "text": "8.1 Libraries and functions",
+    "crumbs": [
+      "Background",
+      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Checking assumptions</span>"
+    ]
+  },
+  {
+    "objectID": "materials/checking-assumptions.html#libraries-and-functions",
+    "href": "materials/checking-assumptions.html#libraries-and-functions",
+    "title": "\n8  Checking assumptions\n",
+    "section": "",
+    "text": "Click to expand\n\n\n\n\n\n\n\nR\nPython\n\n\n\n\nlibrary(ggResidpanel)\n\n\n\n\nfrom scipy.stats import *",
+    "crumbs": [
+      "Background",
+      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Checking assumptions</span>"
+    ]
+  },
+  {
+    "objectID": "materials/checking-assumptions.html#assumption-1-distribution-of-response-variable",
+    "href": "materials/checking-assumptions.html#assumption-1-distribution-of-response-variable",
+    "title": "\n8  Checking assumptions\n",
+    "section": "\n8.2 Assumption 1: Distribution of response variable",
+    "text": "8.2 Assumption 1: Distribution of response variable\nAlthough we don’t expect our response variable \\(y\\) to be continuous and normally distributed (as we did in linear modelling), we do still expect its distribution to come from the “exponential family” of distributions.\nThe exponential family contains the following distributions, among others:\n\nnormal\nexponential\nPoisson\nBernoulli\nbinomial (for fixed number of trials)\nchi-squared\n\nYou can use a histogram to visualise the distribution of your response variable, but it is typically most useful just to think about the nature of your response variable. For instance, binary variables will follow a Bernoulli distribution, proportional variables follow a binomial distribution, and most count variables will follow a Poisson distribution.\nIf you have a very unusual variable that doesn’t follow one of these exponential family distributions, however, then a GLM will not be an appropriate choice. In other words, a GLM is not necessarily a magic fix!",
+    "crumbs": [
+      "Background",
+      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Checking assumptions</span>"
+    ]
+  },
+  {
+    "objectID": "materials/checking-assumptions.html#assumption-2-correct-link-function",
+    "href": "materials/checking-assumptions.html#assumption-2-correct-link-function",
+    "title": "\n8  Checking assumptions\n",
+    "section": "\n8.3 Assumption 2: Correct link function",
+    "text": "8.3 Assumption 2: Correct link function\nA closely-related assumption to assumption 1 above, is that we have chosen the correct link function for our model.\nIf we have done so, then there should be a linear relationship between our transformed model and our response variable; in other words, if we have chosen the right link function, then we have correctly “linearised” our model.",
+    "crumbs": [
+      "Background",
+      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Checking assumptions</span>"
+    ]
+  },
+  {
+    "objectID": "materials/checking-assumptions.html#assumption-3-independence",
+    "href": "materials/checking-assumptions.html#assumption-3-independence",
+    "title": "\n8  Checking assumptions\n",
+    "section": "\n8.4 Assumption 3: Independence",
+    "text": "8.4 Assumption 3: Independence\nWe expect that the each observation or datapoint in our sample is independent of all the others. Specifically, we expect that our set of \\(y\\) response variables are independent of one another.\nFor this to be true, we have to make sure:\n\nthat we aren’t treating technical replicates as true/biological replicates;\nthat we don’t have observations/datapoints in our sample that are artificially similar to each other (compared to other datapoints);\nthat we don’t have any nuisance/confounding variables that create “clusters” or hierarchy in our dataset;\nthat we haven’t got repeated measures, i.e., multiple measurements/rows per individual in our sample\n\nThere is no diagnostic plot for assessing this assumption. To determine whether your data are independent, you need to understand your experimental design.\nYou might find this page useful if you’re looking for more information on what counts as truly independent data.",
+    "crumbs": [
+      "Background",
+      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Checking assumptions</span>"
+    ]
+  },
+  {
+    "objectID": "materials/checking-assumptions.html#good-science-no-influential-observations",
+    "href": "materials/checking-assumptions.html#good-science-no-influential-observations",
+    "title": "\n8  Checking assumptions\n",
+    "section": "\n8.5 Good science: No influential observations",
+    "text": "8.5 Good science: No influential observations\nAs with linear models, though this isn’t always considered a “formal” assumption, we do want to ensure that there aren’t any datapoints that are overly influencing our model.\nA datapoint is overly influential, i.e., has high leverage, if removing that point from the dataset would cause large changes in the model coefficients. Datapoints with high leverage are typically those that don’t follow the same general “trend” as the rest of the data.\nThe easiest way to check for overly influential points is to construct a Cook’s distance plot.\nLet’s try that out, using the diabetes example dataset.\n\n\nR\nPython\n\n\n\n\ndiabetes &lt;- read_csv(\"data/diabetes.csv\")\n\nRows: 728 Columns: 3\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (3): glucose, diastolic, test_result\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n\nglm_dia &lt;- glm(test_result ~ glucose * diastolic,\n                  family = \"binomial\",\n                  data = diabetes)\n\n\n\n\ndiabetes_py = pd.read_csv(\"data/diabetes.csv\")\n\nmodel = smf.glm(formula = \"test_result ~ glucose * diastolic\", \n                family = sm.families.Binomial(), \n                data = diabetes_py)\n                \nglm_dia_py = model.fit()\n\n\n\n\nOnce our model is fitted, we can fit a Cook’s distance plot:\n\n\nR\nPython\n\n\n\n\nresid_panel(glm_dia, plots = \"cookd\")\n\n\n\n\n\n\n\n\n\n\n\n\n\nGood news - there don’t appear to be any overly influential points!",
+    "crumbs": [
+      "Background",
+      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Checking assumptions</span>"
+    ]
+  },
+  {
+    "objectID": "materials/checking-assumptions.html#dispersion",
+    "href": "materials/checking-assumptions.html#dispersion",
+    "title": "\n8  Checking assumptions\n",
+    "section": "\n8.6 Dispersion",
+    "text": "8.6 Dispersion\nAnother thing that we want to check, primarily in Poisson regression, is whether our dispersion parameter is correct.\n\n\n\n\n\n\nFirst, let’s unpack what dispersion is!\n\n\n\n\n\nDispersion, in statistics, is a general term to describe the variability, scatter, or spread of a distribution. Variance is a common measure of dispersion that hopefully you are familiar with.\nIn a normal distribution, the mean (average) and the variance (dispersion) are independent of each other; we need both numbers, or parameters, to understand the shape of the distribution.\nOther distributions, however, require different parameters to describe them in full. For a Poisson distribution, we need just one parameter \\(lambda\\), which captures the expected rate of occurrences/expected count. The mean and variance of a Poisson distribution are actually expected to be the same.\nIn the context of a model, you can think about the dispersion as the degree to which the data are spread out around the model curve. A dispersion parameter of 1 means the data are spread out exactly as we expect; &lt;1 is called underdispersion; and &gt;1 is called overdispersion.\n\n\n\n\n8.6.1 A “hidden assumption”\nWhen we fit a linear model, because we’re assuming a normal distribution, we take the time to estimate the dispersion - by measuring the variance.\nWhen performing Poisson regression, however, we make an extra “hidden” assumption, in setting the dispersion parameter to 1. In other words, we expect the errors to have a certain spread to them that matches our theoretical distribution/model. This means we don’t have to waste time and statistical power in estimating the dispersion.\nHowever, if our data are underdispersed or overdispersed, then we might be violating this assumption we’ve made.\nUnderdispersion is quite rare. It’s far more likely that you’ll encounter overdispersion; in Poisson regression, this is usually caused by the presence of lots of zeroes in your response variable (known as zero-inflation).\nIn these situations, you may wish to fit a different GLM to the data. Negative binomial regression, for instance, is a common alternative for zero-inflated count data.\n\n8.6.2 Checking the dispersion parameter\nThe easiest way to check dispersion in a model is to calculate the ratio of the residual deviance to the residual degrees of freedom.\nLet’s practice doing this using a Poisson regression fitted to the islands dataset that you saw earlier in the course.\n\n\nR\nPython\n\n\n\n\nislands &lt;- read_csv(\"data/islands.csv\")\n\nRows: 35 Columns: 2\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (2): species, area\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n\nglm_isl &lt;- glm(species ~ area,\n               data = islands, family = \"poisson\")\n\n\n\n\nislands_py = pd.read_csv(\"data/islands.csv\")\n\nmodel = smf.glm(formula = \"species ~ area\",\n                family = sm.families.Poisson(),\n                data = islands_py)\n\nglm_isl_py = model.fit()\n\n\n\n\nIf we take a look at the model output, we can see the two quantities we care about - residual deviance and residual degrees of freedom:\n\n\nR\nPython\n\n\n\n\nsummary(glm_isl)\n\n\nCall:\nglm(formula = species ~ area, family = \"poisson\", data = islands)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)    \n(Intercept) 4.241129   0.041322  102.64   &lt;2e-16 ***\narea        0.035613   0.001247   28.55   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 856.899  on 34  degrees of freedom\nResidual deviance:  30.437  on 33  degrees of freedom\nAIC: 282.66\n\nNumber of Fisher Scoring iterations: 3\n\n\n\n\n\nprint(glm_isl_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:                species   No. Observations:                   35\nModel:                            GLM   Df Residuals:                       33\nModel Family:                 Poisson   Df Model:                            1\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -139.33\nDate:                Fri, 17 May 2024   Deviance:                       30.437\nTime:                        12:40:44   Pearson chi2:                     30.3\nNo. Iterations:                     4   Pseudo R-squ. (CS):              1.000\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      4.2411      0.041    102.636      0.000       4.160       4.322\narea           0.0356      0.001     28.551      0.000       0.033       0.038\n==============================================================================\n\n\n\n\n\nThe residual deviance is 30.437, on 33 residual degrees of freedom. All we need to do is divide one by the other to get our dispersion parameter.\n\n\nR\nPython\n\n\n\n\nglm_isl$deviance/glm_isl$df.residual\n\n[1] 0.922334\n\n\n\n\n\nprint(glm_isl_py.deviance/glm_isl_py.df_resid)\n\n0.9223340414458532\n\n\n\n\n\nThe dispersion parameter here is 0.922. That’s pretty good - not far off 1 at all.\nBut how can we check whether it is significantly different from 1?\nWell, you’ve actually already got the knowledge you need to do this, from the previous course section on significance testing. Specifically, the chi-squared goodness-of-fit test can be used to check whether the dispersion is within sensible limits.\nYou may have noticed that the two values we’re using for the dispersion parameter are the same two numbers that we used in those chi-squared tests. For this Poisson regression fitted to the islands dataset, that goodness-of-fit test would look like this:\n\n\nR\nPython\n\n\n\n\n1 - pchisq(glm_isl$deviance, glm_isl$df.residual)\n\n[1] 0.595347\n\n\n\n\n\npvalue = chi2.sf(glm_isl_py.deviance, glm_isl_py.df_resid)\n\nprint(pvalue)\n\n0.5953470127463187\n\n\n\n\n\nIf our chi-squared goodness-of-fit test returns a large (insignificant) p-value, as it does here, that tells us that we don’t need to worry about the dispersion.\nIf our chi-squared goodness-of-fit test returned a small, significant p-value, this would tell us our model doesn’t fit the data well. And, since dispersion is all about the spread of points around the model, it makes sense that these two things are so closely related!",
+    "crumbs": [
+      "Background",
+      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Checking assumptions</span>"
+    ]
+  },
+  {
+    "objectID": "materials/checking-assumptions.html#summary",
+    "href": "materials/checking-assumptions.html#summary",
+    "title": "\n8  Checking assumptions\n",
+    "section": "\n8.7 Summary",
+    "text": "8.7 Summary\nWhile generalised linear models make fewer assumptions than standard linear models, we do still expect certain things to be true about the model and our variables for GLMs to be valid. Checking most of these assumptions requires understanding your dataset, and diagnostic plots play a less heavy role.\n\n\n\n\n\n\nKey points\n\n\n\n\nFor a generalised linear model, we assume that we have chosen the correct link function, that our response variable follows a distribution from the exponential family, and that our data are independent\nTo assess these assumptions, we need to understand our dataset and variables\nWe can also use visualisation to determine whether we have overly influential (high leverage) datapoints\nFor Poisson regression, we should also investigate the dispersion parameter of our model, which we expect to be close to 1",
+    "crumbs": [
+      "Background",
+      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Checking assumptions</span>"
+    ]
+  },
   {
     "objectID": "materials/glm-practical-logistic-binary.html",
     "href": "materials/glm-practical-logistic-binary.html",
-    "title": "\n7  Binary response\n",
+    "title": "\n9  Binary response\n",
     "section": "",
-    "text": "7.1 Libraries and functions\nThe example in this section uses the following data set:\ndata/finches_early.csv\nThese data come from an analysis of gene flow across two finch species (Lamichhaney et al. 2020). They are slightly adapted here for illustrative purposes.\nThe data focus on two species, Geospiza fortis and G. scandens. The original measurements are split by a uniquely timed event: a particularly strong El Niño event in 1983. This event changed the vegetation and food supply of the finches, allowing F1 hybrids of the two species to survive, whereas before 1983 they could not. The measurements are classed as early (pre-1983) and late (1983 onwards).\nHere we are looking only at the early data. We are specifically focussing on the beak shape classification, which we saw earlier in Figure 6.5.",
+    "text": "9.1 Libraries and functions\nThe example in this section uses the following data set:\ndata/finches_early.csv\nThese data come from an analysis of gene flow across two finch species (Lamichhaney et al. 2020). They are slightly adapted here for illustrative purposes.\nThe data focus on two species, Geospiza fortis and G. scandens. The original measurements are split by a uniquely timed event: a particularly strong El Niño event in 1983. This event changed the vegetation and food supply of the finches, allowing F1 hybrids of the two species to survive, whereas before 1983 they could not. The measurements are classed as early (pre-1983) and late (1983 onwards).\nHere we are looking only at the early data. We are specifically focussing on the beak shape classification, which we saw earlier in Figure 6.5.",
     "crumbs": [
-      "Binary and proportional data",
-      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Binary response</span>"
+      "Logistic regression",
+      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Binary response</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-logistic-binary.html#libraries-and-functions",
     "href": "materials/glm-practical-logistic-binary.html#libraries-and-functions",
-    "title": "\n7  Binary response\n",
+    "title": "\n9  Binary response\n",
     "section": "",
-    "text": "Click to expand\n\n\n\n\n\n\n\nR\nPython\n\n\n\n\n7.1.1 Libraries\n\n7.1.2 Functions\n\n\n\n\n7.1.3 Libraries\n\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n\n# Needed for additional probability functionality\nfrom scipy.stats import *\n\n\n7.1.4 Functions",
+    "text": "Click to expand\n\n\n\n\n\n\n\nR\nPython\n\n\n\n\n9.1.1 Libraries\n\n9.1.2 Functions\n\n\n\n\n9.1.3 Libraries\n\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n\n# Needed for additional probability functionality\nfrom scipy.stats import *\n\n\n9.1.4 Functions",
     "crumbs": [
-      "Binary and proportional data",
-      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Binary response</span>"
+      "Logistic regression",
+      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Binary response</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-logistic-binary.html#load-and-visualise-the-data",
     "href": "materials/glm-practical-logistic-binary.html#load-and-visualise-the-data",
-    "title": "\n7  Binary response\n",
-    "section": "\n7.2 Load and visualise the data",
-    "text": "7.2 Load and visualise the data\nFirst we load the data, then we visualise it.\n\n\nR\nPython\n\n\n\n\nearly_finches &lt;- read_csv(\"data/finches_early.csv\")\n\n\n\n\nearly_finches_py = pd.read_csv(\"data/finches_early.csv\")\n\n\n\n\nLooking at the data, we can see that the pointed_beak column contains zeros and ones. These are actually yes/no classification outcomes and not numeric representations.\nWe’ll have to deal with this soon. For now, we can plot the data:\n\n\nR\nPython\n\n\n\n\nggplot(early_finches,\n       aes(x = factor(pointed_beak),\n          y = blength)) +\n  geom_boxplot()\n\n\n\n\n\n\n\n\n\nWe could just give Python the pointed_beak data directly, but then it would view the values as numeric. Which doesn’t really work, because we have two groups as such: those with a pointed beak (1), and those with a blunt one (0).\nWe can force Python to temporarily covert the data to a factor, by making the pointed_beak column an object type. We can do this directly inside the ggplot() function.\n\n(ggplot(early_finches_py,\n         aes(x = early_finches_py.pointed_beak.astype(object),\n             y = \"blength\")) +\n     geom_boxplot())\n\n\n\n\n\n\n\n\n\n\nIt looks as though the finches with blunt beaks generally have shorter beak lengths.\nWe can visualise that differently by plotting all the data points as a classic binary response plot:\n\n\nR\nPython\n\n\n\n\nggplot(early_finches,\n       aes(x = blength, y = pointed_beak)) +\n  geom_point()\n\n\n\n\n\n\n\n\n\n\n(ggplot(early_finches_py,\n         aes(x = \"blength\",\n             y = \"pointed_beak\")) +\n     geom_point())\n\n\n\n\n\n\n\n\n\n\nThis presents us with a bit of an issue. We could fit a linear regression model to these data, although we already know that this is a bad idea…\n\n\nR\nPython\n\n\n\n\nggplot(early_finches,\n       aes(x = blength, y = pointed_beak)) +\n  geom_point() +\n  geom_smooth(method = \"lm\", se = FALSE)\n\n\n\n\n\n\n\n\n\n\n(ggplot(early_finches_py,\n         aes(x = \"blength\",\n             y = \"pointed_beak\")) +\n     geom_point() +\n     geom_smooth(method = \"lm\",\n                 colour = \"blue\",\n                 se = False))\n\n\n\n\n\n\n\n\n\n\nOf course this is rubbish - we can’t have a beak classification outside the range of \\([0, 1]\\). It’s either blunt (0) or pointed (1).\nBut for the sake of exploration, let’s look at the assumptions:\n\n\nR\nPython\n\n\n\n\nlm_bks &lt;- lm(pointed_beak ~ blength,\n             data = early_finches)\n\nresid_panel(lm_bks,\n            plots = c(\"resid\", \"qq\", \"ls\", \"cookd\"),\n            smoother = TRUE)\n\n\n\n\n\n\n\n\n\nFirst, we create a linear model:\n\n# create a linear model\nmodel = smf.ols(formula = \"pointed_beak ~ blength\",\n                data = early_finches_py)\n# and get the fitted parameters of the model\nlm_bks_py = model.fit()\n\nNext, we can create the diagnostic plots:\n\ndgplots(lm_bks_py)\n\n\n\n\n\n\n\n\n\n\n\n\nThey’re pretty extremely bad.\n\nThe response is not linear (Residual Plot, binary response plot, common sense).\nThe residuals do not appear to be distributed normally (Q-Q Plot)\nThe variance is not homogeneous across the predicted values (Location-Scale Plot)\nBut - there is always a silver lining - we don’t have influential data points.",
+    "title": "\n9  Binary response\n",
+    "section": "\n9.2 Load and visualise the data",
+    "text": "9.2 Load and visualise the data\nFirst we load the data, then we visualise it.\n\n\nR\nPython\n\n\n\n\nearly_finches &lt;- read_csv(\"data/finches_early.csv\")\n\n\n\n\nearly_finches_py = pd.read_csv(\"data/finches_early.csv\")\n\n\n\n\nLooking at the data, we can see that the pointed_beak column contains zeros and ones. These are actually yes/no classification outcomes and not numeric representations.\nWe’ll have to deal with this soon. For now, we can plot the data:\n\n\nR\nPython\n\n\n\n\nggplot(early_finches,\n       aes(x = factor(pointed_beak),\n          y = blength)) +\n  geom_boxplot()\n\n\n\n\n\n\n\n\n\nWe could just give Python the pointed_beak data directly, but then it would view the values as numeric. Which doesn’t really work, because we have two groups as such: those with a pointed beak (1), and those with a blunt one (0).\nWe can force Python to temporarily covert the data to a factor, by making the pointed_beak column an object type. We can do this directly inside the ggplot() function.\n\n(ggplot(early_finches_py,\n         aes(x = early_finches_py.pointed_beak.astype(object),\n             y = \"blength\")) +\n     geom_boxplot())\n\n\n\n\n\n\n\n\n\n\nIt looks as though the finches with blunt beaks generally have shorter beak lengths.\nWe can visualise that differently by plotting all the data points as a classic binary response plot:\n\n\nR\nPython\n\n\n\n\nggplot(early_finches,\n       aes(x = blength, y = pointed_beak)) +\n  geom_point()\n\n\n\n\n\n\n\n\n\n\n(ggplot(early_finches_py,\n         aes(x = \"blength\",\n             y = \"pointed_beak\")) +\n     geom_point())\n\n\n\n\n\n\n\n\n\n\nThis presents us with a bit of an issue. We could fit a linear regression model to these data, although we already know that this is a bad idea…\n\n\nR\nPython\n\n\n\n\nggplot(early_finches,\n       aes(x = blength, y = pointed_beak)) +\n  geom_point() +\n  geom_smooth(method = \"lm\", se = FALSE)\n\n\n\n\n\n\n\n\n\n\n(ggplot(early_finches_py,\n         aes(x = \"blength\",\n             y = \"pointed_beak\")) +\n     geom_point() +\n     geom_smooth(method = \"lm\",\n                 colour = \"blue\",\n                 se = False))\n\n\n\n\n\n\n\n\n\n\nOf course this is rubbish - we can’t have a beak classification outside the range of \\([0, 1]\\). It’s either blunt (0) or pointed (1).\nBut for the sake of exploration, let’s look at the assumptions:\n\n\nR\nPython\n\n\n\n\nlm_bks &lt;- lm(pointed_beak ~ blength,\n             data = early_finches)\n\nresid_panel(lm_bks,\n            plots = c(\"resid\", \"qq\", \"ls\", \"cookd\"),\n            smoother = TRUE)\n\n\n\n\n\n\n\n\n\nFirst, we create a linear model:\n\n# create a linear model\nmodel = smf.ols(formula = \"pointed_beak ~ blength\",\n                data = early_finches_py)\n# and get the fitted parameters of the model\nlm_bks_py = model.fit()\n\nNext, we can create the diagnostic plots:\n\ndgplots(lm_bks_py)\n\n\n\n\n\n\n\n\n\n\n\n\nThey’re pretty extremely bad.\n\nThe response is not linear (Residual Plot, binary response plot, common sense).\nThe residuals do not appear to be distributed normally (Q-Q Plot)\nThe variance is not homogeneous across the predicted values (Location-Scale Plot)\nBut - there is always a silver lining - we don’t have influential data points.",
     "crumbs": [
-      "Binary and proportional data",
-      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Binary response</span>"
+      "Logistic regression",
+      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Binary response</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-logistic-binary.html#creating-a-suitable-model",
     "href": "materials/glm-practical-logistic-binary.html#creating-a-suitable-model",
-    "title": "\n7  Binary response\n",
-    "section": "\n7.3 Creating a suitable model",
-    "text": "7.3 Creating a suitable model\nSo far we’ve established that using a simple linear model to describe a potential relationship between beak length and the probability of having a pointed beak is not a good idea. So, what can we do?\nOne of the ways we can deal with binary outcome data is by performing a logistic regression. Instead of fitting a straight line to our data, and performing a regression on that, we fit a line that has an S shape. This avoids the model making predictions outside the \\([0, 1]\\) range.\nWe described our standard linear relationship as follows:\n\\(Y = \\beta_0 + \\beta_1X\\)\nWe can now map this to our non-linear relationship via the logistic link function:\n\\(Y = \\frac{\\exp(\\beta_0 + \\beta_1X)}{1 + \\exp(\\beta_0 + \\beta_1X)}\\)\nNote that the \\(\\beta_0 + \\beta_1X\\) part is identical to the formula of a straight line.\nThe rest of the function is what makes the straight line curve into its characteristic S shape.\n\n\n\n\n\n\nEuler’s number (\\(\\exp\\)): would you like to know more?\n\n\n\n\n\nIn mathematics, \\(\\rm e\\) represents a constant of around 2.718. Another notation is \\(\\exp\\), which is often used when notations become a bit cumbersome. Here, I exclusively use the \\(\\exp\\) notation for consistency.\n\n\n\n\n\n\n\n\n\nThe logistic function\n\n\n\nThe shape of the logistic function is hugely influenced by the different parameters, in particular \\(\\beta_1\\). The plots below show different situations, where \\(\\beta_0 = 0\\) in all cases, but \\(\\beta_1\\) varies.\nThe first plot shows the logistic function in its simplest form, with the others showing the effect of varying \\(\\beta_1\\).\n\n\n\n\n\n\n\n\n\nwhen \\(\\beta_1 = 1\\), this gives the simplest logistic function\nwhen \\(\\beta_1 = 0\\) gives a horizontal line, with \\(Y = \\frac{\\exp(\\beta_0)}{1+\\exp(\\beta_0)}\\)\n\nwhen \\(\\beta_1\\) is negative flips the curve around, so it slopes down\nwhen \\(\\beta_1\\) is very large then the curve becomes extremely steep\n\n\n\nWe can fit such an S-shaped curve to our early_finches data set, by creating a generalised linear model.\n\n\nR\nPython\n\n\n\nIn R we have a few options to do this, and by far the most familiar function would be glm(). Here we save the model in an object called glm_bks:\n\nglm_bks &lt;- glm(pointed_beak ~ blength,\n               family = binomial,\n               data = early_finches)\n\nThe format of this function is similar to that used by the lm() function for linear models. The important difference is that we must specify the family of error distribution to use. For logistic regression we must set the family to binomial.\nIf you forget to set the family argument, then the glm() function will perform a standard linear model fit, identical to what the lm() function would do.\n\n\nIn Python we have a few options to do this, and by far the most familiar function would be glm(). Here we save the model in an object called glm_bks_py:\n\n# create a linear model\nmodel = smf.glm(formula = \"pointed_beak ~ blength\",\n                family = sm.families.Binomial(),\n                data = early_finches_py)\n# and get the fitted parameters of the model\nglm_bks_py = model.fit()\n\nThe format of this function is similar to that used by the ols() function for linear models. The important difference is that we must specify the family of error distribution to use. For logistic regression we must set the family to binomial. This is buried deep inside the statsmodels package and needs to be defined as sm.families.Binomial().",
+    "title": "\n9  Binary response\n",
+    "section": "\n9.3 Creating a suitable model",
+    "text": "9.3 Creating a suitable model\nSo far we’ve established that using a simple linear model to describe a potential relationship between beak length and the probability of having a pointed beak is not a good idea. So, what can we do?\nOne of the ways we can deal with binary outcome data is by performing a logistic regression. Instead of fitting a straight line to our data, and performing a regression on that, we fit a line that has an S shape. This avoids the model making predictions outside the \\([0, 1]\\) range.\nWe described our standard linear relationship as follows:\n\\(Y = \\beta_0 + \\beta_1X\\)\nWe can now map this to our non-linear relationship via the logistic link function:\n\\(Y = \\frac{\\exp(\\beta_0 + \\beta_1X)}{1 + \\exp(\\beta_0 + \\beta_1X)}\\)\nNote that the \\(\\beta_0 + \\beta_1X\\) part is identical to the formula of a straight line.\nThe rest of the function is what makes the straight line curve into its characteristic S shape.\n\n\n\n\n\n\nEuler’s number (\\(\\exp\\)): would you like to know more?\n\n\n\n\n\nIn mathematics, \\(\\rm e\\) represents a constant of around 2.718. Another notation is \\(\\exp\\), which is often used when notations become a bit cumbersome. Here, I exclusively use the \\(\\exp\\) notation for consistency.\n\n\n\n\n\n\n\n\n\nThe logistic function\n\n\n\nThe shape of the logistic function is hugely influenced by the different parameters, in particular \\(\\beta_1\\). The plots below show different situations, where \\(\\beta_0 = 0\\) in all cases, but \\(\\beta_1\\) varies.\nThe first plot shows the logistic function in its simplest form, with the others showing the effect of varying \\(\\beta_1\\).\n\n\n\n\n\n\n\n\n\nwhen \\(\\beta_1 = 1\\), this gives the simplest logistic function\nwhen \\(\\beta_1 = 0\\) gives a horizontal line, with \\(Y = \\frac{\\exp(\\beta_0)}{1+\\exp(\\beta_0)}\\)\n\nwhen \\(\\beta_1\\) is negative flips the curve around, so it slopes down\nwhen \\(\\beta_1\\) is very large then the curve becomes extremely steep\n\n\n\nWe can fit such an S-shaped curve to our early_finches data set, by creating a generalised linear model.\n\n\nR\nPython\n\n\n\nIn R we have a few options to do this, and by far the most familiar function would be glm(). Here we save the model in an object called glm_bks:\n\nglm_bks &lt;- glm(pointed_beak ~ blength,\n               family = binomial,\n               data = early_finches)\n\nThe format of this function is similar to that used by the lm() function for linear models. The important difference is that we must specify the family of error distribution to use. For logistic regression we must set the family to binomial.\nIf you forget to set the family argument, then the glm() function will perform a standard linear model fit, identical to what the lm() function would do.\n\n\nIn Python we have a few options to do this, and by far the most familiar function would be glm(). Here we save the model in an object called glm_bks_py:\n\n# create a linear model\nmodel = smf.glm(formula = \"pointed_beak ~ blength\",\n                family = sm.families.Binomial(),\n                data = early_finches_py)\n# and get the fitted parameters of the model\nglm_bks_py = model.fit()\n\nThe format of this function is similar to that used by the ols() function for linear models. The important difference is that we must specify the family of error distribution to use. For logistic regression we must set the family to binomial. This is buried deep inside the statsmodels package and needs to be defined as sm.families.Binomial().",
     "crumbs": [
-      "Binary and proportional data",
-      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Binary response</span>"
+      "Logistic regression",
+      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Binary response</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-logistic-binary.html#model-output",
     "href": "materials/glm-practical-logistic-binary.html#model-output",
-    "title": "\n7  Binary response\n",
-    "section": "\n7.4 Model output",
-    "text": "7.4 Model output\nThat’s the easy part done! The trickier part is interpreting the output. First of all, we’ll get some summary information.\n\n\nR\nPython\n\n\n\n\nsummary(glm_bks)\n\n\nCall:\nglm(formula = pointed_beak ~ blength, family = binomial, data = early_finches)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)   \n(Intercept)  -43.410     15.250  -2.847  0.00442 **\nblength        3.387      1.193   2.839  0.00452 **\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 84.5476  on 60  degrees of freedom\nResidual deviance:  9.1879  on 59  degrees of freedom\nAIC: 13.188\n\nNumber of Fisher Scoring iterations: 8\n\n\n\n\n\nprint(glm_bks_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:           pointed_beak   No. Observations:                   61\nModel:                            GLM   Df Residuals:                       59\nModel Family:                Binomial   Df Model:                            1\nLink Function:                  Logit   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -4.5939\nDate:                Thu, 01 Feb 2024   Deviance:                       9.1879\nTime:                        07:29:39   Pearson chi2:                     15.1\nNo. Iterations:                     8   Pseudo R-squ. (CS):             0.7093\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept    -43.4096     15.250     -2.847      0.004     -73.298     -13.521\nblength        3.3866      1.193      2.839      0.005       1.049       5.724\n==============================================================================\n\n\n\n\n\nThere’s a lot to unpack here, but let’s start with what we’re familiar with: coefficients!",
+    "title": "\n9  Binary response\n",
+    "section": "\n9.4 Model output",
+    "text": "9.4 Model output\nThat’s the easy part done! The trickier part is interpreting the output. First of all, we’ll get some summary information.\n\n\nR\nPython\n\n\n\n\nsummary(glm_bks)\n\n\nCall:\nglm(formula = pointed_beak ~ blength, family = binomial, data = early_finches)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)   \n(Intercept)  -43.410     15.250  -2.847  0.00442 **\nblength        3.387      1.193   2.839  0.00452 **\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 84.5476  on 60  degrees of freedom\nResidual deviance:  9.1879  on 59  degrees of freedom\nAIC: 13.188\n\nNumber of Fisher Scoring iterations: 8\n\n\n\n\n\nprint(glm_bks_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:           pointed_beak   No. Observations:                   61\nModel:                            GLM   Df Residuals:                       59\nModel Family:                Binomial   Df Model:                            1\nLink Function:                  Logit   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -4.5939\nDate:                Thu, 01 Feb 2024   Deviance:                       9.1879\nTime:                        07:29:39   Pearson chi2:                     15.1\nNo. Iterations:                     8   Pseudo R-squ. (CS):             0.7093\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept    -43.4096     15.250     -2.847      0.004     -73.298     -13.521\nblength        3.3866      1.193      2.839      0.005       1.049       5.724\n==============================================================================\n\n\n\n\n\nThere’s a lot to unpack here, but let’s start with what we’re familiar with: coefficients!",
     "crumbs": [
-      "Binary and proportional data",
-      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Binary response</span>"
+      "Logistic regression",
+      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Binary response</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-logistic-binary.html#parameter-interpretation",
     "href": "materials/glm-practical-logistic-binary.html#parameter-interpretation",
-    "title": "\n7  Binary response\n",
-    "section": "\n7.5 Parameter interpretation",
-    "text": "7.5 Parameter interpretation\n\n\nR\nPython\n\n\n\nThe coefficients or parameters can be found in the Coefficients block. The main numbers to extract from the output are the two numbers underneath Estimate.Std:\nCoefficients:\n            Estimate Std.\n(Intercept)  -43.410\nblength        3.387 \n\n\nRight at the bottom is a table showing the model coefficients. The main numbers to extract from the output are the two numbers in the coef column:\n======================\n                 coef\n----------------------\nIntercept    -43.4096\nblength        3.3866\n======================\n\n\n\nThese are the coefficients of the logistic model equation and need to be placed in the correct equation if we want to be able to calculate the probability of having a pointed beak for a given beak length.\nThe \\(p\\) values at the end of each coefficient row merely show whether that particular coefficient is significantly different from zero. This is similar to the \\(p\\) values obtained in the summary output of a linear model. As with continuous predictors in simple models, these \\(p\\) values can be used to decide whether that predictor is important (so in this case beak length appears to be significant). However, these \\(p\\) values aren’t great to work with when we have multiple predictor variables, or when we have categorical predictors with multiple levels (since the output will give us a \\(p\\) value for each level rather than for the predictor as a whole).\nWe can use the coefficients to calculate the probability of having a pointed beak for a given beak length:\n\\[ P(pointed \\ beak) = \\frac{\\exp(-43.41 + 3.39 \\times blength)}{1 + \\exp(-43.41 + 3.39 \\times blength)} \\]\nHaving this formula means that we can calculate the probability of having a pointed beak for any beak length. How do we work this out in practice?\n\n\nR\nPython\n\n\n\nWell, the probability of having a pointed beak if the beak length is large (for example 15 mm) can be calculated as follows:\n\nexp(-43.41 + 3.39 * 15) / (1 + exp(-43.41 + 3.39 * 15))\n\n[1] 0.9994131\n\n\nIf the beak length is small (for example 10 mm), the probability of having a pointed beak is extremely low:\n\nexp(-43.41 + 3.39 * 10) / (1 + exp(-43.41 + 3.39 * 10))\n\n[1] 7.410155e-05\n\n\n\n\nWell, the probability of having a pointed beak if the beak length is large (for example 15 mm) can be calculated as follows:\n\n# import the math library\nimport math\n\n\nmath.exp(-43.41 + 3.39 * 15) / (1 + math.exp(-43.41 + 3.39 * 15))\n\n0.9994130595039192\n\n\nIf the beak length is small (for example 10 mm), the probability of having a pointed beak is extremely low:\n\nmath.exp(-43.41 + 3.39 * 10) / (1 + math.exp(-43.41 + 3.39 * 10))\n\n7.410155028945912e-05\n\n\n\n\n\nWe can calculate the the probabilities for all our observed values and if we do that then we can see that the larger the beak length is, the higher the probability that a beak shape would be pointed. I’m visualising this together with the logistic curve, where the blue points are the calculated probabilities:\n\n\n\n\n\n\nCode available here\n\n\n\n\n\n\n\nR\nPython\n\n\n\n\nglm_bks %&gt;% \n  augment(type.predict = \"response\") %&gt;% \n  ggplot() +\n  geom_point(aes(x = blength, y = pointed_beak)) +\n  geom_line(aes(x = blength, y = .fitted),\n            linetype = \"dashed\",\n            colour = \"blue\") +\n  geom_point(aes(x = blength, y = .fitted),\n             colour = \"blue\", alpha = 0.5) +\n  labs(x = \"beak length (mm)\",\n       y = \"Probability\")\n\n\n\n\n(ggplot(early_finches_py) +\n  geom_point(aes(x = \"blength\", y = \"pointed_beak\")) +\n  geom_line(aes(x = \"blength\", y = glm_bks_py.fittedvalues),\n            linetype = \"dashed\",\n            colour = \"blue\") +\n  geom_point(aes(x = \"blength\", y = glm_bks_py.fittedvalues),\n             colour = \"blue\", alpha = 0.5) +\n  labs(x = \"beak length (mm)\",\n       y = \"Probability\"))\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFigure 7.1: Predicted probabilities for beak classification\n\n\n\n\nThe graph shows us that, based on the data that we have and the model we used to make predictions about our response variable, the probability of seeing a pointed beak increases with beak length.\nShort beaks are more closely associated with the bluntly shaped beaks, whereas long beaks are more closely associated with the pointed shape. It’s also clear that there is a range of beak lengths (around 13 mm) where the probability of getting one shape or another is much more even.",
+    "title": "\n9  Binary response\n",
+    "section": "\n9.5 Parameter interpretation",
+    "text": "9.5 Parameter interpretation\n\n\nR\nPython\n\n\n\nThe coefficients or parameters can be found in the Coefficients block. The main numbers to extract from the output are the two numbers underneath Estimate.Std:\nCoefficients:\n            Estimate Std.\n(Intercept)  -43.410\nblength        3.387 \n\n\nRight at the bottom is a table showing the model coefficients. The main numbers to extract from the output are the two numbers in the coef column:\n======================\n                 coef\n----------------------\nIntercept    -43.4096\nblength        3.3866\n======================\n\n\n\nThese are the coefficients of the logistic model equation and need to be placed in the correct equation if we want to be able to calculate the probability of having a pointed beak for a given beak length.\nThe \\(p\\) values at the end of each coefficient row merely show whether that particular coefficient is significantly different from zero. This is similar to the \\(p\\) values obtained in the summary output of a linear model. As with continuous predictors in simple models, these \\(p\\) values can be used to decide whether that predictor is important (so in this case beak length appears to be significant). However, these \\(p\\) values aren’t great to work with when we have multiple predictor variables, or when we have categorical predictors with multiple levels (since the output will give us a \\(p\\) value for each level rather than for the predictor as a whole).\nWe can use the coefficients to calculate the probability of having a pointed beak for a given beak length:\n\\[ P(pointed \\ beak) = \\frac{\\exp(-43.41 + 3.39 \\times blength)}{1 + \\exp(-43.41 + 3.39 \\times blength)} \\]\nHaving this formula means that we can calculate the probability of having a pointed beak for any beak length. How do we work this out in practice?\n\n\nR\nPython\n\n\n\nWell, the probability of having a pointed beak if the beak length is large (for example 15 mm) can be calculated as follows:\n\nexp(-43.41 + 3.39 * 15) / (1 + exp(-43.41 + 3.39 * 15))\n\n[1] 0.9994131\n\n\nIf the beak length is small (for example 10 mm), the probability of having a pointed beak is extremely low:\n\nexp(-43.41 + 3.39 * 10) / (1 + exp(-43.41 + 3.39 * 10))\n\n[1] 7.410155e-05\n\n\n\n\nWell, the probability of having a pointed beak if the beak length is large (for example 15 mm) can be calculated as follows:\n\n# import the math library\nimport math\n\n\nmath.exp(-43.41 + 3.39 * 15) / (1 + math.exp(-43.41 + 3.39 * 15))\n\n0.9994130595039192\n\n\nIf the beak length is small (for example 10 mm), the probability of having a pointed beak is extremely low:\n\nmath.exp(-43.41 + 3.39 * 10) / (1 + math.exp(-43.41 + 3.39 * 10))\n\n7.410155028945912e-05\n\n\n\n\n\nWe can calculate the the probabilities for all our observed values and if we do that then we can see that the larger the beak length is, the higher the probability that a beak shape would be pointed. I’m visualising this together with the logistic curve, where the blue points are the calculated probabilities:\n\n\n\n\n\n\nCode available here\n\n\n\n\n\n\n\nR\nPython\n\n\n\n\nglm_bks %&gt;% \n  augment(type.predict = \"response\") %&gt;% \n  ggplot() +\n  geom_point(aes(x = blength, y = pointed_beak)) +\n  geom_line(aes(x = blength, y = .fitted),\n            linetype = \"dashed\",\n            colour = \"blue\") +\n  geom_point(aes(x = blength, y = .fitted),\n             colour = \"blue\", alpha = 0.5) +\n  labs(x = \"beak length (mm)\",\n       y = \"Probability\")\n\n\n\n\n(ggplot(early_finches_py) +\n  geom_point(aes(x = \"blength\", y = \"pointed_beak\")) +\n  geom_line(aes(x = \"blength\", y = glm_bks_py.fittedvalues),\n            linetype = \"dashed\",\n            colour = \"blue\") +\n  geom_point(aes(x = \"blength\", y = glm_bks_py.fittedvalues),\n             colour = \"blue\", alpha = 0.5) +\n  labs(x = \"beak length (mm)\",\n       y = \"Probability\"))\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFigure 9.1: Predicted probabilities for beak classification\n\n\n\n\nThe graph shows us that, based on the data that we have and the model we used to make predictions about our response variable, the probability of seeing a pointed beak increases with beak length.\nShort beaks are more closely associated with the bluntly shaped beaks, whereas long beaks are more closely associated with the pointed shape. It’s also clear that there is a range of beak lengths (around 13 mm) where the probability of getting one shape or another is much more even.",
     "crumbs": [
-      "Binary and proportional data",
-      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Binary response</span>"
+      "Logistic regression",
+      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Binary response</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-logistic-binary.html#assumptions",
     "href": "materials/glm-practical-logistic-binary.html#assumptions",
-    "title": "\n7  Binary response\n",
-    "section": "\n7.6 Assumptions",
-    "text": "7.6 Assumptions\nAs explained in the background chapter, we can’t really use the standard diagnostic plots to assess assumptions. We’re not going to go into a lot of detail for now, but there is one thing that we can do: look for influential points using the Cook’s distance plot.\n\n\nR\nPython\n\n\n\n\nplot(glm_bks , which = 4)\n\n\n\n\n\n\n\n\n\n\n\n\n\nExtracting the Cook’s distances from the glm object\n\n\n\n\n\nInstead of using the plot() function, we can also extract the values directly from the glm object. We can use the augment() function to do this and create a lollipop or stem plot:\n\nglm_bks %&gt;% \n  augment() %&gt;%            # get underlying data\n  select(.cooksd) %&gt;%      # select the Cook's d\n  mutate(obs = 1:n()) %&gt;%  # create an index column\n  ggplot(aes(x = obs, y = .cooksd)) +\n  geom_point() +\n  geom_segment(aes(x = obs, y = .cooksd, xend = obs, yend = 0))\n\n\n\n\n\n\n\n\n\n\n\n\nAs always, there are different ways of doing this. Here we extract the Cook’s d values from the glm object and put them in a Pandas DataFrame. We can then use that to plot them in a lollipop or stem plot.\n\n# extract the Cook's distances\nglm_bks_py_resid = pd.DataFrame(glm_bks_py.\n                                get_influence().\n                                summary_frame()[\"cooks_d\"])\n\n# add row index \nglm_bks_py_resid['obs'] = glm_bks_py_resid.reset_index().index\n\nWe now have two columns:\n\nglm_bks_py_resid\n\n         cooks_d  obs\n0   1.854360e-07    0\n1   3.388262e-07    1\n2   3.217960e-05    2\n3   1.194847e-05    3\n4   6.643975e-06    4\n..           ...  ...\n56  1.225519e-05   56\n57  2.484468e-05   57\n58  6.781364e-06   58\n59  1.850240e-07   59\n60  3.532360e-05   60\n\n[61 rows x 2 columns]\n\n\nWe can use these to create the plot:\n\n(ggplot(glm_bks_py_resid,\n         aes(x = \"obs\",\n             y = \"cooks_d\")) +\n     geom_segment(aes(x = \"obs\", y = \"cooks_d\", xend = \"obs\", yend = 0)) +\n     geom_point())\n\n\n\n\n\n\n\n\n\n\nThis shows that there are no very obvious influential points. You could regard point 34 as potentially influential (it’s got a Cook’s distance of around 0.8), but I’m not overly worried.\nIf we were worried, we’d remove the troublesome data point, re-run the analysis and see if that changes the statistical outcome. If so, then our entire (statistical) conclusion hinges on one data point, which is not a very robust bit of research. If it doesn’t change our significance, then all is well, even though that data point is influential.",
+    "title": "\n9  Binary response\n",
+    "section": "\n9.6 Assumptions",
+    "text": "9.6 Assumptions\nAs explained in the background chapter, we can’t really use the standard diagnostic plots to assess assumptions. We’re not going to go into a lot of detail for now, but there is one thing that we can do: look for influential points using the Cook’s distance plot.\n\n\nR\nPython\n\n\n\n\nplot(glm_bks , which = 4)\n\n\n\n\n\n\n\n\n\n\n\n\n\nExtracting the Cook’s distances from the glm object\n\n\n\n\n\nInstead of using the plot() function, we can also extract the values directly from the glm object. We can use the augment() function to do this and create a lollipop or stem plot:\n\nglm_bks %&gt;% \n  augment() %&gt;%            # get underlying data\n  select(.cooksd) %&gt;%      # select the Cook's d\n  mutate(obs = 1:n()) %&gt;%  # create an index column\n  ggplot(aes(x = obs, y = .cooksd)) +\n  geom_point() +\n  geom_segment(aes(x = obs, y = .cooksd, xend = obs, yend = 0))\n\n\n\n\n\n\n\n\n\n\n\n\nAs always, there are different ways of doing this. Here we extract the Cook’s d values from the glm object and put them in a Pandas DataFrame. We can then use that to plot them in a lollipop or stem plot.\n\n# extract the Cook's distances\nglm_bks_py_resid = pd.DataFrame(glm_bks_py.\n                                get_influence().\n                                summary_frame()[\"cooks_d\"])\n\n# add row index \nglm_bks_py_resid['obs'] = glm_bks_py_resid.reset_index().index\n\nWe now have two columns:\n\nglm_bks_py_resid\n\n         cooks_d  obs\n0   1.854360e-07    0\n1   3.388262e-07    1\n2   3.217960e-05    2\n3   1.194847e-05    3\n4   6.643975e-06    4\n..           ...  ...\n56  1.225519e-05   56\n57  2.484468e-05   57\n58  6.781364e-06   58\n59  1.850240e-07   59\n60  3.532360e-05   60\n\n[61 rows x 2 columns]\n\n\nWe can use these to create the plot:\n\n(ggplot(glm_bks_py_resid,\n         aes(x = \"obs\",\n             y = \"cooks_d\")) +\n     geom_segment(aes(x = \"obs\", y = \"cooks_d\", xend = \"obs\", yend = 0)) +\n     geom_point())\n\n\n\n\n\n\n\n\n\n\nThis shows that there are no very obvious influential points. You could regard point 34 as potentially influential (it’s got a Cook’s distance of around 0.8), but I’m not overly worried.\nIf we were worried, we’d remove the troublesome data point, re-run the analysis and see if that changes the statistical outcome. If so, then our entire (statistical) conclusion hinges on one data point, which is not a very robust bit of research. If it doesn’t change our significance, then all is well, even though that data point is influential.",
     "crumbs": [
-      "Binary and proportional data",
-      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Binary response</span>"
+      "Logistic regression",
+      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Binary response</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-logistic-binary.html#assessing-significance",
     "href": "materials/glm-practical-logistic-binary.html#assessing-significance",
-    "title": "\n7  Binary response\n",
-    "section": "\n7.7 Assessing significance",
-    "text": "7.7 Assessing significance\nWe can ask several questions.\nIs the model well-specified?\nRoughly speaking this asks “can our model predict our data reasonably well?”\n\n\nR\nPython\n\n\n\nUnfortunately, there isn’t a single command that does this for us, and we have to lift some of the numbers from the summary output ourselves.\n\npchisq(9.1879, 59, lower.tail = FALSE)\n\n[1] 1\n\n\nHere, we’ve used the pchisq function (which calculates the correct probability for us – ask if you want a hand-wavy explanation). The first argument to it is the residual deviance value from the summary table, the second argument to is the residual degrees of freedom argument from the same table.\n\n\n\nfrom scipy.stats import chi2\n\nchi2.sf(9.1879, 59)\n\n0.9999999999999916\n\n\n\n\n\nThis gives us a probability of 1. We can interpret this as the probability that the model is actually good. There aren’t any strict conventions on how to interpret this value but, for me, a tiny value would indicate a rubbish model.\nIs the overall model better than the null model?\n\n\nR\nPython\n\n\n\n\npchisq(84.5476 - 9.1879, 60 - 59, lower.tail = FALSE)\n\n[1] 3.923163e-18\n\n\nHere we’ve used the pchisq function again (if you didn’t ask before, you probably aren’t going to ask now).\n\n\nFirst we need to define the null model:\n\n# create a linear model\nmodel = smf.glm(formula = \"pointed_beak ~ 1\",\n                family = sm.families.Binomial(),\n                data = early_finches_py)\n# and get the fitted parameters of the model\nglm_bks_null_py = model.fit()\n\nprint(glm_bks_null_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:           pointed_beak   No. Observations:                   61\nModel:                            GLM   Df Residuals:                       60\nModel Family:                Binomial   Df Model:                            0\nLink Function:                  Logit   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -42.274\nDate:                Thu, 01 Feb 2024   Deviance:                       84.548\nTime:                        07:29:43   Pearson chi2:                     61.0\nNo. Iterations:                     3   Pseudo R-squ. (CS):              0.000\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      0.0328      0.256      0.128      0.898      -0.469       0.535\n==============================================================================\n\n\nIn order to compare our original fitted model to the null model we need to know the deviances of both models and the residual degrees of freedom of both models, which we could get from the summary method.\n\nchi2.sf(84.5476 - 9.1879, 60 - 59)\n\n3.9231627082752525e-18\n\n\n\n\n\nThe first argument is the difference between the null and residual deviances and the second argument is the difference in degrees of freedom between the null and residual models. All of these values can be lifted from the summary table.\nThis gives us a probability of pretty much zero. This value is doing a formal test to see whether our fitted model is significantly different from the null model. Here we can treat this a classical hypothesis test and since this p-value is less than 0.05 then we can say that our fitted model (with blength as a predictor variable) is definitely better than the null model (which has no predictor variables). Woohoo!\nAre any of the individual predictors significant?\nFinally, we’ll use the anova function from before to determine which predictor variables are important, and specifically in this case whether the beak length predictor is significant.\n\n\nR\nPython\n\n\n\n\nanova(glm_bks, test = \"Chisq\")\n\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: pointed_beak\n\nTerms added sequentially (first to last)\n\n        Df Deviance Resid. Df Resid. Dev  Pr(&gt;Chi)    \nNULL                       60     84.548              \nblength  1    75.36        59      9.188 &lt; 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nThe anova() function is a true workhorse within R! This time we’ve used it to create an Analysis of Deviance table. This is exactly equivalent to an ordinary ANOVA table where we have rows corresponding to each predictor variable and a p-value telling us whether that variable is significant or not.\nThe p-value for the blength predictor is written under then Pr(&gt;Chi) column and we can see that it is less than &lt; 2.2e-16. So, beak length is a significant predictor.\nThis shouldn’t be surprising since we already saw that our overall model was better than the null model, which in this case is exactly the same as asking whether the beak length term is significant. However, in more complicated models with multiple predictors these two comparisons (and p-values) won’t be the same.\n\n\nAlas, for some inexplicable reason this is not (yet?) possible to do in Python. At least, unbeknownst to me…",
+    "title": "\n9  Binary response\n",
+    "section": "\n9.7 Assessing significance",
+    "text": "9.7 Assessing significance\nWe can ask several questions.\nIs the model well-specified?\nRoughly speaking this asks “can our model predict our data reasonably well?”\n\n\nR\nPython\n\n\n\nUnfortunately, there isn’t a single command that does this for us, and we have to lift some of the numbers from the summary output ourselves.\n\npchisq(9.1879, 59, lower.tail = FALSE)\n\n[1] 1\n\n\nHere, we’ve used the pchisq function (which calculates the correct probability for us – ask if you want a hand-wavy explanation). The first argument to it is the residual deviance value from the summary table, the second argument to is the residual degrees of freedom argument from the same table.\n\n\n\nfrom scipy.stats import chi2\n\nchi2.sf(9.1879, 59)\n\n0.9999999999999916\n\n\n\n\n\nThis gives us a probability of 1. We can interpret this as the probability that the model is actually good. There aren’t any strict conventions on how to interpret this value but, for me, a tiny value would indicate a rubbish model.\nIs the overall model better than the null model?\n\n\nR\nPython\n\n\n\n\npchisq(84.5476 - 9.1879, 60 - 59, lower.tail = FALSE)\n\n[1] 3.923163e-18\n\n\nHere we’ve used the pchisq function again (if you didn’t ask before, you probably aren’t going to ask now).\n\n\nFirst we need to define the null model:\n\n# create a linear model\nmodel = smf.glm(formula = \"pointed_beak ~ 1\",\n                family = sm.families.Binomial(),\n                data = early_finches_py)\n# and get the fitted parameters of the model\nglm_bks_null_py = model.fit()\n\nprint(glm_bks_null_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:           pointed_beak   No. Observations:                   61\nModel:                            GLM   Df Residuals:                       60\nModel Family:                Binomial   Df Model:                            0\nLink Function:                  Logit   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -42.274\nDate:                Thu, 01 Feb 2024   Deviance:                       84.548\nTime:                        07:29:43   Pearson chi2:                     61.0\nNo. Iterations:                     3   Pseudo R-squ. (CS):              0.000\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      0.0328      0.256      0.128      0.898      -0.469       0.535\n==============================================================================\n\n\nIn order to compare our original fitted model to the null model we need to know the deviances of both models and the residual degrees of freedom of both models, which we could get from the summary method.\n\nchi2.sf(84.5476 - 9.1879, 60 - 59)\n\n3.9231627082752525e-18\n\n\n\n\n\nThe first argument is the difference between the null and residual deviances and the second argument is the difference in degrees of freedom between the null and residual models. All of these values can be lifted from the summary table.\nThis gives us a probability of pretty much zero. This value is doing a formal test to see whether our fitted model is significantly different from the null model. Here we can treat this a classical hypothesis test and since this p-value is less than 0.05 then we can say that our fitted model (with blength as a predictor variable) is definitely better than the null model (which has no predictor variables). Woohoo!\nAre any of the individual predictors significant?\nFinally, we’ll use the anova function from before to determine which predictor variables are important, and specifically in this case whether the beak length predictor is significant.\n\n\nR\nPython\n\n\n\n\nanova(glm_bks, test = \"Chisq\")\n\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: pointed_beak\n\nTerms added sequentially (first to last)\n\n        Df Deviance Resid. Df Resid. Dev  Pr(&gt;Chi)    \nNULL                       60     84.548              \nblength  1    75.36        59      9.188 &lt; 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nThe anova() function is a true workhorse within R! This time we’ve used it to create an Analysis of Deviance table. This is exactly equivalent to an ordinary ANOVA table where we have rows corresponding to each predictor variable and a p-value telling us whether that variable is significant or not.\nThe p-value for the blength predictor is written under then Pr(&gt;Chi) column and we can see that it is less than &lt; 2.2e-16. So, beak length is a significant predictor.\nThis shouldn’t be surprising since we already saw that our overall model was better than the null model, which in this case is exactly the same as asking whether the beak length term is significant. However, in more complicated models with multiple predictors these two comparisons (and p-values) won’t be the same.\n\n\nAlas, for some inexplicable reason this is not (yet?) possible to do in Python. At least, unbeknownst to me…",
     "crumbs": [
-      "Binary and proportional data",
-      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Binary response</span>"
+      "Logistic regression",
+      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Binary response</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-logistic-binary.html#exercises",
     "href": "materials/glm-practical-logistic-binary.html#exercises",
-    "title": "\n7  Binary response\n",
-    "section": "\n7.8 Exercises",
-    "text": "7.8 Exercises\n\n7.8.1 Diabetes\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nLevel: \nFor this exercise we’ll be using the data from data/diabetes.csv.\nThis is a data set comprising 768 observations of three variables (one dependent and two predictor variables). This records the results of a diabetes test result as a binary variable (1 is a positive result, 0 is a negative result), along with the result of a glucose tolerance test and the diastolic blood pressure for each of 768 women. The variables are called test_result, glucose and diastolic.\nWe want to see if the glucose tolerance is a meaningful predictor for predictions on a diabetes test. To investigate this, do the following:\n\nLoad and visualise the data\nCreate a suitable model\nDetermine if there are any statistically significant predictors\nCalculate the probability of a positive diabetes test result for a glucose tolerance test value of glucose = 150\n\n\n\n\n\n\n\n\nAnswer\n\n\n\n\n\n\n\nLoad and visualise the data\nFirst we load the data, then we visualise it.\n\n\nR\nPython\n\n\n\n\ndiabetes &lt;- read_csv(\"data/diabetes.csv\")\n\n\n\n\ndiabetes_py = pd.read_csv(\"data/diabetes.csv\")\n\n\n\n\nLooking at the data, we can see that the test_result column contains zeros and ones. These are yes/no test result outcomes and not actually numeric representations.\nWe’ll have to deal with this soon. For now, we can plot the data, by outcome:\n\n\nR\nPython\n\n\n\n\nggplot(diabetes,\n       aes(x = factor(test_result),\n           y = glucose)) +\n  geom_boxplot()\n\n\n\n\n\n\n\n\n\nWe could just give Python the test_result data directly, but then it would view the values as numeric. Which doesn’t really work, because we have two groups as such: those with a negative diabetes test result, and those with a positive one.\nWe can force Python to temporarily covert the data to a factor, by making the test_result column an object type. We can do this directly inside the ggplot() function.\n\n(ggplot(diabetes_py,\n         aes(x = diabetes_py.test_result.astype(object),\n             y = \"glucose\")) +\n     geom_boxplot())\n\n\n\n\n\n\n\n\n\n\nIt looks as though the patients with a positive diabetes test have slightly higher glucose levels than those with a negative diabetes test.\nWe can visualise that differently by plotting all the data points as a classic binary response plot:\n\n\nR\nPython\n\n\n\n\nggplot(diabetes,\n       aes(x = glucose,\n           y = test_result)) +\n  geom_point()\n\n\n\n\n\n\n\n\n\n\n(ggplot(diabetes_py,\n         aes(x = \"glucose\",\n             y = \"test_result\")) +\n     geom_point())\n\n\n\n\n\n\n\n\n\n\nCreate a suitable model\n\n\nR\nPython\n\n\n\nWe’ll use the glm() function to create a generalised linear model. Here we save the model in an object called glm_dia:\n\nglm_dia &lt;- glm(test_result ~ glucose,\n               family = binomial,\n               data = diabetes)\n\nThe format of this function is similar to that used by the lm() function for linear models. The important difference is that we must specify the family of error distribution to use. For logistic regression we must set the family to binomial.\n\n\n\n# create a linear model\nmodel = smf.glm(formula = \"test_result ~ glucose\",\n                family = sm.families.Binomial(),\n                data = diabetes_py)\n# and get the fitted parameters of the model\nglm_dia_py = model.fit()\n\n\n\n\nLet’s look at the model parameters:\n\n\nR\nPython\n\n\n\n\nsummary(glm_dia)\n\n\nCall:\nglm(formula = test_result ~ glucose, family = binomial, data = diabetes)\n\nCoefficients:\n             Estimate Std. Error z value Pr(&gt;|z|)    \n(Intercept) -5.611732   0.442289  -12.69   &lt;2e-16 ***\nglucose      0.039510   0.003398   11.63   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 936.6  on 727  degrees of freedom\nResidual deviance: 752.2  on 726  degrees of freedom\nAIC: 756.2\n\nNumber of Fisher Scoring iterations: 4\n\n\n\n\n\nprint(glm_dia_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:            test_result   No. Observations:                  728\nModel:                            GLM   Df Residuals:                      726\nModel Family:                Binomial   Df Model:                            1\nLink Function:                  Logit   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -376.10\nDate:                Thu, 01 Feb 2024   Deviance:                       752.20\nTime:                        07:29:45   Pearson chi2:                     713.\nNo. Iterations:                     4   Pseudo R-squ. (CS):             0.2238\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     -5.6117      0.442    -12.688      0.000      -6.479      -4.745\nglucose        0.0395      0.003     11.628      0.000       0.033       0.046\n==============================================================================\n\n\n\n\n\nWe can see that glucose is a significant predictor for the test_result (the \\(p\\) value is much smaller than 0.05).\nKnowing this, we’re interested in the coefficients. We have an intercept of -5.61 and 0.0395 for glucose. We can use these coefficients to write a formula that describes the potential relationship between the probability of having a positive test result, dependent on the glucose tolerance level value:\n\\[ P(positive \\ test\\ result) = \\frac{\\exp(-5.61 + 0.04 \\times glucose)}{1 + \\exp(-5.61 + 0.04 \\times glucose)} \\]\nCalculating probabilities\nUsing the formula above, we can now calculate the probability of having a positive test result, for a given glucose value. If we do this for glucose = 150, we get the following:\n\n\nR\nPython\n\n\n\n\nexp(-5.61 + 0.04 * 150) / (1 + exp(-5.61 + 0.04 * 150))\n\n[1] 0.5962827\n\n\n\n\n\nmath.exp(-5.61 + 0.04 * 150) / (1 + math.exp(-5.61 + 0.04 * 150))\n\n0.5962826992967878\n\n\n\n\n\nThis tells us that the probability of having a positive diabetes test result, given a glucose tolerance level of 150 is around 60 %.",
+    "title": "\n9  Binary response\n",
+    "section": "\n9.8 Exercises",
+    "text": "9.8 Exercises\n\n9.8.1 Diabetes\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nLevel: \nFor this exercise we’ll be using the data from data/diabetes.csv.\nThis is a data set comprising 768 observations of three variables (one dependent and two predictor variables). This records the results of a diabetes test result as a binary variable (1 is a positive result, 0 is a negative result), along with the result of a glucose tolerance test and the diastolic blood pressure for each of 768 women. The variables are called test_result, glucose and diastolic.\nWe want to see if the glucose tolerance is a meaningful predictor for predictions on a diabetes test. To investigate this, do the following:\n\nLoad and visualise the data\nCreate a suitable model\nDetermine if there are any statistically significant predictors\nCalculate the probability of a positive diabetes test result for a glucose tolerance test value of glucose = 150\n\n\n\n\n\n\n\n\nAnswer\n\n\n\n\n\n\n\nLoad and visualise the data\nFirst we load the data, then we visualise it.\n\n\nR\nPython\n\n\n\n\ndiabetes &lt;- read_csv(\"data/diabetes.csv\")\n\n\n\n\ndiabetes_py = pd.read_csv(\"data/diabetes.csv\")\n\n\n\n\nLooking at the data, we can see that the test_result column contains zeros and ones. These are yes/no test result outcomes and not actually numeric representations.\nWe’ll have to deal with this soon. For now, we can plot the data, by outcome:\n\n\nR\nPython\n\n\n\n\nggplot(diabetes,\n       aes(x = factor(test_result),\n           y = glucose)) +\n  geom_boxplot()\n\n\n\n\n\n\n\n\n\nWe could just give Python the test_result data directly, but then it would view the values as numeric. Which doesn’t really work, because we have two groups as such: those with a negative diabetes test result, and those with a positive one.\nWe can force Python to temporarily covert the data to a factor, by making the test_result column an object type. We can do this directly inside the ggplot() function.\n\n(ggplot(diabetes_py,\n         aes(x = diabetes_py.test_result.astype(object),\n             y = \"glucose\")) +\n     geom_boxplot())\n\n\n\n\n\n\n\n\n\n\nIt looks as though the patients with a positive diabetes test have slightly higher glucose levels than those with a negative diabetes test.\nWe can visualise that differently by plotting all the data points as a classic binary response plot:\n\n\nR\nPython\n\n\n\n\nggplot(diabetes,\n       aes(x = glucose,\n           y = test_result)) +\n  geom_point()\n\n\n\n\n\n\n\n\n\n\n(ggplot(diabetes_py,\n         aes(x = \"glucose\",\n             y = \"test_result\")) +\n     geom_point())\n\n\n\n\n\n\n\n\n\n\nCreate a suitable model\n\n\nR\nPython\n\n\n\nWe’ll use the glm() function to create a generalised linear model. Here we save the model in an object called glm_dia:\n\nglm_dia &lt;- glm(test_result ~ glucose,\n               family = binomial,\n               data = diabetes)\n\nThe format of this function is similar to that used by the lm() function for linear models. The important difference is that we must specify the family of error distribution to use. For logistic regression we must set the family to binomial.\n\n\n\n# create a linear model\nmodel = smf.glm(formula = \"test_result ~ glucose\",\n                family = sm.families.Binomial(),\n                data = diabetes_py)\n# and get the fitted parameters of the model\nglm_dia_py = model.fit()\n\n\n\n\nLet’s look at the model parameters:\n\n\nR\nPython\n\n\n\n\nsummary(glm_dia)\n\n\nCall:\nglm(formula = test_result ~ glucose, family = binomial, data = diabetes)\n\nCoefficients:\n             Estimate Std. Error z value Pr(&gt;|z|)    \n(Intercept) -5.611732   0.442289  -12.69   &lt;2e-16 ***\nglucose      0.039510   0.003398   11.63   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 936.6  on 727  degrees of freedom\nResidual deviance: 752.2  on 726  degrees of freedom\nAIC: 756.2\n\nNumber of Fisher Scoring iterations: 4\n\n\n\n\n\nprint(glm_dia_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:            test_result   No. Observations:                  728\nModel:                            GLM   Df Residuals:                      726\nModel Family:                Binomial   Df Model:                            1\nLink Function:                  Logit   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -376.10\nDate:                Thu, 01 Feb 2024   Deviance:                       752.20\nTime:                        07:29:45   Pearson chi2:                     713.\nNo. Iterations:                     4   Pseudo R-squ. (CS):             0.2238\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     -5.6117      0.442    -12.688      0.000      -6.479      -4.745\nglucose        0.0395      0.003     11.628      0.000       0.033       0.046\n==============================================================================\n\n\n\n\n\nWe can see that glucose is a significant predictor for the test_result (the \\(p\\) value is much smaller than 0.05).\nKnowing this, we’re interested in the coefficients. We have an intercept of -5.61 and 0.0395 for glucose. We can use these coefficients to write a formula that describes the potential relationship between the probability of having a positive test result, dependent on the glucose tolerance level value:\n\\[ P(positive \\ test\\ result) = \\frac{\\exp(-5.61 + 0.04 \\times glucose)}{1 + \\exp(-5.61 + 0.04 \\times glucose)} \\]\nCalculating probabilities\nUsing the formula above, we can now calculate the probability of having a positive test result, for a given glucose value. If we do this for glucose = 150, we get the following:\n\n\nR\nPython\n\n\n\n\nexp(-5.61 + 0.04 * 150) / (1 + exp(-5.61 + 0.04 * 150))\n\n[1] 0.5962827\n\n\n\n\n\nmath.exp(-5.61 + 0.04 * 150) / (1 + math.exp(-5.61 + 0.04 * 150))\n\n0.5962826992967878\n\n\n\n\n\nThis tells us that the probability of having a positive diabetes test result, given a glucose tolerance level of 150 is around 60 %.",
     "crumbs": [
-      "Binary and proportional data",
-      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Binary response</span>"
+      "Logistic regression",
+      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Binary response</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-logistic-binary.html#summary",
     "href": "materials/glm-practical-logistic-binary.html#summary",
-    "title": "\n7  Binary response\n",
-    "section": "\n7.9 Summary",
-    "text": "7.9 Summary\n\n\n\n\n\n\nKey points\n\n\n\n\nWe use a logistic regression to model a binary response\nWe can feed new observations into the model and get probabilities for the outcome\n\n\n\n\n\n\n\nLamichhaney, Sangeet, Fan Han, Matthew T. Webster, B. Rosemary Grant, Peter R. Grant, and Leif Andersson. 2020. “Female-Biased Gene Flow Between Two Species of Darwin’s Finches.” Nature Ecology & Evolution 4 (7): 979–86. https://doi.org/10.1038/s41559-020-1183-9.",
+    "title": "\n9  Binary response\n",
+    "section": "\n9.9 Summary",
+    "text": "9.9 Summary\n\n\n\n\n\n\nKey points\n\n\n\n\nWe use a logistic regression to model a binary response\nWe can feed new observations into the model and get probabilities for the outcome\n\n\n\n\n\n\n\nLamichhaney, Sangeet, Fan Han, Matthew T. Webster, B. Rosemary Grant, Peter R. Grant, and Leif Andersson. 2020. “Female-Biased Gene Flow Between Two Species of Darwin’s Finches.” Nature Ecology & Evolution 4 (7): 979–86. https://doi.org/10.1038/s41559-020-1183-9.",
     "crumbs": [
-      "Binary and proportional data",
-      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Binary response</span>"
+      "Logistic regression",
+      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Binary response</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-logistic-proportion.html",
     "href": "materials/glm-practical-logistic-proportion.html",
-    "title": "\n8  Proportional response\n",
+    "title": "\n10  Proportional response\n",
     "section": "",
-    "text": "8.1 Libraries and functions\nThe example in this section uses the following data set:\ndata/challenger.csv\nThese data, obtained from the faraway package in R, contain information related to the explosion of the USA Space Shuttle Challenger on 28 January, 1986. An investigation after the disaster traced back to certain joints on one of the two solid booster rockets, each containing O-rings that ensured no exhaust gases could escape from the booster.\nThe night before the launch was unusually cold, with temperatures below freezing. The final report suggested that the cold snap during the night made the o-rings stiff, and unable to adjust to changes in pressure. As a result, exhaust gases leaked away from the solid booster rockets, causing one of them to break loose and rupture the main fuel tank, leading to the final explosion.\nThe question we’re trying to answer in this session is: based on the data from the previous flights, would it have been possible to predict the failure of most o-rings on the Challenger flight?",
+    "text": "10.1 Libraries and functions\nThe example in this section uses the following data set:\ndata/challenger.csv\nThese data, obtained from the faraway package in R, contain information related to the explosion of the USA Space Shuttle Challenger on 28 January, 1986. An investigation after the disaster traced back to certain joints on one of the two solid booster rockets, each containing O-rings that ensured no exhaust gases could escape from the booster.\nThe night before the launch was unusually cold, with temperatures below freezing. The final report suggested that the cold snap during the night made the o-rings stiff, and unable to adjust to changes in pressure. As a result, exhaust gases leaked away from the solid booster rockets, causing one of them to break loose and rupture the main fuel tank, leading to the final explosion.\nThe question we’re trying to answer in this session is: based on the data from the previous flights, would it have been possible to predict the failure of most o-rings on the Challenger flight?",
     "crumbs": [
-      "Binary and proportional data",
-      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Proportional response</span>"
+      "Logistic regression",
+      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Proportional response</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-logistic-proportion.html#libraries-and-functions",
     "href": "materials/glm-practical-logistic-proportion.html#libraries-and-functions",
-    "title": "\n8  Proportional response\n",
+    "title": "\n10  Proportional response\n",
     "section": "",
-    "text": "Click to expand\n\n\n\n\n\n\n\nR\nPython\n\n\n\n\n8.1.1 Libraries\n\n8.1.2 Functions\n\n\n\n\n8.1.3 Libraries\n\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n\n# Needed for additional probability functionality\nfrom scipy.stats import *\n\n\n8.1.4 Functions",
+    "text": "Click to expand\n\n\n\n\n\n\n\nR\nPython\n\n\n\n\n10.1.1 Libraries\n\n10.1.2 Functions\n\n\n\n\n10.1.3 Libraries\n\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n\n# Needed for additional probability functionality\nfrom scipy.stats import *\n\n\n10.1.4 Functions",
     "crumbs": [
-      "Binary and proportional data",
-      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Proportional response</span>"
+      "Logistic regression",
+      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Proportional response</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-logistic-proportion.html#load-and-visualise-the-data",
     "href": "materials/glm-practical-logistic-proportion.html#load-and-visualise-the-data",
-    "title": "\n8  Proportional response\n",
-    "section": "\n8.2 Load and visualise the data",
-    "text": "8.2 Load and visualise the data\nFirst we load the data, then we visualise it.\n\n\nR\nPython\n\n\n\n\nchallenger &lt;- read_csv(\"data/challenger.csv\")\n\nRows: 23 Columns: 2\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (2): temp, damage\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n\n\n\n\n\nchallenger_py = pd.read_csv(\"data/challenger.csv\")\n\n\n\n\nThe data set contains several columns:\n\n\ntemp, the launch temperature in degrees Fahrenheit\n\ndamage, the number of o-rings that showed erosion\n\nBefore we have a further look at the data, let’s calculate the proportion of damaged o-rings (prop_damaged) and the total number of o-rings (total) and update our data set.\n\n\nR\nPython\n\n\n\n\nchallenger &lt;-\nchallenger %&gt;%\n  mutate(total = 6,                     # total number of o-rings\n         intact = 6 - damage,           # number of undamaged o-rings\n         prop_damaged = damage / total) # proportion damaged o-rings\n\nchallenger\n\n# A tibble: 23 × 5\n    temp damage total intact prop_damaged\n   &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt;        &lt;dbl&gt;\n 1    53      5     6      1        0.833\n 2    57      1     6      5        0.167\n 3    58      1     6      5        0.167\n 4    63      1     6      5        0.167\n 5    66      0     6      6        0    \n 6    67      0     6      6        0    \n 7    67      0     6      6        0    \n 8    67      0     6      6        0    \n 9    68      0     6      6        0    \n10    69      0     6      6        0    \n# ℹ 13 more rows\n\n\n\n\n\nchallenger_py['total'] = 6\nchallenger_py['intact'] = challenger_py['total'] - challenger_py['damage']\nchallenger_py['prop_damaged'] = challenger_py['damage'] / challenger_py['total']\n\n\n\n\nPlotting the proportion of damaged o-rings against the launch temperature shows the following picture:\n\n\nR\nPython\n\n\n\n\nggplot(challenger, aes(x = temp, y = prop_damaged)) +\n  geom_point()\n\n\n\n\n\n\n\n\n\n\n(ggplot(challenger_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point())\n\n\n\n\n\n\n\n\n\n\nThe point on the left is the data point corresponding to the coldest flight experienced before the disaster, where five damaged o-rings were found. Fortunately, this did not result in a disaster.\nHere we’ll explore if we could have reasonably predicted the failure of both o-rings on the Challenger flight, where the launch temperature was 31 degrees Fahrenheit.",
+    "title": "\n10  Proportional response\n",
+    "section": "\n10.2 Load and visualise the data",
+    "text": "10.2 Load and visualise the data\nFirst we load the data, then we visualise it.\n\n\nR\nPython\n\n\n\n\nchallenger &lt;- read_csv(\"data/challenger.csv\")\n\nRows: 23 Columns: 2\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (2): temp, damage\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n\n\n\n\n\nchallenger_py = pd.read_csv(\"data/challenger.csv\")\n\n\n\n\nThe data set contains several columns:\n\n\ntemp, the launch temperature in degrees Fahrenheit\n\ndamage, the number of o-rings that showed erosion\n\nBefore we have a further look at the data, let’s calculate the proportion of damaged o-rings (prop_damaged) and the total number of o-rings (total) and update our data set.\n\n\nR\nPython\n\n\n\n\nchallenger &lt;-\nchallenger %&gt;%\n  mutate(total = 6,                     # total number of o-rings\n         intact = 6 - damage,           # number of undamaged o-rings\n         prop_damaged = damage / total) # proportion damaged o-rings\n\nchallenger\n\n# A tibble: 23 × 5\n    temp damage total intact prop_damaged\n   &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt;        &lt;dbl&gt;\n 1    53      5     6      1        0.833\n 2    57      1     6      5        0.167\n 3    58      1     6      5        0.167\n 4    63      1     6      5        0.167\n 5    66      0     6      6        0    \n 6    67      0     6      6        0    \n 7    67      0     6      6        0    \n 8    67      0     6      6        0    \n 9    68      0     6      6        0    \n10    69      0     6      6        0    \n# ℹ 13 more rows\n\n\n\n\n\nchallenger_py['total'] = 6\nchallenger_py['intact'] = challenger_py['total'] - challenger_py['damage']\nchallenger_py['prop_damaged'] = challenger_py['damage'] / challenger_py['total']\n\n\n\n\nPlotting the proportion of damaged o-rings against the launch temperature shows the following picture:\n\n\nR\nPython\n\n\n\n\nggplot(challenger, aes(x = temp, y = prop_damaged)) +\n  geom_point()\n\n\n\n\n\n\n\n\n\n\n(ggplot(challenger_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point())\n\n\n\n\n\n\n\n\n\n\nThe point on the left is the data point corresponding to the coldest flight experienced before the disaster, where five damaged o-rings were found. Fortunately, this did not result in a disaster.\nHere we’ll explore if we could have reasonably predicted the failure of both o-rings on the Challenger flight, where the launch temperature was 31 degrees Fahrenheit.",
     "crumbs": [
-      "Binary and proportional data",
-      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Proportional response</span>"
+      "Logistic regression",
+      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Proportional response</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-logistic-proportion.html#creating-a-suitable-model",
     "href": "materials/glm-practical-logistic-proportion.html#creating-a-suitable-model",
-    "title": "\n8  Proportional response\n",
-    "section": "\n8.3 Creating a suitable model",
-    "text": "8.3 Creating a suitable model\nWe only have 23 data points in total. So we’re building a model on not that much data - we should keep this in mind when we draw our conclusions!\nWe are using a logistic regression for a proportion response in this case, since we’re interested in the proportion of o-rings that are damaged.\nWe can define this as follows:\n\n\nR\nPython\n\n\n\n\nglm_chl &lt;- glm(cbind(damage, intact) ~ temp,\n               family = binomial,\n               data = challenger)\n\nDefining the relationship for proportion responses is a bit annoying, where you have to give the glm model a two-column matrix to specify the response variable.\nHere, the first column corresponds to the number of damaged o-rings, whereas the second column refers to the number of intact o-rings. We use the cbind() function to bind these two together into a matrix.\n\n\n\n# create a generalised linear model\nmodel = smf.glm(formula = \"damage + intact ~ temp\",\n                family = sm.families.Binomial(),\n                data = challenger_py)\n# and get the fitted parameters of the model\nglm_chl_py = model.fit()",
+    "title": "\n10  Proportional response\n",
+    "section": "\n10.3 Creating a suitable model",
+    "text": "10.3 Creating a suitable model\nWe only have 23 data points in total. So we’re building a model on not that much data - we should keep this in mind when we draw our conclusions!\nWe are using a logistic regression for a proportion response in this case, since we’re interested in the proportion of o-rings that are damaged.\nWe can define this as follows:\n\n\nR\nPython\n\n\n\n\nglm_chl &lt;- glm(cbind(damage, intact) ~ temp,\n               family = binomial,\n               data = challenger)\n\nDefining the relationship for proportion responses is a bit annoying, where you have to give the glm model a two-column matrix to specify the response variable.\nHere, the first column corresponds to the number of damaged o-rings, whereas the second column refers to the number of intact o-rings. We use the cbind() function to bind these two together into a matrix.\n\n\n\n# create a generalised linear model\nmodel = smf.glm(formula = \"damage + intact ~ temp\",\n                family = sm.families.Binomial(),\n                data = challenger_py)\n# and get the fitted parameters of the model\nglm_chl_py = model.fit()",
     "crumbs": [
-      "Binary and proportional data",
-      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Proportional response</span>"
+      "Logistic regression",
+      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Proportional response</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-logistic-proportion.html#model-output",
     "href": "materials/glm-practical-logistic-proportion.html#model-output",
-    "title": "\n8  Proportional response\n",
-    "section": "\n8.4 Model output",
-    "text": "8.4 Model output\nThat’s the easy part done! The trickier part is interpreting the output. First of all, we’ll get some summary information.\n\n\nR\nPython\n\n\n\nNext, we can have a closer look at the results:\n\nsummary(glm_chl)\n\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)    \n(Intercept) 11.66299    3.29626   3.538 0.000403 ***\ntemp        -0.21623    0.05318  -4.066 4.78e-05 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 38.898  on 22  degrees of freedom\nResidual deviance: 16.912  on 21  degrees of freedom\nAIC: 33.675\n\nNumber of Fisher Scoring iterations: 6\n\n\nWe can see that the p-values of the intercept and temp are significant. We can also use the intercept and temp coefficients to construct the logistic equation, which we can use to sketch the logistic curve.\n\n\n\nprint(glm_chl_py.summary())\n\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   23\nModel:                              GLM   Df Residuals:                       21\nModel Family:                  Binomial   Df Model:                            1\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -14.837\nDate:                  Tue, 06 Feb 2024   Deviance:                       16.912\nTime:                          16:12:12   Pearson chi2:                     28.1\nNo. Iterations:                       7   Pseudo R-squ. (CS):             0.6155\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     11.6630      3.296      3.538      0.000       5.202      18.124\ntemp          -0.2162      0.053     -4.066      0.000      -0.320      -0.112\n==============================================================================\n\n\n\n\n\n\\[E(prop \\ failed\\ orings) = \\frac{\\exp{(11.66 -  0.22 \\times temp)}}{1 + \\exp{(11.66 -  0.22 \\times temp)}}\\]\nLet’s see how well our model would have performed if we would have fed it the data from the ill-fated Challenger launch.\n\n\nR\nPython\n\n\n\n\nggplot(challenger, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85)\n\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n\n\n\n\n\n\n\n\n\n\nWe can get the predicted values for the model as follows:\n\nchallenger_py['predicted_values'] = glm_chl_py.predict()\n\nchallenger_py.head()\n\n   temp  damage  total  intact  prop_damaged  predicted_values\n0    53       5      6       1      0.833333          0.550479\n1    57       1      6       5      0.166667          0.340217\n2    58       1      6       5      0.166667          0.293476\n3    63       1      6       5      0.166667          0.123496\n4    66       0      6       6      0.000000          0.068598\n\n\nThis would only give us the predicted values for the data we already have. Instead we want to extrapolate to what would have been predicted for a wider range of temperatures. Here, we use a range of \\([25, 85]\\) degrees Fahrenheit.\n\nmodel = pd.DataFrame({'temp': list(range(25, 86))})\n\nmodel[\"pred\"] = glm_chl_py.predict(model)\n\nmodel.head()\n\n   temp      pred\n0    25  0.998087\n1    26  0.997626\n2    27  0.997055\n3    28  0.996347\n4    29  0.995469\n\n\n\n(ggplot(challenger_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"temp\", y = \"pred\"), colour = \"blue\", size = 1))\n\n\n\n\n\n\n\n\n\n\n\n\n\nGenerating predicted values\n\n\n\n\n\n\n\nR\nPython\n\n\n\nAnother way of doing this it to generate a table with data for a range of temperatures, from 25 to 85 degrees Fahrenheit, in steps of 1. We can then use these data to generate the logistic curve, based on the fitted model.\n\n# create a table with sequential numbers ranging from 25 to 85\nmodel &lt;- tibble(temp = seq(25, 85, by = 1)) %&gt;% \n  # add a new column containing the predicted values\n  mutate(.pred = predict(glm_chl, newdata = ., type = \"response\"))\n\nggplot(model, aes(temp, .pred)) +\n  geom_line()\n\n\n\n\n\n\n\n\n# plot the curve and the original data\nggplot(model, aes(temp, .pred)) +\n  geom_line(colour = \"blue\") +\n  geom_point(data = challenger, aes(temp, prop_damaged)) +\n  # add a vertical line at the disaster launch temperature\n  geom_vline(xintercept = 31, linetype = \"dashed\")\n\n\n\n\n\n\n\nIt seems that there was a high probability of both o-rings failing at that launch temperature. One thing that the graph shows is that there is a lot of uncertainty involved in this model. We can tell, because the fit of the line is very poor at the lower temperature range. There is just very little data to work on, with the data point at 53 F having a large influence on the fit.\n\n\nWe already did this above, since this is the most straightforward way of plotting the model in Python.",
+    "title": "\n10  Proportional response\n",
+    "section": "\n10.4 Model output",
+    "text": "10.4 Model output\nThat’s the easy part done! The trickier part is interpreting the output. First of all, we’ll get some summary information.\n\n\nR\nPython\n\n\n\nNext, we can have a closer look at the results:\n\nsummary(glm_chl)\n\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)    \n(Intercept) 11.66299    3.29626   3.538 0.000403 ***\ntemp        -0.21623    0.05318  -4.066 4.78e-05 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 38.898  on 22  degrees of freedom\nResidual deviance: 16.912  on 21  degrees of freedom\nAIC: 33.675\n\nNumber of Fisher Scoring iterations: 6\n\n\nWe can see that the p-values of the intercept and temp are significant. We can also use the intercept and temp coefficients to construct the logistic equation, which we can use to sketch the logistic curve.\n\n\n\nprint(glm_chl_py.summary())\n\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   23\nModel:                              GLM   Df Residuals:                       21\nModel Family:                  Binomial   Df Model:                            1\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -14.837\nDate:                  Tue, 06 Feb 2024   Deviance:                       16.912\nTime:                          16:12:12   Pearson chi2:                     28.1\nNo. Iterations:                       7   Pseudo R-squ. (CS):             0.6155\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     11.6630      3.296      3.538      0.000       5.202      18.124\ntemp          -0.2162      0.053     -4.066      0.000      -0.320      -0.112\n==============================================================================\n\n\n\n\n\n\\[E(prop \\ failed\\ orings) = \\frac{\\exp{(11.66 -  0.22 \\times temp)}}{1 + \\exp{(11.66 -  0.22 \\times temp)}}\\]\nLet’s see how well our model would have performed if we would have fed it the data from the ill-fated Challenger launch.\n\n\nR\nPython\n\n\n\n\nggplot(challenger, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85)\n\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n\n\n\n\n\n\n\n\n\n\nWe can get the predicted values for the model as follows:\n\nchallenger_py['predicted_values'] = glm_chl_py.predict()\n\nchallenger_py.head()\n\n   temp  damage  total  intact  prop_damaged  predicted_values\n0    53       5      6       1      0.833333          0.550479\n1    57       1      6       5      0.166667          0.340217\n2    58       1      6       5      0.166667          0.293476\n3    63       1      6       5      0.166667          0.123496\n4    66       0      6       6      0.000000          0.068598\n\n\nThis would only give us the predicted values for the data we already have. Instead we want to extrapolate to what would have been predicted for a wider range of temperatures. Here, we use a range of \\([25, 85]\\) degrees Fahrenheit.\n\nmodel = pd.DataFrame({'temp': list(range(25, 86))})\n\nmodel[\"pred\"] = glm_chl_py.predict(model)\n\nmodel.head()\n\n   temp      pred\n0    25  0.998087\n1    26  0.997626\n2    27  0.997055\n3    28  0.996347\n4    29  0.995469\n\n\n\n(ggplot(challenger_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"temp\", y = \"pred\"), colour = \"blue\", size = 1))\n\n\n\n\n\n\n\n\n\n\n\n\n\nGenerating predicted values\n\n\n\n\n\n\n\nR\nPython\n\n\n\nAnother way of doing this it to generate a table with data for a range of temperatures, from 25 to 85 degrees Fahrenheit, in steps of 1. We can then use these data to generate the logistic curve, based on the fitted model.\n\n# create a table with sequential numbers ranging from 25 to 85\nmodel &lt;- tibble(temp = seq(25, 85, by = 1)) %&gt;% \n  # add a new column containing the predicted values\n  mutate(.pred = predict(glm_chl, newdata = ., type = \"response\"))\n\nggplot(model, aes(temp, .pred)) +\n  geom_line()\n\n\n\n\n\n\n\n\n# plot the curve and the original data\nggplot(model, aes(temp, .pred)) +\n  geom_line(colour = \"blue\") +\n  geom_point(data = challenger, aes(temp, prop_damaged)) +\n  # add a vertical line at the disaster launch temperature\n  geom_vline(xintercept = 31, linetype = \"dashed\")\n\n\n\n\n\n\n\nIt seems that there was a high probability of both o-rings failing at that launch temperature. One thing that the graph shows is that there is a lot of uncertainty involved in this model. We can tell, because the fit of the line is very poor at the lower temperature range. There is just very little data to work on, with the data point at 53 F having a large influence on the fit.\n\n\nWe already did this above, since this is the most straightforward way of plotting the model in Python.",
     "crumbs": [
-      "Binary and proportional data",
-      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Proportional response</span>"
+      "Logistic regression",
+      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Proportional response</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-logistic-proportion.html#exercises",
     "href": "materials/glm-practical-logistic-proportion.html#exercises",
-    "title": "\n8  Proportional response\n",
-    "section": "\n8.5 Exercises",
-    "text": "8.5 Exercises\n\n8.5.1 Predicting failure\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nLevel: \nThe data point at 53 degrees Fahrenheit is quite influential for the analysis. Remove this data point and repeat the analysis. Is there still a predicted link between launch temperature and o-ring failure?\n\n\n\n\n\n\nAnswer\n\n\n\n\n\n\n\n\n\nR\nPython\n\n\n\nFirst, we need to remove the influential data point:\n\nchallenger_new &lt;- challenger %&gt;% filter(temp != 53)\n\nWe can create a new generalised linear model, based on these data:\n\nglm_chl_new &lt;- glm(cbind(damage, intact) ~ temp,\n               family = binomial,\n               data = challenger_new)\n\nWe can get the model parameters as follows:\n\nsummary(glm_chl_new)\n\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger_new)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)  \n(Intercept)  5.68223    4.43138   1.282   0.1997  \ntemp        -0.12817    0.06697  -1.914   0.0556 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 16.375  on 21  degrees of freedom\nResidual deviance: 12.633  on 20  degrees of freedom\nAIC: 27.572\n\nNumber of Fisher Scoring iterations: 5\n\n\n\nggplot(challenger_new, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85) +\n  # add a vertical line at 53 F temperature\n  geom_vline(xintercept = 53, linetype = \"dashed\")\n\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n\n\n\n\n\n\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\nBefore we can make any firm conclusions, though, we need to check our model:\n\n1- pchisq(12.633,20)\n\n[1] 0.8925695\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty rubbish – our points are not very close to our curve, overall.\nIs the model any better than the null though?\n\n1 - pchisq(16.375 - 12.633, 21 - 20)\n\n[1] 0.0530609\n\nanova(glm_chl_new, test = 'Chisq')\n\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: cbind(damage, intact)\n\nTerms added sequentially (first to last)\n\n     Df Deviance Resid. Df Resid. Dev Pr(&gt;Chi)  \nNULL                    21     16.375           \ntemp  1   3.7421        20     12.633  0.05306 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n\n\nFirst, we need to remove the influential data point:\n\nchallenger_new_py = challenger_py.query(\"temp != 53\")\n\nWe can create a new generalised linear model, based on these data:\n\n# create a generalised linear model\nmodel = smf.glm(formula = \"damage + intact ~ temp\",\n                family = sm.families.Binomial(),\n                data = challenger_new_py)\n# and get the fitted parameters of the model\nglm_chl_new_py = model.fit()\n\nWe can get the model parameters as follows:\n\nprint(glm_chl_new_py.summary())\n\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   22\nModel:                              GLM   Df Residuals:                       20\nModel Family:                  Binomial   Df Model:                            1\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -11.786\nDate:                  Tue, 06 Feb 2024   Deviance:                       12.633\nTime:                          16:12:15   Pearson chi2:                     16.6\nNo. Iterations:                       6   Pseudo R-squ. (CS):             0.1564\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      5.6822      4.431      1.282      0.200      -3.003      14.368\ntemp          -0.1282      0.067     -1.914      0.056      -0.259       0.003\n==============================================================================\n\n\nGenerate new model data:\n\nmodel = pd.DataFrame({'temp': list(range(25, 86))})\n\nmodel[\"pred\"] = glm_chl_new_py.predict(model)\n\nmodel.head()\n\n   temp      pred\n0    25  0.922585\n1    26  0.912920\n2    27  0.902177\n3    28  0.890269\n4    29  0.877107\n\n\n\n(ggplot(challenger_new_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"temp\", y = \"pred\"), colour = \"blue\", size = 1) +\n     # add a vertical line at 53 F temperature\n     geom_vline(xintercept = 53, linetype = \"dashed\"))\n\n\n\n\n\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\nBefore we can make any firm conclusions, though, we need to check our model:\n\nchi2.sf(12.633, 20)\n\n0.8925694610786307\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty rubbish – our points are not very close to our curve, overall.\nIs the model any better than the null though?\nFirst we need to define the null model:\n\n# create a linear model\nmodel = smf.glm(formula = \"damage + intact ~ 1\",\n                family = sm.families.Binomial(),\n                data = challenger_new_py)\n# and get the fitted parameters of the model\nglm_chl_new_null_py = model.fit()\n\nprint(glm_chl_new_null_py.summary())\n\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   22\nModel:                              GLM   Df Residuals:                       21\nModel Family:                  Binomial   Df Model:                            0\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -13.657\nDate:                  Tue, 06 Feb 2024   Deviance:                       16.375\nTime:                          16:12:16   Pearson chi2:                     16.8\nNo. Iterations:                       6   Pseudo R-squ. (CS):         -2.220e-16\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     -3.0445      0.418     -7.286      0.000      -3.864      -2.226\n==============================================================================\n\n\n\nchi2.sf(16.375 - 12.633, 21 - 20)\n\n0.053060897703157646\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n\n\n\nSo, could NASA have predicted what happened? This model is not significantly different from the null, i.e., temperature is not a significant predictor. Note that it’s only marginally non-significant, and we do have a high goodness-of-fit value.\nIt is possible that if more data points were available that followed a similar trend, the story might be different). Even if we did use our non-significant model to make a prediction, it doesn’t give us a value anywhere near 5 failures for a temperature of 53 degrees Fahrenheit. So overall, based on the model we’ve fitted with these data, there was no indication that a temperature just a few degrees cooler than previous missions could have been so disastrous for the Challenger.",
+    "title": "\n10  Proportional response\n",
+    "section": "\n10.5 Exercises",
+    "text": "10.5 Exercises\n\n10.5.1 Predicting failure\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nLevel: \nThe data point at 53 degrees Fahrenheit is quite influential for the analysis. Remove this data point and repeat the analysis. Is there still a predicted link between launch temperature and o-ring failure?\n\n\n\n\n\n\nAnswer\n\n\n\n\n\n\n\n\n\nR\nPython\n\n\n\nFirst, we need to remove the influential data point:\n\nchallenger_new &lt;- challenger %&gt;% filter(temp != 53)\n\nWe can create a new generalised linear model, based on these data:\n\nglm_chl_new &lt;- glm(cbind(damage, intact) ~ temp,\n               family = binomial,\n               data = challenger_new)\n\nWe can get the model parameters as follows:\n\nsummary(glm_chl_new)\n\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger_new)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)  \n(Intercept)  5.68223    4.43138   1.282   0.1997  \ntemp        -0.12817    0.06697  -1.914   0.0556 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 16.375  on 21  degrees of freedom\nResidual deviance: 12.633  on 20  degrees of freedom\nAIC: 27.572\n\nNumber of Fisher Scoring iterations: 5\n\n\n\nggplot(challenger_new, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85) +\n  # add a vertical line at 53 F temperature\n  geom_vline(xintercept = 53, linetype = \"dashed\")\n\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n\n\n\n\n\n\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\nBefore we can make any firm conclusions, though, we need to check our model:\n\n1- pchisq(12.633,20)\n\n[1] 0.8925695\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty rubbish – our points are not very close to our curve, overall.\nIs the model any better than the null though?\n\n1 - pchisq(16.375 - 12.633, 21 - 20)\n\n[1] 0.0530609\n\nanova(glm_chl_new, test = 'Chisq')\n\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: cbind(damage, intact)\n\nTerms added sequentially (first to last)\n\n     Df Deviance Resid. Df Resid. Dev Pr(&gt;Chi)  \nNULL                    21     16.375           \ntemp  1   3.7421        20     12.633  0.05306 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n\n\nFirst, we need to remove the influential data point:\n\nchallenger_new_py = challenger_py.query(\"temp != 53\")\n\nWe can create a new generalised linear model, based on these data:\n\n# create a generalised linear model\nmodel = smf.glm(formula = \"damage + intact ~ temp\",\n                family = sm.families.Binomial(),\n                data = challenger_new_py)\n# and get the fitted parameters of the model\nglm_chl_new_py = model.fit()\n\nWe can get the model parameters as follows:\n\nprint(glm_chl_new_py.summary())\n\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   22\nModel:                              GLM   Df Residuals:                       20\nModel Family:                  Binomial   Df Model:                            1\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -11.786\nDate:                  Tue, 06 Feb 2024   Deviance:                       12.633\nTime:                          16:12:15   Pearson chi2:                     16.6\nNo. Iterations:                       6   Pseudo R-squ. (CS):             0.1564\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      5.6822      4.431      1.282      0.200      -3.003      14.368\ntemp          -0.1282      0.067     -1.914      0.056      -0.259       0.003\n==============================================================================\n\n\nGenerate new model data:\n\nmodel = pd.DataFrame({'temp': list(range(25, 86))})\n\nmodel[\"pred\"] = glm_chl_new_py.predict(model)\n\nmodel.head()\n\n   temp      pred\n0    25  0.922585\n1    26  0.912920\n2    27  0.902177\n3    28  0.890269\n4    29  0.877107\n\n\n\n(ggplot(challenger_new_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"temp\", y = \"pred\"), colour = \"blue\", size = 1) +\n     # add a vertical line at 53 F temperature\n     geom_vline(xintercept = 53, linetype = \"dashed\"))\n\n\n\n\n\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\nBefore we can make any firm conclusions, though, we need to check our model:\n\nchi2.sf(12.633, 20)\n\n0.8925694610786307\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty rubbish – our points are not very close to our curve, overall.\nIs the model any better than the null though?\nFirst we need to define the null model:\n\n# create a linear model\nmodel = smf.glm(formula = \"damage + intact ~ 1\",\n                family = sm.families.Binomial(),\n                data = challenger_new_py)\n# and get the fitted parameters of the model\nglm_chl_new_null_py = model.fit()\n\nprint(glm_chl_new_null_py.summary())\n\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   22\nModel:                              GLM   Df Residuals:                       21\nModel Family:                  Binomial   Df Model:                            0\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -13.657\nDate:                  Tue, 06 Feb 2024   Deviance:                       16.375\nTime:                          16:12:16   Pearson chi2:                     16.8\nNo. Iterations:                       6   Pseudo R-squ. (CS):         -2.220e-16\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     -3.0445      0.418     -7.286      0.000      -3.864      -2.226\n==============================================================================\n\n\n\nchi2.sf(16.375 - 12.633, 21 - 20)\n\n0.053060897703157646\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n\n\n\nSo, could NASA have predicted what happened? This model is not significantly different from the null, i.e., temperature is not a significant predictor. Note that it’s only marginally non-significant, and we do have a high goodness-of-fit value.\nIt is possible that if more data points were available that followed a similar trend, the story might be different). Even if we did use our non-significant model to make a prediction, it doesn’t give us a value anywhere near 5 failures for a temperature of 53 degrees Fahrenheit. So overall, based on the model we’ve fitted with these data, there was no indication that a temperature just a few degrees cooler than previous missions could have been so disastrous for the Challenger.",
     "crumbs": [
-      "Binary and proportional data",
-      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Proportional response</span>"
+      "Logistic regression",
+      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Proportional response</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-logistic-proportion.html#summary",
     "href": "materials/glm-practical-logistic-proportion.html#summary",
-    "title": "\n8  Proportional response\n",
-    "section": "\n8.6 Summary",
-    "text": "8.6 Summary\n\n\n\n\n\n\nKey points\n\n\n\n\nWe can use a logistic model for proportion response variables",
+    "title": "\n10  Proportional response\n",
+    "section": "\n10.6 Summary",
+    "text": "10.6 Summary\n\n\n\n\n\n\nKey points\n\n\n\n\nWe can use a logistic model for proportion response variables",
     "crumbs": [
-      "Binary and proportional data",
-      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Proportional response</span>"
+      "Logistic regression",
+      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Proportional response</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-poisson.html",
     "href": "materials/glm-practical-poisson.html",
-    "title": "\n9  Count data\n",
+    "title": "\n11  Count data\n",
     "section": "",
-    "text": "9.1 Libraries and functions\nThe examples in this section use the following data sets:\ndata/islands.csv\nThis is a data set comprising 35 observations of two variables (one dependent and one predictor). This records the number of species recorded on different small islands along with the area (km2) of the islands. The variables are species and area.\nThe second data set is on seat belts.\nThe seatbelts data set is a multiple time-series data set that was commissioned by the Department of Transport in 1984 to measure differences in deaths before and after front seat belt legislation was introduced on 31st January 1983. It provides monthly total numerical data on a number of incidents including those related to death and injury in Road Traffic Accidents (RTA’s). The data set starts in January 1969 and observations run until December 1984.\nYou can find the file in data/seatbelts.csv",
+    "text": "11.1 Libraries and functions\nThe examples in this section use the following data sets:\ndata/islands.csv\nThis is a data set comprising 35 observations of two variables (one dependent and one predictor). This records the number of species recorded on different small islands along with the area (km2) of the islands. The variables are species and area.\nThe second data set is on seat belts.\nThe seatbelts data set is a multiple time-series data set that was commissioned by the Department of Transport in 1984 to measure differences in deaths before and after front seat belt legislation was introduced on 31st January 1983. It provides monthly total numerical data on a number of incidents including those related to death and injury in Road Traffic Accidents (RTA’s). The data set starts in January 1969 and observations run until December 1984.\nYou can find the file in data/seatbelts.csv",
     "crumbs": [
-      "Count data",
-      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Count data</span>"
+      "Poisson regression",
+      "<span class='chapter-number'>11</span>  <span class='chapter-title'>Count data</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-poisson.html#libraries-and-functions",
     "href": "materials/glm-practical-poisson.html#libraries-and-functions",
-    "title": "\n9  Count data\n",
+    "title": "\n11  Count data\n",
     "section": "",
-    "text": "Click to expand\n\n\n\n\n\n\n\nR\nPython\n\n\n\n\n9.1.1 Libraries\n\n9.1.2 Functions\n\n\n\n\n9.1.3 Libraries\n\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n\n# Needed for additional probability functionality\nfrom scipy.stats import *\n\n\n9.1.4 Functions",
+    "text": "Click to expand\n\n\n\n\n\n\n\nR\nPython\n\n\n\n\n11.1.1 Libraries\n\n11.1.2 Functions\n\n\n\n\n11.1.3 Libraries\n\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n\n# Needed for additional probability functionality\nfrom scipy.stats import *\n\n\n11.1.4 Functions",
     "crumbs": [
-      "Count data",
-      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Count data</span>"
+      "Poisson regression",
+      "<span class='chapter-number'>11</span>  <span class='chapter-title'>Count data</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-poisson.html#load-and-visualise-the-data",
     "href": "materials/glm-practical-poisson.html#load-and-visualise-the-data",
-    "title": "\n9  Count data\n",
-    "section": "\n9.2 Load and visualise the data",
-    "text": "9.2 Load and visualise the data\nFirst we load the data, then we visualise it.\n\n\nR\nPython\n\n\n\n\nislands &lt;- read_csv(\"data/islands.csv\")\n\nLet’s have a glimpse at the data:\n\nislands\n\n# A tibble: 35 × 2\n   species  area\n     &lt;dbl&gt; &lt;dbl&gt;\n 1     114  12.1\n 2     130  13.4\n 3     113  13.7\n 4     109  14.5\n 5     118  16.8\n 6     136  19.0\n 7     149  19.6\n 8     162  20.6\n 9     145  20.9\n10     148  21.0\n# ℹ 25 more rows\n\n\n\n\n\nislands_py = pd.read_csv(\"data/islands.csv\")\n\nLet’s have a glimpse at the data:\n\nislands_py.head()\n\n   species       area\n0      114  12.076133\n1      130  13.405439\n2      113  13.723525\n3      109  14.540359\n4      118  16.792122\n\n\n\n\n\nLooking at the data, we can see that there are two columns: species, which contains the number of species recorded on each island and area, which contains the surface area of the island in square kilometers.\nWe can plot the data:\n\n\nR\nPython\n\n\n\n\nggplot(islands, aes(x = area, y = species)) +\n  geom_point()\n\n\n\n\n\n\n\n\n\n\n(ggplot(islands_py, aes(x = \"area\", y = \"species\")) +\n  geom_point())\n\n\n\n\n\n\n\n\n\n\nIt looks as though area may have an effect on the number of species that we observe on each island. We note that the response variable is count data and so we try to construct a Poisson regression.",
+    "title": "\n11  Count data\n",
+    "section": "\n11.2 Load and visualise the data",
+    "text": "11.2 Load and visualise the data\nFirst we load the data, then we visualise it.\n\n\nR\nPython\n\n\n\n\nislands &lt;- read_csv(\"data/islands.csv\")\n\nLet’s have a glimpse at the data:\n\nislands\n\n# A tibble: 35 × 2\n   species  area\n     &lt;dbl&gt; &lt;dbl&gt;\n 1     114  12.1\n 2     130  13.4\n 3     113  13.7\n 4     109  14.5\n 5     118  16.8\n 6     136  19.0\n 7     149  19.6\n 8     162  20.6\n 9     145  20.9\n10     148  21.0\n# ℹ 25 more rows\n\n\n\n\n\nislands_py = pd.read_csv(\"data/islands.csv\")\n\nLet’s have a glimpse at the data:\n\nislands_py.head()\n\n   species       area\n0      114  12.076133\n1      130  13.405439\n2      113  13.723525\n3      109  14.540359\n4      118  16.792122\n\n\n\n\n\nLooking at the data, we can see that there are two columns: species, which contains the number of species recorded on each island and area, which contains the surface area of the island in square kilometers.\nWe can plot the data:\n\n\nR\nPython\n\n\n\n\nggplot(islands, aes(x = area, y = species)) +\n  geom_point()\n\n\n\n\n\n\n\n\n\n\n(ggplot(islands_py, aes(x = \"area\", y = \"species\")) +\n  geom_point())\n\n\n\n\n\n\n\n\n\n\nIt looks as though area may have an effect on the number of species that we observe on each island. We note that the response variable is count data and so we try to construct a Poisson regression.",
     "crumbs": [
-      "Count data",
-      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Count data</span>"
+      "Poisson regression",
+      "<span class='chapter-number'>11</span>  <span class='chapter-title'>Count data</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-poisson.html#constructing-a-model",
     "href": "materials/glm-practical-poisson.html#constructing-a-model",
-    "title": "\n9  Count data\n",
-    "section": "\n9.3 Constructing a model",
-    "text": "9.3 Constructing a model\n\n\nR\nPython\n\n\n\n\nglm_isl &lt;- glm(species ~ area,\n               data = islands, family = \"poisson\")\n\nand we look at the model summary:\n\nsummary(glm_isl)\n\n\nCall:\nglm(formula = species ~ area, family = \"poisson\", data = islands)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)    \n(Intercept) 4.241129   0.041322  102.64   &lt;2e-16 ***\narea        0.035613   0.001247   28.55   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 856.899  on 34  degrees of freedom\nResidual deviance:  30.437  on 33  degrees of freedom\nAIC: 282.66\n\nNumber of Fisher Scoring iterations: 3\n\n\nThe output is strikingly similar to the logistic regression models (who’d have guessed, eh?) and the main numbers to extract from the output are the two numbers underneath Estimate.Std in the Coefficients table:\n(Intercept)    4.241129\narea           0.035613\n\n\n\n# create a generalised linear model\nmodel = smf.glm(formula = \"species ~ area\",\n                family = sm.families.Poisson(),\n                data = islands_py)\n# and get the fitted parameters of the model\nglm_isl_py = model.fit()\n\nLet’s look at the model output:\n\nprint(glm_isl_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:                species   No. Observations:                   35\nModel:                            GLM   Df Residuals:                       33\nModel Family:                 Poisson   Df Model:                            1\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -139.33\nDate:                Tue, 06 Feb 2024   Deviance:                       30.437\nTime:                        16:16:33   Pearson chi2:                     30.3\nNo. Iterations:                     4   Pseudo R-squ. (CS):              1.000\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      4.2411      0.041    102.636      0.000       4.160       4.322\narea           0.0356      0.001     28.551      0.000       0.033       0.038\n==============================================================================\n\n\n\n\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\\[ E(species) = \\exp(4.24 + 0.036 \\times area) \\]\nInterpreting this requires a bit of thought (not much, but a bit). The intercept coefficient, 4.24, is related to the number of species we would expect on an island of zero area (this is statistics, not real life. You’d do well to remember that before you worry too much about what that even means). But in order to turn this number into something meaningful we have to exponentiate it. Since exp(4.24) ≈ 70, we can say that the baseline number of species the model expects on any island is 70. This isn’t actually the interesting bit though.\nThe coefficient of area is the fun bit. For starters we can see that it is a positive number which does mean that increasing area leads to increasing numbers of species. Good so far.\nBut what does the value 0.036 actually mean? Well, if we exponentiate it as well, we get exp(0.036) ≈ 1.04. This means that for every increase in area of 1 km^2 (the original units of the area variable), the number of species on the island is multiplied by 1.04. So, an island of area 1 km^2 will have 1.04 x 70 ≈ 72 species.\nSo, in order to interpret Poisson coefficients, you have to exponentiate them.",
+    "title": "\n11  Count data\n",
+    "section": "\n11.3 Constructing a model",
+    "text": "11.3 Constructing a model\n\n\nR\nPython\n\n\n\n\nglm_isl &lt;- glm(species ~ area,\n               data = islands, family = \"poisson\")\n\nand we look at the model summary:\n\nsummary(glm_isl)\n\n\nCall:\nglm(formula = species ~ area, family = \"poisson\", data = islands)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)    \n(Intercept) 4.241129   0.041322  102.64   &lt;2e-16 ***\narea        0.035613   0.001247   28.55   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 856.899  on 34  degrees of freedom\nResidual deviance:  30.437  on 33  degrees of freedom\nAIC: 282.66\n\nNumber of Fisher Scoring iterations: 3\n\n\nThe output is strikingly similar to the logistic regression models (who’d have guessed, eh?) and the main numbers to extract from the output are the two numbers underneath Estimate.Std in the Coefficients table:\n(Intercept)    4.241129\narea           0.035613\n\n\n\n# create a generalised linear model\nmodel = smf.glm(formula = \"species ~ area\",\n                family = sm.families.Poisson(),\n                data = islands_py)\n# and get the fitted parameters of the model\nglm_isl_py = model.fit()\n\nLet’s look at the model output:\n\nprint(glm_isl_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:                species   No. Observations:                   35\nModel:                            GLM   Df Residuals:                       33\nModel Family:                 Poisson   Df Model:                            1\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -139.33\nDate:                Fri, 17 May 2024   Deviance:                       30.437\nTime:                        12:30:01   Pearson chi2:                     30.3\nNo. Iterations:                     4   Pseudo R-squ. (CS):              1.000\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      4.2411      0.041    102.636      0.000       4.160       4.322\narea           0.0356      0.001     28.551      0.000       0.033       0.038\n==============================================================================\n\n\n\n\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\\[ E(species) = \\exp(4.24 + 0.036 \\times area) \\]\nInterpreting this requires a bit of thought (not much, but a bit). The intercept coefficient, 4.24, is related to the number of species we would expect on an island of zero area (this is statistics, not real life. You’d do well to remember that before you worry too much about what that even means). But in order to turn this number into something meaningful we have to exponentiate it. Since exp(4.24) ≈ 70, we can say that the baseline number of species the model expects on any island is 70. This isn’t actually the interesting bit though.\nThe coefficient of area is the fun bit. For starters we can see that it is a positive number which does mean that increasing area leads to increasing numbers of species. Good so far.\nBut what does the value 0.036 actually mean? Well, if we exponentiate it as well, we get exp(0.036) ≈ 1.04. This means that for every increase in area of 1 km^2 (the original units of the area variable), the number of species on the island is multiplied by 1.04. So, an island of area 1 km^2 will have 1.04 x 70 ≈ 72 species.\nSo, in order to interpret Poisson coefficients, you have to exponentiate them.",
     "crumbs": [
-      "Count data",
-      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Count data</span>"
+      "Poisson regression",
+      "<span class='chapter-number'>11</span>  <span class='chapter-title'>Count data</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-poisson.html#plotting-the-poisson-regression",
     "href": "materials/glm-practical-poisson.html#plotting-the-poisson-regression",
-    "title": "\n9  Count data\n",
-    "section": "\n9.4 Plotting the Poisson regression",
-    "text": "9.4 Plotting the Poisson regression\n\n\nR\nPython\n\n\n\n\nggplot(islands, aes(area, species)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = poisson)) +\n  xlim(10,50)\n\n`geom_smooth()` using formula = 'y ~ x'\n\n\n\n\n\n\n\n\n\n\n\nmodel = pd.DataFrame({'area': list(range(10, 50))})\n\nmodel[\"pred\"] = glm_isl_py.predict(model)\n\nmodel.head()\n\n   area        pred\n0    10   99.212463\n1    11  102.809432\n2    12  106.536811\n3    13  110.399326\n4    14  114.401877\n\n\n\n(ggplot(islands_py,\n         aes(x = \"area\",\n             y = \"species\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"area\", y = \"pred\"), colour = \"blue\", size = 1))",
+    "title": "\n11  Count data\n",
+    "section": "\n11.4 Plotting the Poisson regression",
+    "text": "11.4 Plotting the Poisson regression\n\n\nR\nPython\n\n\n\n\nggplot(islands, aes(area, species)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = poisson)) +\n  xlim(10,50)\n\n`geom_smooth()` using formula = 'y ~ x'\n\n\n\n\n\n\n\n\n\n\n\nmodel = pd.DataFrame({'area': list(range(10, 50))})\n\nmodel[\"pred\"] = glm_isl_py.predict(model)\n\nmodel.head()\n\n   area        pred\n0    10   99.212463\n1    11  102.809432\n2    12  106.536811\n3    13  110.399326\n4    14  114.401877\n\n\n\n(ggplot(islands_py,\n         aes(x = \"area\",\n             y = \"species\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"area\", y = \"pred\"), colour = \"blue\", size = 1))",
     "crumbs": [
-      "Count data",
-      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Count data</span>"
+      "Poisson regression",
+      "<span class='chapter-number'>11</span>  <span class='chapter-title'>Count data</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-poisson.html#assumptions",
     "href": "materials/glm-practical-poisson.html#assumptions",
-    "title": "\n9  Count data\n",
-    "section": "\n9.5 Assumptions",
-    "text": "9.5 Assumptions\nAs we mentioned earlier, Poisson regressions require that the variance of the data at any point is the same as the mean of the data at that point. We checked that earlier by looking at the residual deviance values.\nWe can look for influential points using the Cook’s distance plot:\n\n\nR\nPython\n\n\n\n\nplot(glm_isl , which=4)\n\n\n\n\n\n\n\n\n\n\n# extract the Cook's distances\nglm_isl_py_resid = pd.DataFrame(glm_isl_py.\n                                get_influence().\n                                summary_frame()[\"cooks_d\"])\n\n# add row index \nglm_isl_py_resid['obs'] = glm_isl_py_resid.reset_index().index\n\nWe can use these to create the plot:\n\n(ggplot(glm_isl_py_resid,\n         aes(x = \"obs\",\n             y = \"cooks_d\")) +\n     geom_segment(aes(x = \"obs\", y = \"cooks_d\", xend = \"obs\", yend = 0)) +\n     geom_point())\n\n\n\n\n\n\n\n\n\n\nNone of our points have particularly large Cook’s distances and so life is rosy.",
+    "title": "\n11  Count data\n",
+    "section": "\n11.5 Assumptions",
+    "text": "11.5 Assumptions\nAs we mentioned earlier, Poisson regressions require that the variance of the data at any point is the same as the mean of the data at that point. We checked that earlier by looking at the residual deviance values.\nWe can look for influential points using the Cook’s distance plot:\n\n\nR\nPython\n\n\n\n\nplot(glm_isl , which = 4)\n\n\n\n\n\n\n\n\n\n\n# extract the Cook's distances\nglm_isl_py_resid = pd.DataFrame(glm_isl_py.\n                                get_influence().\n                                summary_frame()[\"cooks_d\"])\n\n# add row index \nglm_isl_py_resid['obs'] = glm_isl_py_resid.reset_index().index\n\nWe can use these to create the plot:\n\n(ggplot(glm_isl_py_resid,\n         aes(x = \"obs\",\n             y = \"cooks_d\")) +\n     geom_segment(aes(x = \"obs\", y = \"cooks_d\", xend = \"obs\", yend = 0)) +\n     geom_point())\n\n\n\n\n\n\n\n\n\n\nNone of our points have particularly large Cook’s distances and so life is rosy.",
     "crumbs": [
-      "Count data",
-      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Count data</span>"
+      "Poisson regression",
+      "<span class='chapter-number'>11</span>  <span class='chapter-title'>Count data</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-poisson.html#assessing-significance",
     "href": "materials/glm-practical-poisson.html#assessing-significance",
-    "title": "\n9  Count data\n",
-    "section": "\n9.6 Assessing significance",
-    "text": "9.6 Assessing significance\nWe can ask the same three questions we asked before.\n\nIs the model well-specified?\nIs the overall model better than the null model?\nAre any of the individual predictors significant?\n\nAgain, in this case, questions 2 and 3 are effectively asking the same thing because we still only have a single predictor variable.\nTo assess if the model is any good we’ll again use the residual deviance and the residual degrees of freedom.\n\n\nR\nPython\n\n\n\n\n1 - pchisq(30.437, 33)\n\n[1] 0.5953482\n\n\n\n\n\nchi2.sf(30.437, 33)\n\n0.5953481872979622\n\n\n\n\n\nThis gives a probability of 0.60. This suggests that this model is actually a reasonably decent one and that the data are pretty well supported by the model. For Poisson models this has an extra interpretation. This can be used to assess whether we have significant over-dispersion in our data.\nFor a Poisson model to be appropriate we need that the variance of the data to be exactly the same as the mean of the data. Visually, this would correspond to the data spreading out more for higher predicted values of species. However, we don’t want the data to spread out too much. If that happens then a Poisson model wouldn’t be appropriate.\nThe easy way to check this is to look at the ratio of the residual deviance to the residual degrees of freedom (in this case 0.922). For a Poisson model to be valid, this ratio should be about 1. If the ratio is significantly bigger than 1 then we say that we have over-dispersion in the model and we wouldn’t be able to trust any of the significance testing that we are about to do using a Poisson regression.\nThankfully the probability we have just created (0.60) is exactly the right one we need to look at to assess whether we have significant over-dispersion in our model.\nSecondly, to assess whether the overall model, with all of the terms, is better than the null model we’ll look at the difference in deviances and the difference in degrees of freedom:\n\n\nR\nPython\n\n\n\n\n1 - pchisq(856.899 - 30.437, 34 - 33)\n\n[1] 0\n\n\n\n\n\nchi2.sf(856.899 - 30.437, 34 - 33)\n\n9.524927068555617e-182\n\n\n\n\n\nThis gives a reported p-value of pretty much zero, which is pretty damn small. So, yes, this model is better than nothing at all and species does appear to change with some of our predictors\nFinally, we’ll construct an analysis of deviance table to look at the individual terms:\n\n\nR\nPython\n\n\n\n\nanova(glm_isl , test = \"Chisq\")\n\nAnalysis of Deviance Table\n\nModel: poisson, link: log\n\nResponse: species\n\nTerms added sequentially (first to last)\n\n     Df Deviance Resid. Df Resid. Dev  Pr(&gt;Chi)    \nNULL                    34     856.90              \narea  1   826.46        33      30.44 &lt; 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nThe p-value in this table is just as small as we’d expect given our previous result (&lt;2.2e-16 is pretty close to 0), and we have the nice consistent result that area definitely has an effect on species.\n\n\nAs mentioned before, this is not quite possible in Python.",
+    "title": "\n11  Count data\n",
+    "section": "\n11.6 Assessing significance",
+    "text": "11.6 Assessing significance\nWe can ask the same three questions we asked before.\n\nIs the model well-specified?\nIs the overall model better than the null model?\nAre any of the individual predictors significant?\n\nAgain, in this case, questions 2 and 3 are effectively asking the same thing because we still only have a single predictor variable.\nTo assess if the model is any good we’ll again use the residual deviance and the residual degrees of freedom.\n\n\nR\nPython\n\n\n\n\n1 - pchisq(30.437, 33)\n\n[1] 0.5953482\n\n\n\n\n\nchi2.sf(30.437, 33)\n\n0.5953481872979622\n\n\n\n\n\nThis gives a probability of 0.60. This suggests that this model is actually a reasonably decent one and that the data are pretty well supported by the model. For Poisson models this has an extra interpretation. This can be used to assess whether we have significant over-dispersion in our data.\nFor a Poisson model to be appropriate we need that the variance of the data to be exactly the same as the mean of the data. Visually, this would correspond to the data spreading out more for higher predicted values of species. However, we don’t want the data to spread out too much. If that happens then a Poisson model wouldn’t be appropriate.\nThe easy way to check this is to look at the ratio of the residual deviance to the residual degrees of freedom (in this case 0.922). For a Poisson model to be valid, this ratio should be about 1. If the ratio is significantly bigger than 1 then we say that we have over-dispersion in the model and we wouldn’t be able to trust any of the significance testing that we are about to do using a Poisson regression.\nThankfully the probability we have just created (0.60) is exactly the right one we need to look at to assess whether we have significant over-dispersion in our model.\nSecondly, to assess whether the overall model, with all of the terms, is better than the null model we’ll look at the difference in deviances and the difference in degrees of freedom:\n\n\nR\nPython\n\n\n\n\n1 - pchisq(856.899 - 30.437, 34 - 33)\n\n[1] 0\n\n\n\n\n\nchi2.sf(856.899 - 30.437, 34 - 33)\n\n9.524927068555617e-182\n\n\n\n\n\nThis gives a reported p-value of pretty much zero, which is pretty damn small. So, yes, this model is better than nothing at all and species does appear to change with some of our predictors\nFinally, we’ll construct an analysis of deviance table to look at the individual terms:\n\n\nR\nPython\n\n\n\n\nanova(glm_isl , test = \"Chisq\")\n\nAnalysis of Deviance Table\n\nModel: poisson, link: log\n\nResponse: species\n\nTerms added sequentially (first to last)\n\n     Df Deviance Resid. Df Resid. Dev  Pr(&gt;Chi)    \nNULL                    34     856.90              \narea  1   826.46        33      30.44 &lt; 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nThe p-value in this table is just as small as we’d expect given our previous result (&lt;2.2e-16 is pretty close to 0), and we have the nice consistent result that area definitely has an effect on species.\n\n\nAs mentioned before, this is not quite possible in Python.",
     "crumbs": [
-      "Count data",
-      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Count data</span>"
+      "Poisson regression",
+      "<span class='chapter-number'>11</span>  <span class='chapter-title'>Count data</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-poisson.html#exercises",
     "href": "materials/glm-practical-poisson.html#exercises",
-    "title": "\n9  Count data\n",
-    "section": "\n9.7 Exercises",
-    "text": "9.7 Exercises\n\n9.7.1 Seat belts\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nLevel: \nFor this exercise we’ll be using the data from data/seatbelts.csv.\nI’d like you to do the following:\n\nLoad the data\nVisualise the data and create a poisson regression model\nPlot the regression model on top of the data\nAssess if the model is a decent predictor for the number of fatalities\n\n\n\n\n\n\n\nAnswer\n\n\n\n\n\n\n\nLoad and visualise the data\nFirst we load the data, then we visualise it.\n\n\nR\nPython\n\n\n\n\nseatbelts &lt;- read_csv(\"data/seatbelts.csv\")\n\n\n\n\nseatbelts_py = pd.read_csv(\"data/seatbelts.csv\")\n\nLet’s have a glimpse at the data:\n\nseatbelts_py.head()\n\n   casualties  drivers  front  rear  ...  van_killed  law  year  month\n0         107     1687    867   269  ...          12    0  1969    Jan\n1          97     1508    825   265  ...           6    0  1969    Feb\n2         102     1507    806   319  ...          12    0  1969    Mar\n3          87     1385    814   407  ...           8    0  1969    Apr\n4         119     1632    991   454  ...          10    0  1969    May\n\n[5 rows x 10 columns]\n\n\n\n\n\nThe data tracks the number of drivers killed in road traffic accidents, before and after the seat belt law was introduced. The information on whether the law was in place is encoded in the law column as 0 (law not in place) or 1 (law in place).\nThere are many more observations when the law was not in place, so we need to keep this in mind when we’re interpreting the data.\nFirst we have a look at the data comparing no law vs law:\n\n\nR\nPython\n\n\n\nWe have to convert the law column to a factor, otherwise R will see it as numerical.\n\nseatbelts %&gt;% \n  ggplot(aes(as_factor(law), casualties)) +\n   geom_boxplot()\n\n\n\n\n\n\n\nThe data are recorded by month and year, so we can also display the number of drivers killed by year:\n\nseatbelts %&gt;% \n  ggplot(aes(year, casualties)) +\n  geom_point()\n\n\n\n\n\n\n\n\n\nWe have to convert the law column to a factor, otherwise R will see it as numerical.\n\n(ggplot(seatbelts_py,\n         aes(x = seatbelts_py.law.astype(object),\n             y = \"casualties\")) +\n     geom_boxplot())\n\n\n\n\n\n\n\nThe data are recorded by month and year, so we can also display the number of casualties by year:\n\n(ggplot(seatbelts_py,\n         aes(x = \"year\",\n             y = \"casualties\")) +\n     geom_point())\n\n\n\n\n\n\n\n\n\n\nThe data look a bit weird. There is quite some variation within years (keeping in mind that the data are aggregated monthly). The data also seems to wave around a bit… with some vague peaks (e.g. 1972 - 1973) and some troughs (e.g. around 1976).\nSo my initial thought is that these data are going to be a bit tricky to interpret. But that’s OK.\nConstructing a model\n\n\nR\nPython\n\n\n\n\nglm_stb &lt;- glm(casualties ~ year,\n               data = seatbelts, family = \"poisson\")\n\nand we look at the model summary:\n\nsummary(glm_stb)\n\n\nCall:\nglm(formula = casualties ~ year, family = \"poisson\", data = seatbelts)\n\nCoefficients:\n             Estimate Std. Error z value Pr(&gt;|z|)    \n(Intercept) 37.168958   2.796636   13.29   &lt;2e-16 ***\nyear        -0.016373   0.001415  -11.57   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 984.50  on 191  degrees of freedom\nResidual deviance: 850.41  on 190  degrees of freedom\nAIC: 2127.2\n\nNumber of Fisher Scoring iterations: 4\n\n\n(Intercept)    37.168958\nyear           -0.016373\n\n\n\n# create a linear model\nmodel = smf.glm(formula = \"casualties ~ year\",\n                family = sm.families.Poisson(),\n                data = seatbelts_py)\n# and get the fitted parameters of the model\nglm_stb_py = model.fit()\n\n\nprint(glm_stb_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:             casualties   No. Observations:                  192\nModel:                            GLM   Df Residuals:                      190\nModel Family:                 Poisson   Df Model:                            1\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -1061.6\nDate:                Tue, 06 Feb 2024   Deviance:                       850.41\nTime:                        16:16:38   Pearson chi2:                     862.\nNo. Iterations:                     4   Pseudo R-squ. (CS):             0.5026\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     37.1690      2.797     13.291      0.000      31.688      42.650\nyear          -0.0164      0.001    -11.569      0.000      -0.019      -0.014\n==============================================================================\n\n\n======================\n                 coef  \n----------------------\nIntercept     37.1690 \nyear          -0.0164 \n======================\n\n\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\\[ E(casualties) = \\exp(37.17 - 0.164 \\times year) \\]\nAssessing significance\nIs the model well-specified?\n\n\nR\nPython\n\n\n\n\n1 - pchisq(850.41, 190)\n\n[1] 0\n\n\n\n\n\nchi2.sf(850.41, 190)\n\n3.1319689119997022e-84\n\n\n\n\n\nThis value indicates that the model is actually pretty good. Remember, it is between \\([0, 1]\\) and the closer to zero, the better the model.\nHow about the overall fit?\n\n\nR\nPython\n\n\n\n\n1 - pchisq(984.50 - 850.41, 191 - 190)\n\n[1] 0\n\n\n\n\nFirst we need to define the null model:\n\n# create a linear model\nmodel = smf.glm(formula = \"casualties ~ 1\",\n                family = sm.families.Poisson(),\n                data = seatbelts_py)\n# and get the fitted parameters of the model\nglm_stb_null_py = model.fit()\n\nprint(glm_stb_null_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:             casualties   No. Observations:                  192\nModel:                            GLM   Df Residuals:                      191\nModel Family:                 Poisson   Df Model:                            0\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -1128.6\nDate:                Tue, 06 Feb 2024   Deviance:                       984.50\nTime:                        16:16:38   Pearson chi2:                 1.00e+03\nNo. Iterations:                     4   Pseudo R-squ. (CS):          1.942e-13\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      4.8106      0.007    738.670      0.000       4.798       4.823\n==============================================================================\n\n\n\nchi2.sf(984.50 - 850.41, 191 - 190)\n\n5.2214097202831414e-31\n\n\n\n\n\nAgain, this indicates that the model is markedly better than the null model.\nPlotting the regression\n\n\nR\nPython\n\n\n\n\nggplot(seatbelts, aes(year, casualties)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = poisson)) +\n  xlim(1970,1985)\n\n\n\n\n\n\n\n\n\n\nmodel = pd.DataFrame({'year': list(range(1968, 1985))})\n\nmodel[\"pred\"] = glm_stb_py.predict(model)\n\nmodel.head()\n\n   year        pred\n0  1968  140.737690\n1  1969  138.452153\n2  1970  136.203733\n3  1971  133.991827\n4  1972  131.815842\n\n\n\n(ggplot(seatbelts_py,\n         aes(x = \"year\",\n             y = \"casualties\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"year\", y = \"pred\"), colour = \"blue\", size = 1))\n\n\n\n\n\n\n\n\n\n\nConclusions\nThe model we constructed appears to be a decent predictor for the number of fatalities.",
+    "title": "\n11  Count data\n",
+    "section": "\n11.7 Exercises",
+    "text": "11.7 Exercises\n\n11.7.1 Seat belts\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nLevel: \nFor this exercise we’ll be using the data from data/seatbelts.csv.\nI’d like you to do the following:\n\nLoad the data\nVisualise the data and create a poisson regression model\nPlot the regression model on top of the data\nAssess if the model is a decent predictor for the number of fatalities\n\n\n\n\n\n\n\nAnswer\n\n\n\n\n\n\n\nLoad and visualise the data\nFirst we load the data, then we visualise it.\n\n\nR\nPython\n\n\n\n\nseatbelts &lt;- read_csv(\"data/seatbelts.csv\")\n\n\n\n\nseatbelts_py = pd.read_csv(\"data/seatbelts.csv\")\n\nLet’s have a glimpse at the data:\n\nseatbelts_py.head()\n\n   casualties  drivers  front  rear  ...  van_killed  law  year  month\n0         107     1687    867   269  ...          12    0  1969    Jan\n1          97     1508    825   265  ...           6    0  1969    Feb\n2         102     1507    806   319  ...          12    0  1969    Mar\n3          87     1385    814   407  ...           8    0  1969    Apr\n4         119     1632    991   454  ...          10    0  1969    May\n\n[5 rows x 10 columns]\n\n\n\n\n\nThe data tracks the number of drivers killed in road traffic accidents, before and after the seat belt law was introduced. The information on whether the law was in place is encoded in the law column as 0 (law not in place) or 1 (law in place).\nThere are many more observations when the law was not in place, so we need to keep this in mind when we’re interpreting the data.\nFirst we have a look at the data comparing no law vs law:\n\n\nR\nPython\n\n\n\nWe have to convert the law column to a factor, otherwise R will see it as numerical.\n\nseatbelts %&gt;% \n  ggplot(aes(as_factor(law), casualties)) +\n   geom_boxplot()\n\n\n\n\n\n\n\nThe data are recorded by month and year, so we can also display the number of drivers killed by year:\n\nseatbelts %&gt;% \n  ggplot(aes(year, casualties)) +\n  geom_point()\n\n\n\n\n\n\n\n\n\nWe have to convert the law column to a factor, otherwise R will see it as numerical.\n\n(ggplot(seatbelts_py,\n         aes(x = seatbelts_py.law.astype(object),\n             y = \"casualties\")) +\n     geom_boxplot())\n\n\n\n\n\n\n\nThe data are recorded by month and year, so we can also display the number of casualties by year:\n\n(ggplot(seatbelts_py,\n         aes(x = \"year\",\n             y = \"casualties\")) +\n     geom_point())\n\n\n\n\n\n\n\n\n\n\nThe data look a bit weird. There is quite some variation within years (keeping in mind that the data are aggregated monthly). The data also seems to wave around a bit… with some vague peaks (e.g. 1972 - 1973) and some troughs (e.g. around 1976).\nSo my initial thought is that these data are going to be a bit tricky to interpret. But that’s OK.\nConstructing a model\n\n\nR\nPython\n\n\n\n\nglm_stb &lt;- glm(casualties ~ year,\n               data = seatbelts, family = \"poisson\")\n\nand we look at the model summary:\n\nsummary(glm_stb)\n\n\nCall:\nglm(formula = casualties ~ year, family = \"poisson\", data = seatbelts)\n\nCoefficients:\n             Estimate Std. Error z value Pr(&gt;|z|)    \n(Intercept) 37.168958   2.796636   13.29   &lt;2e-16 ***\nyear        -0.016373   0.001415  -11.57   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 984.50  on 191  degrees of freedom\nResidual deviance: 850.41  on 190  degrees of freedom\nAIC: 2127.2\n\nNumber of Fisher Scoring iterations: 4\n\n\n(Intercept)    37.168958\nyear           -0.016373\n\n\n\n# create a linear model\nmodel = smf.glm(formula = \"casualties ~ year\",\n                family = sm.families.Poisson(),\n                data = seatbelts_py)\n# and get the fitted parameters of the model\nglm_stb_py = model.fit()\n\n\nprint(glm_stb_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:             casualties   No. Observations:                  192\nModel:                            GLM   Df Residuals:                      190\nModel Family:                 Poisson   Df Model:                            1\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -1061.6\nDate:                Fri, 17 May 2024   Deviance:                       850.41\nTime:                        12:30:06   Pearson chi2:                     862.\nNo. Iterations:                     4   Pseudo R-squ. (CS):             0.5026\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     37.1690      2.797     13.291      0.000      31.688      42.650\nyear          -0.0164      0.001    -11.569      0.000      -0.019      -0.014\n==============================================================================\n\n\n======================\n                 coef  \n----------------------\nIntercept     37.1690 \nyear          -0.0164 \n======================\n\n\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\\[ E(casualties) = \\exp(37.17 - 0.164 \\times year) \\]\nAssessing significance\nIs the model well-specified?\n\n\nR\nPython\n\n\n\n\n1 - pchisq(850.41, 190)\n\n[1] 0\n\n\n\n\n\nchi2.sf(850.41, 190)\n\n3.1319689119997022e-84\n\n\n\n\n\nThis value indicates that the model is actually pretty good. Remember, it is between \\([0, 1]\\) and the closer to zero, the better the model.\nHow about the overall fit?\n\n\nR\nPython\n\n\n\n\n1 - pchisq(984.50 - 850.41, 191 - 190)\n\n[1] 0\n\n\n\n\nFirst we need to define the null model:\n\n# create a linear model\nmodel = smf.glm(formula = \"casualties ~ 1\",\n                family = sm.families.Poisson(),\n                data = seatbelts_py)\n# and get the fitted parameters of the model\nglm_stb_null_py = model.fit()\n\nprint(glm_stb_null_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:             casualties   No. Observations:                  192\nModel:                            GLM   Df Residuals:                      191\nModel Family:                 Poisson   Df Model:                            0\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -1128.6\nDate:                Fri, 17 May 2024   Deviance:                       984.50\nTime:                        12:30:07   Pearson chi2:                 1.00e+03\nNo. Iterations:                     4   Pseudo R-squ. (CS):          1.942e-13\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      4.8106      0.007    738.670      0.000       4.798       4.823\n==============================================================================\n\n\n\nchi2.sf(984.50 - 850.41, 191 - 190)\n\n5.2214097202831414e-31\n\n\n\n\n\nAgain, this indicates that the model is markedly better than the null model.\nPlotting the regression\n\n\nR\nPython\n\n\n\n\nggplot(seatbelts, aes(year, casualties)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = poisson)) +\n  xlim(1970, 1985)\n\n\n\n\n\n\n\n\n\n\nmodel = pd.DataFrame({'year': list(range(1968, 1985))})\n\nmodel[\"pred\"] = glm_stb_py.predict(model)\n\nmodel.head()\n\n   year        pred\n0  1968  140.737690\n1  1969  138.452153\n2  1970  136.203733\n3  1971  133.991827\n4  1972  131.815842\n\n\n\n(ggplot(seatbelts_py,\n         aes(x = \"year\",\n             y = \"casualties\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"year\", y = \"pred\"), colour = \"blue\", size = 1))\n\n\n\n\n\n\n\n\n\n\nConclusions\nThe model we constructed appears to be a decent predictor for the number of fatalities.",
     "crumbs": [
-      "Count data",
-      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Count data</span>"
+      "Poisson regression",
+      "<span class='chapter-number'>11</span>  <span class='chapter-title'>Count data</span>"
     ]
   },
   {
     "objectID": "materials/glm-practical-poisson.html#summary",
     "href": "materials/glm-practical-poisson.html#summary",
-    "title": "\n9  Count data\n",
-    "section": "\n9.8 Summary",
-    "text": "9.8 Summary\n\n\n\n\n\n\nKey points\n\n\n\n\nPoisson regression is useful when dealing with count data",
+    "title": "\n11  Count data\n",
+    "section": "\n11.8 Summary",
+    "text": "11.8 Summary\n\n\n\n\n\n\nKey points\n\n\n\n\nPoisson regression is useful when dealing with count data",
     "crumbs": [
-      "Count data",
-      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Count data</span>"
+      "Poisson regression",
+      "<span class='chapter-number'>11</span>  <span class='chapter-title'>Count data</span>"
     ]
   }
 ]
\ No newline at end of file
diff --git a/_site/setup.html b/_site/setup.html
index 742ab7f..fd6e9d0 100644
--- a/_site/setup.html
+++ b/_site/setup.html
@@ -111,9 +111,9 @@
       <button type="button" class="quarto-btn-toggle btn" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }">
         <i class="bi bi-layout-text-sidebar-reverse"></i>
       </button>
-        <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="./index.html">Welcome</a></li><li class="breadcrumb-item"><a href="./setup.html"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">Setup</span></a></li></ol></nav>
-        <a class="flex-grow-1" role="button" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }">      
-        </a>
+        <a class="flex-grow-1 no-decor" role="button" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }">      
+          <h1 class="quarto-secondary-nav-title"><span class="chapter-title">Setup</span></h1>
+        </a>     
       <button type="button" class="btn quarto-search-button" aria-label="" onclick="window.quartoOpenSearch();">
         <i class="bi bi-search"></i>
       </button>
@@ -186,13 +186,25 @@
   <a href="./materials/glm-intro-glm.html" class="sidebar-item-text sidebar-link">
  <span class="menu-text"><span class="chapter-number">6</span>&nbsp; <span class="chapter-title">Generalise your model</span></span></a>
   </div>
+</li>
+          <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="./materials/significance-testing.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">7</span>&nbsp; <span class="chapter-title">Significance &amp; goodness-of-fit</span></span></a>
+  </div>
+</li>
+          <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="./materials/checking-assumptions.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">8</span>&nbsp; <span class="chapter-title">Checking assumptions</span></span></a>
+  </div>
 </li>
       </ul>
   </li>
         <li class="sidebar-item sidebar-item-section">
       <div class="sidebar-item-container"> 
             <a class="sidebar-item-text sidebar-link text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-3" aria-expanded="true">
- <span class="menu-text">Binary and proportional data</span></a>
+ <span class="menu-text">Logistic regression</span></a>
           <a class="sidebar-item-toggle text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-3" aria-expanded="true" aria-label="Toggle section">
             <i class="bi bi-chevron-right ms-2"></i>
           </a> 
@@ -201,13 +213,13 @@
           <li class="sidebar-item">
   <div class="sidebar-item-container"> 
   <a href="./materials/glm-practical-logistic-binary.html" class="sidebar-item-text sidebar-link">
- <span class="menu-text"><span class="chapter-number">7</span>&nbsp; <span class="chapter-title">Binary response</span></span></a>
+ <span class="menu-text"><span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Binary response</span></span></a>
   </div>
 </li>
           <li class="sidebar-item">
   <div class="sidebar-item-container"> 
   <a href="./materials/glm-practical-logistic-proportion.html" class="sidebar-item-text sidebar-link">
- <span class="menu-text"><span class="chapter-number">8</span>&nbsp; <span class="chapter-title">Proportional response</span></span></a>
+ <span class="menu-text"><span class="chapter-number">10</span>&nbsp; <span class="chapter-title">Proportional response</span></span></a>
   </div>
 </li>
       </ul>
@@ -215,7 +227,7 @@
         <li class="sidebar-item sidebar-item-section">
       <div class="sidebar-item-container"> 
             <a class="sidebar-item-text sidebar-link text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-4" aria-expanded="true">
- <span class="menu-text">Count data</span></a>
+ <span class="menu-text">Poisson regression</span></a>
           <a class="sidebar-item-toggle text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-4" aria-expanded="true" aria-label="Toggle section">
             <i class="bi bi-chevron-right ms-2"></i>
           </a> 
@@ -224,7 +236,7 @@
           <li class="sidebar-item">
   <div class="sidebar-item-container"> 
   <a href="./materials/glm-practical-poisson.html" class="sidebar-item-text sidebar-link">
- <span class="menu-text"><span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Count data</span></span></a>
+ <span class="menu-text"><span class="chapter-number">11</span>&nbsp; <span class="chapter-title">Count data</span></span></a>
   </div>
 </li>
       </ul>
@@ -246,9 +258,9 @@ <h2 id="toc-title">Table of contents</h2>
 <!-- main -->
 <main class="content" id="quarto-document-content">
 
-<header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="./index.html">Welcome</a></li><li class="breadcrumb-item"><a href="./setup.html"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">Setup</span></a></li></ol></nav>
+<header id="title-block-header" class="quarto-title-block default">
 <div class="quarto-title">
-<h1 class="title"><span class="chapter-title">Setup</span></h1>
+<h1 class="title d-none d-lg-block"><span class="chapter-title">Setup</span></h1>
 </div>
 
 
diff --git a/index.qmd b/index.qmd
index 615abeb..7fdd1e3 100644
--- a/index.qmd
+++ b/index.qmd
@@ -1,5 +1,6 @@
 ---
 title: "Course overview"
+author: ["Vicki Hodgson, Matt Castle, Martin van Rongen"]
 number-sections: false
 ---
 
@@ -29,3 +30,23 @@ To know what to do when presented with an arbitrary data set e.g.
 3.  Assess the significance of fit
 4.  Assess assumption of the model
 :::
+
+## Authors
+
+About the author(s):
+
+- **Vicki Hodgson**
+  <a href="https://orcid.org/0000-0001-5619-2118" target="_blank"><i class="fa-brands fa-orcid" style="color:#a6ce39"></i></a> 
+  <a href="https://github.com/Vicki-H" target="_blank"><i class="fa-brands fa-github" style="color:#4078c0"></i></a>  
+  _Affiliation_: Bioinformatics Training Facility, University of Cambridge  
+  _Roles_: writing - review & editing; conceptualisation; coding
+- **Martin van Rongen**
+  <a href="https://orcid.org/0000-0002-1441-367X" target="_blank"><i class="fa-brands fa-orcid" style="color:#a6ce39"></i></a> 
+  <a href="https://github.com/mvanrongen" target="_blank"><i class="fa-brands fa-github" style="color:#4078c0"></i></a>  
+  _Affiliation_: Bioinformatics Training Facility, University of Cambridge  
+  _Roles_: writing - review & editing; conceptualisation; coding
+- **Matt Castle**
+  <a href="https://orcid.org/0000-0002-9439-552X" target="_blank"><i class="fa-brands fa-orcid" style="color:#a6ce39"></i></a> 
+  _Affiliation_: Bioinformatics Training Facility, University of Cambridge  
+  _Roles_: conceptualisation; writing
+
diff --git a/materials/_chapters.yml b/materials/_chapters.yml
index 6a7d5b1..6578ccb 100644
--- a/materials/_chapters.yml
+++ b/materials/_chapters.yml
@@ -4,10 +4,12 @@ book:
       chapters:
         - materials/glm-intro-lm.qmd
         - materials/glm-intro-glm.qmd
-    - part: "Binary and proportional data"
+        - materials/significance-testing.qmd
+        - materials/checking-assumptions.qmd
+    - part: "Logistic regression"
       chapters:
         - materials/glm-practical-logistic-binary.qmd
         - materials/glm-practical-logistic-proportion.qmd
-    - part: "Count data"
+    - part: "Poisson regression"
       chapters:
         - materials/glm-practical-poisson.qmd
diff --git a/materials/checking-assumptions.qmd b/materials/checking-assumptions.qmd
new file mode 100644
index 0000000..584e5fa
--- /dev/null
+++ b/materials/checking-assumptions.qmd
@@ -0,0 +1,258 @@
+---
+title: "Checking assumptions"
+output: html_document
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+library(tidyverse)
+```
+
+```{r}
+#| echo: false
+#| message: false
+#| results: hide
+source(file = "setup_files/setup.R")
+```
+
+```{python}
+#| echo: false
+#| message: false
+import shutup;shutup.please()
+exec(open('setup_files/setup.py').read())
+```
+
+Although generalised linear models do allow us to relax certain assumptions compared to standard linear models (linearity, equality of variance of residuals, and normality of residuals).
+
+However, we cannot relax all of them. This section of the materials will talk through the important assumptions for GLMs, and how to assess them.
+
+## Libraries and functions
+
+::: {.callout-note collapse="true"}
+## Click to expand
+
+::: {.panel-tabset group="language"}
+## R
+```{r}
+#| eval: false
+library(ggResidpanel)
+```
+## Python
+```{python}
+#| eval: false
+from scipy.stats import *
+```
+:::
+
+:::
+
+## Assumption 1: Distribution of response variable
+
+Although we don't expect our response variable $y$ to be continuous and normally distributed (as we did in linear modelling), we do still expect its distribution to come from the "exponential family" of distributions.
+
+The exponential family contains the following distributions, among others:
+
+- normal
+- exponential
+- Poisson 
+- Bernoulli
+- binomial (for fixed number of trials)
+- chi-squared
+
+You can use a histogram to visualise the distribution of your response variable, but it is typically most useful just to think about the nature of your response variable. For instance, binary variables will follow a Bernoulli distribution, proportional variables follow a binomial distribution, and most count variables will follow a Poisson distribution.
+
+If you have a very unusual variable that doesn't follow one of these exponential family distributions, however, then a GLM will not be an appropriate choice. In other words, a GLM is not necessarily a magic fix!
+
+## Assumption 2: Correct link function
+
+A closely-related assumption to assumption 1 above, is that we have chosen the correct link function for our model.
+
+If we have done so, then there should be a linear relationship between our *transformed* model and our response variable; in other words, if we have chosen the right link function, then we have correctly "linearised" our model.
+
+## Assumption 3: Independence
+
+We expect that the each observation or datapoint in our sample is independent of all the others. Specifically, we expect that our set of $y$ response variables are independent of one another.
+
+For this to be true, we have to make sure:
+
+- that we aren't treating technical replicates as true/biological replicates;
+- that we don't have observations/datapoints in our sample that are artificially similar to each other (compared to other datapoints);
+- that we don't have any nuisance/confounding variables that create "clusters" or hierarchy in our dataset;
+- that we haven't got repeated measures, i.e., multiple measurements/rows per individual in our sample
+
+There is no diagnostic plot for assessing this assumption. To determine whether your data are independent, you need to understand your experimental design.
+
+You might find [this page](https://cambiotraining.github.io/experimental-design/materials/04-replication.html#criteria-for-true-independent-replication) useful if you're looking for more information on what counts as truly independent data.
+
+## Good science: No influential observations
+
+As with linear models, though this isn't always considered a "formal" assumption, we do want to ensure that there aren't any datapoints that are overly influencing our model.
+
+A datapoint is overly influential, i.e., has high leverage, if removing that point from the dataset would cause large changes in the model coefficients. Datapoints with high leverage are typically those that don't follow the same general "trend" as the rest of the data.
+
+The easiest way to check for overly influential points is to construct a Cook's distance plot.
+
+Let's try that out, using the `diabetes` example dataset.
+
+::: {.panel-tabset group="language"}
+## R
+
+```{r}
+diabetes <- read_csv("data/diabetes.csv")
+
+glm_dia <- glm(test_result ~ glucose * diastolic,
+                  family = "binomial",
+                  data = diabetes)
+```
+
+## Python
+
+```{python}
+diabetes_py = pd.read_csv("data/diabetes.csv")
+
+model = smf.glm(formula = "test_result ~ glucose * diastolic", 
+                family = sm.families.Binomial(), 
+                data = diabetes_py)
+                
+glm_dia_py = model.fit()
+```
+:::
+
+Once our model is fitted, we can fit a Cook's distance plot:
+
+::: {.panel-tabset group="language"}
+## R
+```{r}
+resid_panel(glm_dia, plots = "cookd")
+```
+
+## Python
+```{python}
+
+```
+:::
+
+Good news - there don't appear to be any overly influential points!
+
+## Dispersion
+
+Another thing that we want to check, primarily in Poisson regression, is whether our dispersion parameter is correct.
+
+::: {.callout-note collapse="true}
+
+#### First, let's unpack what dispersion is!
+
+Dispersion, in statistics, is a general term to describe the variability, scatter, or spread of a distribution. Variance is a common measure of dispersion that hopefully you are familiar with.
+
+In a normal distribution, the mean (average) and the variance (dispersion) are independent of each other; we need both numbers, or parameters, to understand the shape of the distribution.
+
+Other distributions, however, require different parameters to describe them in full. For a Poisson distribution, we need just one parameter $lambda$, which captures the expected rate of occurrences/expected count. The mean and variance of a Poisson distribution are actually expected to be the same.
+
+In the context of a model, you can think about the dispersion as the degree to which the data are spread out around the model curve. A dispersion parameter of 1 means the data are spread out exactly as we expect; <1 is called underdispersion; and >1 is called overdispersion.
+:::
+
+### A "hidden assumption"
+
+When we fit a linear model, because we're assuming a normal distribution, we take the time to estimate the dispersion - by measuring the variance.  
+
+When performing Poisson regression, however, we make an extra "hidden" assumption, in setting the dispersion parameter to 1. In other words, we expect the errors to have a certain spread to them that matches our theoretical distribution/model. This means we don't have to waste time and statistical power in estimating the dispersion.
+
+However, if our data are underdispersed or overdispersed, then we might be violating this assumption we've made. 
+
+Underdispersion is quite rare. It's far more likely that you'll encounter overdispersion; in Poisson regression, this is usually caused by the presence of lots of zeroes in your response variable (known as zero-inflation).
+
+In these situations, you may wish to fit a different GLM to the data. Negative binomial regression, for instance, is a common alternative for zero-inflated count data.
+
+### Checking the dispersion parameter
+
+The easiest way to check dispersion in a model is to calculate the ratio of the residual deviance to the residual degrees of freedom.
+
+Let's practice doing this using a Poisson regression fitted to the `islands` dataset that you saw earlier in the course.
+
+::: {.panel-tabset group="language"}
+## R
+```{r}
+islands <- read_csv("data/islands.csv")
+
+glm_isl <- glm(species ~ area,
+               data = islands, family = "poisson")
+```
+
+## Python
+```{python}
+islands_py = pd.read_csv("data/islands.csv")
+
+model = smf.glm(formula = "species ~ area",
+                family = sm.families.Poisson(),
+                data = islands_py)
+
+glm_isl_py = model.fit()
+```
+:::
+
+If we take a look at the model output, we can see the two quantities we care about - residual deviance and residual degrees of freedom:
+
+::: {.panel-tabset group="language"}
+## R
+```{r}
+summary(glm_isl)
+```
+
+## Python
+```{python}
+print(glm_isl_py.summary())
+```
+:::
+
+The residual deviance is 30.437, on 33 residual degrees of freedom. All we need to do is divide one by the other to get our dispersion parameter.
+
+::: {.panel-tabset group="language"}
+## R
+```{r}
+glm_isl$deviance/glm_isl$df.residual
+```
+
+## Python
+```{python}
+print(glm_isl_py.deviance/glm_isl_py.df_resid)
+```
+:::
+
+The dispersion parameter here is 0.922. That's pretty good - not far off 1 at all.
+
+But how can we check whether it is *significantly* different from 1? 
+
+Well, you've actually already got the knowledge you need to do this, from the previous course section on significance testing. Specifically, the chi-squared goodness-of-fit test can be used to check whether the dispersion is within sensible limits.
+
+You may have noticed that the two values we're using for the dispersion parameter are the same two numbers that we used in those chi-squared tests. For this Poisson regression fitted to the `islands` dataset, that goodness-of-fit test would look like this:
+
+::: {.panel-tabset group="language"}
+## R
+```{r}
+1 - pchisq(glm_isl$deviance, glm_isl$df.residual)
+```
+
+## Python
+```{python}
+pvalue = chi2.sf(glm_isl_py.deviance, glm_isl_py.df_resid)
+
+print(pvalue)
+```
+:::
+
+If our chi-squared goodness-of-fit test returns a large (insignificant) p-value, as it does here, that tells us that we don't need to worry about the dispersion.
+
+If our chi-squared goodness-of-fit test returned a small, significant p-value, this would tell us our model doesn't fit the data well. And, since dispersion is all about the spread of points around the model, it makes sense that these two things are so closely related!
+
+## Summary
+
+While generalised linear models make fewer assumptions than standard linear models, we do still expect certain things to be true about the model and our variables for GLMs to be valid. Checking most of these assumptions requires understanding your dataset, and diagnostic plots play a less heavy role.
+
+::: {.callout-tip}
+#### Key points
+
+- For a generalised linear model, we assume that we have chosen the correct link function, that our response variable follows a distribution from the exponential family, and that our data are independent
+- To assess these assumptions, we need to understand our dataset and variables
+- We can also use visualisation to determine whether we have overly influential (high leverage) datapoints
+- For Poisson regression, we should also investigate the dispersion parameter of our model, which we expect to be close to 1
+:::
diff --git a/materials/glm-practical-poisson.qmd b/materials/glm-practical-poisson.qmd
index 56d7366..f2a4239 100644
--- a/materials/glm-practical-poisson.qmd
+++ b/materials/glm-practical-poisson.qmd
@@ -235,7 +235,7 @@ We can look for influential points using the Cook’s distance plot:
 ## R
 
 ```{r}
-plot(glm_isl , which=4)
+plot(glm_isl , which = 4)
 ```
 
 ## Python
@@ -545,7 +545,7 @@ ggplot(seatbelts, aes(year, casualties)) +
   geom_point() +
   geom_smooth(method = "glm", se = FALSE, fullrange = TRUE, 
               method.args = list(family = poisson)) +
-  xlim(1970,1985)
+  xlim(1970, 1985)
 ```
 
 ## Python
diff --git a/materials/images/deviance.png b/materials/images/deviance.png
new file mode 100644
index 0000000..9a6cc8b
Binary files /dev/null and b/materials/images/deviance.png differ
diff --git a/materials/setup_files/setup.R b/materials/setup_files/setup.R
index 0c991e1..5977a06 100644
--- a/materials/setup_files/setup.R
+++ b/materials/setup_files/setup.R
@@ -14,5 +14,6 @@ library(pwr)
 library(reticulate)
 library(rstatix)
 library(tidyverse)
+library(lmtest)
 #library(powerAnalysis)
 theme_set(theme_bw())
diff --git a/materials/significance-testing.qmd b/materials/significance-testing.qmd
new file mode 100644
index 0000000..7382aca
--- /dev/null
+++ b/materials/significance-testing.qmd
@@ -0,0 +1,405 @@
+---
+title: "Significance & goodness-of-fit"
+output: html_document
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+library(tidyverse)
+```
+
+```{r}
+#| echo: false
+#| message: false
+#| results: hide
+source(file = "setup_files/setup.R")
+```
+
+```{python}
+#| echo: false
+#| message: false
+import shutup;shutup.please()
+exec(open('setup_files/setup.py').read())
+```
+
+Generalised linear models are fitted a little differently to standard linear models - namely, using maximum likelihood estimation instead of ordinary least squares for estimating the model coefficients.
+
+As a result, we can no longer use F-tests for significance, or interpret $R^2$ values in quite the same way. This section will discuss new techniques for significance and goodness-of-fit testing, specifically for use with GLMs.
+
+## Libraries and functions
+
+::: {.callout-note collapse="true"}
+## Click to expand
+
+::: {.panel-tabset group="language"}
+## R
+```{r}
+#| eval: false
+install.packages("lmtest")
+library(lmtest)
+```
+## Python
+```{python}
+#| eval: false
+from scipy.stats import *
+```
+:::
+
+:::
+
+## Deviance
+
+Several of the tests and metrics we'll discuss below are based heavily on deviance. So, what is deviance, and where does it come from?
+
+Fitting a model using maximum likelihood estimation - the method that we use for GLMs, among other models - is all about finding the parameters that maximise the **likelihood**, or joint probability, of the sample. In other words, how likely is it that you would sample a set of data points like these, if they were being drawn from an underlying population where your model is true? Each model that you fit has its own likelihood.
+
+Now, for each dataset, there is a "saturated", or perfect, model. This model has the same number of parameters in it as there are data points, meaning the data are fitted exactly - as if connecting the dots between them. The **saturated model** has the largest possible likelihood of any model fitted to the dataset.
+
+Of course, we don't actually use the saturated model for drawing real conclusions, but we can use it as a baseline for comparison. We compare each model that we fit to this saturated model, to calculate the **deviance**. Deviance is defined as the difference between the log-likelihood of your fitted model and the log-likelihood of the saturated model (multiplied by 2). 
+
+Because deviance is all about capturing the discrepancy between fitted and actual values, it's performing a similar function to the residual sum of squares (RSS) in a standard linear model. In fact, the RSS is really just a specific type of deviance.
+
+![Different models and their deviances](images/deviance.png){width=70%}
+
+## Significance testing
+
+There are a few different potential sources of p-values for a generalised linear model. 
+
+Here, we'll briefly discuss the p-values that are reported "as standard" in a typical GLM model output.
+
+Then, we'll spend most of our time focusing on likelihood ratio tests, perhaps the most effective way to assess significance in a GLM.
+
+### Revisiting the diabetes dataset
+
+As a worked example, we'll use a logistic regression fitted to the `diabetes` dataset that we saw in a previous section.
+
+::: {.panel-tabset group="language"}
+## R
+
+```{r}
+diabetes <- read_csv("data/diabetes.csv")
+```
+
+## Python
+
+```{python}
+diabetes_py = pd.read_csv("data/diabetes.csv")
+
+diabetes_py.head()
+```
+:::
+
+As a reminder, this dataset contains three variables:
+
+- `test_result`, binary results of a diabetes test result (1 for positive, 0 for negative)
+- `glucose`, the results of a glucose tolerance test
+- `diastolic` blood pressure
+
+::: {.panel-tabset group="language"}
+## R
+
+```{r}
+glm_dia <- glm(test_result ~ glucose * diastolic,
+                  family = "binomial",
+                  data = diabetes)
+```
+
+## Python
+
+```{python}
+model = smf.glm(formula = "test_result ~ glucose * diastolic", 
+                family = sm.families.Binomial(), 
+                data = diabetes_py)
+                
+glm_dia_py = model.fit()
+```
+:::
+
+### Wald tests
+
+Let's use the `summary` function to see the model we've just fitted.
+
+::: {.panel-tabset group="language"}
+## R
+
+```{r}
+summary(glm_dia)
+```
+
+## Python
+
+```{python}
+print(glm_dia_py.summary())
+```
+:::
+
+Whichever language you're using, you may have spotted some p-values being reported directly here in the model summaries. Specifically, each individual parameter, or coefficient, has its own z-value and associated p-value.
+
+A hypothesis test has automatically been performed for each of the parameters in your model, including the intercept and interaction. In each case, something called a **Wald test** has been performed.
+
+The null hypothesis for these Wald tests is that the value of the coefficient = 0. The idea is that if a coefficient isn't significantly different from 0, then that parameter isn't useful and could be dropped from the model. These tests are the equivalent of the t-tests that are calculated as part of the `summary` output for standard linear models.
+
+Importantly, these Wald tests *don't* tell you about the significance of the overall model. For that, we're going to need something else: a likelihood ratio test.
+
+### Likelihood ratio tests (LRTs)
+
+When we were assessing the significance of standard linear models, we were able to use the F-statistic to determine:
+
+- the significance of the model versus a null model, and
+- the significance of individual predictors.
+
+We can't use these F-tests for GLMs, but we can use LRTs in a really similar way, to calculate p-values for both the model as a whole, and for individual variables.
+
+These tests are all built on the idea of deviance, or the likelihood ratio, as discussed above on this page. We can compare any two models fitted to the same dataset by looking at the difference in their deviances, also known as the difference in their log-likelihoods, or more simply as a likelihood ratio.
+
+Helpfully, this likelihood ratio approximately follows a chi-square distribution, which we can capitalise on that to calculate a p-value. All we need is the number of degrees of freedom, which is equal to the difference in the number of parameters of the two models you're comparing.
+
+::: {.callout-warning}
+Importantly, we are only able to use this sort of test when one of the two models that we are comparing is a "simpler" version of the other, i.e., one model has a subset of the parameters of the other model. 
+
+So while we could perform an LRT just fine between these two models: `Y ~ A + B + C` and `Y ~ A + B + C + D`, or between any model and the null (`Y ~ 1`), we would not be able to use this test to compare `Y ~ A + B + C` and `Y ~ A + B + D`.
+:::
+
+#### Testing the model versus the null
+
+Since LRTs involve making a comparison between two models, we must first decide which models we're comparing, and check that one model is a "subset" of the other.
+
+Let's use an example from a previous section of the course, where we fitted a logistic regression to the `diabetes` dataset. 
+
+::: {.panel-tabset group="language"}
+## R
+
+The first step is to create the two models that we want to compare: our original model, and the null model (with and without predictors, respectively).
+
+```{r}
+glm_dia <- glm(test_result ~ glucose * diastolic,
+                  family = "binomial",
+                  data = diabetes)
+
+glm_null <- glm(test_result ~ 1, 
+                family = binomial, 
+                data = diabetes)
+```
+
+Then, we use the `lrtest` function from the `lmtest` package to perform the test itself; we include both the models that we want to compare, listing them one after another.
+
+```{r}
+lrtest(glm_dia, glm_null)
+```
+
+We can see from the output that our chi-square statistic is significant, with a really small p-value. This tells us that, for the difference in degrees of freedom (here, that's 3), the change in deviance is actually quite big. (In this case, you can use `summary(glm_dia)` to see those deviances - 936 versus 748!)
+
+In other words, our model is better than the null.
+
+## Python
+
+The first step is to create the two models that we want to compare: our original model, and the null model (with and without our predictor, respectively).
+
+```{python}
+model = smf.glm(formula = "test_result ~ glucose * diastolic", 
+                family = sm.families.Binomial(), 
+                data = diabetes_py)
+                
+glm_dia_py = model.fit()
+
+model = smf.glm(formula = "test_result ~ 1",
+                family = sm.families.Binomial(),
+                data = diabetes_py)
+
+glm_null_py = model.fit()
+```
+
+Unlike in R, there isn't a nice neat function for extracting the $\chi^2$ value, so we have to do a little bit of work by hand.
+
+```{python}
+# calculate the likelihood ratio (i.e. the chi-square value)
+lrstat = -2*(glm_null_py.llf - glm_dia_py.llf)
+
+# calculate the associated p-value
+pvalue = chi2.sf(lrstat, glm_dia_py.df_model - glm_null_py.df_model)
+
+print(lrstat, pvalue)
+```
+
+This gives us the likelihood ratio, based on the log-likelihoods that we've extracted directly from the models, which approximates a chi-square distribution. 
+
+We've also calculated the associated p-value, by providing the difference in degrees of freedom between the two models (in this case, that's simply 1, but for more complicated models it's easier to extract the degrees of freedom directly from the model as we've done here).
+
+Here, we have a large chi-square statistic and a small p-value. This tells us that, for the difference in degrees of freedom (here, that's 1), the change in deviance is actually quite big. (In this case, you can use `summary(glm_dia)` to see those deviances - 936 versus 748!)
+
+In other words, our model is better than the null.
+:::
+
+### Testing individual predictors
+
+As well as testing the overall model versus the null, we might want to test particular predictors to determine whether they are individually significant.
+
+The way to achieve this is essentially to perform a series of "targeted" likelihood ratio tests. In each LRT, we'll compare two models that are almost identical - one with, and one without, our variable of interest in each case.
+
+::: {.panel-tabset group="language"}
+## R
+
+The first step is to construct a new model that doesn't contain our predictor of interest. Let's test the `glucose:diastolic` interaction.
+
+```{r}
+glm_dia_add <- glm(test_result ~ glucose + diastolic,
+                  family = "binomial",
+                  data = diabetes)
+```
+
+Now, we use the `lrtest` function to compare the models with and without the interaction:
+
+```{r}
+lrtest(glm_dia, glm_dia_add)
+```
+
+This tells us that our interaction `glucose:diastolic` isn't significant - our more complex model doesn't have a meaningful reduction in deviance.
+
+This might, however, seem like a slightly clunky way to test each individual predictor. Luckily, we can also use our trusty `anova` function with an extra argument to tell us about individual predictors. 
+
+By specifying that we want to use a chi-squared test, we are able to construct an analysis of deviance table (as opposed to an analysis of variance table) that will perform the likelihood ratio tests for us for each predictor:
+
+```{r}
+anova(glm_dia, test="Chisq")
+```
+
+You'll spot that the p-values we get from the analysis of deviance table match the p-values you could calculate yourself using `lrtest`; this is just more efficient when you have a complex model!
+
+## Python
+
+The first step is to construct a new model that doesn't contain our predictor of interest. Let's test the `glucose:diastolic` interaction.
+
+```{python}
+model = smf.glm(formula = "test_result ~ glucose + diastolic", 
+                family = sm.families.Binomial(), 
+                data = diabetes_py)
+                
+glm_dia_add_py = model.fit()
+```
+
+We'll then use the same code we used above, to compare the models with and without the interaction:
+
+```{python}
+lrstat = -2*(glm_dia_add_py.llf - glm_dia_py.llf)
+
+pvalue = chi2.sf(lrstat, glm_dia_py.df_model - glm_dia_add_py.df_model)
+
+print(lrstat, pvalue)
+```
+
+This tells us that our interaction `glucose:diastolic` isn't significant - our more complex model doesn't have a meaningful reduction in deviance.
+:::
+
+## Goodness-of-fit
+
+Goodness-of-fit is all about how well a model fits the data, and typically involves summarising the discrepancy between the actual data points, and the fitted/predicted values that the model produces.
+
+Though closely linked, it's important to realise that goodness-of-fit and significance don't come hand-in-hand automatically: we might find a model that is significantly better than the null, but is still overall pretty rubbish at matching the data. So, to understand the quality of our model better, we should ideally perform both types of test. 
+
+### Chi-square tests
+
+Once again, we can make use of deviance and chi-square tests, this time to assess goodness-of-fit.
+
+Above, we used likelihood ratio tests to assess the null hypothesis that our candidate fitted model and the null model had the same deviance.
+
+Now, however, we will test the null hypothesis that the fitted model and the saturated (perfect) model have the same deviance, i.e., that they both fit the data equally well. In most hypothesis tests, we want to reject the null hypothesis, but in this case, we'd like it to be true.
+
+::: {.panel-tabset group="language"}
+## R
+
+Running a goodness-of-fit chi-square test in R can be done using the `pchisq` function. We need to include two arguments: 1) the residual deviance, and 2) the residual degrees of freedom. Both of these can be found in the `summary` output, but you can use the `$` syntax to call these properties directly like so:
+
+```{r}
+1 - pchisq(glm_dia$deviance, glm_dia$df.residual)
+```
+
+## Python
+
+The syntax is very similar to the LRT we ran above, but now instead of including information about both our candidate model and the null, we instead just need 1) the residual deviance, and 2) the residual degrees of freedom:
+
+```{python}
+pvalue = chi2.sf(glm_dia_py.deviance, glm_dia_py.df_resid)
+
+print(pvalue)
+```
+:::
+
+You can think about this p-value, roughly, as "the probability that this model is good". We're not below our significance threshold, which means that we're not rejecting our null hypothesis (which is a good thing) - but it's also not a huge probability. This suggests that there's probably other variables we could measure and include in a future experiment, to give a better overall model.
+
+### AIC values
+
+You might remember AIC values from standard linear modelling. AIC values are useful, because they tell us about overall model quality, factoring in both goodness-of-fit and model complexity.
+
+One of the best things about the Akaike information criterion (AIC) is that it isn't specific to linear models - it works for models fitted with maximum likelihood estimation.
+
+In fact, if you look at the formula for AIC, you'll see why:
+
+$$
+AIC = 2k - 2ln(\hat{L})
+$$
+
+where $k$ represents the number of parameters in the model, and $\hat{L}$ is the maximised likelihood function. In other words, the two parts of the equation represent the complexity of the model, versus the log-likelihood.
+
+This means that AIC can be used for model comparison for GLMs in precisely the same way as it's used for linear models: lower AIC indicates a better-quality model.
+
+::: {.panel-tabset group="language"}
+## R
+
+The AIC value is given as standard, near the bottom of the `summary` output (just below the deviance values). You can also print it directly using the `$` syntax:
+
+```{r}
+summary(glm_dia)
+
+glm_dia$aic
+```
+
+In even better news for R users, the `step` function works for GLMs just as it does for linear models, so long as you include the `test = LRT` argument.
+
+```{r}
+step(glm_dia, test = "LRT")
+```
+
+## Python
+
+The AIC value isn't printed as standard with the model summary, but you can access it easily like so:
+
+```{python}
+print(glm_dia_py.aic)
+```
+:::
+
+### Pseudo r-squared
+
+We can't use $R^2$ values to represent the amount of variance explained in a GLM. This is primarily because, while linear models are fitted by minimising the squared residuals, GLMs are fitted by maximising the likelihood - an entirely different procedure.
+
+However, because $R^2$ values are so useful in linear modelling, statisticians have developed something called a "pseudo $R^2$" for GLMs.
+
+::: {.callout-note}
+#### Debate about pseudo $R^2$ values
+
+There are two main areas of debate:
+
+1. Which version of pseudo $R^2$ to use? 
+
+There are many. Some of the most popular are McFadden's, Nagelkerke's, Cox & Snell's, and Tjur's. They all have slightly different formulae and in some cases can give quite different results. [This post](https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/) does a nice job of discussing some of them and providing some comparisons.
+
+2. Should pseudo $R^2$ values be calculated at all? 
+
+Well, it depends what you want them for. Most statisticians tend to advise that pseudo $R^2$ values are only really useful for model comparisons (i.e., comparing different GLMs fitted to the same dataset). This is in contrast to the way that we use $R^2$ values in linear models, as a measure of effect size that is generalisable across studies.
+
+So, if you choose to use pseudo $R^2$ values, try to be thoughtful about it; and avoid the temptation to over-interpret! 
+:::
+
+## Summary
+
+Likelihood and deviance are very important in generalised linear models - not just for fitting the model via maximum likelihood estimation, but for assessing significance and goodness-of-fit. To determine the quality of a model and draw conclusions from it, it's important to assess both of these things.
+
+::: {.callout-tip}
+#### Key points
+
+- Deviance is the difference between predicted and actual values, and is calculated by comparing a model's log-likelihood to that of the perfect "saturated" model 
+- Using deviance, likelihood ratio tests can be used in lieu of F-tests for generalised linear models
+- Similarly, a chi-square goodness-of-fit test can also be performed using likelihood/deviance
+- The Akaike information criterion is also based on likelihood, and can be used to compare the quality of GLMs fitted to the same dataset
+- Other metrics that may be of use are Wald test p-values and pseudo $R^2$ values
+:::