diff --git a/_freeze/materials/glm-practical-logistic-binary/execute-results/html.json b/_freeze/materials/glm-practical-logistic-binary/execute-results/html.json
index 315269c..5a9eab1 100644
--- a/_freeze/materials/glm-practical-logistic-binary/execute-results/html.json
+++ b/_freeze/materials/glm-practical-logistic-binary/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "6cd63fec01955d0a10967c5e12ce0878",
+  "hash": "8e7bf90c5ceb5c589f9315ca1d1b87aa",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: \"Binary response\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n**Questions**\n\n-   How do we analyse data with a binary outcome?\n-   Can we test if our model is any good?\n-   Be able to perform a logistic regression with a binary outcome\n-   Predict outcomes of new data, based on a defined model\n\n**Objectives**\n\n- Be able to analyse binary outcome data\n- Understand different methods of testing model fit\n- Be able to make model predictions\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n### Functions\n\n## Python\n\n### Libraries\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n\n# Needed for additional probability functionality\nfrom scipy.stats import *\n```\n:::\n\n\n### Functions\n:::\n:::\n\nThe example in this section uses the following data set:\n\n`data/finches_early.csv`\n\nThese data come from an analysis of gene flow across two finch species [@lamichhaney2020]. They are slightly adapted here for illustrative purposes.\n\nThe data focus on two species, _Geospiza fortis_ and _G. scandens_. The original measurements are split by a uniquely timed event: a particularly strong El Niño event in 1983. This event changed the vegetation and food supply of the finches, allowing F1 hybrids of the two species to survive, whereas before 1983 they could not. The measurements are classed as `early` (pre-1983) and `late` (1983 onwards).\n\nHere we are looking only at the `early` data. We are specifically focussing on the beak shape classification, which we saw earlier in @fig-beak_shape_glm.\n\n## Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nearly_finches <- read_csv(\"data/finches_early.csv\")\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nearly_finches_py = pd.read_csv(\"data/finches_early.csv\")\n```\n:::\n\n\n:::\n\nLooking at the data, we can see that the `pointed_beak` column contains zeros and ones. These are actually yes/no classification outcomes and not numeric representations.\n\nWe'll have to deal with this soon. For now, we can plot the data:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(early_finches,\n       aes(x = factor(pointed_beak),\n          y = blength)) +\n  geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-6-1.png){width=672}\n:::\n:::\n\n\n## Python\n\nWe could just give Python the `pointed_beak` data directly, but then it would view the values as numeric. Which doesn't really work, because we have two groups as such: those with a pointed beak (`1`), and those with a blunt one (`0`).\n\nWe can force Python to temporarily covert the data to a factor, by making the `pointed_beak` column an `object` type. We can do this directly inside the `ggplot()` function.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(early_finches_py,\n         aes(x = early_finches_py.pointed_beak.astype(object),\n             y = \"blength\")) +\n     geom_boxplot())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-7-1.png){width=614}\n:::\n:::\n\n:::\n\nIt looks as though the finches with blunt beaks generally have shorter beak lengths.\n\nWe can visualise that differently by plotting all the data points as a classic binary response plot:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(early_finches,\n       aes(x = blength, y = pointed_beak)) +\n  geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-8-3.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(early_finches_py,\n         aes(x = \"blength\",\n             y = \"pointed_beak\")) +\n     geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-9-1.png){width=614}\n:::\n:::\n\n\n:::\n\nThis presents us with a bit of an issue. We could fit a linear regression model to these data, although we already know that this is a bad idea...\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(early_finches,\n       aes(x = blength, y = pointed_beak)) +\n  geom_point() +\n  geom_smooth(method = \"lm\", se = FALSE)\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-10-3.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(early_finches_py,\n         aes(x = \"blength\",\n             y = \"pointed_beak\")) +\n     geom_point() +\n     geom_smooth(method = \"lm\",\n                 colour = \"blue\",\n                 se = False))\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-11-1.png){width=614}\n:::\n:::\n\n\n:::\n\nOf course this is rubbish - we can't have a beak classification outside the range of $[0, 1]$. It's either blunt (`0`) or pointed (`1`).\n\nBut for the sake of exploration, let's look at the assumptions:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_bks <- lm(pointed_beak ~ blength,\n             data = early_finches)\n\nresid_panel(lm_bks,\n            plots = c(\"resid\", \"qq\", \"ls\", \"cookd\"),\n            smoother = TRUE)\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-12-3.png){width=672}\n:::\n:::\n\n\n## Python\n\nFirst, we create a linear model:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.ols(formula = \"pointed_beak ~ blength\",\n                data = early_finches_py)\n# and get the fitted parameters of the model\nlm_bks_py = model.fit()\n```\n:::\n\n\nNext, we can create the diagnostic plots:\n\n::: {.cell}\n\n```{.python .cell-code}\ndgplots(lm_bks_py)\n```\n:::\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/dgplots/2024_02_01-07-29-38_AM_dgplots.png){width=805}\n:::\n:::\n\n\n:::\n\nThey're ~~pretty~~ extremely bad.\n\n-   The response is not linear (Residual Plot, binary response plot, common sense).\n-   The residuals do not appear to be distributed normally (Q-Q Plot)\n-   The variance is not homogeneous across the predicted values (Location-Scale Plot)\n-   But - there is always a silver lining - we don't have influential data points.\n\n## Creating a suitable model\n\nSo far we've established that using a simple linear model to describe a potential relationship between beak length and the probability of having a pointed beak is not a good idea. So, what _can_ we do?\n\nOne of the ways we can deal with binary outcome data is by performing a logistic regression. Instead of fitting a straight line to our data, and performing a regression on that, we fit a line that has an S shape. This avoids the model making predictions outside the $[0, 1]$ range.\n\nWe described our standard linear relationship as follows:\n\n$Y = \\beta_0 + \\beta_1X$\n\nWe can now map this to our non-linear relationship via the **logistic link function**:\n\n$Y = \\frac{\\exp(\\beta_0 + \\beta_1X)}{1 + \\exp(\\beta_0 + \\beta_1X)}$\n\nNote that the $\\beta_0 + \\beta_1X$ part is identical to the formula of a straight line.\n\nThe rest of the function is what makes the straight line curve into its characteristic S shape. \n\n:::{.callout-note collapse=true}\n## Euler's number ($\\exp$): would you like to know more?\n\nIn mathematics, $\\rm e$ represents a constant of around 2.718. Another notation is $\\exp$, which is often used when notations become a bit cumbersome. Here, I exclusively use the $\\exp$ notation for consistency.\n:::\n\n::: {.callout-important}\n## The logistic function\n\nThe shape of the logistic function is hugely influenced by the different parameters, in particular $\\beta_1$. The plots below show different situations, where $\\beta_0 = 0$ in all cases, but $\\beta_1$ varies.\n\nThe first plot shows the logistic function in its simplest form, with the others showing the effect of varying $\\beta_1$.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-17-1.png){width=672}\n:::\n:::\n\n\n* when $\\beta_1 = 1$, this gives the simplest logistic function\n* when $\\beta_1 = 0$ gives a horizontal line, with $Y = \\frac{\\exp(\\beta_0)}{1+\\exp(\\beta_0)}$\n* when $\\beta_1$ is negative flips the curve around, so it slopes down\n* when $\\beta_1$ is very large then the curve becomes extremely steep\n\n:::\n\nWe can fit such an S-shaped curve to our `early_finches` data set, by creating a generalised linear model.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nIn R we have a few options to do this, and by far the most familiar function would be `glm()`. Here we save the model in an object called `glm_bks`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_bks <- glm(pointed_beak ~ blength,\n               family = binomial,\n               data = early_finches)\n```\n:::\n\n\nThe format of this function is similar to that used by the `lm()` function for linear models. The important difference is that we must specify the _family_ of error distribution to use. For logistic regression we must set the family to **binomial**.\n\nIf you forget to set the `family` argument, then the `glm()` function will perform a standard linear model fit, identical to what the `lm()` function would do.\n\n## Python\n\nIn Python we have a few options to do this, and by far the most familiar function would be `glm()`. Here we save the model in an object called `glm_bks_py`:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.glm(formula = \"pointed_beak ~ blength\",\n                family = sm.families.Binomial(),\n                data = early_finches_py)\n# and get the fitted parameters of the model\nglm_bks_py = model.fit()\n```\n:::\n\n\nThe format of this function is similar to that used by the `ols()` function for linear models. The important difference is that we must specify the _family_ of error distribution to use. For logistic regression we must set the family to **binomial**. This is buried deep inside the `statsmodels` package and needs to be defined as `sm.families.Binomial()`.\n\n:::\n\n## Model output\n\nThat's the easy part done! The trickier part is interpreting the output. First of all, we'll get some summary information.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_bks)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = pointed_beak ~ blength, family = binomial, data = early_finches)\n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)   \n(Intercept)  -43.410     15.250  -2.847  0.00442 **\nblength        3.387      1.193   2.839  0.00452 **\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 84.5476  on 60  degrees of freedom\nResidual deviance:  9.1879  on 59  degrees of freedom\nAIC: 13.188\n\nNumber of Fisher Scoring iterations: 8\n```\n\n\n:::\n:::\n\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_bks_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:           pointed_beak   No. Observations:                   61\nModel:                            GLM   Df Residuals:                       59\nModel Family:                Binomial   Df Model:                            1\nLink Function:                  Logit   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -4.5939\nDate:                Thu, 01 Feb 2024   Deviance:                       9.1879\nTime:                        07:29:39   Pearson chi2:                     15.1\nNo. Iterations:                     8   Pseudo R-squ. (CS):             0.7093\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept    -43.4096     15.250     -2.847      0.004     -73.298     -13.521\nblength        3.3866      1.193      2.839      0.005       1.049       5.724\n==============================================================================\n```\n\n\n:::\n:::\n\n\n:::\n\nThere’s a lot to unpack here, but let's start with what we're familiar with: coefficients!\n\n## Parameter interpretation\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nThe coefficients or parameters can be found in the `Coefficients` block. The main numbers to extract from the output are the two numbers underneath `Estimate.Std`:\n\n```\nCoefficients:\n            Estimate Std.\n(Intercept)  -43.410\nblength        3.387 \n```\n\n## Python\n\nRight at the bottom is a table showing the model coefficients. The main numbers to extract from the output are the two numbers in the `coef` column:\n\n```\n======================\n                 coef\n----------------------\nIntercept    -43.4096\nblength        3.3866\n======================\n```\n\n:::\n\nThese are the coefficients of the logistic model equation and need to be placed in the correct equation if we want to be able to calculate the probability of having a pointed beak for a given beak length.\n\nThe $p$ values at the end of each coefficient row merely show whether that particular coefficient is significantly different from zero. This is similar to the $p$ values obtained in the summary output of a linear model. As with continuous predictors in simple models, these $p$ values can be used to decide whether that predictor is important (so in this case beak length appears to be significant). However, these $p$ values aren’t great to work with when we have multiple predictor variables, or when we have categorical predictors with multiple levels (since the output will give us a $p$ value for each level rather than for the predictor as a whole).\n\nWe can use the coefficients to calculate the probability of having a pointed beak for a given beak length:\n\n$$ P(pointed \\ beak) = \\frac{\\exp(-43.41 + 3.39 \\times blength)}{1 + \\exp(-43.41 + 3.39 \\times blength)} $$\n\nHaving this formula means that we can calculate the probability of having a pointed beak for any beak length. How do we work this out in practice? \n\n::: {.panel-tabset group=\"language\"}\n\n## R\n\nWell, the probability of having a pointed beak if the beak length is large (for example 15 mm) can be calculated as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nexp(-43.41 + 3.39 * 15) / (1 + exp(-43.41 + 3.39 * 15))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.9994131\n```\n\n\n:::\n:::\n\n\nIf the beak length is small (for example 10 mm), the probability of having a pointed beak is extremely low:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nexp(-43.41 + 3.39 * 10) / (1 + exp(-43.41 + 3.39 * 10))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 7.410155e-05\n```\n\n\n:::\n:::\n\n\n## Python\n\nWell, the probability of having a pointed beak if the beak length is large (for example 15 mm) can be calculated as follows:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# import the math library\nimport math\n```\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\nmath.exp(-43.41 + 3.39 * 15) / (1 + math.exp(-43.41 + 3.39 * 15))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.9994130595039192\n```\n\n\n:::\n:::\n\n\nIf the beak length is small (for example 10 mm), the probability of having a pointed beak is extremely low:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmath.exp(-43.41 + 3.39 * 10) / (1 + math.exp(-43.41 + 3.39 * 10))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n7.410155028945912e-05\n```\n\n\n:::\n:::\n\n:::\n\nWe can calculate the the probabilities for all our observed values and if we do that then we can see that the larger the beak length is, the higher the probability that a beak shape would be pointed. I'm visualising this together with the logistic curve, where the blue points are the calculated probabilities:\n\n::: {.callout-note collapse=true}\n## Code available here\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_bks %>% \n  augment(type.predict = \"response\") %>% \n  ggplot() +\n  geom_point(aes(x = blength, y = pointed_beak)) +\n  geom_line(aes(x = blength, y = .fitted),\n            linetype = \"dashed\",\n            colour = \"blue\") +\n  geom_point(aes(x = blength, y = .fitted),\n             colour = \"blue\", alpha = 0.5) +\n  labs(x = \"beak length (mm)\",\n       y = \"Probability\")\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(early_finches_py) +\n  geom_point(aes(x = \"blength\", y = \"pointed_beak\")) +\n  geom_line(aes(x = \"blength\", y = glm_bks_py.fittedvalues),\n            linetype = \"dashed\",\n            colour = \"blue\") +\n  geom_point(aes(x = \"blength\", y = glm_bks_py.fittedvalues),\n             colour = \"blue\", alpha = 0.5) +\n  labs(x = \"beak length (mm)\",\n       y = \"Probability\"))\n```\n:::\n\n:::\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Predicted probabilities for beak classification](glm-practical-logistic-binary_files/figure-html/fig-beak_class_glm_probs-2.png){#fig-beak_class_glm_probs width=672}\n:::\n:::\n\n\nThe graph shows us that, based on the data that we have and the model we used to make predictions about our response variable, the probability of seeing a pointed beak increases with beak length.\n\nShort beaks are more closely associated with the bluntly shaped beaks, whereas long beaks are more closely associated with the pointed shape. It's also clear that there is a range of beak lengths (around 13 mm) where the probability of getting one shape or another is much more even.\n\n<!-- ## Parameter estimation explained -->\n\n<!-- Now that we know how to interpret the coefficients or parameters, let's have a look at how they're actually determined. To understand this, we need to take a step back. -->\n\n<!-- * Sum of squares (OLS) -->\n<!-- * MLE -->\n<!-- * Deviance -->\n\n<!-- ## Assumptions -->\n\n<!-- * GAMLSS -->\n\n\n## Assumptions\n\nAs explained in the background chapter, we can't really use the standard diagnostic plots to assess assumptions. We're not going to go into a lot of detail for now, but there is one thing that we can do: look for influential points using the Cook’s distance plot.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot(glm_bks , which = 4)\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-30-1.png){width=672}\n:::\n:::\n\n\n:::{.callout-note collapse=true}\n## Extracting the Cook's distances from the `glm` object\n\nInstead of using the `plot()` function, we can also extract the values directly from the `glm` object. We can use the `augment()` function to do this and create a lollipop or stem plot:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_bks %>% \n  augment() %>%            # get underlying data\n  select(.cooksd) %>%      # select the Cook's d\n  mutate(obs = 1:n()) %>%  # create an index column\n  ggplot(aes(x = obs, y = .cooksd)) +\n  geom_point() +\n  geom_segment(aes(x = obs, y = .cooksd, xend = obs, yend = 0))\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-31-1.png){width=672}\n:::\n:::\n\n:::\n\n## Python\n\nAs always, there are different ways of doing this. Here we extract the Cook's d values from the `glm` object and put them in a Pandas DataFrame. We can then use that to plot them in a lollipop or stem plot.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# extract the Cook's distances\nglm_bks_py_resid = pd.DataFrame(glm_bks_py.\n                                get_influence().\n                                summary_frame()[\"cooks_d\"])\n\n# add row index \nglm_bks_py_resid['obs'] = glm_bks_py_resid.reset_index().index\n```\n:::\n\n\nWe now have two columns:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nglm_bks_py_resid\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n         cooks_d  obs\n0   1.854360e-07    0\n1   3.388262e-07    1\n2   3.217960e-05    2\n3   1.194847e-05    3\n4   6.643975e-06    4\n..           ...  ...\n56  1.225519e-05   56\n57  2.484468e-05   57\n58  6.781364e-06   58\n59  1.850240e-07   59\n60  3.532360e-05   60\n\n[61 rows x 2 columns]\n```\n\n\n:::\n:::\n\n\nWe can use these to create the plot:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(glm_bks_py_resid,\n         aes(x = \"obs\",\n             y = \"cooks_d\")) +\n     geom_segment(aes(x = \"obs\", y = \"cooks_d\", xend = \"obs\", yend = 0)) +\n     geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-34-1.png){width=614}\n:::\n:::\n\n\n:::\n\nThis shows that there are no very obvious influential points. You could regard point `34` as potentially influential (it's got a Cook's distance of around `0.8`), but I'm not overly worried.\n\nIf we were worried, we'd remove the troublesome data point, re-run the analysis and see if that changes the statistical outcome. If so, then our entire (statistical) conclusion hinges on one data point, which is not a very robust bit of research. If it *doesn't* change our significance, then all is well, even though that data point is influential.\n\n## Assessing significance\n\nWe can ask several questions.\n\n**Is the model well-specified?**\n\nRoughly speaking this asks \"can our model predict our data reasonably well?\"\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nUnfortunately, there isn’t a single command that does this for us, and we have to lift some of the numbers from the summary output ourselves. \n\n\n::: {.cell}\n\n```{.r .cell-code}\npchisq(9.1879, 59, lower.tail = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1\n```\n\n\n:::\n:::\n\n\n\nHere, we’ve used the `pchisq` function (which calculates the correct probability for us – ask if you want a hand-wavy explanation). The first argument to it is the residual deviance value from the summary table, the second argument to is the residual degrees of freedom argument from the same table.\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nfrom scipy.stats import chi2\n\nchi2.sf(9.1879, 59)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.9999999999999916\n```\n\n\n:::\n:::\n\n\n:::\n\nThis gives us a probability of `1`. We can interpret this as the probability that the model is actually good. There aren’t any strict conventions on how to interpret this value but, for me, a tiny value would indicate a rubbish model.\n\n**Is the overall model better than the null model?**\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\npchisq(84.5476 - 9.1879, 60 - 59, lower.tail = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.923163e-18\n```\n\n\n:::\n:::\n\n\nHere we’ve used the `pchisq` function again (if you didn’t ask before, you probably aren’t going to ask now).\n\n## Python\n\nFirst we need to define the null model:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.glm(formula = \"pointed_beak ~ 1\",\n                family = sm.families.Binomial(),\n                data = early_finches_py)\n# and get the fitted parameters of the model\nglm_bks_null_py = model.fit()\n\nprint(glm_bks_null_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:           pointed_beak   No. Observations:                   61\nModel:                            GLM   Df Residuals:                       60\nModel Family:                Binomial   Df Model:                            0\nLink Function:                  Logit   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -42.274\nDate:                Thu, 01 Feb 2024   Deviance:                       84.548\nTime:                        07:29:43   Pearson chi2:                     61.0\nNo. Iterations:                     3   Pseudo R-squ. (CS):              0.000\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      0.0328      0.256      0.128      0.898      -0.469       0.535\n==============================================================================\n```\n\n\n:::\n:::\n\n\nIn order to compare our original fitted model to the null model we need to know the deviances of both models and the residual degrees of freedom of both models, which we could get from the summary method.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(84.5476 - 9.1879, 60 - 59)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n3.9231627082752525e-18\n```\n\n\n:::\n:::\n\n\n:::\n\nThe first argument is the difference between the null and residual deviances and the second argument is the difference in degrees of freedom between the null and residual models. All of these values can be lifted from the summary table.\n\nThis gives us a probability of pretty much zero. This value is doing a formal test to see whether our fitted model is significantly different from the null model. Here we can treat this a classical hypothesis test and since this p-value is less than 0.05 then we can say that our fitted model (with `blength` as a predictor variable) is definitely better than the null model (which has no predictor variables). Woohoo!\n\n**Are any of the individual predictors significant?**\n\nFinally, we’ll use the anova function from before to determine which predictor variables are important, and specifically in this case whether the beak length predictor is significant.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(glm_bks, test = \"Chisq\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: pointed_beak\n\nTerms added sequentially (first to last)\n\n        Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    \nNULL                       60     84.548              \nblength  1    75.36        59      9.188 < 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nThe `anova()` function is a true workhorse within R! This time we’ve used it to create an Analysis of Deviance table. This is exactly equivalent to an ordinary ANOVA table where we have rows corresponding to each predictor variable and a p-value telling us whether that variable is significant or not.\n\nThe p-value for the `blength` predictor is written under then Pr(>Chi) column and we can see that it is less than `< 2.2e-16`. So, beak length is a significant predictor.\n\nThis shouldn’t be surprising since we already saw that our overall model was better than the null model, which in this case is exactly the same as asking whether the beak length term is significant. However, in more complicated models with multiple predictors these two comparisons (and p-values) won’t be the same.\n\n## Python\n\nAlas, for some inexplicable reason this is not (yet?) possible to do in Python. At least, unbeknownst to me...\n\n:::\n\n\n## Exercises\n\n### Diabetes {#sec-exr_diabetes}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nFor this exercise we'll be using the data from `data/diabetes.csv`.\n\nThis is a data set comprising 768 observations of three variables (one dependent and two predictor variables). This records the results of a diabetes test result as a binary variable (1 is a positive result, 0 is a negative result), along with the result of a glucose tolerance test and the diastolic blood pressure for each of 768 women. The variables are called `test_result`, `glucose` and `diastolic`.\n\nWe want to see if the `glucose` tolerance is a meaningful predictor for predictions on a diabetes test. To investigate this, do the following:\n\n1. Load and visualise the data\n2. Create a suitable model\n3. Determine if there are any statistically significant predictors\n4. Calculate the probability of a positive diabetes test result for a glucose tolerance test value of `glucose = 150`\n\n::: {.callout-answer collapse=\"true\"}\n\n#### Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndiabetes <- read_csv(\"data/diabetes.csv\")\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndiabetes_py = pd.read_csv(\"data/diabetes.csv\")\n```\n:::\n\n\n:::\n\nLooking at the data, we can see that the `test_result` column contains zeros and ones. These are yes/no test result outcomes and not actually numeric representations.\n\nWe'll have to deal with this soon. For now, we can plot the data, by outcome:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(diabetes,\n       aes(x = factor(test_result),\n           y = glucose)) +\n  geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-43-1.png){width=672}\n:::\n:::\n\n\n## Python\n\nWe could just give Python the `test_result` data directly, but then it would view the values as numeric. Which doesn't really work, because we have two groups as such: those with a negative diabetes test result, and those with a positive one.\n\nWe can force Python to temporarily covert the data to a factor, by making the `test_result` column an `object` type. We can do this directly inside the `ggplot()` function.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(diabetes_py,\n         aes(x = diabetes_py.test_result.astype(object),\n             y = \"glucose\")) +\n     geom_boxplot())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-44-1.png){width=614}\n:::\n:::\n\n:::\n\nIt looks as though the patients with a positive diabetes test have slightly higher glucose levels than those with a negative diabetes test.\n\nWe can visualise that differently by plotting all the data points as a classic binary response plot:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(diabetes,\n       aes(x = glucose,\n           y = test_result)) +\n  geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-45-3.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(diabetes_py,\n         aes(x = \"glucose\",\n             y = \"test_result\")) +\n     geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-46-1.png){width=614}\n:::\n:::\n\n\n:::\n\n#### Create a suitable model\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe'll use the `glm()` function to create a generalised linear model. Here we save the model in an object called `glm_dia`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_dia <- glm(test_result ~ glucose,\n               family = binomial,\n               data = diabetes)\n```\n:::\n\n\nThe format of this function is similar to that used by the `lm()` function for linear models. The important difference is that we must specify the _family_ of error distribution to use. For logistic regression we must set the family to **binomial**.\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.glm(formula = \"test_result ~ glucose\",\n                family = sm.families.Binomial(),\n                data = diabetes_py)\n# and get the fitted parameters of the model\nglm_dia_py = model.fit()\n```\n:::\n\n\n:::\n\nLet's look at the model parameters:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_dia)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = test_result ~ glucose, family = binomial, data = diabetes)\n\nCoefficients:\n             Estimate Std. Error z value Pr(>|z|)    \n(Intercept) -5.611732   0.442289  -12.69   <2e-16 ***\nglucose      0.039510   0.003398   11.63   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 936.6  on 727  degrees of freedom\nResidual deviance: 752.2  on 726  degrees of freedom\nAIC: 756.2\n\nNumber of Fisher Scoring iterations: 4\n```\n\n\n:::\n:::\n\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_dia_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:            test_result   No. Observations:                  728\nModel:                            GLM   Df Residuals:                      726\nModel Family:                Binomial   Df Model:                            1\nLink Function:                  Logit   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -376.10\nDate:                Thu, 01 Feb 2024   Deviance:                       752.20\nTime:                        07:29:45   Pearson chi2:                     713.\nNo. Iterations:                     4   Pseudo R-squ. (CS):             0.2238\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     -5.6117      0.442    -12.688      0.000      -6.479      -4.745\nglucose        0.0395      0.003     11.628      0.000       0.033       0.046\n==============================================================================\n```\n\n\n:::\n:::\n\n:::\n\nWe can see that `glucose` is a significant predictor for the `test_result` (the $p$ value is much smaller than 0.05).\n\nKnowing this, we're interested in the coefficients. We have an intercept of `-5.61` and `0.0395` for `glucose`. We can use these coefficients to write a formula that describes the potential relationship between the probability of having a positive test result, dependent on the glucose tolerance level value:\n\n$$ P(positive \\ test\\ result) = \\frac{\\exp(-5.61 + 0.04 \\times glucose)}{1 + \\exp(-5.61 + 0.04 \\times glucose)} $$\n\n#### Calculating probabilities\n\nUsing the formula above, we can now calculate the probability of having a positive test result, for a given `glucose` value. If we do this for `glucose = 150`, we get the following:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nexp(-5.61 + 0.04 * 150) / (1 + exp(-5.61 + 0.04 * 150))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.5962827\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmath.exp(-5.61 + 0.04 * 150) / (1 + math.exp(-5.61 + 0.04 * 150))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.5962826992967878\n```\n\n\n:::\n:::\n\n:::\n\nThis tells us that the probability of having a positive diabetes test result, given a glucose tolerance level of 150 is around 60 %.\n\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n-   We use a logistic regression to model a binary response\n-   We can feed new observations into the model and get probabilities for the outcome\n:::\n",
+    "markdown": "---\ntitle: \"Binary response\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n**Questions**\n\n-   How do we analyse data with a binary outcome?\n-   Can we test if our model is any good?\n-   Be able to perform a logistic regression with a binary outcome\n-   Predict outcomes of new data, based on a defined model\n\n**Objectives**\n\n- Be able to analyse binary outcome data\n- Understand different methods of testing model fit\n- Be able to make model predictions\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(broom)\nlibrary(tidyverse)\nlibrary(ggResidpanel)\n```\n:::\n\n\n### Functions\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# create diagnostic plots\nggResidpanel::resid_panel()\n```\n:::\n\n\n## Python\n\n### Libraries\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n\n# Needed for additional probability functionality\nfrom scipy.stats import *\n```\n:::\n\n\n### Functions\n:::\n:::\n\nThe example in this section uses the following data set:\n\n`data/finches_early.csv`\n\nThese data come from an analysis of gene flow across two finch species [@lamichhaney2020]. They are slightly adapted here for illustrative purposes.\n\nThe data focus on two species, _Geospiza fortis_ and _G. scandens_. The original measurements are split by a uniquely timed event: a particularly strong El Niño event in 1983. This event changed the vegetation and food supply of the finches, allowing F1 hybrids of the two species to survive, whereas before 1983 they could not. The measurements are classed as `early` (pre-1983) and `late` (1983 onwards).\n\nHere we are looking only at the `early` data. We are specifically focussing on the beak shape classification, which we saw earlier in @fig-beak_shape_glm.\n\n## Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nearly_finches <- read_csv(\"data/finches_early.csv\")\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nearly_finches_py = pd.read_csv(\"data/finches_early.csv\")\n```\n:::\n\n\n:::\n\nLooking at the data, we can see that the `pointed_beak` column contains zeros and ones. These are actually yes/no classification outcomes and not numeric representations.\n\nWe'll have to deal with this soon. For now, we can plot the data:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(early_finches,\n       aes(x = factor(pointed_beak),\n          y = blength)) +\n  geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\n## Python\n\nWe could just give Python the `pointed_beak` data directly, but then it would view the values as numeric. Which doesn't really work, because we have two groups as such: those with a pointed beak (`1`), and those with a blunt one (`0`).\n\nWe can force Python to temporarily covert the data to a factor, by making the `pointed_beak` column an `object` type. We can do this directly inside the `ggplot()` function.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(early_finches_py,\n         aes(x = early_finches_py.pointed_beak.astype(object),\n             y = \"blength\")) +\n     geom_boxplot())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-9-1.png){width=614}\n:::\n:::\n\n:::\n\nIt looks as though the finches with blunt beaks generally have shorter beak lengths.\n\nWe can visualise that differently by plotting all the data points as a classic binary response plot:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(early_finches,\n       aes(x = blength, y = pointed_beak)) +\n  geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-10-3.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(early_finches_py,\n         aes(x = \"blength\",\n             y = \"pointed_beak\")) +\n     geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-11-1.png){width=614}\n:::\n:::\n\n\n:::\n\nThis presents us with a bit of an issue. We could fit a linear regression model to these data, although we already know that this is a bad idea...\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(early_finches,\n       aes(x = blength, y = pointed_beak)) +\n  geom_point() +\n  geom_smooth(method = \"lm\", se = FALSE)\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-12-3.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(early_finches_py,\n         aes(x = \"blength\",\n             y = \"pointed_beak\")) +\n     geom_point() +\n     geom_smooth(method = \"lm\",\n                 colour = \"blue\",\n                 se = False))\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-13-1.png){width=614}\n:::\n:::\n\n\n:::\n\nOf course this is rubbish - we can't have a beak classification outside the range of $[0, 1]$. It's either blunt (`0`) or pointed (`1`).\n\nBut for the sake of exploration, let's look at the assumptions:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_bks <- lm(pointed_beak ~ blength,\n             data = early_finches)\n\nresid_panel(lm_bks,\n            plots = c(\"resid\", \"qq\", \"ls\", \"cookd\"),\n            smoother = TRUE)\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-14-3.png){width=672}\n:::\n:::\n\n\n## Python\n\nFirst, we create a linear model:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.ols(formula = \"pointed_beak ~ blength\",\n                data = early_finches_py)\n# and get the fitted parameters of the model\nlm_bks_py = model.fit()\n```\n:::\n\n\nNext, we can create the diagnostic plots:\n\n::: {.cell}\n\n```{.python .cell-code}\ndgplots(lm_bks_py)\n```\n:::\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/dgplots/2024_06_14-08-41-54_AM_dgplots.png){width=804}\n:::\n:::\n\n\n:::\n\nThey're ~~pretty~~ extremely bad.\n\n-   The response is not linear (Residual Plot, binary response plot, common sense).\n-   The residuals do not appear to be distributed normally (Q-Q Plot)\n-   The variance is not homogeneous across the predicted values (Location-Scale Plot)\n-   But - there is always a silver lining - we don't have influential data points.\n\n## Creating a suitable model\n\nSo far we've established that using a simple linear model to describe a potential relationship between beak length and the probability of having a pointed beak is not a good idea. So, what _can_ we do?\n\nOne of the ways we can deal with binary outcome data is by performing a logistic regression. Instead of fitting a straight line to our data, and performing a regression on that, we fit a line that has an S shape. This avoids the model making predictions outside the $[0, 1]$ range.\n\nWe described our standard linear relationship as follows:\n\n$Y = \\beta_0 + \\beta_1X$\n\nWe can now map this to our non-linear relationship via the **logistic link function**:\n\n$Y = \\frac{\\exp(\\beta_0 + \\beta_1X)}{1 + \\exp(\\beta_0 + \\beta_1X)}$\n\nNote that the $\\beta_0 + \\beta_1X$ part is identical to the formula of a straight line.\n\nThe rest of the function is what makes the straight line curve into its characteristic S shape. \n\n:::{.callout-note collapse=true}\n## Euler's number ($\\exp$): would you like to know more?\n\nIn mathematics, $\\rm e$ represents a constant of around 2.718. Another notation is $\\exp$, which is often used when notations become a bit cumbersome. Here, I exclusively use the $\\exp$ notation for consistency.\n:::\n\n::: {.callout-important}\n## The logistic function\n\nThe shape of the logistic function is hugely influenced by the different parameters, in particular $\\beta_1$. The plots below show different situations, where $\\beta_0 = 0$ in all cases, but $\\beta_1$ varies.\n\nThe first plot shows the logistic function in its simplest form, with the others showing the effect of varying $\\beta_1$.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-19-1.png){width=672}\n:::\n:::\n\n\n* when $\\beta_1 = 1$, this gives the simplest logistic function\n* when $\\beta_1 = 0$ gives a horizontal line, with $Y = \\frac{\\exp(\\beta_0)}{1+\\exp(\\beta_0)}$\n* when $\\beta_1$ is negative flips the curve around, so it slopes down\n* when $\\beta_1$ is very large then the curve becomes extremely steep\n\n:::\n\nWe can fit such an S-shaped curve to our `early_finches` data set, by creating a generalised linear model.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nIn R we have a few options to do this, and by far the most familiar function would be `glm()`. Here we save the model in an object called `glm_bks`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_bks <- glm(pointed_beak ~ blength,\n               family = binomial,\n               data = early_finches)\n```\n:::\n\n\nThe format of this function is similar to that used by the `lm()` function for linear models. The important difference is that we must specify the _family_ of error distribution to use. For logistic regression we must set the family to **binomial**.\n\nIf you forget to set the `family` argument, then the `glm()` function will perform a standard linear model fit, identical to what the `lm()` function would do.\n\n## Python\n\nIn Python we have a few options to do this, and by far the most familiar function would be `glm()`. Here we save the model in an object called `glm_bks_py`:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.glm(formula = \"pointed_beak ~ blength\",\n                family = sm.families.Binomial(),\n                data = early_finches_py)\n# and get the fitted parameters of the model\nglm_bks_py = model.fit()\n```\n:::\n\n\nThe format of this function is similar to that used by the `ols()` function for linear models. The important difference is that we must specify the _family_ of error distribution to use. For logistic regression we must set the family to **binomial**. This is buried deep inside the `statsmodels` package and needs to be defined as `sm.families.Binomial()`.\n\n:::\n\n## Model output\n\nThat's the easy part done! The trickier part is interpreting the output. First of all, we'll get some summary information.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_bks)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = pointed_beak ~ blength, family = binomial, data = early_finches)\n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)   \n(Intercept)  -43.410     15.250  -2.847  0.00442 **\nblength        3.387      1.193   2.839  0.00452 **\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 84.5476  on 60  degrees of freedom\nResidual deviance:  9.1879  on 59  degrees of freedom\nAIC: 13.188\n\nNumber of Fisher Scoring iterations: 8\n```\n\n\n:::\n:::\n\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_bks_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:           pointed_beak   No. Observations:                   61\nModel:                            GLM   Df Residuals:                       59\nModel Family:                Binomial   Df Model:                            1\nLink Function:                  Logit   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -4.5939\nDate:                Fri, 14 Jun 2024   Deviance:                       9.1879\nTime:                        08:41:55   Pearson chi2:                     15.1\nNo. Iterations:                     8   Pseudo R-squ. (CS):             0.7093\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept    -43.4096     15.250     -2.847      0.004     -73.298     -13.521\nblength        3.3866      1.193      2.839      0.005       1.049       5.724\n==============================================================================\n```\n\n\n:::\n:::\n\n\n:::\n\nThere’s a lot to unpack here, but let's start with what we're familiar with: coefficients!\n\n## Parameter interpretation\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nThe coefficients or parameters can be found in the `Coefficients` block. The main numbers to extract from the output are the two numbers underneath `Estimate.Std`:\n\n```\nCoefficients:\n            Estimate Std.\n(Intercept)  -43.410\nblength        3.387 \n```\n\n## Python\n\nRight at the bottom is a table showing the model coefficients. The main numbers to extract from the output are the two numbers in the `coef` column:\n\n```\n======================\n                 coef\n----------------------\nIntercept    -43.4096\nblength        3.3866\n======================\n```\n\n:::\n\nThese are the coefficients of the logistic model equation and need to be placed in the correct equation if we want to be able to calculate the probability of having a pointed beak for a given beak length.\n\nThe $p$ values at the end of each coefficient row merely show whether that particular coefficient is significantly different from zero. This is similar to the $p$ values obtained in the summary output of a linear model. As with continuous predictors in simple models, these $p$ values can be used to decide whether that predictor is important (so in this case beak length appears to be significant). However, these $p$ values aren’t great to work with when we have multiple predictor variables, or when we have categorical predictors with multiple levels (since the output will give us a $p$ value for each level rather than for the predictor as a whole).\n\nWe can use the coefficients to calculate the probability of having a pointed beak for a given beak length:\n\n$$ P(pointed \\ beak) = \\frac{\\exp(-43.41 + 3.39 \\times blength)}{1 + \\exp(-43.41 + 3.39 \\times blength)} $$\n\nHaving this formula means that we can calculate the probability of having a pointed beak for any beak length. How do we work this out in practice? \n\n::: {.panel-tabset group=\"language\"}\n\n## R\n\nWell, the probability of having a pointed beak if the beak length is large (for example 15 mm) can be calculated as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nexp(-43.41 + 3.39 * 15) / (1 + exp(-43.41 + 3.39 * 15))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.9994131\n```\n\n\n:::\n:::\n\n\nIf the beak length is small (for example 10 mm), the probability of having a pointed beak is extremely low:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nexp(-43.41 + 3.39 * 10) / (1 + exp(-43.41 + 3.39 * 10))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 7.410155e-05\n```\n\n\n:::\n:::\n\n\n## Python\n\nWell, the probability of having a pointed beak if the beak length is large (for example 15 mm) can be calculated as follows:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# import the math library\nimport math\n```\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\nmath.exp(-43.41 + 3.39 * 15) / (1 + math.exp(-43.41 + 3.39 * 15))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.9994130595039192\n```\n\n\n:::\n:::\n\n\nIf the beak length is small (for example 10 mm), the probability of having a pointed beak is extremely low:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmath.exp(-43.41 + 3.39 * 10) / (1 + math.exp(-43.41 + 3.39 * 10))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n7.410155028945912e-05\n```\n\n\n:::\n:::\n\n:::\n\nWe can calculate the the probabilities for all our observed values and if we do that then we can see that the larger the beak length is, the higher the probability that a beak shape would be pointed. I'm visualising this together with the logistic curve, where the blue points are the calculated probabilities:\n\n::: {.callout-note collapse=true}\n## Code available here\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_bks %>% \n  augment(type.predict = \"response\") %>% \n  ggplot() +\n  geom_point(aes(x = blength, y = pointed_beak)) +\n  geom_line(aes(x = blength, y = .fitted),\n            linetype = \"dashed\",\n            colour = \"blue\") +\n  geom_point(aes(x = blength, y = .fitted),\n             colour = \"blue\", alpha = 0.5) +\n  labs(x = \"beak length (mm)\",\n       y = \"Probability\")\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(early_finches_py) +\n  geom_point(aes(x = \"blength\", y = \"pointed_beak\")) +\n  geom_line(aes(x = \"blength\", y = glm_bks_py.fittedvalues),\n            linetype = \"dashed\",\n            colour = \"blue\") +\n  geom_point(aes(x = \"blength\", y = glm_bks_py.fittedvalues),\n             colour = \"blue\", alpha = 0.5) +\n  labs(x = \"beak length (mm)\",\n       y = \"Probability\"))\n```\n:::\n\n:::\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Predicted probabilities for beak classification](glm-practical-logistic-binary_files/figure-html/fig-beak_class_glm_probs-2.png){#fig-beak_class_glm_probs width=672}\n:::\n:::\n\n\nThe graph shows us that, based on the data that we have and the model we used to make predictions about our response variable, the probability of seeing a pointed beak increases with beak length.\n\nShort beaks are more closely associated with the bluntly shaped beaks, whereas long beaks are more closely associated with the pointed shape. It's also clear that there is a range of beak lengths (around 13 mm) where the probability of getting one shape or another is much more even.\n\n## Assumptions\n\nAs explained in the background chapter, we can't really use the standard diagnostic plots to assess assumptions. We're not going to go into a lot of detail for now, but there is one thing that we can do: look for influential points using the Cook’s distance plot.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nresid_panel(glm_bks, plots = \"cookd\")\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-32-1.png){width=672}\n:::\n:::\n\n\n## Python\n\nAs always, there are different ways of doing this. Here we extract the Cook's d values from the `glm` object and put them in a Pandas DataFrame. We can then use that to plot them in a lollipop or stem plot.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# extract the Cook's distances\nglm_bks_py_resid = pd.DataFrame(glm_bks_py.\n                                get_influence().\n                                summary_frame()[\"cooks_d\"])\n\n# add row index \nglm_bks_py_resid['obs'] = glm_bks_py_resid.reset_index().index\n```\n:::\n\n\nWe now have two columns:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nglm_bks_py_resid\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n         cooks_d  obs\n0   1.854360e-07    0\n1   3.388262e-07    1\n2   3.217960e-05    2\n3   1.194847e-05    3\n4   6.643975e-06    4\n..           ...  ...\n56  1.225519e-05   56\n57  2.484468e-05   57\n58  6.781364e-06   58\n59  1.850240e-07   59\n60  3.532360e-05   60\n\n[61 rows x 2 columns]\n```\n\n\n:::\n:::\n\n\nWe can use these to create the plot:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(glm_bks_py_resid,\n         aes(x = \"obs\",\n             y = \"cooks_d\")) +\n     geom_segment(aes(x = \"obs\", y = \"cooks_d\", xend = \"obs\", yend = 0)) +\n     geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-35-1.png){width=614}\n:::\n:::\n\n\n:::\n\nThis shows that there are no very obvious influential points. You could regard point `34` as potentially influential (it's got a Cook's distance of around `0.8`), but I'm not overly worried.\n\nIf we were worried, we'd remove the troublesome data point, re-run the analysis and see if that changes the statistical outcome. If so, then our entire (statistical) conclusion hinges on one data point, which is not a very robust bit of research. If it *doesn't* change our significance, then all is well, even though that data point is influential.\n\n## Assessing significance\n\nWe can ask several questions.\n\n**Is the model well-specified?**\n\nRoughly speaking this asks \"can our model predict our data reasonably well?\"\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nUnfortunately, there isn’t a single command that does this for us, and we have to lift some of the numbers from the summary output ourselves. \n\n\n::: {.cell}\n\n```{.r .cell-code}\npchisq(9.1879, 59, lower.tail = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1\n```\n\n\n:::\n:::\n\n\n\nHere, we’ve used the `pchisq` function (which calculates the correct probability for us – ask if you want a hand-wavy explanation). The first argument to it is the residual deviance value from the summary table, the second argument to is the residual degrees of freedom argument from the same table.\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nfrom scipy.stats import chi2\n\nchi2.sf(9.1879, 59)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.9999999999999916\n```\n\n\n:::\n:::\n\n\n:::\n\nThis gives us a probability of `1`. We can interpret this as the probability that the model is actually good. There aren’t any strict conventions on how to interpret this value but, for me, a tiny value would indicate a rubbish model.\n\n**Is the overall model better than the null model?**\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\npchisq(84.5476 - 9.1879, 60 - 59, lower.tail = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.923163e-18\n```\n\n\n:::\n:::\n\n\nHere we’ve used the `pchisq` function again (if you didn’t ask before, you probably aren’t going to ask now).\n\n## Python\n\nFirst we need to define the null model:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.glm(formula = \"pointed_beak ~ 1\",\n                family = sm.families.Binomial(),\n                data = early_finches_py)\n# and get the fitted parameters of the model\nglm_bks_null_py = model.fit()\n\nprint(glm_bks_null_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:           pointed_beak   No. Observations:                   61\nModel:                            GLM   Df Residuals:                       60\nModel Family:                Binomial   Df Model:                            0\nLink Function:                  Logit   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -42.274\nDate:                Fri, 14 Jun 2024   Deviance:                       84.548\nTime:                        08:41:56   Pearson chi2:                     61.0\nNo. Iterations:                     3   Pseudo R-squ. (CS):              0.000\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      0.0328      0.256      0.128      0.898      -0.469       0.535\n==============================================================================\n```\n\n\n:::\n:::\n\n\nIn order to compare our original fitted model to the null model we need to know the deviances of both models and the residual degrees of freedom of both models, which we could get from the summary method.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(84.5476 - 9.1879, 60 - 59)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n3.9231627082752525e-18\n```\n\n\n:::\n:::\n\n\n:::\n\nThe first argument is the difference between the null and residual deviances and the second argument is the difference in degrees of freedom between the null and residual models. All of these values can be lifted from the summary table.\n\nThis gives us a probability of pretty much zero. This value is doing a formal test to see whether our fitted model is significantly different from the null model. Here we can treat this a classical hypothesis test and since this p-value is less than 0.05 then we can say that our fitted model (with `blength` as a predictor variable) is definitely better than the null model (which has no predictor variables). Woohoo!\n\n**Are any of the individual predictors significant?**\n\nFinally, we’ll use the anova function from before to determine which predictor variables are important, and specifically in this case whether the beak length predictor is significant.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(glm_bks, test = \"Chisq\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: pointed_beak\n\nTerms added sequentially (first to last)\n\n        Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    \nNULL                       60     84.548              \nblength  1    75.36        59      9.188 < 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nThe `anova()` function is a true workhorse within R! This time we’ve used it to create an Analysis of Deviance table. This is exactly equivalent to an ordinary ANOVA table where we have rows corresponding to each predictor variable and a p-value telling us whether that variable is significant or not.\n\nThe p-value for the `blength` predictor is written under then Pr(>Chi) column and we can see that it is less than `< 2.2e-16`. So, beak length is a significant predictor.\n\nThis shouldn’t be surprising since we already saw that our overall model was better than the null model, which in this case is exactly the same as asking whether the beak length term is significant. However, in more complicated models with multiple predictors these two comparisons (and p-values) won’t be the same.\n\n## Python\n\nAlas, for some inexplicable reason this is not (yet?) possible to do in Python. At least, unbeknownst to me...\n\n:::\n\n\n## Exercises\n\n### Diabetes {#sec-exr_diabetes}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nFor this exercise we'll be using the data from `data/diabetes.csv`.\n\nThis is a data set comprising 768 observations of three variables (one dependent and two predictor variables). This records the results of a diabetes test result as a binary variable (1 is a positive result, 0 is a negative result), along with the result of a glucose tolerance test and the diastolic blood pressure for each of 768 women. The variables are called `test_result`, `glucose` and `diastolic`.\n\nWe want to see if the `glucose` tolerance is a meaningful predictor for predictions on a diabetes test. To investigate this, do the following:\n\n1. Load and visualise the data\n2. Create a suitable model\n3. Determine if there are any statistically significant predictors\n4. Calculate the probability of a positive diabetes test result for a glucose tolerance test value of `glucose = 150`\n\n::: {.callout-answer collapse=\"true\"}\n\n#### Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndiabetes <- read_csv(\"data/diabetes.csv\")\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndiabetes_py = pd.read_csv(\"data/diabetes.csv\")\n```\n:::\n\n\n:::\n\nLooking at the data, we can see that the `test_result` column contains zeros and ones. These are yes/no test result outcomes and not actually numeric representations.\n\nWe'll have to deal with this soon. For now, we can plot the data, by outcome:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(diabetes,\n       aes(x = factor(test_result),\n           y = glucose)) +\n  geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-44-1.png){width=672}\n:::\n:::\n\n\n## Python\n\nWe could just give Python the `test_result` data directly, but then it would view the values as numeric. Which doesn't really work, because we have two groups as such: those with a negative diabetes test result, and those with a positive one.\n\nWe can force Python to temporarily covert the data to a factor, by making the `test_result` column an `object` type. We can do this directly inside the `ggplot()` function.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(diabetes_py,\n         aes(x = diabetes_py.test_result.astype(object),\n             y = \"glucose\")) +\n     geom_boxplot())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-45-1.png){width=614}\n:::\n:::\n\n:::\n\nIt looks as though the patients with a positive diabetes test have slightly higher glucose levels than those with a negative diabetes test.\n\nWe can visualise that differently by plotting all the data points as a classic binary response plot:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(diabetes,\n       aes(x = glucose,\n           y = test_result)) +\n  geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-46-3.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(diabetes_py,\n         aes(x = \"glucose\",\n             y = \"test_result\")) +\n     geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-47-1.png){width=614}\n:::\n:::\n\n\n:::\n\n#### Create a suitable model\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe'll use the `glm()` function to create a generalised linear model. Here we save the model in an object called `glm_dia`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_dia <- glm(test_result ~ glucose,\n               family = binomial,\n               data = diabetes)\n```\n:::\n\n\nThe format of this function is similar to that used by the `lm()` function for linear models. The important difference is that we must specify the _family_ of error distribution to use. For logistic regression we must set the family to **binomial**.\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.glm(formula = \"test_result ~ glucose\",\n                family = sm.families.Binomial(),\n                data = diabetes_py)\n# and get the fitted parameters of the model\nglm_dia_py = model.fit()\n```\n:::\n\n\n:::\n\nLet's look at the model parameters:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_dia)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = test_result ~ glucose, family = binomial, data = diabetes)\n\nCoefficients:\n             Estimate Std. Error z value Pr(>|z|)    \n(Intercept) -5.611732   0.442289  -12.69   <2e-16 ***\nglucose      0.039510   0.003398   11.63   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 936.6  on 727  degrees of freedom\nResidual deviance: 752.2  on 726  degrees of freedom\nAIC: 756.2\n\nNumber of Fisher Scoring iterations: 4\n```\n\n\n:::\n:::\n\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_dia_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:            test_result   No. Observations:                  728\nModel:                            GLM   Df Residuals:                      726\nModel Family:                Binomial   Df Model:                            1\nLink Function:                  Logit   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -376.10\nDate:                Fri, 14 Jun 2024   Deviance:                       752.20\nTime:                        08:41:57   Pearson chi2:                     713.\nNo. Iterations:                     4   Pseudo R-squ. (CS):             0.2238\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     -5.6117      0.442    -12.688      0.000      -6.479      -4.745\nglucose        0.0395      0.003     11.628      0.000       0.033       0.046\n==============================================================================\n```\n\n\n:::\n:::\n\n:::\n\nWe can see that `glucose` is a significant predictor for the `test_result` (the $p$ value is much smaller than 0.05).\n\nKnowing this, we're interested in the coefficients. We have an intercept of `-5.61` and `0.0395` for `glucose`. We can use these coefficients to write a formula that describes the potential relationship between the probability of having a positive test result, dependent on the glucose tolerance level value:\n\n$$ P(positive \\ test\\ result) = \\frac{\\exp(-5.61 + 0.04 \\times glucose)}{1 + \\exp(-5.61 + 0.04 \\times glucose)} $$\n\n#### Calculating probabilities\n\nUsing the formula above, we can now calculate the probability of having a positive test result, for a given `glucose` value. If we do this for `glucose = 150`, we get the following:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nexp(-5.61 + 0.04 * 150) / (1 + exp(-5.61 + 0.04 * 150))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.5962827\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmath.exp(-5.61 + 0.04 * 150) / (1 + math.exp(-5.61 + 0.04 * 150))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.5962826992967878\n```\n\n\n:::\n:::\n\n:::\n\nThis tells us that the probability of having a positive diabetes test result, given a glucose tolerance level of 150 is around 60 %.\n\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n-   We use a logistic regression to model a binary response\n-   We can feed new observations into the model and get probabilities for the outcome\n:::\n",
     "supporting": [
       "glm-practical-logistic-binary_files"
     ],
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-10-3.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-10-3.png
index b897fc1..be8caa0 100644
Binary files a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-10-3.png and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-10-3.png differ
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-11-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-11-1.png
index c9d36d1..29b50de 100644
Binary files a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-11-1.png and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-11-1.png differ
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-12-3.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-12-3.png
index acf091e..b897fc1 100644
Binary files a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-12-3.png and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-12-3.png differ
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-13-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-13-1.png
new file mode 100644
index 0000000..ddce7d0
Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-13-1.png differ
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-14-3.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-14-3.png
new file mode 100644
index 0000000..f15f76b
Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-14-3.png differ
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-17-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-17-1.png
index eb0a49a..0c56b86 100644
Binary files a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-17-1.png and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-17-1.png differ
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-19-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-19-1.png
new file mode 100644
index 0000000..eb0a49a
Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-19-1.png differ
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-29-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-29-1.png
new file mode 100644
index 0000000..56a33c5
Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-29-1.png differ
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-30-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-30-1.png
index fe8b9b7..648423a 100644
Binary files a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-30-1.png and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-30-1.png differ
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-32-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-32-1.png
new file mode 100644
index 0000000..5c77a39
Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-32-1.png differ
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-35-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-35-1.png
new file mode 100644
index 0000000..b8dd693
Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-35-1.png differ
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-44-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-44-1.png
index a849e0c..68a5fa8 100644
Binary files a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-44-1.png and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-44-1.png differ
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-45-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-45-1.png
new file mode 100644
index 0000000..129fd3d
Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-45-1.png differ
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-46-3.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-46-3.png
new file mode 100644
index 0000000..0d59157
Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-46-3.png differ
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-47-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-47-1.png
new file mode 100644
index 0000000..7041659
Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-47-1.png differ
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-8-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-8-1.png
new file mode 100644
index 0000000..ef52e71
Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-8-1.png differ
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-9-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-9-1.png
index f1a858b..d21f78c 100644
Binary files a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-9-1.png and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-9-1.png differ
diff --git a/_freeze/materials/glm-practical-logistic-proportion/execute-results/html.json b/_freeze/materials/glm-practical-logistic-proportion/execute-results/html.json
index f285248..faba804 100644
--- a/_freeze/materials/glm-practical-logistic-proportion/execute-results/html.json
+++ b/_freeze/materials/glm-practical-logistic-proportion/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "173861ccb267f6eb833aef4954211086",
+  "hash": "2eeb0509103d7e969b67af04d77cf4f5",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: \"Proportional response\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n-   How do I analyse proportion responses?\n-   Be able to create a logistic model to test proportion response variables\n-   Be able to plot the data and fitted curve\n-   Assess the significance of the fit\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n### Functions\n\n## Python\n\n### Libraries\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n\n# Needed for additional probability functionality\nfrom scipy.stats import *\n```\n:::\n\n\n### Functions\n:::\n:::\n\nThe example in this section uses the following data set:\n\n`data/challenger.csv`\n\nThese data, obtained from the [faraway package](https://www.rdocumentation.org/packages/faraway/versions/1.0.7) in R, contain information related to the explosion of the USA Space Shuttle Challenger on 28 January, 1986. An investigation after the disaster traced back to certain joints on one of the two solid booster rockets, each containing O-rings that ensured no exhaust gases could escape from the booster.\n\nThe night before the launch was unusually cold, with temperatures below freezing. The final report suggested that the cold snap during the night made the o-rings stiff, and unable to adjust to changes in pressure. As a result, exhaust gases leaked away from the solid booster rockets, causing one of them to break loose and rupture the main fuel tank, leading to the final explosion.\n\nThe question we're trying to answer in this session is: based on the data from the previous flights, would it have been possible to predict the failure of most o-rings on the Challenger flight?\n\n## Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchallenger <- read_csv(\"data/challenger.csv\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 23 Columns: 2\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (2): temp, damage\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchallenger_py = pd.read_csv(\"data/challenger.csv\")\n```\n:::\n\n\n:::\n\nThe data set contains several columns:\n\n1.  `temp`, the launch temperature in degrees Fahrenheit\n2.  `damage`, the number of o-rings that showed erosion\n\nBefore we have a further look at the data, let's calculate the proportion of damaged o-rings (`prop_damaged`) and the total number of o-rings (`total`) and update our data set.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchallenger <-\nchallenger %>%\n  mutate(total = 6,                     # total number of o-rings\n         intact = 6 - damage,           # number of undamaged o-rings\n         prop_damaged = damage / total) # proportion damaged o-rings\n\nchallenger\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 23 × 5\n    temp damage total intact prop_damaged\n   <dbl>  <dbl> <dbl>  <dbl>        <dbl>\n 1    53      5     6      1        0.833\n 2    57      1     6      5        0.167\n 3    58      1     6      5        0.167\n 4    63      1     6      5        0.167\n 5    66      0     6      6        0    \n 6    67      0     6      6        0    \n 7    67      0     6      6        0    \n 8    67      0     6      6        0    \n 9    68      0     6      6        0    \n10    69      0     6      6        0    \n# ℹ 13 more rows\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchallenger_py['total'] = 6\nchallenger_py['intact'] = challenger_py['total'] - challenger_py['damage']\nchallenger_py['prop_damaged'] = challenger_py['damage'] / challenger_py['total']\n```\n:::\n\n\n:::\n\nPlotting the proportion of damaged o-rings against the launch temperature shows the following picture:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(challenger, aes(x = temp, y = prop_damaged)) +\n  geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(challenger_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-9-1.png){width=614}\n:::\n:::\n\n\n:::\n\nThe point on the left is the data point corresponding to the coldest flight experienced before the disaster, where five damaged o-rings were found. Fortunately, this did not result in a disaster.\n\nHere we'll explore if we could have reasonably predicted the failure of both o-rings on the Challenger flight, where the launch temperature was 31 degrees Fahrenheit.\n\n## Creating a suitable model\n\nWe only have 23 data points in total. So we're building a model on not that much data - we should keep this in mind when we draw our conclusions!\n\nWe are using a logistic regression for a proportion response in this case, since we're interested in the proportion of o-rings that are damaged.\n\nWe can define this as follows:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_chl <- glm(cbind(damage, intact) ~ temp,\n               family = binomial,\n               data = challenger)\n```\n:::\n\n\nDefining the relationship for proportion responses is a bit annoying, where you have to give the `glm` model a two-column matrix to specify the response variable.\n\nHere, the first column corresponds to the number of damaged o-rings, whereas the second column refers to the number of intact o-rings. We use the `cbind()` function to bind these two together into a matrix.\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a generalised linear model\nmodel = smf.glm(formula = \"damage + intact ~ temp\",\n                family = sm.families.Binomial(),\n                data = challenger_py)\n# and get the fitted parameters of the model\nglm_chl_py = model.fit()\n```\n:::\n\n\n:::\n\n## Model output\n\nThat's the easy part done! The trickier part is interpreting the output. First of all, we'll get some summary information.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nNext, we can have a closer look at the results:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_chl)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger)\n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)    \n(Intercept) 11.66299    3.29626   3.538 0.000403 ***\ntemp        -0.21623    0.05318  -4.066 4.78e-05 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 38.898  on 22  degrees of freedom\nResidual deviance: 16.912  on 21  degrees of freedom\nAIC: 33.675\n\nNumber of Fisher Scoring iterations: 6\n```\n\n\n:::\n:::\n\n\nWe can see that the p-values of the `intercept` and `temp` are significant. We can also use the intercept and `temp` coefficients to construct the logistic equation, which we can use to sketch the logistic curve.\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_chl_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   23\nModel:                              GLM   Df Residuals:                       21\nModel Family:                  Binomial   Df Model:                            1\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -14.837\nDate:                  Tue, 06 Feb 2024   Deviance:                       16.912\nTime:                          16:12:12   Pearson chi2:                     28.1\nNo. Iterations:                       7   Pseudo R-squ. (CS):             0.6155\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     11.6630      3.296      3.538      0.000       5.202      18.124\ntemp          -0.2162      0.053     -4.066      0.000      -0.320      -0.112\n==============================================================================\n```\n\n\n:::\n:::\n\n:::\n\n$$E(prop \\ failed\\ orings) = \\frac{\\exp{(11.66 -  0.22 \\times temp)}}{1 + \\exp{(11.66 -  0.22 \\times temp)}}$$\n\nLet's see how well our model would have performed if we would have fed it the data from the ill-fated Challenger launch.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(challenger, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\n## Python\n\nWe can get the predicted values for the model as follows:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchallenger_py['predicted_values'] = glm_chl_py.predict()\n\nchallenger_py.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   temp  damage  total  intact  prop_damaged  predicted_values\n0    53       5      6       1      0.833333          0.550479\n1    57       1      6       5      0.166667          0.340217\n2    58       1      6       5      0.166667          0.293476\n3    63       1      6       5      0.166667          0.123496\n4    66       0      6       6      0.000000          0.068598\n```\n\n\n:::\n:::\n\n\nThis would only give us the predicted values for the data we already have. Instead we want to extrapolate to what would have been predicted for a wider range of temperatures. Here, we use a range of $[25, 85]$ degrees Fahrenheit.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel = pd.DataFrame({'temp': list(range(25, 86))})\n\nmodel[\"pred\"] = glm_chl_py.predict(model)\n\nmodel.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   temp      pred\n0    25  0.998087\n1    26  0.997626\n2    27  0.997055\n3    28  0.996347\n4    29  0.995469\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(challenger_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"temp\", y = \"pred\"), colour = \"blue\", size = 1))\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-17-1.png){width=614}\n:::\n:::\n\n\n\n::: {.callout-note collapse=true}\n## Generating predicted values\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nAnother way of doing this it to generate a table with data for a range of temperatures, from 25 to 85 degrees Fahrenheit, in steps of 1. We can then use these data to generate the logistic curve, based on the fitted model.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# create a table with sequential numbers ranging from 25 to 85\nmodel <- tibble(temp = seq(25, 85, by = 1)) %>% \n  # add a new column containing the predicted values\n  mutate(.pred = predict(glm_chl, newdata = ., type = \"response\"))\n\nggplot(model, aes(temp, .pred)) +\n  geom_line()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-18-3.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# plot the curve and the original data\nggplot(model, aes(temp, .pred)) +\n  geom_line(colour = \"blue\") +\n  geom_point(data = challenger, aes(temp, prop_damaged)) +\n  # add a vertical line at the disaster launch temperature\n  geom_vline(xintercept = 31, linetype = \"dashed\")\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-19-1.png){width=672}\n:::\n:::\n\n\nIt seems that there was a high probability of both o-rings failing at that launch temperature. One thing that the graph shows is that there is a lot of uncertainty involved in this model. We can tell, because the fit of the line is very poor at the lower temperature range. There is just very little data to work on, with the data point at 53 F having a large influence on the fit.\n\n## Python\n\nWe already did this above, since this is the most straightforward way of plotting the model in Python.\n\n:::\n:::\n:::\n\n## Exercises\n\n### Predicting failure {#sec-exr_failure}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nThe data point at 53 degrees Fahrenheit is quite influential for the analysis. Remove this data point and repeat the analysis. Is there still a predicted link between launch temperature and o-ring failure?\n\n::: {.callout-answer collapse=\"true\"}\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nFirst, we need to remove the influential data point:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchallenger_new <- challenger %>% filter(temp != 53)\n```\n:::\n\n\nWe can create a new generalised linear model, based on these data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_chl_new <- glm(cbind(damage, intact) ~ temp,\n               family = binomial,\n               data = challenger_new)\n```\n:::\n\n\nWe can get the model parameters as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_chl_new)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger_new)\n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)  \n(Intercept)  5.68223    4.43138   1.282   0.1997  \ntemp        -0.12817    0.06697  -1.914   0.0556 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 16.375  on 21  degrees of freedom\nResidual deviance: 12.633  on 20  degrees of freedom\nAIC: 27.572\n\nNumber of Fisher Scoring iterations: 5\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(challenger_new, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85) +\n  # add a vertical line at 53 F temperature\n  geom_vline(xintercept = 53, linetype = \"dashed\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-23-1.png){width=672}\n:::\n:::\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\n\nBefore we can make any firm conclusions, though, we need to check our model:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1- pchisq(12.633,20)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.8925695\n```\n\n\n:::\n:::\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty rubbish – our points are not very close to our curve, overall.\n\nIs the model any better than the null though?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(16.375 - 12.633, 21 - 20)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.0530609\n```\n\n\n:::\n\n```{.r .cell-code}\nanova(glm_chl_new, test = 'Chisq')\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: cbind(damage, intact)\n\nTerms added sequentially (first to last)\n\n     Df Deviance Resid. Df Resid. Dev Pr(>Chi)  \nNULL                    21     16.375           \ntemp  1   3.7421        20     12.633  0.05306 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n\n## Python\n\nFirst, we need to remove the influential data point:\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchallenger_new_py = challenger_py.query(\"temp != 53\")\n```\n:::\n\n\nWe can create a new generalised linear model, based on these data:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a generalised linear model\nmodel = smf.glm(formula = \"damage + intact ~ temp\",\n                family = sm.families.Binomial(),\n                data = challenger_new_py)\n# and get the fitted parameters of the model\nglm_chl_new_py = model.fit()\n```\n:::\n\n\nWe can get the model parameters as follows:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_chl_new_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   22\nModel:                              GLM   Df Residuals:                       20\nModel Family:                  Binomial   Df Model:                            1\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -11.786\nDate:                  Tue, 06 Feb 2024   Deviance:                       12.633\nTime:                          16:12:15   Pearson chi2:                     16.6\nNo. Iterations:                       6   Pseudo R-squ. (CS):             0.1564\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      5.6822      4.431      1.282      0.200      -3.003      14.368\ntemp          -0.1282      0.067     -1.914      0.056      -0.259       0.003\n==============================================================================\n```\n\n\n:::\n:::\n\n\nGenerate new model data:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel = pd.DataFrame({'temp': list(range(25, 86))})\n\nmodel[\"pred\"] = glm_chl_new_py.predict(model)\n\nmodel.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   temp      pred\n0    25  0.922585\n1    26  0.912920\n2    27  0.902177\n3    28  0.890269\n4    29  0.877107\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(challenger_new_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"temp\", y = \"pred\"), colour = \"blue\", size = 1) +\n     # add a vertical line at 53 F temperature\n     geom_vline(xintercept = 53, linetype = \"dashed\"))\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-30-1.png){width=614}\n:::\n:::\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\n\nBefore we can make any firm conclusions, though, we need to check our model:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(12.633, 20)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.8925694610786307\n```\n\n\n:::\n:::\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty rubbish – our points are not very close to our curve, overall.\n\nIs the model any better than the null though?\n\nFirst we need to define the null model:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.glm(formula = \"damage + intact ~ 1\",\n                family = sm.families.Binomial(),\n                data = challenger_new_py)\n# and get the fitted parameters of the model\nglm_chl_new_null_py = model.fit()\n\nprint(glm_chl_new_null_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   22\nModel:                              GLM   Df Residuals:                       21\nModel Family:                  Binomial   Df Model:                            0\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -13.657\nDate:                  Tue, 06 Feb 2024   Deviance:                       16.375\nTime:                          16:12:16   Pearson chi2:                     16.8\nNo. Iterations:                       6   Pseudo R-squ. (CS):         -2.220e-16\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     -3.0445      0.418     -7.286      0.000      -3.864      -2.226\n==============================================================================\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(16.375 - 12.633, 21 - 20)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.053060897703157646\n```\n\n\n:::\n:::\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n:::\n\nSo, could NASA have predicted what happened? This model is not significantly different from the null, i.e., temperature is not a significant predictor. Note that it’s only marginally non-significant, and we do have a high goodness-of-fit value.\n\nIt is possible that if more data points were available that followed a similar trend, the story might be different). Even if we did use our non-significant model to make a prediction, it doesn’t give us a value anywhere near 5 failures for a temperature of 53 degrees Fahrenheit. So overall, based on the model we’ve fitted with these data, there was no indication that a temperature just a few degrees cooler than previous missions could have been so disastrous for the Challenger.\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n-   We can use a logistic model for proportion response variables\n\n:::\n",
+    "markdown": "---\ntitle: \"Proportional response\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n-   How do I analyse proportion responses?\n-   Be able to create a logistic model to test proportion response variables\n-   Be able to plot the data and fitted curve\n-   Assess the significance of the fit\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(broom)\nlibrary(tidyverse)\nlibrary(ggResidpanel)\n```\n:::\n\n\n### Functions\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# create diagnostic plots\nggResidpanel::resid_panel()\n```\n:::\n\n\n## Python\n\n### Libraries\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n\n# Needed for additional probability functionality\nfrom scipy.stats import *\n```\n:::\n\n\n### Functions\n:::\n:::\n\nThe example in this section uses the following data set:\n\n`data/challenger.csv`\n\nThese data, obtained from the [faraway package](https://www.rdocumentation.org/packages/faraway/versions/1.0.7) in R, contain information related to the explosion of the USA Space Shuttle Challenger on 28 January, 1986. An investigation after the disaster traced back to certain joints on one of the two solid booster rockets, each containing O-rings that ensured no exhaust gases could escape from the booster.\n\nThe night before the launch was unusually cold, with temperatures below freezing. The final report suggested that the cold snap during the night made the o-rings stiff, and unable to adjust to changes in pressure. As a result, exhaust gases leaked away from the solid booster rockets, causing one of them to break loose and rupture the main fuel tank, leading to the final explosion.\n\nThe question we're trying to answer in this session is: based on the data from the previous flights, would it have been possible to predict the failure of most o-rings on the Challenger flight?\n\n## Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchallenger <- read_csv(\"data/challenger.csv\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 23 Columns: 2\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (2): temp, damage\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchallenger_py = pd.read_csv(\"data/challenger.csv\")\n```\n:::\n\n\n:::\n\nThe data set contains several columns:\n\n1.  `temp`, the launch temperature in degrees Fahrenheit\n2.  `damage`, the number of o-rings that showed erosion\n\nBefore we have a further look at the data, let's calculate the proportion of damaged o-rings (`prop_damaged`) and the total number of o-rings (`total`) and update our data set.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchallenger <-\nchallenger %>%\n  mutate(total = 6,                     # total number of o-rings\n         intact = 6 - damage,           # number of undamaged o-rings\n         prop_damaged = damage / total) # proportion damaged o-rings\n\nchallenger\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 23 × 5\n    temp damage total intact prop_damaged\n   <dbl>  <dbl> <dbl>  <dbl>        <dbl>\n 1    53      5     6      1        0.833\n 2    57      1     6      5        0.167\n 3    58      1     6      5        0.167\n 4    63      1     6      5        0.167\n 5    66      0     6      6        0    \n 6    67      0     6      6        0    \n 7    67      0     6      6        0    \n 8    67      0     6      6        0    \n 9    68      0     6      6        0    \n10    69      0     6      6        0    \n# ℹ 13 more rows\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchallenger_py['total'] = 6\nchallenger_py['intact'] = challenger_py['total'] - challenger_py['damage']\nchallenger_py['prop_damaged'] = challenger_py['damage'] / challenger_py['total']\n```\n:::\n\n\n:::\n\nPlotting the proportion of damaged o-rings against the launch temperature shows the following picture:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(challenger, aes(x = temp, y = prop_damaged)) +\n  geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(challenger_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-11-1.png){width=614}\n:::\n:::\n\n\n:::\n\nThe point on the left is the data point corresponding to the coldest flight experienced before the disaster, where five damaged o-rings were found. Fortunately, this did not result in a disaster.\n\nHere we'll explore if we could have reasonably predicted the failure of both o-rings on the Challenger flight, where the launch temperature was 31 degrees Fahrenheit.\n\n## Creating a suitable model\n\nWe only have 23 data points in total. So we're building a model on not that much data - we should keep this in mind when we draw our conclusions!\n\nWe are using a logistic regression for a proportion response in this case, since we're interested in the proportion of o-rings that are damaged.\n\nWe can define this as follows:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_chl <- glm(cbind(damage, intact) ~ temp,\n               family = binomial,\n               data = challenger)\n```\n:::\n\n\nDefining the relationship for proportion responses is a bit annoying, where you have to give the `glm` model a two-column matrix to specify the response variable.\n\nHere, the first column corresponds to the number of damaged o-rings, whereas the second column refers to the number of intact o-rings. We use the `cbind()` function to bind these two together into a matrix.\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a generalised linear model\nmodel = smf.glm(formula = \"damage + intact ~ temp\",\n                family = sm.families.Binomial(),\n                data = challenger_py)\n# and get the fitted parameters of the model\nglm_chl_py = model.fit()\n```\n:::\n\n\n:::\n\n## Model output\n\nThat's the easy part done! The trickier part is interpreting the output. First of all, we'll get some summary information.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nNext, we can have a closer look at the results:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_chl)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger)\n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)    \n(Intercept) 11.66299    3.29626   3.538 0.000403 ***\ntemp        -0.21623    0.05318  -4.066 4.78e-05 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 38.898  on 22  degrees of freedom\nResidual deviance: 16.912  on 21  degrees of freedom\nAIC: 33.675\n\nNumber of Fisher Scoring iterations: 6\n```\n\n\n:::\n:::\n\n\nWe can see that the p-values of the `intercept` and `temp` are significant. We can also use the intercept and `temp` coefficients to construct the logistic equation, which we can use to sketch the logistic curve.\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_chl_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   23\nModel:                              GLM   Df Residuals:                       21\nModel Family:                  Binomial   Df Model:                            1\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -14.837\nDate:                  Fri, 14 Jun 2024   Deviance:                       16.912\nTime:                          08:42:07   Pearson chi2:                     28.1\nNo. Iterations:                       7   Pseudo R-squ. (CS):             0.6155\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     11.6630      3.296      3.538      0.000       5.202      18.124\ntemp          -0.2162      0.053     -4.066      0.000      -0.320      -0.112\n==============================================================================\n```\n\n\n:::\n:::\n\n:::\n\n$$E(prop \\ failed\\ orings) = \\frac{\\exp{(11.66 -  0.22 \\times temp)}}{1 + \\exp{(11.66 -  0.22 \\times temp)}}$$\n\nLet's see how well our model would have performed if we would have fed it the data from the ill-fated Challenger launch.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(challenger, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-16-1.png){width=672}\n:::\n:::\n\n\n## Python\n\nWe can get the predicted values for the model as follows:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchallenger_py['predicted_values'] = glm_chl_py.predict()\n\nchallenger_py.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   temp  damage  total  intact  prop_damaged  predicted_values\n0    53       5      6       1      0.833333          0.550479\n1    57       1      6       5      0.166667          0.340217\n2    58       1      6       5      0.166667          0.293476\n3    63       1      6       5      0.166667          0.123496\n4    66       0      6       6      0.000000          0.068598\n```\n\n\n:::\n:::\n\n\nThis would only give us the predicted values for the data we already have. Instead we want to extrapolate to what would have been predicted for a wider range of temperatures. Here, we use a range of $[25, 85]$ degrees Fahrenheit.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel = pd.DataFrame({'temp': list(range(25, 86))})\n\nmodel[\"pred\"] = glm_chl_py.predict(model)\n\nmodel.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   temp      pred\n0    25  0.998087\n1    26  0.997626\n2    27  0.997055\n3    28  0.996347\n4    29  0.995469\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(challenger_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"temp\", y = \"pred\"), colour = \"blue\", size = 1))\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-19-1.png){width=614}\n:::\n:::\n\n\n\n::: {.callout-note collapse=true}\n## Generating predicted values\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nAnother way of doing this it to generate a table with data for a range of temperatures, from 25 to 85 degrees Fahrenheit, in steps of 1. We can then use these data to generate the logistic curve, based on the fitted model.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# create a table with sequential numbers ranging from 25 to 85\nmodel <- tibble(temp = seq(25, 85, by = 1)) %>% \n  # add a new column containing the predicted values\n  mutate(.pred = predict(glm_chl, newdata = ., type = \"response\"))\n\nggplot(model, aes(temp, .pred)) +\n  geom_line()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-20-3.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# plot the curve and the original data\nggplot(model, aes(temp, .pred)) +\n  geom_line(colour = \"blue\") +\n  geom_point(data = challenger, aes(temp, prop_damaged)) +\n  # add a vertical line at the disaster launch temperature\n  geom_vline(xintercept = 31, linetype = \"dashed\")\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-21-1.png){width=672}\n:::\n:::\n\n\nIt seems that there was a high probability of both o-rings failing at that launch temperature. One thing that the graph shows is that there is a lot of uncertainty involved in this model. We can tell, because the fit of the line is very poor at the lower temperature range. There is just very little data to work on, with the data point at 53 F having a large influence on the fit.\n\n## Python\n\nWe already did this above, since this is the most straightforward way of plotting the model in Python.\n\n:::\n:::\n:::\n\n## Exercises\n\n### Predicting failure {#sec-exr_failure}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nThe data point at 53 degrees Fahrenheit is quite influential for the analysis. Remove this data point and repeat the analysis. Is there still a predicted link between launch temperature and o-ring failure?\n\n::: {.callout-answer collapse=\"true\"}\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nFirst, we need to remove the influential data point:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchallenger_new <- challenger %>% filter(temp != 53)\n```\n:::\n\n\nWe can create a new generalised linear model, based on these data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_chl_new <- glm(cbind(damage, intact) ~ temp,\n               family = binomial,\n               data = challenger_new)\n```\n:::\n\n\nWe can get the model parameters as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_chl_new)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger_new)\n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)  \n(Intercept)  5.68223    4.43138   1.282   0.1997  \ntemp        -0.12817    0.06697  -1.914   0.0556 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 16.375  on 21  degrees of freedom\nResidual deviance: 12.633  on 20  degrees of freedom\nAIC: 27.572\n\nNumber of Fisher Scoring iterations: 5\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(challenger_new, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85) +\n  # add a vertical line at 53 F temperature\n  geom_vline(xintercept = 53, linetype = \"dashed\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-25-1.png){width=672}\n:::\n:::\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\n\nBefore we can make any firm conclusions, though, we need to check our model:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1- pchisq(12.633,20)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.8925695\n```\n\n\n:::\n:::\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty rubbish – our points are not very close to our curve, overall.\n\nIs the model any better than the null though?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(16.375 - 12.633, 21 - 20)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.0530609\n```\n\n\n:::\n\n```{.r .cell-code}\nanova(glm_chl_new, test = 'Chisq')\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: cbind(damage, intact)\n\nTerms added sequentially (first to last)\n\n     Df Deviance Resid. Df Resid. Dev Pr(>Chi)  \nNULL                    21     16.375           \ntemp  1   3.7421        20     12.633  0.05306 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n\n## Python\n\nFirst, we need to remove the influential data point:\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchallenger_new_py = challenger_py.query(\"temp != 53\")\n```\n:::\n\n\nWe can create a new generalised linear model, based on these data:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a generalised linear model\nmodel = smf.glm(formula = \"damage + intact ~ temp\",\n                family = sm.families.Binomial(),\n                data = challenger_new_py)\n# and get the fitted parameters of the model\nglm_chl_new_py = model.fit()\n```\n:::\n\n\nWe can get the model parameters as follows:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_chl_new_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   22\nModel:                              GLM   Df Residuals:                       20\nModel Family:                  Binomial   Df Model:                            1\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -11.786\nDate:                  Fri, 14 Jun 2024   Deviance:                       12.633\nTime:                          08:42:08   Pearson chi2:                     16.6\nNo. Iterations:                       6   Pseudo R-squ. (CS):             0.1564\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      5.6822      4.431      1.282      0.200      -3.003      14.368\ntemp          -0.1282      0.067     -1.914      0.056      -0.259       0.003\n==============================================================================\n```\n\n\n:::\n:::\n\n\nGenerate new model data:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel = pd.DataFrame({'temp': list(range(25, 86))})\n\nmodel[\"pred\"] = glm_chl_new_py.predict(model)\n\nmodel.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   temp      pred\n0    25  0.922585\n1    26  0.912920\n2    27  0.902177\n3    28  0.890269\n4    29  0.877107\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(challenger_new_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"temp\", y = \"pred\"), colour = \"blue\", size = 1) +\n     # add a vertical line at 53 F temperature\n     geom_vline(xintercept = 53, linetype = \"dashed\"))\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-32-1.png){width=614}\n:::\n:::\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\n\nBefore we can make any firm conclusions, though, we need to check our model:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(12.633, 20)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.8925694610786307\n```\n\n\n:::\n:::\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty rubbish – our points are not very close to our curve, overall.\n\nIs the model any better than the null though?\n\nFirst we need to define the null model:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.glm(formula = \"damage + intact ~ 1\",\n                family = sm.families.Binomial(),\n                data = challenger_new_py)\n# and get the fitted parameters of the model\nglm_chl_new_null_py = model.fit()\n\nprint(glm_chl_new_null_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   22\nModel:                              GLM   Df Residuals:                       21\nModel Family:                  Binomial   Df Model:                            0\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -13.657\nDate:                  Fri, 14 Jun 2024   Deviance:                       16.375\nTime:                          08:42:08   Pearson chi2:                     16.8\nNo. Iterations:                       6   Pseudo R-squ. (CS):         -2.220e-16\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     -3.0445      0.418     -7.286      0.000      -3.864      -2.226\n==============================================================================\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(16.375 - 12.633, 21 - 20)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.053060897703157646\n```\n\n\n:::\n:::\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n:::\n\nSo, could NASA have predicted what happened? This model is not significantly different from the null, i.e., temperature is not a significant predictor. Note that it’s only marginally non-significant, and we do have a high goodness-of-fit value.\n\nIt is possible that if more data points were available that followed a similar trend, the story might be different). Even if we did use our non-significant model to make a prediction, it doesn’t give us a value anywhere near 5 failures for a temperature of 53 degrees Fahrenheit. So overall, based on the model we’ve fitted with these data, there was no indication that a temperature just a few degrees cooler than previous missions could have been so disastrous for the Challenger.\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n-   We can use a logistic model for proportion response variables\n\n:::\n",
     "supporting": [
       "glm-practical-logistic-proportion_files"
     ],
diff --git a/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-10-1.png b/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-10-1.png
index 9b3e566..cd17d72 100644
Binary files a/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-10-1.png and b/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-10-1.png differ
diff --git a/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-11-1.png b/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-11-1.png
new file mode 100644
index 0000000..8e32c18
Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-11-1.png differ
diff --git a/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-16-1.png b/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-16-1.png
new file mode 100644
index 0000000..7c097f1
Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-16-1.png differ
diff --git a/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-19-1.png b/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-19-1.png
index 9b3e566..9666f33 100644
Binary files a/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-19-1.png and b/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-19-1.png differ
diff --git a/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-20-3.png b/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-20-3.png
new file mode 100644
index 0000000..101c3bb
Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-20-3.png differ
diff --git a/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-21-1.png b/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-21-1.png
new file mode 100644
index 0000000..9b3e566
Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-21-1.png differ
diff --git a/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-25-1.png b/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-25-1.png
new file mode 100644
index 0000000..5ea53a3
Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-25-1.png differ
diff --git a/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-32-1.png b/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-32-1.png
new file mode 100644
index 0000000..7e6cb68
Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-32-1.png differ
diff --git a/_site/materials/glm-practical-logistic-binary.html b/_site/materials/glm-practical-logistic-binary.html
index 8902af1..08b1b1f 100644
--- a/_site/materials/glm-practical-logistic-binary.html
+++ b/_site/materials/glm-practical-logistic-binary.html
@@ -418,30 +418,39 @@ <h1 class="title d-none d-lg-block">
 <div id="tabset-1-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-1-1-tab">
 <section id="libraries" class="level3" data-number="7.1.1"><h3 data-number="7.1.1" class="anchored" data-anchor-id="libraries">
 <span class="header-section-number">7.1.1</span> Libraries</h3>
+<div class="cell">
+<div class="sourceCode" id="cb1"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="kw"><a href="https://rdrr.io/r/base/library.html">library</a></span><span class="op">(</span><span class="va"><a href="https://broom.tidymodels.org/">broom</a></span><span class="op">)</span></span>
+<span><span class="kw"><a href="https://rdrr.io/r/base/library.html">library</a></span><span class="op">(</span><span class="va"><a href="https://tidyverse.tidyverse.org">tidyverse</a></span><span class="op">)</span></span>
+<span><span class="kw"><a href="https://rdrr.io/r/base/library.html">library</a></span><span class="op">(</span><span class="va"><a href="https://goodekat.github.io/ggResidpanel/">ggResidpanel</a></span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+</div>
 </section><section id="functions" class="level3" data-number="7.1.2"><h3 data-number="7.1.2" class="anchored" data-anchor-id="functions">
 <span class="header-section-number">7.1.2</span> Functions</h3>
+<div class="cell">
+<div class="sourceCode" id="cb2"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="co"># create diagnostic plots</span></span>
+<span><span class="fu">ggResidpanel</span><span class="fu">::</span><span class="fu"><a href="https://goodekat.github.io/ggResidpanel/reference/resid_panel.html">resid_panel</a></span><span class="op">(</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+</div>
 </section>
 </div>
 <div id="tabset-1-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-1-2-tab">
 <section id="libraries-1" class="level3" data-number="7.1.3"><h3 data-number="7.1.3" class="anchored" data-anchor-id="libraries-1">
 <span class="header-section-number">7.1.3</span> Libraries</h3>
 <div class="cell">
-<div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="co"># A maths library</span></span>
-<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> math</span>
-<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="co"># A Python data analysis and manipulation tool</span></span>
-<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
-<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a><span class="co"># Python equivalent of `ggplot2`</span></span>
-<span id="cb1-7"><a href="#cb1-7" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> plotnine <span class="im">import</span> <span class="op">*</span></span>
-<span id="cb1-8"><a href="#cb1-8" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-9"><a href="#cb1-9" aria-hidden="true" tabindex="-1"></a><span class="co"># Statistical models, conducting tests and statistical data exploration</span></span>
-<span id="cb1-10"><a href="#cb1-10" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> statsmodels.api <span class="im">as</span> sm</span>
-<span id="cb1-11"><a href="#cb1-11" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-12"><a href="#cb1-12" aria-hidden="true" tabindex="-1"></a><span class="co"># Convenience interface for specifying models using formula strings and DataFrames</span></span>
-<span id="cb1-13"><a href="#cb1-13" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> statsmodels.formula.api <span class="im">as</span> smf</span>
-<span id="cb1-14"><a href="#cb1-14" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-15"><a href="#cb1-15" aria-hidden="true" tabindex="-1"></a><span class="co"># Needed for additional probability functionality</span></span>
-<span id="cb1-16"><a href="#cb1-16" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> scipy.stats <span class="im">import</span> <span class="op">*</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># A maths library</span></span>
+<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> math</span>
+<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="co"># A Python data analysis and manipulation tool</span></span>
+<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
+<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a><span class="co"># Python equivalent of `ggplot2`</span></span>
+<span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> plotnine <span class="im">import</span> <span class="op">*</span></span>
+<span id="cb3-8"><a href="#cb3-8" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb3-9"><a href="#cb3-9" aria-hidden="true" tabindex="-1"></a><span class="co"># Statistical models, conducting tests and statistical data exploration</span></span>
+<span id="cb3-10"><a href="#cb3-10" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> statsmodels.api <span class="im">as</span> sm</span>
+<span id="cb3-11"><a href="#cb3-11" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb3-12"><a href="#cb3-12" aria-hidden="true" tabindex="-1"></a><span class="co"># Convenience interface for specifying models using formula strings and DataFrames</span></span>
+<span id="cb3-13"><a href="#cb3-13" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> statsmodels.formula.api <span class="im">as</span> smf</span>
+<span id="cb3-14"><a href="#cb3-14" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb3-15"><a href="#cb3-15" aria-hidden="true" tabindex="-1"></a><span class="co"># Needed for additional probability functionality</span></span>
+<span id="cb3-16"><a href="#cb3-16" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> scipy.stats <span class="im">import</span> <span class="op">*</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 </section><section id="functions-1" class="level3" data-number="7.1.4"><h3 data-number="7.1.4" class="anchored" data-anchor-id="functions-1">
 <span class="header-section-number">7.1.4</span> Functions</h3>
@@ -468,12 +477,12 @@ <h1 class="title d-none d-lg-block">
 <div class="tab-content" data-group="language">
 <div id="tabset-2-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-2-1-tab">
 <div class="cell">
-<div class="sourceCode" id="cb2"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="va">early_finches</span> <span class="op">&lt;-</span> <span class="fu">read_csv</span><span class="op">(</span><span class="st">"data/finches_early.csv"</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode" id="cb4"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="va">early_finches</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv</a></span><span class="op">(</span><span class="st">"data/finches_early.csv"</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 </div>
 <div id="tabset-2-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-2-2-tab">
 <div class="cell">
-<div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>early_finches_py <span class="op">=</span> pd.read_csv(<span class="st">"data/finches_early.csv"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>early_finches_py <span class="op">=</span> pd.read_csv(<span class="st">"data/finches_early.csv"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 </div>
 </div>
@@ -488,13 +497,13 @@ <h1 class="title d-none d-lg-block">
 <div class="tab-content" data-group="language">
 <div id="tabset-3-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-3-1-tab">
 <div class="cell">
-<div class="sourceCode" id="cb4"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu">ggplot</span><span class="op">(</span><span class="va">early_finches</span>,</span>
-<span>       <span class="fu">aes</span><span class="op">(</span>x <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/factor.html">factor</a></span><span class="op">(</span><span class="va">pointed_beak</span><span class="op">)</span>,</span>
+<div class="sourceCode" id="cb6"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot</a></span><span class="op">(</span><span class="va">early_finches</span>,</span>
+<span>       <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes</a></span><span class="op">(</span>x <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/factor.html">factor</a></span><span class="op">(</span><span class="va">pointed_beak</span><span class="op">)</span>,</span>
 <span>          y <span class="op">=</span> <span class="va">blength</span><span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
-<span>  <span class="fu">geom_boxplot</span><span class="op">(</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<span>  <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">geom_boxplot</a></span><span class="op">(</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
 <div>
-<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-6-1.png" class="img-fluid figure-img" width="672"></p>
+<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-8-1.png" class="img-fluid figure-img" width="672"></p>
 </figure>
 </div>
 </div>
@@ -502,15 +511,15 @@ <h1 class="title d-none d-lg-block">
 </div>
 <div id="tabset-3-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-3-2-tab">
 <p>We could just give Python the <code>pointed_beak</code> data directly, but then it would view the values as numeric. Which doesn’t really work, because we have two groups as such: those with a pointed beak (<code>1</code>), and those with a blunt one (<code>0</code>).</p>
-<p>We can force Python to temporarily covert the data to a factor, by making the <code>pointed_beak</code> column an <code>object</code> type. We can do this directly inside the <code>ggplot()</code> function.</p>
+<p>We can force Python to temporarily covert the data to a factor, by making the <code>pointed_beak</code> column an <code>object</code> type. We can do this directly inside the <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code> function.</p>
 <div class="cell">
-<div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>(ggplot(early_finches_py,</span>
-<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>         aes(x <span class="op">=</span> early_finches_py.pointed_beak.astype(<span class="bu">object</span>),</span>
-<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>             y <span class="op">=</span> <span class="st">"blength"</span>)) <span class="op">+</span></span>
-<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a>     geom_boxplot())</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>(ggplot(early_finches_py,</span>
+<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>         aes(x <span class="op">=</span> early_finches_py.pointed_beak.astype(<span class="bu">object</span>),</span>
+<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>             y <span class="op">=</span> <span class="st">"blength"</span>)) <span class="op">+</span></span>
+<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a>     geom_boxplot())</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
 <div>
-<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-7-1.png" class="img-fluid figure-img" width="614"></p>
+<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-9-1.png" class="img-fluid figure-img" width="614"></p>
 </figure>
 </div>
 </div>
@@ -528,12 +537,12 @@ <h1 class="title d-none d-lg-block">
 <div class="tab-content" data-group="language">
 <div id="tabset-4-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-4-1-tab">
 <div class="cell">
-<div class="sourceCode" id="cb6"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu">ggplot</span><span class="op">(</span><span class="va">early_finches</span>,</span>
-<span>       <span class="fu">aes</span><span class="op">(</span>x <span class="op">=</span> <span class="va">blength</span>, y <span class="op">=</span> <span class="va">pointed_beak</span><span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
-<span>  <span class="fu">geom_point</span><span class="op">(</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode" id="cb8"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot</a></span><span class="op">(</span><span class="va">early_finches</span>,</span>
+<span>       <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes</a></span><span class="op">(</span>x <span class="op">=</span> <span class="va">blength</span>, y <span class="op">=</span> <span class="va">pointed_beak</span><span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
+<span>  <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point</a></span><span class="op">(</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
 <div>
-<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-8-3.png" class="img-fluid figure-img" width="672"></p>
+<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-10-3.png" class="img-fluid figure-img" width="672"></p>
 </figure>
 </div>
 </div>
@@ -541,13 +550,13 @@ <h1 class="title d-none d-lg-block">
 </div>
 <div id="tabset-4-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-4-2-tab">
 <div class="cell">
-<div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>(ggplot(early_finches_py,</span>
-<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>         aes(x <span class="op">=</span> <span class="st">"blength"</span>,</span>
-<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>             y <span class="op">=</span> <span class="st">"pointed_beak"</span>)) <span class="op">+</span></span>
-<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a>     geom_point())</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>(ggplot(early_finches_py,</span>
+<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a>         aes(x <span class="op">=</span> <span class="st">"blength"</span>,</span>
+<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a>             y <span class="op">=</span> <span class="st">"pointed_beak"</span>)) <span class="op">+</span></span>
+<span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a>     geom_point())</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
 <div>
-<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-9-1.png" class="img-fluid figure-img" width="614"></p>
+<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-11-1.png" class="img-fluid figure-img" width="614"></p>
 </figure>
 </div>
 </div>
@@ -564,13 +573,13 @@ <h1 class="title d-none d-lg-block">
 <div class="tab-content" data-group="language">
 <div id="tabset-5-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-5-1-tab">
 <div class="cell">
-<div class="sourceCode" id="cb8"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu">ggplot</span><span class="op">(</span><span class="va">early_finches</span>,</span>
-<span>       <span class="fu">aes</span><span class="op">(</span>x <span class="op">=</span> <span class="va">blength</span>, y <span class="op">=</span> <span class="va">pointed_beak</span><span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
-<span>  <span class="fu">geom_point</span><span class="op">(</span><span class="op">)</span> <span class="op">+</span></span>
-<span>  <span class="fu">geom_smooth</span><span class="op">(</span>method <span class="op">=</span> <span class="st">"lm"</span>, se <span class="op">=</span> <span class="cn">FALSE</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode" id="cb10"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot</a></span><span class="op">(</span><span class="va">early_finches</span>,</span>
+<span>       <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes</a></span><span class="op">(</span>x <span class="op">=</span> <span class="va">blength</span>, y <span class="op">=</span> <span class="va">pointed_beak</span><span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
+<span>  <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point</a></span><span class="op">(</span><span class="op">)</span> <span class="op">+</span></span>
+<span>  <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth</a></span><span class="op">(</span>method <span class="op">=</span> <span class="st">"lm"</span>, se <span class="op">=</span> <span class="cn">FALSE</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
 <div>
-<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-10-3.png" class="img-fluid figure-img" width="672"></p>
+<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-12-3.png" class="img-fluid figure-img" width="672"></p>
 </figure>
 </div>
 </div>
@@ -578,16 +587,16 @@ <h1 class="title d-none d-lg-block">
 </div>
 <div id="tabset-5-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-5-2-tab">
 <div class="cell">
-<div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>(ggplot(early_finches_py,</span>
-<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a>         aes(x <span class="op">=</span> <span class="st">"blength"</span>,</span>
-<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a>             y <span class="op">=</span> <span class="st">"pointed_beak"</span>)) <span class="op">+</span></span>
-<span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a>     geom_point() <span class="op">+</span></span>
-<span id="cb9-5"><a href="#cb9-5" aria-hidden="true" tabindex="-1"></a>     geom_smooth(method <span class="op">=</span> <span class="st">"lm"</span>,</span>
-<span id="cb9-6"><a href="#cb9-6" aria-hidden="true" tabindex="-1"></a>                 colour <span class="op">=</span> <span class="st">"blue"</span>,</span>
-<span id="cb9-7"><a href="#cb9-7" aria-hidden="true" tabindex="-1"></a>                 se <span class="op">=</span> <span class="va">False</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a>(ggplot(early_finches_py,</span>
+<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>         aes(x <span class="op">=</span> <span class="st">"blength"</span>,</span>
+<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>             y <span class="op">=</span> <span class="st">"pointed_beak"</span>)) <span class="op">+</span></span>
+<span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a>     geom_point() <span class="op">+</span></span>
+<span id="cb11-5"><a href="#cb11-5" aria-hidden="true" tabindex="-1"></a>     geom_smooth(method <span class="op">=</span> <span class="st">"lm"</span>,</span>
+<span id="cb11-6"><a href="#cb11-6" aria-hidden="true" tabindex="-1"></a>                 colour <span class="op">=</span> <span class="st">"blue"</span>,</span>
+<span id="cb11-7"><a href="#cb11-7" aria-hidden="true" tabindex="-1"></a>                 se <span class="op">=</span> <span class="va">False</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
 <div>
-<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-11-1.png" class="img-fluid figure-img" width="614"></p>
+<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-13-1.png" class="img-fluid figure-img" width="614"></p>
 </figure>
 </div>
 </div>
@@ -605,15 +614,15 @@ <h1 class="title d-none d-lg-block">
 <div class="tab-content" data-group="language">
 <div id="tabset-6-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-6-1-tab">
 <div class="cell">
-<div class="sourceCode" id="cb10"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="va">lm_bks</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://rdrr.io/r/stats/lm.html">lm</a></span><span class="op">(</span><span class="va">pointed_beak</span> <span class="op">~</span> <span class="va">blength</span>,</span>
+<div class="sourceCode" id="cb12"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="va">lm_bks</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://rdrr.io/r/stats/lm.html">lm</a></span><span class="op">(</span><span class="va">pointed_beak</span> <span class="op">~</span> <span class="va">blength</span>,</span>
 <span>             data <span class="op">=</span> <span class="va">early_finches</span><span class="op">)</span></span>
 <span></span>
-<span><span class="fu">resid_panel</span><span class="op">(</span><span class="va">lm_bks</span>,</span>
+<span><span class="fu"><a href="https://goodekat.github.io/ggResidpanel/reference/resid_panel.html">resid_panel</a></span><span class="op">(</span><span class="va">lm_bks</span>,</span>
 <span>            plots <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html">c</a></span><span class="op">(</span><span class="st">"resid"</span>, <span class="st">"qq"</span>, <span class="st">"ls"</span>, <span class="st">"cookd"</span><span class="op">)</span>,</span>
 <span>            smoother <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
 <div>
-<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-12-3.png" class="img-fluid figure-img" width="672"></p>
+<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-14-3.png" class="img-fluid figure-img" width="672"></p>
 </figure>
 </div>
 </div>
@@ -622,20 +631,20 @@ <h1 class="title d-none d-lg-block">
 <div id="tabset-6-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-6-2-tab">
 <p>First, we create a linear model:</p>
 <div class="cell">
-<div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="co"># create a linear model</span></span>
-<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>model <span class="op">=</span> smf.ols(formula <span class="op">=</span> <span class="st">"pointed_beak ~ blength"</span>,</span>
-<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>                data <span class="op">=</span> early_finches_py)</span>
-<span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a><span class="co"># and get the fitted parameters of the model</span></span>
-<span id="cb11-5"><a href="#cb11-5" aria-hidden="true" tabindex="-1"></a>lm_bks_py <span class="op">=</span> model.fit()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="co"># create a linear model</span></span>
+<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>model <span class="op">=</span> smf.ols(formula <span class="op">=</span> <span class="st">"pointed_beak ~ blength"</span>,</span>
+<span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a>                data <span class="op">=</span> early_finches_py)</span>
+<span id="cb13-4"><a href="#cb13-4" aria-hidden="true" tabindex="-1"></a><span class="co"># and get the fitted parameters of the model</span></span>
+<span id="cb13-5"><a href="#cb13-5" aria-hidden="true" tabindex="-1"></a>lm_bks_py <span class="op">=</span> model.fit()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>Next, we can create the diagnostic plots:</p>
 <div class="cell">
-<div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a>dgplots(lm_bks_py)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a>dgplots(lm_bks_py)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <div class="cell">
 <div class="cell-output-display">
 <div>
-<figure class="figure"><p><img src="images/dgplots/2024_02_01-07-29-38_AM_dgplots.png" class="img-fluid figure-img" width="805"></p>
+<figure class="figure"><p><img src="images/dgplots/2024_06_14-08-41-54_AM_dgplots.png" class="img-fluid figure-img" width="804"></p>
 </figure>
 </div>
 </div>
@@ -690,7 +699,7 @@ <h1 class="title d-none d-lg-block">
 <div class="cell">
 <div class="cell-output-display">
 <div>
-<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-17-1.png" class="img-fluid figure-img" width="672"></p>
+<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-19-1.png" class="img-fluid figure-img" width="672"></p>
 </figure>
 </div>
 </div>
@@ -714,7 +723,7 @@ <h1 class="title d-none d-lg-block">
 <div id="tabset-7-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-7-1-tab">
 <p>In R we have a few options to do this, and by far the most familiar function would be <code><a href="https://rdrr.io/r/stats/glm.html">glm()</a></code>. Here we save the model in an object called <code>glm_bks</code>:</p>
 <div class="cell">
-<div class="sourceCode" id="cb13"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="va">glm_bks</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://rdrr.io/r/stats/glm.html">glm</a></span><span class="op">(</span><span class="va">pointed_beak</span> <span class="op">~</span> <span class="va">blength</span>,</span>
+<div class="sourceCode" id="cb15"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="va">glm_bks</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://rdrr.io/r/stats/glm.html">glm</a></span><span class="op">(</span><span class="va">pointed_beak</span> <span class="op">~</span> <span class="va">blength</span>,</span>
 <span>               family <span class="op">=</span> <span class="va">binomial</span>,</span>
 <span>               data <span class="op">=</span> <span class="va">early_finches</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
@@ -724,12 +733,12 @@ <h1 class="title d-none d-lg-block">
 <div id="tabset-7-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-7-2-tab">
 <p>In Python we have a few options to do this, and by far the most familiar function would be <code><a href="https://rdrr.io/r/stats/glm.html">glm()</a></code>. Here we save the model in an object called <code>glm_bks_py</code>:</p>
 <div class="cell">
-<div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="co"># create a linear model</span></span>
-<span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a>model <span class="op">=</span> smf.glm(formula <span class="op">=</span> <span class="st">"pointed_beak ~ blength"</span>,</span>
-<span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a>                family <span class="op">=</span> sm.families.Binomial(),</span>
-<span id="cb14-4"><a href="#cb14-4" aria-hidden="true" tabindex="-1"></a>                data <span class="op">=</span> early_finches_py)</span>
-<span id="cb14-5"><a href="#cb14-5" aria-hidden="true" tabindex="-1"></a><span class="co"># and get the fitted parameters of the model</span></span>
-<span id="cb14-6"><a href="#cb14-6" aria-hidden="true" tabindex="-1"></a>glm_bks_py <span class="op">=</span> model.fit()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb16"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a><span class="co"># create a linear model</span></span>
+<span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a>model <span class="op">=</span> smf.glm(formula <span class="op">=</span> <span class="st">"pointed_beak ~ blength"</span>,</span>
+<span id="cb16-3"><a href="#cb16-3" aria-hidden="true" tabindex="-1"></a>                family <span class="op">=</span> sm.families.Binomial(),</span>
+<span id="cb16-4"><a href="#cb16-4" aria-hidden="true" tabindex="-1"></a>                data <span class="op">=</span> early_finches_py)</span>
+<span id="cb16-5"><a href="#cb16-5" aria-hidden="true" tabindex="-1"></a><span class="co"># and get the fitted parameters of the model</span></span>
+<span id="cb16-6"><a href="#cb16-6" aria-hidden="true" tabindex="-1"></a>glm_bks_py <span class="op">=</span> model.fit()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>The format of this function is similar to that used by the <code>ols()</code> function for linear models. The important difference is that we must specify the <em>family</em> of error distribution to use. For logistic regression we must set the family to <strong>binomial</strong>. This is buried deep inside the <code>statsmodels</code> package and needs to be defined as <code>sm.families.Binomial()</code>.</p>
 </div>
@@ -746,7 +755,7 @@ <h1 class="title d-none d-lg-block">
 <div class="tab-content" data-group="language">
 <div id="tabset-8-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-8-1-tab">
 <div class="cell">
-<div class="sourceCode" id="cb15"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/summary.html">summary</a></span><span class="op">(</span><span class="va">glm_bks</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode" id="cb17"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/summary.html">summary</a></span><span class="op">(</span><span class="va">glm_bks</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>
 Call:
@@ -771,7 +780,7 @@ <h1 class="title d-none d-lg-block">
 </div>
 <div id="tabset-8-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-8-2-tab">
 <div class="cell">
-<div class="sourceCode cell-code" id="cb17"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(glm_bks_py.summary())</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb19"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(glm_bks_py.summary())</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>                 Generalized Linear Model Regression Results                  
 ==============================================================================
@@ -780,8 +789,8 @@ <h1 class="title d-none d-lg-block">
 Model Family:                Binomial   Df Model:                            1
 Link Function:                  Logit   Scale:                          1.0000
 Method:                          IRLS   Log-Likelihood:                -4.5939
-Date:                Thu, 01 Feb 2024   Deviance:                       9.1879
-Time:                        07:29:39   Pearson chi2:                     15.1
+Date:                Fri, 14 Jun 2024   Deviance:                       9.1879
+Time:                        08:41:55   Pearson chi2:                     15.1
 No. Iterations:                     8   Pseudo R-squ. (CS):             0.7093
 Covariance Type:            nonrobust                                         
 ==============================================================================
@@ -836,14 +845,14 @@ <h1 class="title d-none d-lg-block">
 <div id="tabset-10-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-10-1-tab">
 <p>Well, the probability of having a pointed beak if the beak length is large (for example 15 mm) can be calculated as follows:</p>
 <div class="cell">
-<div class="sourceCode" id="cb21"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/Log.html">exp</a></span><span class="op">(</span><span class="op">-</span><span class="fl">43.41</span> <span class="op">+</span> <span class="fl">3.39</span> <span class="op">*</span> <span class="fl">15</span><span class="op">)</span> <span class="op">/</span> <span class="op">(</span><span class="fl">1</span> <span class="op">+</span> <span class="fu"><a href="https://rdrr.io/r/base/Log.html">exp</a></span><span class="op">(</span><span class="op">-</span><span class="fl">43.41</span> <span class="op">+</span> <span class="fl">3.39</span> <span class="op">*</span> <span class="fl">15</span><span class="op">)</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode" id="cb23"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/Log.html">exp</a></span><span class="op">(</span><span class="op">-</span><span class="fl">43.41</span> <span class="op">+</span> <span class="fl">3.39</span> <span class="op">*</span> <span class="fl">15</span><span class="op">)</span> <span class="op">/</span> <span class="op">(</span><span class="fl">1</span> <span class="op">+</span> <span class="fu"><a href="https://rdrr.io/r/base/Log.html">exp</a></span><span class="op">(</span><span class="op">-</span><span class="fl">43.41</span> <span class="op">+</span> <span class="fl">3.39</span> <span class="op">*</span> <span class="fl">15</span><span class="op">)</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>[1] 0.9994131</code></pre>
 </div>
 </div>
 <p>If the beak length is small (for example 10 mm), the probability of having a pointed beak is extremely low:</p>
 <div class="cell">
-<div class="sourceCode" id="cb23"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/Log.html">exp</a></span><span class="op">(</span><span class="op">-</span><span class="fl">43.41</span> <span class="op">+</span> <span class="fl">3.39</span> <span class="op">*</span> <span class="fl">10</span><span class="op">)</span> <span class="op">/</span> <span class="op">(</span><span class="fl">1</span> <span class="op">+</span> <span class="fu"><a href="https://rdrr.io/r/base/Log.html">exp</a></span><span class="op">(</span><span class="op">-</span><span class="fl">43.41</span> <span class="op">+</span> <span class="fl">3.39</span> <span class="op">*</span> <span class="fl">10</span><span class="op">)</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode" id="cb25"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/Log.html">exp</a></span><span class="op">(</span><span class="op">-</span><span class="fl">43.41</span> <span class="op">+</span> <span class="fl">3.39</span> <span class="op">*</span> <span class="fl">10</span><span class="op">)</span> <span class="op">/</span> <span class="op">(</span><span class="fl">1</span> <span class="op">+</span> <span class="fu"><a href="https://rdrr.io/r/base/Log.html">exp</a></span><span class="op">(</span><span class="op">-</span><span class="fl">43.41</span> <span class="op">+</span> <span class="fl">3.39</span> <span class="op">*</span> <span class="fl">10</span><span class="op">)</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>[1] 7.410155e-05</code></pre>
 </div>
@@ -852,18 +861,18 @@ <h1 class="title d-none d-lg-block">
 <div id="tabset-10-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-10-2-tab">
 <p>Well, the probability of having a pointed beak if the beak length is large (for example 15 mm) can be calculated as follows:</p>
 <div class="cell">
-<div class="sourceCode cell-code" id="cb25"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb25-1"><a href="#cb25-1" aria-hidden="true" tabindex="-1"></a><span class="co"># import the math library</span></span>
-<span id="cb25-2"><a href="#cb25-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> math</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb27"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb27-1"><a href="#cb27-1" aria-hidden="true" tabindex="-1"></a><span class="co"># import the math library</span></span>
+<span id="cb27-2"><a href="#cb27-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> math</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <div class="cell">
-<div class="sourceCode cell-code" id="cb26"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb26-1"><a href="#cb26-1" aria-hidden="true" tabindex="-1"></a>math.exp(<span class="op">-</span><span class="fl">43.41</span> <span class="op">+</span> <span class="fl">3.39</span> <span class="op">*</span> <span class="dv">15</span>) <span class="op">/</span> (<span class="dv">1</span> <span class="op">+</span> math.exp(<span class="op">-</span><span class="fl">43.41</span> <span class="op">+</span> <span class="fl">3.39</span> <span class="op">*</span> <span class="dv">15</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb28"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb28-1"><a href="#cb28-1" aria-hidden="true" tabindex="-1"></a>math.exp(<span class="op">-</span><span class="fl">43.41</span> <span class="op">+</span> <span class="fl">3.39</span> <span class="op">*</span> <span class="dv">15</span>) <span class="op">/</span> (<span class="dv">1</span> <span class="op">+</span> math.exp(<span class="op">-</span><span class="fl">43.41</span> <span class="op">+</span> <span class="fl">3.39</span> <span class="op">*</span> <span class="dv">15</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>0.9994130595039192</code></pre>
 </div>
 </div>
 <p>If the beak length is small (for example 10 mm), the probability of having a pointed beak is extremely low:</p>
 <div class="cell">
-<div class="sourceCode cell-code" id="cb28"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb28-1"><a href="#cb28-1" aria-hidden="true" tabindex="-1"></a>math.exp(<span class="op">-</span><span class="fl">43.41</span> <span class="op">+</span> <span class="fl">3.39</span> <span class="op">*</span> <span class="dv">10</span>) <span class="op">/</span> (<span class="dv">1</span> <span class="op">+</span> math.exp(<span class="op">-</span><span class="fl">43.41</span> <span class="op">+</span> <span class="fl">3.39</span> <span class="op">*</span> <span class="dv">10</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb30"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb30-1"><a href="#cb30-1" aria-hidden="true" tabindex="-1"></a>math.exp(<span class="op">-</span><span class="fl">43.41</span> <span class="op">+</span> <span class="fl">3.39</span> <span class="op">*</span> <span class="dv">10</span>) <span class="op">/</span> (<span class="dv">1</span> <span class="op">+</span> math.exp(<span class="op">-</span><span class="fl">43.41</span> <span class="op">+</span> <span class="fl">3.39</span> <span class="op">*</span> <span class="dv">10</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>7.410155028945912e-05</code></pre>
 </div>
@@ -892,30 +901,30 @@ <h1 class="title d-none d-lg-block">
 <div class="tab-content" data-group="language">
 <div id="tabset-11-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-11-1-tab">
 <div class="cell">
-<div class="sourceCode" id="cb30"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="va">glm_bks</span> <span class="op">%&gt;%</span> </span>
-<span>  <span class="fu">augment</span><span class="op">(</span>type.predict <span class="op">=</span> <span class="st">"response"</span><span class="op">)</span> <span class="op">%&gt;%</span> </span>
-<span>  <span class="fu">ggplot</span><span class="op">(</span><span class="op">)</span> <span class="op">+</span></span>
-<span>  <span class="fu">geom_point</span><span class="op">(</span><span class="fu">aes</span><span class="op">(</span>x <span class="op">=</span> <span class="va">blength</span>, y <span class="op">=</span> <span class="va">pointed_beak</span><span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
-<span>  <span class="fu">geom_line</span><span class="op">(</span><span class="fu">aes</span><span class="op">(</span>x <span class="op">=</span> <span class="va">blength</span>, y <span class="op">=</span> <span class="va">.fitted</span><span class="op">)</span>,</span>
+<div class="sourceCode" id="cb32"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="va">glm_bks</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html">%&gt;%</a></span> </span>
+<span>  <span class="fu"><a href="https://generics.r-lib.org/reference/augment.html">augment</a></span><span class="op">(</span>type.predict <span class="op">=</span> <span class="st">"response"</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html">%&gt;%</a></span> </span>
+<span>  <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot</a></span><span class="op">(</span><span class="op">)</span> <span class="op">+</span></span>
+<span>  <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point</a></span><span class="op">(</span><span class="fu"><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes</a></span><span class="op">(</span>x <span class="op">=</span> <span class="va">blength</span>, y <span class="op">=</span> <span class="va">pointed_beak</span><span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
+<span>  <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/geom_path.html">geom_line</a></span><span class="op">(</span><span class="fu"><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes</a></span><span class="op">(</span>x <span class="op">=</span> <span class="va">blength</span>, y <span class="op">=</span> <span class="va">.fitted</span><span class="op">)</span>,</span>
 <span>            linetype <span class="op">=</span> <span class="st">"dashed"</span>,</span>
 <span>            colour <span class="op">=</span> <span class="st">"blue"</span><span class="op">)</span> <span class="op">+</span></span>
-<span>  <span class="fu">geom_point</span><span class="op">(</span><span class="fu">aes</span><span class="op">(</span>x <span class="op">=</span> <span class="va">blength</span>, y <span class="op">=</span> <span class="va">.fitted</span><span class="op">)</span>,</span>
+<span>  <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point</a></span><span class="op">(</span><span class="fu"><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes</a></span><span class="op">(</span>x <span class="op">=</span> <span class="va">blength</span>, y <span class="op">=</span> <span class="va">.fitted</span><span class="op">)</span>,</span>
 <span>             colour <span class="op">=</span> <span class="st">"blue"</span>, alpha <span class="op">=</span> <span class="fl">0.5</span><span class="op">)</span> <span class="op">+</span></span>
-<span>  <span class="fu">labs</span><span class="op">(</span>x <span class="op">=</span> <span class="st">"beak length (mm)"</span>,</span>
+<span>  <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/labs.html">labs</a></span><span class="op">(</span>x <span class="op">=</span> <span class="st">"beak length (mm)"</span>,</span>
 <span>       y <span class="op">=</span> <span class="st">"Probability"</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 </div>
 <div id="tabset-11-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-11-2-tab">
 <div class="cell">
-<div class="sourceCode cell-code" id="cb31"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb31-1"><a href="#cb31-1" aria-hidden="true" tabindex="-1"></a>(ggplot(early_finches_py) <span class="op">+</span></span>
-<span id="cb31-2"><a href="#cb31-2" aria-hidden="true" tabindex="-1"></a>  geom_point(aes(x <span class="op">=</span> <span class="st">"blength"</span>, y <span class="op">=</span> <span class="st">"pointed_beak"</span>)) <span class="op">+</span></span>
-<span id="cb31-3"><a href="#cb31-3" aria-hidden="true" tabindex="-1"></a>  geom_line(aes(x <span class="op">=</span> <span class="st">"blength"</span>, y <span class="op">=</span> glm_bks_py.fittedvalues),</span>
-<span id="cb31-4"><a href="#cb31-4" aria-hidden="true" tabindex="-1"></a>            linetype <span class="op">=</span> <span class="st">"dashed"</span>,</span>
-<span id="cb31-5"><a href="#cb31-5" aria-hidden="true" tabindex="-1"></a>            colour <span class="op">=</span> <span class="st">"blue"</span>) <span class="op">+</span></span>
-<span id="cb31-6"><a href="#cb31-6" aria-hidden="true" tabindex="-1"></a>  geom_point(aes(x <span class="op">=</span> <span class="st">"blength"</span>, y <span class="op">=</span> glm_bks_py.fittedvalues),</span>
-<span id="cb31-7"><a href="#cb31-7" aria-hidden="true" tabindex="-1"></a>             colour <span class="op">=</span> <span class="st">"blue"</span>, alpha <span class="op">=</span> <span class="fl">0.5</span>) <span class="op">+</span></span>
-<span id="cb31-8"><a href="#cb31-8" aria-hidden="true" tabindex="-1"></a>  labs(x <span class="op">=</span> <span class="st">"beak length (mm)"</span>,</span>
-<span id="cb31-9"><a href="#cb31-9" aria-hidden="true" tabindex="-1"></a>       y <span class="op">=</span> <span class="st">"Probability"</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb33"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb33-1"><a href="#cb33-1" aria-hidden="true" tabindex="-1"></a>(ggplot(early_finches_py) <span class="op">+</span></span>
+<span id="cb33-2"><a href="#cb33-2" aria-hidden="true" tabindex="-1"></a>  geom_point(aes(x <span class="op">=</span> <span class="st">"blength"</span>, y <span class="op">=</span> <span class="st">"pointed_beak"</span>)) <span class="op">+</span></span>
+<span id="cb33-3"><a href="#cb33-3" aria-hidden="true" tabindex="-1"></a>  geom_line(aes(x <span class="op">=</span> <span class="st">"blength"</span>, y <span class="op">=</span> glm_bks_py.fittedvalues),</span>
+<span id="cb33-4"><a href="#cb33-4" aria-hidden="true" tabindex="-1"></a>            linetype <span class="op">=</span> <span class="st">"dashed"</span>,</span>
+<span id="cb33-5"><a href="#cb33-5" aria-hidden="true" tabindex="-1"></a>            colour <span class="op">=</span> <span class="st">"blue"</span>) <span class="op">+</span></span>
+<span id="cb33-6"><a href="#cb33-6" aria-hidden="true" tabindex="-1"></a>  geom_point(aes(x <span class="op">=</span> <span class="st">"blength"</span>, y <span class="op">=</span> glm_bks_py.fittedvalues),</span>
+<span id="cb33-7"><a href="#cb33-7" aria-hidden="true" tabindex="-1"></a>             colour <span class="op">=</span> <span class="st">"blue"</span>, alpha <span class="op">=</span> <span class="fl">0.5</span>) <span class="op">+</span></span>
+<span id="cb33-8"><a href="#cb33-8" aria-hidden="true" tabindex="-1"></a>  labs(x <span class="op">=</span> <span class="st">"beak length (mm)"</span>,</span>
+<span id="cb33-9"><a href="#cb33-9" aria-hidden="true" tabindex="-1"></a>       y <span class="op">=</span> <span class="st">"Probability"</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 </div>
 </div>
@@ -937,13 +946,6 @@ <h1 class="title d-none d-lg-block">
 </div>
 <p>The graph shows us that, based on the data that we have and the model we used to make predictions about our response variable, the probability of seeing a pointed beak increases with beak length.</p>
 <p>Short beaks are more closely associated with the bluntly shaped beaks, whereas long beaks are more closely associated with the pointed shape. It’s also clear that there is a range of beak lengths (around 13 mm) where the probability of getting one shape or another is much more even.</p>
-<!-- ## Parameter estimation explained -->
-<!-- Now that we know how to interpret the coefficients or parameters, let's have a look at how they're actually determined. To understand this, we need to take a step back. -->
-<!-- * Sum of squares (OLS) -->
-<!-- * MLE -->
-<!-- * Deviance -->
-<!-- ## Assumptions -->
-<!-- * GAMLSS -->
 </section><section id="assumptions" class="level2" data-number="7.6"><h2 data-number="7.6" class="anchored" data-anchor-id="assumptions">
 <span class="header-section-number">7.6</span> Assumptions</h2>
 <p>As explained in the background chapter, we can’t really use the standard diagnostic plots to assess assumptions. We’re not going to go into a lot of detail for now, but there is one thing that we can do: look for influential points using the Cook’s distance plot.</p>
@@ -955,60 +957,29 @@ <h1 class="title d-none d-lg-block">
 <div class="tab-content" data-group="language">
 <div id="tabset-12-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-12-1-tab">
 <div class="cell">
-<div class="sourceCode" id="cb32"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/graphics/plot.default.html">plot</a></span><span class="op">(</span><span class="va">glm_bks</span> , which <span class="op">=</span> <span class="fl">4</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode" id="cb34"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://goodekat.github.io/ggResidpanel/reference/resid_panel.html">resid_panel</a></span><span class="op">(</span><span class="va">glm_bks</span>, plots <span class="op">=</span> <span class="st">"cookd"</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
 <div>
-<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-30-1.png" class="img-fluid figure-img" width="672"></p>
+<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-32-1.png" class="img-fluid figure-img" width="672"></p>
 </figure>
 </div>
 </div>
 </div>
-<div class="callout callout-style-default callout-note callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-6-contents" aria-controls="callout-6" aria-expanded="false" aria-label="Toggle callout">
-<div class="callout-icon-container">
-<i class="callout-icon"></i>
-</div>
-<div class="callout-title-container flex-fill">
-Extracting the Cook’s distances from the <code>glm</code> object
-</div>
-<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
-</div>
-<div id="callout-6" class="callout-6-contents callout-collapse collapse">
-<div class="callout-body-container callout-body">
-<p>Instead of using the <code><a href="https://rdrr.io/r/graphics/plot.default.html">plot()</a></code> function, we can also extract the values directly from the <code>glm</code> object. We can use the <code>augment()</code> function to do this and create a lollipop or stem plot:</p>
-<div class="cell">
-<div class="sourceCode" id="cb33"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="va">glm_bks</span> <span class="op">%&gt;%</span> </span>
-<span>  <span class="fu">augment</span><span class="op">(</span><span class="op">)</span> <span class="op">%&gt;%</span>            <span class="co"># get underlying data</span></span>
-<span>  <span class="fu">select</span><span class="op">(</span><span class="va">.cooksd</span><span class="op">)</span> <span class="op">%&gt;%</span>      <span class="co"># select the Cook's d</span></span>
-<span>  <span class="fu">mutate</span><span class="op">(</span>obs <span class="op">=</span> <span class="fl">1</span><span class="op">:</span><span class="fu">n</span><span class="op">(</span><span class="op">)</span><span class="op">)</span> <span class="op">%&gt;%</span>  <span class="co"># create an index column</span></span>
-<span>  <span class="fu">ggplot</span><span class="op">(</span><span class="fu">aes</span><span class="op">(</span>x <span class="op">=</span> <span class="va">obs</span>, y <span class="op">=</span> <span class="va">.cooksd</span><span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
-<span>  <span class="fu">geom_point</span><span class="op">(</span><span class="op">)</span> <span class="op">+</span></span>
-<span>  <span class="fu">geom_segment</span><span class="op">(</span><span class="fu">aes</span><span class="op">(</span>x <span class="op">=</span> <span class="va">obs</span>, y <span class="op">=</span> <span class="va">.cooksd</span>, xend <span class="op">=</span> <span class="va">obs</span>, yend <span class="op">=</span> <span class="fl">0</span><span class="op">)</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
-<div class="cell-output-display">
-<div>
-<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-31-1.png" class="img-fluid figure-img" width="672"></p>
-</figure>
-</div>
-</div>
-</div>
-</div>
-</div>
-</div>
 </div>
 <div id="tabset-12-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-12-2-tab">
 <p>As always, there are different ways of doing this. Here we extract the Cook’s d values from the <code>glm</code> object and put them in a Pandas DataFrame. We can then use that to plot them in a lollipop or stem plot.</p>
 <div class="cell">
-<div class="sourceCode cell-code" id="cb34"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb34-1"><a href="#cb34-1" aria-hidden="true" tabindex="-1"></a><span class="co"># extract the Cook's distances</span></span>
-<span id="cb34-2"><a href="#cb34-2" aria-hidden="true" tabindex="-1"></a>glm_bks_py_resid <span class="op">=</span> pd.DataFrame(glm_bks_py.</span>
-<span id="cb34-3"><a href="#cb34-3" aria-hidden="true" tabindex="-1"></a>                                get_influence().</span>
-<span id="cb34-4"><a href="#cb34-4" aria-hidden="true" tabindex="-1"></a>                                summary_frame()[<span class="st">"cooks_d"</span>])</span>
-<span id="cb34-5"><a href="#cb34-5" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb34-6"><a href="#cb34-6" aria-hidden="true" tabindex="-1"></a><span class="co"># add row index </span></span>
-<span id="cb34-7"><a href="#cb34-7" aria-hidden="true" tabindex="-1"></a>glm_bks_py_resid[<span class="st">'obs'</span>] <span class="op">=</span> glm_bks_py_resid.reset_index().index</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb35"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb35-1"><a href="#cb35-1" aria-hidden="true" tabindex="-1"></a><span class="co"># extract the Cook's distances</span></span>
+<span id="cb35-2"><a href="#cb35-2" aria-hidden="true" tabindex="-1"></a>glm_bks_py_resid <span class="op">=</span> pd.DataFrame(glm_bks_py.</span>
+<span id="cb35-3"><a href="#cb35-3" aria-hidden="true" tabindex="-1"></a>                                get_influence().</span>
+<span id="cb35-4"><a href="#cb35-4" aria-hidden="true" tabindex="-1"></a>                                summary_frame()[<span class="st">"cooks_d"</span>])</span>
+<span id="cb35-5"><a href="#cb35-5" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb35-6"><a href="#cb35-6" aria-hidden="true" tabindex="-1"></a><span class="co"># add row index </span></span>
+<span id="cb35-7"><a href="#cb35-7" aria-hidden="true" tabindex="-1"></a>glm_bks_py_resid[<span class="st">'obs'</span>] <span class="op">=</span> glm_bks_py_resid.reset_index().index</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>We now have two columns:</p>
 <div class="cell">
-<div class="sourceCode cell-code" id="cb35"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb35-1"><a href="#cb35-1" aria-hidden="true" tabindex="-1"></a>glm_bks_py_resid</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb36"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb36-1"><a href="#cb36-1" aria-hidden="true" tabindex="-1"></a>glm_bks_py_resid</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>         cooks_d  obs
 0   1.854360e-07    0
@@ -1028,14 +999,14 @@ <h1 class="title d-none d-lg-block">
 </div>
 <p>We can use these to create the plot:</p>
 <div class="cell">
-<div class="sourceCode cell-code" id="cb37"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb37-1"><a href="#cb37-1" aria-hidden="true" tabindex="-1"></a>(ggplot(glm_bks_py_resid,</span>
-<span id="cb37-2"><a href="#cb37-2" aria-hidden="true" tabindex="-1"></a>         aes(x <span class="op">=</span> <span class="st">"obs"</span>,</span>
-<span id="cb37-3"><a href="#cb37-3" aria-hidden="true" tabindex="-1"></a>             y <span class="op">=</span> <span class="st">"cooks_d"</span>)) <span class="op">+</span></span>
-<span id="cb37-4"><a href="#cb37-4" aria-hidden="true" tabindex="-1"></a>     geom_segment(aes(x <span class="op">=</span> <span class="st">"obs"</span>, y <span class="op">=</span> <span class="st">"cooks_d"</span>, xend <span class="op">=</span> <span class="st">"obs"</span>, yend <span class="op">=</span> <span class="dv">0</span>)) <span class="op">+</span></span>
-<span id="cb37-5"><a href="#cb37-5" aria-hidden="true" tabindex="-1"></a>     geom_point())</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb38"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb38-1"><a href="#cb38-1" aria-hidden="true" tabindex="-1"></a>(ggplot(glm_bks_py_resid,</span>
+<span id="cb38-2"><a href="#cb38-2" aria-hidden="true" tabindex="-1"></a>         aes(x <span class="op">=</span> <span class="st">"obs"</span>,</span>
+<span id="cb38-3"><a href="#cb38-3" aria-hidden="true" tabindex="-1"></a>             y <span class="op">=</span> <span class="st">"cooks_d"</span>)) <span class="op">+</span></span>
+<span id="cb38-4"><a href="#cb38-4" aria-hidden="true" tabindex="-1"></a>     geom_segment(aes(x <span class="op">=</span> <span class="st">"obs"</span>, y <span class="op">=</span> <span class="st">"cooks_d"</span>, xend <span class="op">=</span> <span class="st">"obs"</span>, yend <span class="op">=</span> <span class="dv">0</span>)) <span class="op">+</span></span>
+<span id="cb38-5"><a href="#cb38-5" aria-hidden="true" tabindex="-1"></a>     geom_point())</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
 <div>
-<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-34-1.png" class="img-fluid figure-img" width="614"></p>
+<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-35-1.png" class="img-fluid figure-img" width="614"></p>
 </figure>
 </div>
 </div>
@@ -1059,7 +1030,7 @@ <h1 class="title d-none d-lg-block">
 <div id="tabset-13-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-13-1-tab">
 <p>Unfortunately, there isn’t a single command that does this for us, and we have to lift some of the numbers from the summary output ourselves.</p>
 <div class="cell">
-<div class="sourceCode" id="cb38"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/stats/Chisquare.html">pchisq</a></span><span class="op">(</span><span class="fl">9.1879</span>, <span class="fl">59</span>, lower.tail <span class="op">=</span> <span class="cn">FALSE</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode" id="cb39"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/stats/Chisquare.html">pchisq</a></span><span class="op">(</span><span class="fl">9.1879</span>, <span class="fl">59</span>, lower.tail <span class="op">=</span> <span class="cn">FALSE</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>[1] 1</code></pre>
 </div>
@@ -1068,9 +1039,9 @@ <h1 class="title d-none d-lg-block">
 </div>
 <div id="tabset-13-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-13-2-tab">
 <div class="cell">
-<div class="sourceCode cell-code" id="cb40"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb40-1"><a href="#cb40-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> scipy.stats <span class="im">import</span> chi2</span>
-<span id="cb40-2"><a href="#cb40-2" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb40-3"><a href="#cb40-3" aria-hidden="true" tabindex="-1"></a>chi2.sf(<span class="fl">9.1879</span>, <span class="dv">59</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb41"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb41-1"><a href="#cb41-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> scipy.stats <span class="im">import</span> chi2</span>
+<span id="cb41-2"><a href="#cb41-2" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb41-3"><a href="#cb41-3" aria-hidden="true" tabindex="-1"></a>chi2.sf(<span class="fl">9.1879</span>, <span class="dv">59</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>0.9999999999999916</code></pre>
 </div>
@@ -1088,7 +1059,7 @@ <h1 class="title d-none d-lg-block">
 <div class="tab-content" data-group="language">
 <div id="tabset-14-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-14-1-tab">
 <div class="cell">
-<div class="sourceCode" id="cb42"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/stats/Chisquare.html">pchisq</a></span><span class="op">(</span><span class="fl">84.5476</span> <span class="op">-</span> <span class="fl">9.1879</span>, <span class="fl">60</span> <span class="op">-</span> <span class="fl">59</span>, lower.tail <span class="op">=</span> <span class="cn">FALSE</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode" id="cb43"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/stats/Chisquare.html">pchisq</a></span><span class="op">(</span><span class="fl">84.5476</span> <span class="op">-</span> <span class="fl">9.1879</span>, <span class="fl">60</span> <span class="op">-</span> <span class="fl">59</span>, lower.tail <span class="op">=</span> <span class="cn">FALSE</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>[1] 3.923163e-18</code></pre>
 </div>
@@ -1098,14 +1069,14 @@ <h1 class="title d-none d-lg-block">
 <div id="tabset-14-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-14-2-tab">
 <p>First we need to define the null model:</p>
 <div class="cell">
-<div class="sourceCode cell-code" id="cb44"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb44-1"><a href="#cb44-1" aria-hidden="true" tabindex="-1"></a><span class="co"># create a linear model</span></span>
-<span id="cb44-2"><a href="#cb44-2" aria-hidden="true" tabindex="-1"></a>model <span class="op">=</span> smf.glm(formula <span class="op">=</span> <span class="st">"pointed_beak ~ 1"</span>,</span>
-<span id="cb44-3"><a href="#cb44-3" aria-hidden="true" tabindex="-1"></a>                family <span class="op">=</span> sm.families.Binomial(),</span>
-<span id="cb44-4"><a href="#cb44-4" aria-hidden="true" tabindex="-1"></a>                data <span class="op">=</span> early_finches_py)</span>
-<span id="cb44-5"><a href="#cb44-5" aria-hidden="true" tabindex="-1"></a><span class="co"># and get the fitted parameters of the model</span></span>
-<span id="cb44-6"><a href="#cb44-6" aria-hidden="true" tabindex="-1"></a>glm_bks_null_py <span class="op">=</span> model.fit()</span>
-<span id="cb44-7"><a href="#cb44-7" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb44-8"><a href="#cb44-8" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(glm_bks_null_py.summary())</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb45"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb45-1"><a href="#cb45-1" aria-hidden="true" tabindex="-1"></a><span class="co"># create a linear model</span></span>
+<span id="cb45-2"><a href="#cb45-2" aria-hidden="true" tabindex="-1"></a>model <span class="op">=</span> smf.glm(formula <span class="op">=</span> <span class="st">"pointed_beak ~ 1"</span>,</span>
+<span id="cb45-3"><a href="#cb45-3" aria-hidden="true" tabindex="-1"></a>                family <span class="op">=</span> sm.families.Binomial(),</span>
+<span id="cb45-4"><a href="#cb45-4" aria-hidden="true" tabindex="-1"></a>                data <span class="op">=</span> early_finches_py)</span>
+<span id="cb45-5"><a href="#cb45-5" aria-hidden="true" tabindex="-1"></a><span class="co"># and get the fitted parameters of the model</span></span>
+<span id="cb45-6"><a href="#cb45-6" aria-hidden="true" tabindex="-1"></a>glm_bks_null_py <span class="op">=</span> model.fit()</span>
+<span id="cb45-7"><a href="#cb45-7" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb45-8"><a href="#cb45-8" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(glm_bks_null_py.summary())</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>                 Generalized Linear Model Regression Results                  
 ==============================================================================
@@ -1114,8 +1085,8 @@ <h1 class="title d-none d-lg-block">
 Model Family:                Binomial   Df Model:                            0
 Link Function:                  Logit   Scale:                          1.0000
 Method:                          IRLS   Log-Likelihood:                -42.274
-Date:                Thu, 01 Feb 2024   Deviance:                       84.548
-Time:                        07:29:43   Pearson chi2:                     61.0
+Date:                Fri, 14 Jun 2024   Deviance:                       84.548
+Time:                        08:41:56   Pearson chi2:                     61.0
 No. Iterations:                     3   Pseudo R-squ. (CS):              0.000
 Covariance Type:            nonrobust                                         
 ==============================================================================
@@ -1127,7 +1098,7 @@ <h1 class="title d-none d-lg-block">
 </div>
 <p>In order to compare our original fitted model to the null model we need to know the deviances of both models and the residual degrees of freedom of both models, which we could get from the summary method.</p>
 <div class="cell">
-<div class="sourceCode cell-code" id="cb46"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb46-1"><a href="#cb46-1" aria-hidden="true" tabindex="-1"></a>chi2.sf(<span class="fl">84.5476</span> <span class="op">-</span> <span class="fl">9.1879</span>, <span class="dv">60</span> <span class="op">-</span> <span class="dv">59</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb47"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb47-1"><a href="#cb47-1" aria-hidden="true" tabindex="-1"></a>chi2.sf(<span class="fl">84.5476</span> <span class="op">-</span> <span class="fl">9.1879</span>, <span class="dv">60</span> <span class="op">-</span> <span class="dv">59</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>3.9231627082752525e-18</code></pre>
 </div>
@@ -1147,7 +1118,7 @@ <h1 class="title d-none d-lg-block">
 <div class="tab-content" data-group="language">
 <div id="tabset-15-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-15-1-tab">
 <div class="cell">
-<div class="sourceCode" id="cb48"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/stats/anova.html">anova</a></span><span class="op">(</span><span class="va">glm_bks</span>, test <span class="op">=</span> <span class="st">"Chisq"</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode" id="cb49"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/stats/anova.html">anova</a></span><span class="op">(</span><span class="va">glm_bks</span>, test <span class="op">=</span> <span class="st">"Chisq"</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>Analysis of Deviance Table
 
@@ -1178,7 +1149,7 @@ <h1 class="title d-none d-lg-block">
 <section id="sec-exr_diabetes" class="level3" data-number="7.8.1"><h3 data-number="7.8.1" class="anchored" data-anchor-id="sec-exr_diabetes">
 <span class="header-section-number">7.8.1</span> Diabetes</h3>
 <div class="callout callout-style-default callout-exercise no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-8-contents" aria-controls="callout-8" aria-expanded="true" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-7-contents" aria-controls="callout-7" aria-expanded="true" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -1187,7 +1158,7 @@ <h1 class="title d-none d-lg-block">
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-8" class="callout-8-contents callout-collapse collapse show">
+<div id="callout-7" class="callout-7-contents callout-collapse collapse show">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-exercise">
@@ -1203,7 +1174,7 @@ <h1 class="title d-none d-lg-block">
 </li>
 </ol>
 <div class="callout callout-style-default callout-answer no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-7-contents" aria-controls="callout-7" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-6-contents" aria-controls="callout-6" aria-expanded="false" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -1212,7 +1183,7 @@ <h1 class="title d-none d-lg-block">
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-7" class="callout-7-contents callout-collapse collapse">
+<div id="callout-6" class="callout-6-contents callout-collapse collapse">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-answer" data-collapse="true">
@@ -1226,12 +1197,12 @@ <h1 class="title d-none d-lg-block">
 <div class="tab-content" data-group="language">
 <div id="tabset-16-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-16-1-tab">
 <div class="cell">
-<div class="sourceCode" id="cb50"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="va">diabetes</span> <span class="op">&lt;-</span> <span class="fu">read_csv</span><span class="op">(</span><span class="st">"data/diabetes.csv"</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode" id="cb51"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="va">diabetes</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv</a></span><span class="op">(</span><span class="st">"data/diabetes.csv"</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 </div>
 <div id="tabset-16-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-16-2-tab">
 <div class="cell">
-<div class="sourceCode cell-code" id="cb51"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb51-1"><a href="#cb51-1" aria-hidden="true" tabindex="-1"></a>diabetes_py <span class="op">=</span> pd.read_csv(<span class="st">"data/diabetes.csv"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb52"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb52-1"><a href="#cb52-1" aria-hidden="true" tabindex="-1"></a>diabetes_py <span class="op">=</span> pd.read_csv(<span class="st">"data/diabetes.csv"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 </div>
 </div>
@@ -1246,13 +1217,13 @@ <h1 class="title d-none d-lg-block">
 <div class="tab-content" data-group="language">
 <div id="tabset-17-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-17-1-tab">
 <div class="cell">
-<div class="sourceCode" id="cb52"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu">ggplot</span><span class="op">(</span><span class="va">diabetes</span>,</span>
-<span>       <span class="fu">aes</span><span class="op">(</span>x <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/factor.html">factor</a></span><span class="op">(</span><span class="va">test_result</span><span class="op">)</span>,</span>
+<div class="sourceCode" id="cb53"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot</a></span><span class="op">(</span><span class="va">diabetes</span>,</span>
+<span>       <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes</a></span><span class="op">(</span>x <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/factor.html">factor</a></span><span class="op">(</span><span class="va">test_result</span><span class="op">)</span>,</span>
 <span>           y <span class="op">=</span> <span class="va">glucose</span><span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
-<span>  <span class="fu">geom_boxplot</span><span class="op">(</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<span>  <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">geom_boxplot</a></span><span class="op">(</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
 <div>
-<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-43-1.png" class="img-fluid figure-img" width="672"></p>
+<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-44-1.png" class="img-fluid figure-img" width="672"></p>
 </figure>
 </div>
 </div>
@@ -1260,15 +1231,15 @@ <h1 class="title d-none d-lg-block">
 </div>
 <div id="tabset-17-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-17-2-tab">
 <p>We could just give Python the <code>test_result</code> data directly, but then it would view the values as numeric. Which doesn’t really work, because we have two groups as such: those with a negative diabetes test result, and those with a positive one.</p>
-<p>We can force Python to temporarily covert the data to a factor, by making the <code>test_result</code> column an <code>object</code> type. We can do this directly inside the <code>ggplot()</code> function.</p>
+<p>We can force Python to temporarily covert the data to a factor, by making the <code>test_result</code> column an <code>object</code> type. We can do this directly inside the <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code> function.</p>
 <div class="cell">
-<div class="sourceCode cell-code" id="cb53"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb53-1"><a href="#cb53-1" aria-hidden="true" tabindex="-1"></a>(ggplot(diabetes_py,</span>
-<span id="cb53-2"><a href="#cb53-2" aria-hidden="true" tabindex="-1"></a>         aes(x <span class="op">=</span> diabetes_py.test_result.astype(<span class="bu">object</span>),</span>
-<span id="cb53-3"><a href="#cb53-3" aria-hidden="true" tabindex="-1"></a>             y <span class="op">=</span> <span class="st">"glucose"</span>)) <span class="op">+</span></span>
-<span id="cb53-4"><a href="#cb53-4" aria-hidden="true" tabindex="-1"></a>     geom_boxplot())</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb54"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb54-1"><a href="#cb54-1" aria-hidden="true" tabindex="-1"></a>(ggplot(diabetes_py,</span>
+<span id="cb54-2"><a href="#cb54-2" aria-hidden="true" tabindex="-1"></a>         aes(x <span class="op">=</span> diabetes_py.test_result.astype(<span class="bu">object</span>),</span>
+<span id="cb54-3"><a href="#cb54-3" aria-hidden="true" tabindex="-1"></a>             y <span class="op">=</span> <span class="st">"glucose"</span>)) <span class="op">+</span></span>
+<span id="cb54-4"><a href="#cb54-4" aria-hidden="true" tabindex="-1"></a>     geom_boxplot())</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
 <div>
-<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-44-1.png" class="img-fluid figure-img" width="614"></p>
+<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-45-1.png" class="img-fluid figure-img" width="614"></p>
 </figure>
 </div>
 </div>
@@ -1286,13 +1257,13 @@ <h1 class="title d-none d-lg-block">
 <div class="tab-content" data-group="language">
 <div id="tabset-18-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-18-1-tab">
 <div class="cell">
-<div class="sourceCode" id="cb54"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu">ggplot</span><span class="op">(</span><span class="va">diabetes</span>,</span>
-<span>       <span class="fu">aes</span><span class="op">(</span>x <span class="op">=</span> <span class="va">glucose</span>,</span>
+<div class="sourceCode" id="cb55"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot</a></span><span class="op">(</span><span class="va">diabetes</span>,</span>
+<span>       <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes</a></span><span class="op">(</span>x <span class="op">=</span> <span class="va">glucose</span>,</span>
 <span>           y <span class="op">=</span> <span class="va">test_result</span><span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
-<span>  <span class="fu">geom_point</span><span class="op">(</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<span>  <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point</a></span><span class="op">(</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
 <div>
-<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-45-3.png" class="img-fluid figure-img" width="672"></p>
+<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-46-3.png" class="img-fluid figure-img" width="672"></p>
 </figure>
 </div>
 </div>
@@ -1300,13 +1271,13 @@ <h1 class="title d-none d-lg-block">
 </div>
 <div id="tabset-18-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-18-2-tab">
 <div class="cell">
-<div class="sourceCode cell-code" id="cb55"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb55-1"><a href="#cb55-1" aria-hidden="true" tabindex="-1"></a>(ggplot(diabetes_py,</span>
-<span id="cb55-2"><a href="#cb55-2" aria-hidden="true" tabindex="-1"></a>         aes(x <span class="op">=</span> <span class="st">"glucose"</span>,</span>
-<span id="cb55-3"><a href="#cb55-3" aria-hidden="true" tabindex="-1"></a>             y <span class="op">=</span> <span class="st">"test_result"</span>)) <span class="op">+</span></span>
-<span id="cb55-4"><a href="#cb55-4" aria-hidden="true" tabindex="-1"></a>     geom_point())</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb56"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb56-1"><a href="#cb56-1" aria-hidden="true" tabindex="-1"></a>(ggplot(diabetes_py,</span>
+<span id="cb56-2"><a href="#cb56-2" aria-hidden="true" tabindex="-1"></a>         aes(x <span class="op">=</span> <span class="st">"glucose"</span>,</span>
+<span id="cb56-3"><a href="#cb56-3" aria-hidden="true" tabindex="-1"></a>             y <span class="op">=</span> <span class="st">"test_result"</span>)) <span class="op">+</span></span>
+<span id="cb56-4"><a href="#cb56-4" aria-hidden="true" tabindex="-1"></a>     geom_point())</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
 <div>
-<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-46-1.png" class="img-fluid figure-img" width="614"></p>
+<figure class="figure"><p><img src="glm-practical-logistic-binary_files/figure-html/unnamed-chunk-47-1.png" class="img-fluid figure-img" width="614"></p>
 </figure>
 </div>
 </div>
@@ -1324,7 +1295,7 @@ <h1 class="title d-none d-lg-block">
 <div id="tabset-19-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-19-1-tab">
 <p>We’ll use the <code><a href="https://rdrr.io/r/stats/glm.html">glm()</a></code> function to create a generalised linear model. Here we save the model in an object called <code>glm_dia</code>:</p>
 <div class="cell">
-<div class="sourceCode" id="cb56"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="va">glm_dia</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://rdrr.io/r/stats/glm.html">glm</a></span><span class="op">(</span><span class="va">test_result</span> <span class="op">~</span> <span class="va">glucose</span>,</span>
+<div class="sourceCode" id="cb57"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="va">glm_dia</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://rdrr.io/r/stats/glm.html">glm</a></span><span class="op">(</span><span class="va">test_result</span> <span class="op">~</span> <span class="va">glucose</span>,</span>
 <span>               family <span class="op">=</span> <span class="va">binomial</span>,</span>
 <span>               data <span class="op">=</span> <span class="va">diabetes</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
@@ -1332,12 +1303,12 @@ <h1 class="title d-none d-lg-block">
 </div>
 <div id="tabset-19-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-19-2-tab">
 <div class="cell">
-<div class="sourceCode cell-code" id="cb57"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb57-1"><a href="#cb57-1" aria-hidden="true" tabindex="-1"></a><span class="co"># create a linear model</span></span>
-<span id="cb57-2"><a href="#cb57-2" aria-hidden="true" tabindex="-1"></a>model <span class="op">=</span> smf.glm(formula <span class="op">=</span> <span class="st">"test_result ~ glucose"</span>,</span>
-<span id="cb57-3"><a href="#cb57-3" aria-hidden="true" tabindex="-1"></a>                family <span class="op">=</span> sm.families.Binomial(),</span>
-<span id="cb57-4"><a href="#cb57-4" aria-hidden="true" tabindex="-1"></a>                data <span class="op">=</span> diabetes_py)</span>
-<span id="cb57-5"><a href="#cb57-5" aria-hidden="true" tabindex="-1"></a><span class="co"># and get the fitted parameters of the model</span></span>
-<span id="cb57-6"><a href="#cb57-6" aria-hidden="true" tabindex="-1"></a>glm_dia_py <span class="op">=</span> model.fit()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb58"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb58-1"><a href="#cb58-1" aria-hidden="true" tabindex="-1"></a><span class="co"># create a linear model</span></span>
+<span id="cb58-2"><a href="#cb58-2" aria-hidden="true" tabindex="-1"></a>model <span class="op">=</span> smf.glm(formula <span class="op">=</span> <span class="st">"test_result ~ glucose"</span>,</span>
+<span id="cb58-3"><a href="#cb58-3" aria-hidden="true" tabindex="-1"></a>                family <span class="op">=</span> sm.families.Binomial(),</span>
+<span id="cb58-4"><a href="#cb58-4" aria-hidden="true" tabindex="-1"></a>                data <span class="op">=</span> diabetes_py)</span>
+<span id="cb58-5"><a href="#cb58-5" aria-hidden="true" tabindex="-1"></a><span class="co"># and get the fitted parameters of the model</span></span>
+<span id="cb58-6"><a href="#cb58-6" aria-hidden="true" tabindex="-1"></a>glm_dia_py <span class="op">=</span> model.fit()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 </div>
 </div>
@@ -1351,7 +1322,7 @@ <h1 class="title d-none d-lg-block">
 <div class="tab-content" data-group="language">
 <div id="tabset-20-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-20-1-tab">
 <div class="cell">
-<div class="sourceCode" id="cb58"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/summary.html">summary</a></span><span class="op">(</span><span class="va">glm_dia</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode" id="cb59"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/summary.html">summary</a></span><span class="op">(</span><span class="va">glm_dia</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>
 Call:
@@ -1376,7 +1347,7 @@ <h1 class="title d-none d-lg-block">
 </div>
 <div id="tabset-20-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-20-2-tab">
 <div class="cell">
-<div class="sourceCode cell-code" id="cb60"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb60-1"><a href="#cb60-1" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(glm_dia_py.summary())</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb61"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb61-1"><a href="#cb61-1" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(glm_dia_py.summary())</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>                 Generalized Linear Model Regression Results                  
 ==============================================================================
@@ -1385,8 +1356,8 @@ <h1 class="title d-none d-lg-block">
 Model Family:                Binomial   Df Model:                            1
 Link Function:                  Logit   Scale:                          1.0000
 Method:                          IRLS   Log-Likelihood:                -376.10
-Date:                Thu, 01 Feb 2024   Deviance:                       752.20
-Time:                        07:29:45   Pearson chi2:                     713.
+Date:                Fri, 14 Jun 2024   Deviance:                       752.20
+Time:                        08:41:57   Pearson chi2:                     713.
 No. Iterations:                     4   Pseudo R-squ. (CS):             0.2238
 Covariance Type:            nonrobust                                         
 ==============================================================================
@@ -1413,7 +1384,7 @@ <h1 class="title d-none d-lg-block">
 <div class="tab-content" data-group="language">
 <div id="tabset-21-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-21-1-tab">
 <div class="cell">
-<div class="sourceCode" id="cb62"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/Log.html">exp</a></span><span class="op">(</span><span class="op">-</span><span class="fl">5.61</span> <span class="op">+</span> <span class="fl">0.04</span> <span class="op">*</span> <span class="fl">150</span><span class="op">)</span> <span class="op">/</span> <span class="op">(</span><span class="fl">1</span> <span class="op">+</span> <span class="fu"><a href="https://rdrr.io/r/base/Log.html">exp</a></span><span class="op">(</span><span class="op">-</span><span class="fl">5.61</span> <span class="op">+</span> <span class="fl">0.04</span> <span class="op">*</span> <span class="fl">150</span><span class="op">)</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode" id="cb63"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/Log.html">exp</a></span><span class="op">(</span><span class="op">-</span><span class="fl">5.61</span> <span class="op">+</span> <span class="fl">0.04</span> <span class="op">*</span> <span class="fl">150</span><span class="op">)</span> <span class="op">/</span> <span class="op">(</span><span class="fl">1</span> <span class="op">+</span> <span class="fu"><a href="https://rdrr.io/r/base/Log.html">exp</a></span><span class="op">(</span><span class="op">-</span><span class="fl">5.61</span> <span class="op">+</span> <span class="fl">0.04</span> <span class="op">*</span> <span class="fl">150</span><span class="op">)</span><span class="op">)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>[1] 0.5962827</code></pre>
 </div>
@@ -1421,7 +1392,7 @@ <h1 class="title d-none d-lg-block">
 </div>
 <div id="tabset-21-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-21-2-tab">
 <div class="cell">
-<div class="sourceCode cell-code" id="cb64"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb64-1"><a href="#cb64-1" aria-hidden="true" tabindex="-1"></a>math.exp(<span class="op">-</span><span class="fl">5.61</span> <span class="op">+</span> <span class="fl">0.04</span> <span class="op">*</span> <span class="dv">150</span>) <span class="op">/</span> (<span class="dv">1</span> <span class="op">+</span> math.exp(<span class="op">-</span><span class="fl">5.61</span> <span class="op">+</span> <span class="fl">0.04</span> <span class="op">*</span> <span class="dv">150</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb65"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb65-1"><a href="#cb65-1" aria-hidden="true" tabindex="-1"></a>math.exp(<span class="op">-</span><span class="fl">5.61</span> <span class="op">+</span> <span class="fl">0.04</span> <span class="op">*</span> <span class="dv">150</span>) <span class="op">/</span> (<span class="dv">1</span> <span class="op">+</span> math.exp(<span class="op">-</span><span class="fl">5.61</span> <span class="op">+</span> <span class="fl">0.04</span> <span class="op">*</span> <span class="dv">150</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>0.5962826992967878</code></pre>
 </div>
diff --git a/_site/materials/glm-practical-logistic-binary_files/figure-html/unnamed-chunk-9-1.png b/_site/materials/glm-practical-logistic-binary_files/figure-html/unnamed-chunk-9-1.png
index f1a858b..d21f78c 100644
Binary files a/_site/materials/glm-practical-logistic-binary_files/figure-html/unnamed-chunk-9-1.png and b/_site/materials/glm-practical-logistic-binary_files/figure-html/unnamed-chunk-9-1.png differ
diff --git a/_site/search.json b/_site/search.json
index d0a41e2..05b9ebb 100644
--- a/_site/search.json
+++ b/_site/search.json
@@ -191,7 +191,7 @@
     "href": "materials/glm-practical-logistic-binary.html#libraries-and-functions",
     "title": "\n7  Binary response\n",
     "section": "",
-    "text": "Click to expand\n\n\n\n\n\n\n\nR\nPython\n\n\n\n\n7.1.1 Libraries\n\n7.1.2 Functions\n\n\n\n\n7.1.3 Libraries\n\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n\n# Needed for additional probability functionality\nfrom scipy.stats import *\n\n\n7.1.4 Functions",
+    "text": "Click to expand\n\n\n\n\n\n\n\nR\nPython\n\n\n\n\n7.1.1 Libraries\n\nlibrary(broom)\nlibrary(tidyverse)\nlibrary(ggResidpanel)\n\n\n7.1.2 Functions\n\n# create diagnostic plots\nggResidpanel::resid_panel()\n\n\n\n\n\n7.1.3 Libraries\n\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n\n# Needed for additional probability functionality\nfrom scipy.stats import *\n\n\n7.1.4 Functions",
     "crumbs": [
       "Binary responses",
       "<span class='chapter-number'>7</span>  <span class='chapter-title'>Binary response</span>"
@@ -224,7 +224,7 @@
     "href": "materials/glm-practical-logistic-binary.html#model-output",
     "title": "\n7  Binary response\n",
     "section": "\n7.4 Model output",
-    "text": "7.4 Model output\nThat’s the easy part done! The trickier part is interpreting the output. First of all, we’ll get some summary information.\n\n\nR\nPython\n\n\n\n\nsummary(glm_bks)\n\n\nCall:\nglm(formula = pointed_beak ~ blength, family = binomial, data = early_finches)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)   \n(Intercept)  -43.410     15.250  -2.847  0.00442 **\nblength        3.387      1.193   2.839  0.00452 **\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 84.5476  on 60  degrees of freedom\nResidual deviance:  9.1879  on 59  degrees of freedom\nAIC: 13.188\n\nNumber of Fisher Scoring iterations: 8\n\n\n\n\n\nprint(glm_bks_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:           pointed_beak   No. Observations:                   61\nModel:                            GLM   Df Residuals:                       59\nModel Family:                Binomial   Df Model:                            1\nLink Function:                  Logit   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -4.5939\nDate:                Thu, 01 Feb 2024   Deviance:                       9.1879\nTime:                        07:29:39   Pearson chi2:                     15.1\nNo. Iterations:                     8   Pseudo R-squ. (CS):             0.7093\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept    -43.4096     15.250     -2.847      0.004     -73.298     -13.521\nblength        3.3866      1.193      2.839      0.005       1.049       5.724\n==============================================================================\n\n\n\n\n\nThere’s a lot to unpack here, but let’s start with what we’re familiar with: coefficients!",
+    "text": "7.4 Model output\nThat’s the easy part done! The trickier part is interpreting the output. First of all, we’ll get some summary information.\n\n\nR\nPython\n\n\n\n\nsummary(glm_bks)\n\n\nCall:\nglm(formula = pointed_beak ~ blength, family = binomial, data = early_finches)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)   \n(Intercept)  -43.410     15.250  -2.847  0.00442 **\nblength        3.387      1.193   2.839  0.00452 **\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 84.5476  on 60  degrees of freedom\nResidual deviance:  9.1879  on 59  degrees of freedom\nAIC: 13.188\n\nNumber of Fisher Scoring iterations: 8\n\n\n\n\n\nprint(glm_bks_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:           pointed_beak   No. Observations:                   61\nModel:                            GLM   Df Residuals:                       59\nModel Family:                Binomial   Df Model:                            1\nLink Function:                  Logit   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -4.5939\nDate:                Fri, 14 Jun 2024   Deviance:                       9.1879\nTime:                        08:41:55   Pearson chi2:                     15.1\nNo. Iterations:                     8   Pseudo R-squ. (CS):             0.7093\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept    -43.4096     15.250     -2.847      0.004     -73.298     -13.521\nblength        3.3866      1.193      2.839      0.005       1.049       5.724\n==============================================================================\n\n\n\n\n\nThere’s a lot to unpack here, but let’s start with what we’re familiar with: coefficients!",
     "crumbs": [
       "Binary responses",
       "<span class='chapter-number'>7</span>  <span class='chapter-title'>Binary response</span>"
@@ -246,7 +246,7 @@
     "href": "materials/glm-practical-logistic-binary.html#assumptions",
     "title": "\n7  Binary response\n",
     "section": "\n7.6 Assumptions",
-    "text": "7.6 Assumptions\nAs explained in the background chapter, we can’t really use the standard diagnostic plots to assess assumptions. We’re not going to go into a lot of detail for now, but there is one thing that we can do: look for influential points using the Cook’s distance plot.\n\n\nR\nPython\n\n\n\n\nplot(glm_bks , which = 4)\n\n\n\n\n\n\n\n\n\n\n\n\n\nExtracting the Cook’s distances from the glm object\n\n\n\n\n\nInstead of using the plot() function, we can also extract the values directly from the glm object. We can use the augment() function to do this and create a lollipop or stem plot:\n\nglm_bks %&gt;% \n  augment() %&gt;%            # get underlying data\n  select(.cooksd) %&gt;%      # select the Cook's d\n  mutate(obs = 1:n()) %&gt;%  # create an index column\n  ggplot(aes(x = obs, y = .cooksd)) +\n  geom_point() +\n  geom_segment(aes(x = obs, y = .cooksd, xend = obs, yend = 0))\n\n\n\n\n\n\n\n\n\n\n\n\nAs always, there are different ways of doing this. Here we extract the Cook’s d values from the glm object and put them in a Pandas DataFrame. We can then use that to plot them in a lollipop or stem plot.\n\n# extract the Cook's distances\nglm_bks_py_resid = pd.DataFrame(glm_bks_py.\n                                get_influence().\n                                summary_frame()[\"cooks_d\"])\n\n# add row index \nglm_bks_py_resid['obs'] = glm_bks_py_resid.reset_index().index\n\nWe now have two columns:\n\nglm_bks_py_resid\n\n         cooks_d  obs\n0   1.854360e-07    0\n1   3.388262e-07    1\n2   3.217960e-05    2\n3   1.194847e-05    3\n4   6.643975e-06    4\n..           ...  ...\n56  1.225519e-05   56\n57  2.484468e-05   57\n58  6.781364e-06   58\n59  1.850240e-07   59\n60  3.532360e-05   60\n\n[61 rows x 2 columns]\n\n\nWe can use these to create the plot:\n\n(ggplot(glm_bks_py_resid,\n         aes(x = \"obs\",\n             y = \"cooks_d\")) +\n     geom_segment(aes(x = \"obs\", y = \"cooks_d\", xend = \"obs\", yend = 0)) +\n     geom_point())\n\n\n\n\n\n\n\n\n\n\nThis shows that there are no very obvious influential points. You could regard point 34 as potentially influential (it’s got a Cook’s distance of around 0.8), but I’m not overly worried.\nIf we were worried, we’d remove the troublesome data point, re-run the analysis and see if that changes the statistical outcome. If so, then our entire (statistical) conclusion hinges on one data point, which is not a very robust bit of research. If it doesn’t change our significance, then all is well, even though that data point is influential.",
+    "text": "7.6 Assumptions\nAs explained in the background chapter, we can’t really use the standard diagnostic plots to assess assumptions. We’re not going to go into a lot of detail for now, but there is one thing that we can do: look for influential points using the Cook’s distance plot.\n\n\nR\nPython\n\n\n\n\nresid_panel(glm_bks, plots = \"cookd\")\n\n\n\n\n\n\n\n\n\nAs always, there are different ways of doing this. Here we extract the Cook’s d values from the glm object and put them in a Pandas DataFrame. We can then use that to plot them in a lollipop or stem plot.\n\n# extract the Cook's distances\nglm_bks_py_resid = pd.DataFrame(glm_bks_py.\n                                get_influence().\n                                summary_frame()[\"cooks_d\"])\n\n# add row index \nglm_bks_py_resid['obs'] = glm_bks_py_resid.reset_index().index\n\nWe now have two columns:\n\nglm_bks_py_resid\n\n         cooks_d  obs\n0   1.854360e-07    0\n1   3.388262e-07    1\n2   3.217960e-05    2\n3   1.194847e-05    3\n4   6.643975e-06    4\n..           ...  ...\n56  1.225519e-05   56\n57  2.484468e-05   57\n58  6.781364e-06   58\n59  1.850240e-07   59\n60  3.532360e-05   60\n\n[61 rows x 2 columns]\n\n\nWe can use these to create the plot:\n\n(ggplot(glm_bks_py_resid,\n         aes(x = \"obs\",\n             y = \"cooks_d\")) +\n     geom_segment(aes(x = \"obs\", y = \"cooks_d\", xend = \"obs\", yend = 0)) +\n     geom_point())\n\n\n\n\n\n\n\n\n\n\nThis shows that there are no very obvious influential points. You could regard point 34 as potentially influential (it’s got a Cook’s distance of around 0.8), but I’m not overly worried.\nIf we were worried, we’d remove the troublesome data point, re-run the analysis and see if that changes the statistical outcome. If so, then our entire (statistical) conclusion hinges on one data point, which is not a very robust bit of research. If it doesn’t change our significance, then all is well, even though that data point is influential.",
     "crumbs": [
       "Binary responses",
       "<span class='chapter-number'>7</span>  <span class='chapter-title'>Binary response</span>"
@@ -257,7 +257,7 @@
     "href": "materials/glm-practical-logistic-binary.html#assessing-significance",
     "title": "\n7  Binary response\n",
     "section": "\n7.7 Assessing significance",
-    "text": "7.7 Assessing significance\nWe can ask several questions.\nIs the model well-specified?\nRoughly speaking this asks “can our model predict our data reasonably well?”\n\n\nR\nPython\n\n\n\nUnfortunately, there isn’t a single command that does this for us, and we have to lift some of the numbers from the summary output ourselves.\n\npchisq(9.1879, 59, lower.tail = FALSE)\n\n[1] 1\n\n\nHere, we’ve used the pchisq function (which calculates the correct probability for us – ask if you want a hand-wavy explanation). The first argument to it is the residual deviance value from the summary table, the second argument to is the residual degrees of freedom argument from the same table.\n\n\n\nfrom scipy.stats import chi2\n\nchi2.sf(9.1879, 59)\n\n0.9999999999999916\n\n\n\n\n\nThis gives us a probability of 1. We can interpret this as the probability that the model is actually good. There aren’t any strict conventions on how to interpret this value but, for me, a tiny value would indicate a rubbish model.\nIs the overall model better than the null model?\n\n\nR\nPython\n\n\n\n\npchisq(84.5476 - 9.1879, 60 - 59, lower.tail = FALSE)\n\n[1] 3.923163e-18\n\n\nHere we’ve used the pchisq function again (if you didn’t ask before, you probably aren’t going to ask now).\n\n\nFirst we need to define the null model:\n\n# create a linear model\nmodel = smf.glm(formula = \"pointed_beak ~ 1\",\n                family = sm.families.Binomial(),\n                data = early_finches_py)\n# and get the fitted parameters of the model\nglm_bks_null_py = model.fit()\n\nprint(glm_bks_null_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:           pointed_beak   No. Observations:                   61\nModel:                            GLM   Df Residuals:                       60\nModel Family:                Binomial   Df Model:                            0\nLink Function:                  Logit   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -42.274\nDate:                Thu, 01 Feb 2024   Deviance:                       84.548\nTime:                        07:29:43   Pearson chi2:                     61.0\nNo. Iterations:                     3   Pseudo R-squ. (CS):              0.000\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      0.0328      0.256      0.128      0.898      -0.469       0.535\n==============================================================================\n\n\nIn order to compare our original fitted model to the null model we need to know the deviances of both models and the residual degrees of freedom of both models, which we could get from the summary method.\n\nchi2.sf(84.5476 - 9.1879, 60 - 59)\n\n3.9231627082752525e-18\n\n\n\n\n\nThe first argument is the difference between the null and residual deviances and the second argument is the difference in degrees of freedom between the null and residual models. All of these values can be lifted from the summary table.\nThis gives us a probability of pretty much zero. This value is doing a formal test to see whether our fitted model is significantly different from the null model. Here we can treat this a classical hypothesis test and since this p-value is less than 0.05 then we can say that our fitted model (with blength as a predictor variable) is definitely better than the null model (which has no predictor variables). Woohoo!\nAre any of the individual predictors significant?\nFinally, we’ll use the anova function from before to determine which predictor variables are important, and specifically in this case whether the beak length predictor is significant.\n\n\nR\nPython\n\n\n\n\nanova(glm_bks, test = \"Chisq\")\n\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: pointed_beak\n\nTerms added sequentially (first to last)\n\n        Df Deviance Resid. Df Resid. Dev  Pr(&gt;Chi)    \nNULL                       60     84.548              \nblength  1    75.36        59      9.188 &lt; 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nThe anova() function is a true workhorse within R! This time we’ve used it to create an Analysis of Deviance table. This is exactly equivalent to an ordinary ANOVA table where we have rows corresponding to each predictor variable and a p-value telling us whether that variable is significant or not.\nThe p-value for the blength predictor is written under then Pr(&gt;Chi) column and we can see that it is less than &lt; 2.2e-16. So, beak length is a significant predictor.\nThis shouldn’t be surprising since we already saw that our overall model was better than the null model, which in this case is exactly the same as asking whether the beak length term is significant. However, in more complicated models with multiple predictors these two comparisons (and p-values) won’t be the same.\n\n\nAlas, for some inexplicable reason this is not (yet?) possible to do in Python. At least, unbeknownst to me…",
+    "text": "7.7 Assessing significance\nWe can ask several questions.\nIs the model well-specified?\nRoughly speaking this asks “can our model predict our data reasonably well?”\n\n\nR\nPython\n\n\n\nUnfortunately, there isn’t a single command that does this for us, and we have to lift some of the numbers from the summary output ourselves.\n\npchisq(9.1879, 59, lower.tail = FALSE)\n\n[1] 1\n\n\nHere, we’ve used the pchisq function (which calculates the correct probability for us – ask if you want a hand-wavy explanation). The first argument to it is the residual deviance value from the summary table, the second argument to is the residual degrees of freedom argument from the same table.\n\n\n\nfrom scipy.stats import chi2\n\nchi2.sf(9.1879, 59)\n\n0.9999999999999916\n\n\n\n\n\nThis gives us a probability of 1. We can interpret this as the probability that the model is actually good. There aren’t any strict conventions on how to interpret this value but, for me, a tiny value would indicate a rubbish model.\nIs the overall model better than the null model?\n\n\nR\nPython\n\n\n\n\npchisq(84.5476 - 9.1879, 60 - 59, lower.tail = FALSE)\n\n[1] 3.923163e-18\n\n\nHere we’ve used the pchisq function again (if you didn’t ask before, you probably aren’t going to ask now).\n\n\nFirst we need to define the null model:\n\n# create a linear model\nmodel = smf.glm(formula = \"pointed_beak ~ 1\",\n                family = sm.families.Binomial(),\n                data = early_finches_py)\n# and get the fitted parameters of the model\nglm_bks_null_py = model.fit()\n\nprint(glm_bks_null_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:           pointed_beak   No. Observations:                   61\nModel:                            GLM   Df Residuals:                       60\nModel Family:                Binomial   Df Model:                            0\nLink Function:                  Logit   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -42.274\nDate:                Fri, 14 Jun 2024   Deviance:                       84.548\nTime:                        08:41:56   Pearson chi2:                     61.0\nNo. Iterations:                     3   Pseudo R-squ. (CS):              0.000\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      0.0328      0.256      0.128      0.898      -0.469       0.535\n==============================================================================\n\n\nIn order to compare our original fitted model to the null model we need to know the deviances of both models and the residual degrees of freedom of both models, which we could get from the summary method.\n\nchi2.sf(84.5476 - 9.1879, 60 - 59)\n\n3.9231627082752525e-18\n\n\n\n\n\nThe first argument is the difference between the null and residual deviances and the second argument is the difference in degrees of freedom between the null and residual models. All of these values can be lifted from the summary table.\nThis gives us a probability of pretty much zero. This value is doing a formal test to see whether our fitted model is significantly different from the null model. Here we can treat this a classical hypothesis test and since this p-value is less than 0.05 then we can say that our fitted model (with blength as a predictor variable) is definitely better than the null model (which has no predictor variables). Woohoo!\nAre any of the individual predictors significant?\nFinally, we’ll use the anova function from before to determine which predictor variables are important, and specifically in this case whether the beak length predictor is significant.\n\n\nR\nPython\n\n\n\n\nanova(glm_bks, test = \"Chisq\")\n\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: pointed_beak\n\nTerms added sequentially (first to last)\n\n        Df Deviance Resid. Df Resid. Dev  Pr(&gt;Chi)    \nNULL                       60     84.548              \nblength  1    75.36        59      9.188 &lt; 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nThe anova() function is a true workhorse within R! This time we’ve used it to create an Analysis of Deviance table. This is exactly equivalent to an ordinary ANOVA table where we have rows corresponding to each predictor variable and a p-value telling us whether that variable is significant or not.\nThe p-value for the blength predictor is written under then Pr(&gt;Chi) column and we can see that it is less than &lt; 2.2e-16. So, beak length is a significant predictor.\nThis shouldn’t be surprising since we already saw that our overall model was better than the null model, which in this case is exactly the same as asking whether the beak length term is significant. However, in more complicated models with multiple predictors these two comparisons (and p-values) won’t be the same.\n\n\nAlas, for some inexplicable reason this is not (yet?) possible to do in Python. At least, unbeknownst to me…",
     "crumbs": [
       "Binary responses",
       "<span class='chapter-number'>7</span>  <span class='chapter-title'>Binary response</span>"
@@ -268,7 +268,7 @@
     "href": "materials/glm-practical-logistic-binary.html#exercises",
     "title": "\n7  Binary response\n",
     "section": "\n7.8 Exercises",
-    "text": "7.8 Exercises\n\n7.8.1 Diabetes\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nLevel: \nFor this exercise we’ll be using the data from data/diabetes.csv.\nThis is a data set comprising 768 observations of three variables (one dependent and two predictor variables). This records the results of a diabetes test result as a binary variable (1 is a positive result, 0 is a negative result), along with the result of a glucose tolerance test and the diastolic blood pressure for each of 768 women. The variables are called test_result, glucose and diastolic.\nWe want to see if the glucose tolerance is a meaningful predictor for predictions on a diabetes test. To investigate this, do the following:\n\nLoad and visualise the data\nCreate a suitable model\nDetermine if there are any statistically significant predictors\nCalculate the probability of a positive diabetes test result for a glucose tolerance test value of glucose = 150\n\n\n\n\n\n\n\n\nAnswer\n\n\n\n\n\n\n\nLoad and visualise the data\nFirst we load the data, then we visualise it.\n\n\nR\nPython\n\n\n\n\ndiabetes &lt;- read_csv(\"data/diabetes.csv\")\n\n\n\n\ndiabetes_py = pd.read_csv(\"data/diabetes.csv\")\n\n\n\n\nLooking at the data, we can see that the test_result column contains zeros and ones. These are yes/no test result outcomes and not actually numeric representations.\nWe’ll have to deal with this soon. For now, we can plot the data, by outcome:\n\n\nR\nPython\n\n\n\n\nggplot(diabetes,\n       aes(x = factor(test_result),\n           y = glucose)) +\n  geom_boxplot()\n\n\n\n\n\n\n\n\n\nWe could just give Python the test_result data directly, but then it would view the values as numeric. Which doesn’t really work, because we have two groups as such: those with a negative diabetes test result, and those with a positive one.\nWe can force Python to temporarily covert the data to a factor, by making the test_result column an object type. We can do this directly inside the ggplot() function.\n\n(ggplot(diabetes_py,\n         aes(x = diabetes_py.test_result.astype(object),\n             y = \"glucose\")) +\n     geom_boxplot())\n\n\n\n\n\n\n\n\n\n\nIt looks as though the patients with a positive diabetes test have slightly higher glucose levels than those with a negative diabetes test.\nWe can visualise that differently by plotting all the data points as a classic binary response plot:\n\n\nR\nPython\n\n\n\n\nggplot(diabetes,\n       aes(x = glucose,\n           y = test_result)) +\n  geom_point()\n\n\n\n\n\n\n\n\n\n\n(ggplot(diabetes_py,\n         aes(x = \"glucose\",\n             y = \"test_result\")) +\n     geom_point())\n\n\n\n\n\n\n\n\n\n\nCreate a suitable model\n\n\nR\nPython\n\n\n\nWe’ll use the glm() function to create a generalised linear model. Here we save the model in an object called glm_dia:\n\nglm_dia &lt;- glm(test_result ~ glucose,\n               family = binomial,\n               data = diabetes)\n\nThe format of this function is similar to that used by the lm() function for linear models. The important difference is that we must specify the family of error distribution to use. For logistic regression we must set the family to binomial.\n\n\n\n# create a linear model\nmodel = smf.glm(formula = \"test_result ~ glucose\",\n                family = sm.families.Binomial(),\n                data = diabetes_py)\n# and get the fitted parameters of the model\nglm_dia_py = model.fit()\n\n\n\n\nLet’s look at the model parameters:\n\n\nR\nPython\n\n\n\n\nsummary(glm_dia)\n\n\nCall:\nglm(formula = test_result ~ glucose, family = binomial, data = diabetes)\n\nCoefficients:\n             Estimate Std. Error z value Pr(&gt;|z|)    \n(Intercept) -5.611732   0.442289  -12.69   &lt;2e-16 ***\nglucose      0.039510   0.003398   11.63   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 936.6  on 727  degrees of freedom\nResidual deviance: 752.2  on 726  degrees of freedom\nAIC: 756.2\n\nNumber of Fisher Scoring iterations: 4\n\n\n\n\n\nprint(glm_dia_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:            test_result   No. Observations:                  728\nModel:                            GLM   Df Residuals:                      726\nModel Family:                Binomial   Df Model:                            1\nLink Function:                  Logit   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -376.10\nDate:                Thu, 01 Feb 2024   Deviance:                       752.20\nTime:                        07:29:45   Pearson chi2:                     713.\nNo. Iterations:                     4   Pseudo R-squ. (CS):             0.2238\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     -5.6117      0.442    -12.688      0.000      -6.479      -4.745\nglucose        0.0395      0.003     11.628      0.000       0.033       0.046\n==============================================================================\n\n\n\n\n\nWe can see that glucose is a significant predictor for the test_result (the \\(p\\) value is much smaller than 0.05).\nKnowing this, we’re interested in the coefficients. We have an intercept of -5.61 and 0.0395 for glucose. We can use these coefficients to write a formula that describes the potential relationship between the probability of having a positive test result, dependent on the glucose tolerance level value:\n\\[ P(positive \\ test\\ result) = \\frac{\\exp(-5.61 + 0.04 \\times glucose)}{1 + \\exp(-5.61 + 0.04 \\times glucose)} \\]\nCalculating probabilities\nUsing the formula above, we can now calculate the probability of having a positive test result, for a given glucose value. If we do this for glucose = 150, we get the following:\n\n\nR\nPython\n\n\n\n\nexp(-5.61 + 0.04 * 150) / (1 + exp(-5.61 + 0.04 * 150))\n\n[1] 0.5962827\n\n\n\n\n\nmath.exp(-5.61 + 0.04 * 150) / (1 + math.exp(-5.61 + 0.04 * 150))\n\n0.5962826992967878\n\n\n\n\n\nThis tells us that the probability of having a positive diabetes test result, given a glucose tolerance level of 150 is around 60 %.",
+    "text": "7.8 Exercises\n\n7.8.1 Diabetes\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nLevel: \nFor this exercise we’ll be using the data from data/diabetes.csv.\nThis is a data set comprising 768 observations of three variables (one dependent and two predictor variables). This records the results of a diabetes test result as a binary variable (1 is a positive result, 0 is a negative result), along with the result of a glucose tolerance test and the diastolic blood pressure for each of 768 women. The variables are called test_result, glucose and diastolic.\nWe want to see if the glucose tolerance is a meaningful predictor for predictions on a diabetes test. To investigate this, do the following:\n\nLoad and visualise the data\nCreate a suitable model\nDetermine if there are any statistically significant predictors\nCalculate the probability of a positive diabetes test result for a glucose tolerance test value of glucose = 150\n\n\n\n\n\n\n\n\nAnswer\n\n\n\n\n\n\n\nLoad and visualise the data\nFirst we load the data, then we visualise it.\n\n\nR\nPython\n\n\n\n\ndiabetes &lt;- read_csv(\"data/diabetes.csv\")\n\n\n\n\ndiabetes_py = pd.read_csv(\"data/diabetes.csv\")\n\n\n\n\nLooking at the data, we can see that the test_result column contains zeros and ones. These are yes/no test result outcomes and not actually numeric representations.\nWe’ll have to deal with this soon. For now, we can plot the data, by outcome:\n\n\nR\nPython\n\n\n\n\nggplot(diabetes,\n       aes(x = factor(test_result),\n           y = glucose)) +\n  geom_boxplot()\n\n\n\n\n\n\n\n\n\nWe could just give Python the test_result data directly, but then it would view the values as numeric. Which doesn’t really work, because we have two groups as such: those with a negative diabetes test result, and those with a positive one.\nWe can force Python to temporarily covert the data to a factor, by making the test_result column an object type. We can do this directly inside the ggplot() function.\n\n(ggplot(diabetes_py,\n         aes(x = diabetes_py.test_result.astype(object),\n             y = \"glucose\")) +\n     geom_boxplot())\n\n\n\n\n\n\n\n\n\n\nIt looks as though the patients with a positive diabetes test have slightly higher glucose levels than those with a negative diabetes test.\nWe can visualise that differently by plotting all the data points as a classic binary response plot:\n\n\nR\nPython\n\n\n\n\nggplot(diabetes,\n       aes(x = glucose,\n           y = test_result)) +\n  geom_point()\n\n\n\n\n\n\n\n\n\n\n(ggplot(diabetes_py,\n         aes(x = \"glucose\",\n             y = \"test_result\")) +\n     geom_point())\n\n\n\n\n\n\n\n\n\n\nCreate a suitable model\n\n\nR\nPython\n\n\n\nWe’ll use the glm() function to create a generalised linear model. Here we save the model in an object called glm_dia:\n\nglm_dia &lt;- glm(test_result ~ glucose,\n               family = binomial,\n               data = diabetes)\n\nThe format of this function is similar to that used by the lm() function for linear models. The important difference is that we must specify the family of error distribution to use. For logistic regression we must set the family to binomial.\n\n\n\n# create a linear model\nmodel = smf.glm(formula = \"test_result ~ glucose\",\n                family = sm.families.Binomial(),\n                data = diabetes_py)\n# and get the fitted parameters of the model\nglm_dia_py = model.fit()\n\n\n\n\nLet’s look at the model parameters:\n\n\nR\nPython\n\n\n\n\nsummary(glm_dia)\n\n\nCall:\nglm(formula = test_result ~ glucose, family = binomial, data = diabetes)\n\nCoefficients:\n             Estimate Std. Error z value Pr(&gt;|z|)    \n(Intercept) -5.611732   0.442289  -12.69   &lt;2e-16 ***\nglucose      0.039510   0.003398   11.63   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 936.6  on 727  degrees of freedom\nResidual deviance: 752.2  on 726  degrees of freedom\nAIC: 756.2\n\nNumber of Fisher Scoring iterations: 4\n\n\n\n\n\nprint(glm_dia_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:            test_result   No. Observations:                  728\nModel:                            GLM   Df Residuals:                      726\nModel Family:                Binomial   Df Model:                            1\nLink Function:                  Logit   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -376.10\nDate:                Fri, 14 Jun 2024   Deviance:                       752.20\nTime:                        08:41:57   Pearson chi2:                     713.\nNo. Iterations:                     4   Pseudo R-squ. (CS):             0.2238\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     -5.6117      0.442    -12.688      0.000      -6.479      -4.745\nglucose        0.0395      0.003     11.628      0.000       0.033       0.046\n==============================================================================\n\n\n\n\n\nWe can see that glucose is a significant predictor for the test_result (the \\(p\\) value is much smaller than 0.05).\nKnowing this, we’re interested in the coefficients. We have an intercept of -5.61 and 0.0395 for glucose. We can use these coefficients to write a formula that describes the potential relationship between the probability of having a positive test result, dependent on the glucose tolerance level value:\n\\[ P(positive \\ test\\ result) = \\frac{\\exp(-5.61 + 0.04 \\times glucose)}{1 + \\exp(-5.61 + 0.04 \\times glucose)} \\]\nCalculating probabilities\nUsing the formula above, we can now calculate the probability of having a positive test result, for a given glucose value. If we do this for glucose = 150, we get the following:\n\n\nR\nPython\n\n\n\n\nexp(-5.61 + 0.04 * 150) / (1 + exp(-5.61 + 0.04 * 150))\n\n[1] 0.5962827\n\n\n\n\n\nmath.exp(-5.61 + 0.04 * 150) / (1 + math.exp(-5.61 + 0.04 * 150))\n\n0.5962826992967878\n\n\n\n\n\nThis tells us that the probability of having a positive diabetes test result, given a glucose tolerance level of 150 is around 60 %.",
     "crumbs": [
       "Binary responses",
       "<span class='chapter-number'>7</span>  <span class='chapter-title'>Binary response</span>"
@@ -301,7 +301,7 @@
     "href": "materials/glm-practical-logistic-proportion.html#libraries-and-functions",
     "title": "\n8  Proportional response\n",
     "section": "",
-    "text": "Click to expand\n\n\n\n\n\n\n\nR\nPython\n\n\n\n\n8.1.1 Libraries\n\n8.1.2 Functions\n\n\n\n\n8.1.3 Libraries\n\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n\n# Needed for additional probability functionality\nfrom scipy.stats import *\n\n\n8.1.4 Functions",
+    "text": "Click to expand\n\n\n\n\n\n\n\nR\nPython\n\n\n\n\n8.1.1 Libraries\n\nlibrary(broom)\nlibrary(tidyverse)\nlibrary(ggResidpanel)\n\n\n8.1.2 Functions\n\n# create diagnostic plots\nggResidpanel::resid_panel()\n\n\n\n\n\n8.1.3 Libraries\n\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n\n# Needed for additional probability functionality\nfrom scipy.stats import *\n\n\n8.1.4 Functions",
     "crumbs": [
       "Binary responses",
       "<span class='chapter-number'>8</span>  <span class='chapter-title'>Proportional response</span>"
@@ -334,7 +334,7 @@
     "href": "materials/glm-practical-logistic-proportion.html#model-output",
     "title": "\n8  Proportional response\n",
     "section": "\n8.4 Model output",
-    "text": "8.4 Model output\nThat’s the easy part done! The trickier part is interpreting the output. First of all, we’ll get some summary information.\n\n\nR\nPython\n\n\n\nNext, we can have a closer look at the results:\n\nsummary(glm_chl)\n\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)    \n(Intercept) 11.66299    3.29626   3.538 0.000403 ***\ntemp        -0.21623    0.05318  -4.066 4.78e-05 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 38.898  on 22  degrees of freedom\nResidual deviance: 16.912  on 21  degrees of freedom\nAIC: 33.675\n\nNumber of Fisher Scoring iterations: 6\n\n\nWe can see that the p-values of the intercept and temp are significant. We can also use the intercept and temp coefficients to construct the logistic equation, which we can use to sketch the logistic curve.\n\n\n\nprint(glm_chl_py.summary())\n\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   23\nModel:                              GLM   Df Residuals:                       21\nModel Family:                  Binomial   Df Model:                            1\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -14.837\nDate:                  Tue, 06 Feb 2024   Deviance:                       16.912\nTime:                          16:12:12   Pearson chi2:                     28.1\nNo. Iterations:                       7   Pseudo R-squ. (CS):             0.6155\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     11.6630      3.296      3.538      0.000       5.202      18.124\ntemp          -0.2162      0.053     -4.066      0.000      -0.320      -0.112\n==============================================================================\n\n\n\n\n\n\\[E(prop \\ failed\\ orings) = \\frac{\\exp{(11.66 -  0.22 \\times temp)}}{1 + \\exp{(11.66 -  0.22 \\times temp)}}\\]\nLet’s see how well our model would have performed if we would have fed it the data from the ill-fated Challenger launch.\n\n\nR\nPython\n\n\n\n\nggplot(challenger, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85)\n\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n\n\n\n\n\n\n\n\n\n\nWe can get the predicted values for the model as follows:\n\nchallenger_py['predicted_values'] = glm_chl_py.predict()\n\nchallenger_py.head()\n\n   temp  damage  total  intact  prop_damaged  predicted_values\n0    53       5      6       1      0.833333          0.550479\n1    57       1      6       5      0.166667          0.340217\n2    58       1      6       5      0.166667          0.293476\n3    63       1      6       5      0.166667          0.123496\n4    66       0      6       6      0.000000          0.068598\n\n\nThis would only give us the predicted values for the data we already have. Instead we want to extrapolate to what would have been predicted for a wider range of temperatures. Here, we use a range of \\([25, 85]\\) degrees Fahrenheit.\n\nmodel = pd.DataFrame({'temp': list(range(25, 86))})\n\nmodel[\"pred\"] = glm_chl_py.predict(model)\n\nmodel.head()\n\n   temp      pred\n0    25  0.998087\n1    26  0.997626\n2    27  0.997055\n3    28  0.996347\n4    29  0.995469\n\n\n\n(ggplot(challenger_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"temp\", y = \"pred\"), colour = \"blue\", size = 1))\n\n\n\n\n\n\n\n\n\n\n\n\n\nGenerating predicted values\n\n\n\n\n\n\n\nR\nPython\n\n\n\nAnother way of doing this it to generate a table with data for a range of temperatures, from 25 to 85 degrees Fahrenheit, in steps of 1. We can then use these data to generate the logistic curve, based on the fitted model.\n\n# create a table with sequential numbers ranging from 25 to 85\nmodel &lt;- tibble(temp = seq(25, 85, by = 1)) %&gt;% \n  # add a new column containing the predicted values\n  mutate(.pred = predict(glm_chl, newdata = ., type = \"response\"))\n\nggplot(model, aes(temp, .pred)) +\n  geom_line()\n\n\n\n\n\n\n\n\n# plot the curve and the original data\nggplot(model, aes(temp, .pred)) +\n  geom_line(colour = \"blue\") +\n  geom_point(data = challenger, aes(temp, prop_damaged)) +\n  # add a vertical line at the disaster launch temperature\n  geom_vline(xintercept = 31, linetype = \"dashed\")\n\n\n\n\n\n\n\nIt seems that there was a high probability of both o-rings failing at that launch temperature. One thing that the graph shows is that there is a lot of uncertainty involved in this model. We can tell, because the fit of the line is very poor at the lower temperature range. There is just very little data to work on, with the data point at 53 F having a large influence on the fit.\n\n\nWe already did this above, since this is the most straightforward way of plotting the model in Python.",
+    "text": "8.4 Model output\nThat’s the easy part done! The trickier part is interpreting the output. First of all, we’ll get some summary information.\n\n\nR\nPython\n\n\n\nNext, we can have a closer look at the results:\n\nsummary(glm_chl)\n\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)    \n(Intercept) 11.66299    3.29626   3.538 0.000403 ***\ntemp        -0.21623    0.05318  -4.066 4.78e-05 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 38.898  on 22  degrees of freedom\nResidual deviance: 16.912  on 21  degrees of freedom\nAIC: 33.675\n\nNumber of Fisher Scoring iterations: 6\n\n\nWe can see that the p-values of the intercept and temp are significant. We can also use the intercept and temp coefficients to construct the logistic equation, which we can use to sketch the logistic curve.\n\n\n\nprint(glm_chl_py.summary())\n\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   23\nModel:                              GLM   Df Residuals:                       21\nModel Family:                  Binomial   Df Model:                            1\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -14.837\nDate:                  Fri, 14 Jun 2024   Deviance:                       16.912\nTime:                          08:42:07   Pearson chi2:                     28.1\nNo. Iterations:                       7   Pseudo R-squ. (CS):             0.6155\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     11.6630      3.296      3.538      0.000       5.202      18.124\ntemp          -0.2162      0.053     -4.066      0.000      -0.320      -0.112\n==============================================================================\n\n\n\n\n\n\\[E(prop \\ failed\\ orings) = \\frac{\\exp{(11.66 -  0.22 \\times temp)}}{1 + \\exp{(11.66 -  0.22 \\times temp)}}\\]\nLet’s see how well our model would have performed if we would have fed it the data from the ill-fated Challenger launch.\n\n\nR\nPython\n\n\n\n\nggplot(challenger, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85)\n\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n\n\n\n\n\n\n\n\n\n\nWe can get the predicted values for the model as follows:\n\nchallenger_py['predicted_values'] = glm_chl_py.predict()\n\nchallenger_py.head()\n\n   temp  damage  total  intact  prop_damaged  predicted_values\n0    53       5      6       1      0.833333          0.550479\n1    57       1      6       5      0.166667          0.340217\n2    58       1      6       5      0.166667          0.293476\n3    63       1      6       5      0.166667          0.123496\n4    66       0      6       6      0.000000          0.068598\n\n\nThis would only give us the predicted values for the data we already have. Instead we want to extrapolate to what would have been predicted for a wider range of temperatures. Here, we use a range of \\([25, 85]\\) degrees Fahrenheit.\n\nmodel = pd.DataFrame({'temp': list(range(25, 86))})\n\nmodel[\"pred\"] = glm_chl_py.predict(model)\n\nmodel.head()\n\n   temp      pred\n0    25  0.998087\n1    26  0.997626\n2    27  0.997055\n3    28  0.996347\n4    29  0.995469\n\n\n\n(ggplot(challenger_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"temp\", y = \"pred\"), colour = \"blue\", size = 1))\n\n\n\n\n\n\n\n\n\n\n\n\n\nGenerating predicted values\n\n\n\n\n\n\n\nR\nPython\n\n\n\nAnother way of doing this it to generate a table with data for a range of temperatures, from 25 to 85 degrees Fahrenheit, in steps of 1. We can then use these data to generate the logistic curve, based on the fitted model.\n\n# create a table with sequential numbers ranging from 25 to 85\nmodel &lt;- tibble(temp = seq(25, 85, by = 1)) %&gt;% \n  # add a new column containing the predicted values\n  mutate(.pred = predict(glm_chl, newdata = ., type = \"response\"))\n\nggplot(model, aes(temp, .pred)) +\n  geom_line()\n\n\n\n\n\n\n\n\n# plot the curve and the original data\nggplot(model, aes(temp, .pred)) +\n  geom_line(colour = \"blue\") +\n  geom_point(data = challenger, aes(temp, prop_damaged)) +\n  # add a vertical line at the disaster launch temperature\n  geom_vline(xintercept = 31, linetype = \"dashed\")\n\n\n\n\n\n\n\nIt seems that there was a high probability of both o-rings failing at that launch temperature. One thing that the graph shows is that there is a lot of uncertainty involved in this model. We can tell, because the fit of the line is very poor at the lower temperature range. There is just very little data to work on, with the data point at 53 F having a large influence on the fit.\n\n\nWe already did this above, since this is the most straightforward way of plotting the model in Python.",
     "crumbs": [
       "Binary responses",
       "<span class='chapter-number'>8</span>  <span class='chapter-title'>Proportional response</span>"
@@ -345,7 +345,7 @@
     "href": "materials/glm-practical-logistic-proportion.html#exercises",
     "title": "\n8  Proportional response\n",
     "section": "\n8.5 Exercises",
-    "text": "8.5 Exercises\n\n8.5.1 Predicting failure\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nLevel: \nThe data point at 53 degrees Fahrenheit is quite influential for the analysis. Remove this data point and repeat the analysis. Is there still a predicted link between launch temperature and o-ring failure?\n\n\n\n\n\n\nAnswer\n\n\n\n\n\n\n\n\n\nR\nPython\n\n\n\nFirst, we need to remove the influential data point:\n\nchallenger_new &lt;- challenger %&gt;% filter(temp != 53)\n\nWe can create a new generalised linear model, based on these data:\n\nglm_chl_new &lt;- glm(cbind(damage, intact) ~ temp,\n               family = binomial,\n               data = challenger_new)\n\nWe can get the model parameters as follows:\n\nsummary(glm_chl_new)\n\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger_new)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)  \n(Intercept)  5.68223    4.43138   1.282   0.1997  \ntemp        -0.12817    0.06697  -1.914   0.0556 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 16.375  on 21  degrees of freedom\nResidual deviance: 12.633  on 20  degrees of freedom\nAIC: 27.572\n\nNumber of Fisher Scoring iterations: 5\n\n\n\nggplot(challenger_new, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85) +\n  # add a vertical line at 53 F temperature\n  geom_vline(xintercept = 53, linetype = \"dashed\")\n\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n\n\n\n\n\n\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\nBefore we can make any firm conclusions, though, we need to check our model:\n\n1- pchisq(12.633,20)\n\n[1] 0.8925695\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty rubbish – our points are not very close to our curve, overall.\nIs the model any better than the null though?\n\n1 - pchisq(16.375 - 12.633, 21 - 20)\n\n[1] 0.0530609\n\nanova(glm_chl_new, test = 'Chisq')\n\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: cbind(damage, intact)\n\nTerms added sequentially (first to last)\n\n     Df Deviance Resid. Df Resid. Dev Pr(&gt;Chi)  \nNULL                    21     16.375           \ntemp  1   3.7421        20     12.633  0.05306 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n\n\nFirst, we need to remove the influential data point:\n\nchallenger_new_py = challenger_py.query(\"temp != 53\")\n\nWe can create a new generalised linear model, based on these data:\n\n# create a generalised linear model\nmodel = smf.glm(formula = \"damage + intact ~ temp\",\n                family = sm.families.Binomial(),\n                data = challenger_new_py)\n# and get the fitted parameters of the model\nglm_chl_new_py = model.fit()\n\nWe can get the model parameters as follows:\n\nprint(glm_chl_new_py.summary())\n\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   22\nModel:                              GLM   Df Residuals:                       20\nModel Family:                  Binomial   Df Model:                            1\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -11.786\nDate:                  Tue, 06 Feb 2024   Deviance:                       12.633\nTime:                          16:12:15   Pearson chi2:                     16.6\nNo. Iterations:                       6   Pseudo R-squ. (CS):             0.1564\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      5.6822      4.431      1.282      0.200      -3.003      14.368\ntemp          -0.1282      0.067     -1.914      0.056      -0.259       0.003\n==============================================================================\n\n\nGenerate new model data:\n\nmodel = pd.DataFrame({'temp': list(range(25, 86))})\n\nmodel[\"pred\"] = glm_chl_new_py.predict(model)\n\nmodel.head()\n\n   temp      pred\n0    25  0.922585\n1    26  0.912920\n2    27  0.902177\n3    28  0.890269\n4    29  0.877107\n\n\n\n(ggplot(challenger_new_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"temp\", y = \"pred\"), colour = \"blue\", size = 1) +\n     # add a vertical line at 53 F temperature\n     geom_vline(xintercept = 53, linetype = \"dashed\"))\n\n\n\n\n\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\nBefore we can make any firm conclusions, though, we need to check our model:\n\nchi2.sf(12.633, 20)\n\n0.8925694610786307\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty rubbish – our points are not very close to our curve, overall.\nIs the model any better than the null though?\nFirst we need to define the null model:\n\n# create a linear model\nmodel = smf.glm(formula = \"damage + intact ~ 1\",\n                family = sm.families.Binomial(),\n                data = challenger_new_py)\n# and get the fitted parameters of the model\nglm_chl_new_null_py = model.fit()\n\nprint(glm_chl_new_null_py.summary())\n\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   22\nModel:                              GLM   Df Residuals:                       21\nModel Family:                  Binomial   Df Model:                            0\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -13.657\nDate:                  Tue, 06 Feb 2024   Deviance:                       16.375\nTime:                          16:12:16   Pearson chi2:                     16.8\nNo. Iterations:                       6   Pseudo R-squ. (CS):         -2.220e-16\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     -3.0445      0.418     -7.286      0.000      -3.864      -2.226\n==============================================================================\n\n\n\nchi2.sf(16.375 - 12.633, 21 - 20)\n\n0.053060897703157646\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n\n\n\nSo, could NASA have predicted what happened? This model is not significantly different from the null, i.e., temperature is not a significant predictor. Note that it’s only marginally non-significant, and we do have a high goodness-of-fit value.\nIt is possible that if more data points were available that followed a similar trend, the story might be different). Even if we did use our non-significant model to make a prediction, it doesn’t give us a value anywhere near 5 failures for a temperature of 53 degrees Fahrenheit. So overall, based on the model we’ve fitted with these data, there was no indication that a temperature just a few degrees cooler than previous missions could have been so disastrous for the Challenger.",
+    "text": "8.5 Exercises\n\n8.5.1 Predicting failure\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nLevel: \nThe data point at 53 degrees Fahrenheit is quite influential for the analysis. Remove this data point and repeat the analysis. Is there still a predicted link between launch temperature and o-ring failure?\n\n\n\n\n\n\nAnswer\n\n\n\n\n\n\n\n\n\nR\nPython\n\n\n\nFirst, we need to remove the influential data point:\n\nchallenger_new &lt;- challenger %&gt;% filter(temp != 53)\n\nWe can create a new generalised linear model, based on these data:\n\nglm_chl_new &lt;- glm(cbind(damage, intact) ~ temp,\n               family = binomial,\n               data = challenger_new)\n\nWe can get the model parameters as follows:\n\nsummary(glm_chl_new)\n\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger_new)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)  \n(Intercept)  5.68223    4.43138   1.282   0.1997  \ntemp        -0.12817    0.06697  -1.914   0.0556 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 16.375  on 21  degrees of freedom\nResidual deviance: 12.633  on 20  degrees of freedom\nAIC: 27.572\n\nNumber of Fisher Scoring iterations: 5\n\n\n\nggplot(challenger_new, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85) +\n  # add a vertical line at 53 F temperature\n  geom_vline(xintercept = 53, linetype = \"dashed\")\n\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n\n\n\n\n\n\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\nBefore we can make any firm conclusions, though, we need to check our model:\n\n1- pchisq(12.633,20)\n\n[1] 0.8925695\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty rubbish – our points are not very close to our curve, overall.\nIs the model any better than the null though?\n\n1 - pchisq(16.375 - 12.633, 21 - 20)\n\n[1] 0.0530609\n\nanova(glm_chl_new, test = 'Chisq')\n\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: cbind(damage, intact)\n\nTerms added sequentially (first to last)\n\n     Df Deviance Resid. Df Resid. Dev Pr(&gt;Chi)  \nNULL                    21     16.375           \ntemp  1   3.7421        20     12.633  0.05306 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n\n\nFirst, we need to remove the influential data point:\n\nchallenger_new_py = challenger_py.query(\"temp != 53\")\n\nWe can create a new generalised linear model, based on these data:\n\n# create a generalised linear model\nmodel = smf.glm(formula = \"damage + intact ~ temp\",\n                family = sm.families.Binomial(),\n                data = challenger_new_py)\n# and get the fitted parameters of the model\nglm_chl_new_py = model.fit()\n\nWe can get the model parameters as follows:\n\nprint(glm_chl_new_py.summary())\n\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   22\nModel:                              GLM   Df Residuals:                       20\nModel Family:                  Binomial   Df Model:                            1\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -11.786\nDate:                  Fri, 14 Jun 2024   Deviance:                       12.633\nTime:                          08:42:08   Pearson chi2:                     16.6\nNo. Iterations:                       6   Pseudo R-squ. (CS):             0.1564\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      5.6822      4.431      1.282      0.200      -3.003      14.368\ntemp          -0.1282      0.067     -1.914      0.056      -0.259       0.003\n==============================================================================\n\n\nGenerate new model data:\n\nmodel = pd.DataFrame({'temp': list(range(25, 86))})\n\nmodel[\"pred\"] = glm_chl_new_py.predict(model)\n\nmodel.head()\n\n   temp      pred\n0    25  0.922585\n1    26  0.912920\n2    27  0.902177\n3    28  0.890269\n4    29  0.877107\n\n\n\n(ggplot(challenger_new_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"temp\", y = \"pred\"), colour = \"blue\", size = 1) +\n     # add a vertical line at 53 F temperature\n     geom_vline(xintercept = 53, linetype = \"dashed\"))\n\n\n\n\n\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\nBefore we can make any firm conclusions, though, we need to check our model:\n\nchi2.sf(12.633, 20)\n\n0.8925694610786307\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty rubbish – our points are not very close to our curve, overall.\nIs the model any better than the null though?\nFirst we need to define the null model:\n\n# create a linear model\nmodel = smf.glm(formula = \"damage + intact ~ 1\",\n                family = sm.families.Binomial(),\n                data = challenger_new_py)\n# and get the fitted parameters of the model\nglm_chl_new_null_py = model.fit()\n\nprint(glm_chl_new_null_py.summary())\n\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   22\nModel:                              GLM   Df Residuals:                       21\nModel Family:                  Binomial   Df Model:                            0\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -13.657\nDate:                  Fri, 14 Jun 2024   Deviance:                       16.375\nTime:                          08:42:08   Pearson chi2:                     16.8\nNo. Iterations:                       6   Pseudo R-squ. (CS):         -2.220e-16\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     -3.0445      0.418     -7.286      0.000      -3.864      -2.226\n==============================================================================\n\n\n\nchi2.sf(16.375 - 12.633, 21 - 20)\n\n0.053060897703157646\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n\n\n\nSo, could NASA have predicted what happened? This model is not significantly different from the null, i.e., temperature is not a significant predictor. Note that it’s only marginally non-significant, and we do have a high goodness-of-fit value.\nIt is possible that if more data points were available that followed a similar trend, the story might be different). Even if we did use our non-significant model to make a prediction, it doesn’t give us a value anywhere near 5 failures for a temperature of 53 degrees Fahrenheit. So overall, based on the model we’ve fitted with these data, there was no indication that a temperature just a few degrees cooler than previous missions could have been so disastrous for the Challenger.",
     "crumbs": [
       "Binary responses",
       "<span class='chapter-number'>8</span>  <span class='chapter-title'>Proportional response</span>"
diff --git a/materials/glm-practical-logistic-binary.qmd b/materials/glm-practical-logistic-binary.qmd
index c80b463..2658254 100644
--- a/materials/glm-practical-logistic-binary.qmd
+++ b/materials/glm-practical-logistic-binary.qmd
@@ -42,8 +42,22 @@ exec(open('setup_files/setup.py').read())
 ## R
 
 ### Libraries
+
+```{r}
+#| eval: false
+library(broom)
+library(tidyverse)
+library(ggResidpanel)
+```
+
 ### Functions
 
+```{r}
+#| eval: false
+# create diagnostic plots
+ggResidpanel::resid_panel()
+```
+
 ## Python
 
 ### Libraries
@@ -509,19 +523,6 @@ The graph shows us that, based on the data that we have and the model we used to
 
 Short beaks are more closely associated with the bluntly shaped beaks, whereas long beaks are more closely associated with the pointed shape. It's also clear that there is a range of beak lengths (around 13 mm) where the probability of getting one shape or another is much more even.
 
-<!-- ## Parameter estimation explained -->
-
-<!-- Now that we know how to interpret the coefficients or parameters, let's have a look at how they're actually determined. To understand this, we need to take a step back. -->
-
-<!-- * Sum of squares (OLS) -->
-<!-- * MLE -->
-<!-- * Deviance -->
-
-<!-- ## Assumptions -->
-
-<!-- * GAMLSS -->
-
-
 ## Assumptions
 
 As explained in the background chapter, we can't really use the standard diagnostic plots to assess assumptions. We're not going to go into a lot of detail for now, but there is one thing that we can do: look for influential points using the Cook’s distance plot.
@@ -530,24 +531,8 @@ As explained in the background chapter, we can't really use the standard diagnos
 ## R
 
 ```{r}
-plot(glm_bks , which = 4)
-```
-
-:::{.callout-note collapse=true}
-## Extracting the Cook's distances from the `glm` object
-
-Instead of using the `plot()` function, we can also extract the values directly from the `glm` object. We can use the `augment()` function to do this and create a lollipop or stem plot:
-
-```{r}
-glm_bks %>% 
-  augment() %>%            # get underlying data
-  select(.cooksd) %>%      # select the Cook's d
-  mutate(obs = 1:n()) %>%  # create an index column
-  ggplot(aes(x = obs, y = .cooksd)) +
-  geom_point() +
-  geom_segment(aes(x = obs, y = .cooksd, xend = obs, yend = 0))
+resid_panel(glm_bks, plots = "cookd")
 ```
-:::
 
 ## Python
 
diff --git a/materials/glm-practical-logistic-proportion.qmd b/materials/glm-practical-logistic-proportion.qmd
index b1e671e..99f9677 100644
--- a/materials/glm-practical-logistic-proportion.qmd
+++ b/materials/glm-practical-logistic-proportion.qmd
@@ -34,8 +34,22 @@ exec(open('setup_files/setup.py').read())
 ## R
 
 ### Libraries
+
+```{r}
+#| eval: false
+library(broom)
+library(tidyverse)
+library(ggResidpanel)
+```
+
 ### Functions
 
+```{r}
+#| eval: false
+# create diagnostic plots
+ggResidpanel::resid_panel()
+```
+
 ## Python
 
 ### Libraries
diff --git a/materials/images/dgplots/2024_06_14-08-35-57_AM_dgplots.png b/materials/images/dgplots/2024_06_14-08-35-57_AM_dgplots.png
new file mode 100644
index 0000000..e848eae
Binary files /dev/null and b/materials/images/dgplots/2024_06_14-08-35-57_AM_dgplots.png differ
diff --git a/materials/images/dgplots/2024_06_14-08-41-54_AM_dgplots.png b/materials/images/dgplots/2024_06_14-08-41-54_AM_dgplots.png
new file mode 100644
index 0000000..e848eae
Binary files /dev/null and b/materials/images/dgplots/2024_06_14-08-41-54_AM_dgplots.png differ