diff --git a/_freeze/materials/glm-practical-logistic-proportion/execute-results/html.json b/_freeze/materials/glm-practical-logistic-proportion/execute-results/html.json
index 6622bd8..f285248 100644
--- a/_freeze/materials/glm-practical-logistic-proportion/execute-results/html.json
+++ b/_freeze/materials/glm-practical-logistic-proportion/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "6fcde4ac55ce6bd3230fc1ec75513c59",
+  "hash": "173861ccb267f6eb833aef4954211086",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: \"Proportional response\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n-   How do I analyse proportion responses?\n-   Be able to create a logistic model to test proportion response variables\n-   Be able to plot the data and fitted curve\n-   Assess the significance of the fit\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n### Functions\n\n## Python\n\n### Libraries\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n\n# Needed for additional probability functionality\nfrom scipy.stats import *\n```\n:::\n\n\n### Functions\n:::\n:::\n\nThe example in this section uses the following data set:\n\n`data/challenger.csv`\n\nThese data, obtained from the [faraway package](https://www.rdocumentation.org/packages/faraway/versions/1.0.7) in R, contain information related to the explosion of the USA Space Shuttle Challenger on 28 January, 1986. An investigation after the disaster traced back to certain joints on one of the two solid booster rockets, each containing O-rings that ensured no exhaust gases could escape from the booster.\n\nThe night before the launch was unusually cold, with temperatures below freezing. The final report suggested that the cold snap during the night made the o-rings stiff, and unable to adjust to changes in pressure. As a result, exhaust gases leaked away from the solid booster rockets, causing one of them to break loose and rupture the main fuel tank, leading to the final explosion.\n\nThe question we're trying to answer in this session is: based on the data from the previous flights, would it have been possible to predict the failure of most o-rings on the Challenger flight?\n\n## Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchallenger <- read_csv(\"data/challenger.csv\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 23 Columns: 2\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (2): temp, damage\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchallenger_py = pd.read_csv(\"data/challenger.csv\")\n```\n:::\n\n\n:::\n\nThe data set contains several columns:\n\n1.  `temp`, the launch temperature in degrees Fahrenheit\n2.  `damage`, the number of o-rings that showed erosion\n\nBefore we have a further look at the data, let's calculate the proportion of damaged o-rings (`prop_damaged`) and the total number of o-rings (`total`) and update our data set.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchallenger <-\nchallenger %>%\n  mutate(total = 6,                     # total number of o-rings\n         intact = 6 - damage,           # number of undamaged o-rings\n         prop_damaged = damage / total) # proportion damaged o-rings\n\nchallenger\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 23 × 5\n    temp damage total intact prop_damaged\n   <dbl>  <dbl> <dbl>  <dbl>        <dbl>\n 1    53      5     6      1        0.833\n 2    57      1     6      5        0.167\n 3    58      1     6      5        0.167\n 4    63      1     6      5        0.167\n 5    66      0     6      6        0    \n 6    67      0     6      6        0    \n 7    67      0     6      6        0    \n 8    67      0     6      6        0    \n 9    68      0     6      6        0    \n10    69      0     6      6        0    \n# ℹ 13 more rows\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchallenger_py['total'] = 6\nchallenger_py['intact'] = challenger_py['total'] - challenger_py['damage']\nchallenger_py['prop_damaged'] = challenger_py['damage'] / challenger_py['total']\n```\n:::\n\n\n:::\n\nPlotting the proportion of damaged o-rings against the launch temperature shows the following picture:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(challenger, aes(x = temp, y = prop_damaged)) +\n  geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(challenger_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-9-1.png){width=614}\n:::\n:::\n\n\n:::\n\nThe point on the left is the data point corresponding to the coldest flight experienced before the disaster, where five damaged o-rings were found. Fortunately, this did not result in a disaster.\n\nHere we'll explore if we could have reasonably predicted the failure of both o-rings on the Challenger flight, where the launch temperature was 31 degrees Fahrenheit.\n\n## Creating a suitable model\n\nWe only have 23 data points in total. So we're building a model on not that much data - we should keep this in mind when we draw our conclusions!\n\nWe are using a logistic regression for a proportion response in this case, since we're interested in the proportion of o-rings that are damaged.\n\nWe can define this as follows:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_chl <- glm(cbind(damage, intact) ~ temp,\n               family = binomial,\n               data = challenger)\n```\n:::\n\n\nDefining the relationship for proportion responses is a bit annoying, where you have to give the `glm` model a two-column matrix to specify the response variable.\n\nHere, the first column corresponds to the number of damaged o-rings, whereas the second column refers to the number of intact o-rings. We use the `cbind()` function to bind these two together into a matrix.\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a generalised linear model\nmodel = smf.glm(formula = \"damage + intact ~ temp\",\n                family = sm.families.Binomial(),\n                data = challenger_py)\n# and get the fitted parameters of the model\nglm_chl_py = model.fit()\n```\n:::\n\n\n:::\n\n## Model output\n\nThat's the easy part done! The trickier part is interpreting the output. First of all, we'll get some summary information.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nNext, we can have a closer look at the results:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_chl)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger)\n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)    \n(Intercept) 11.66299    3.29626   3.538 0.000403 ***\ntemp        -0.21623    0.05318  -4.066 4.78e-05 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 38.898  on 22  degrees of freedom\nResidual deviance: 16.912  on 21  degrees of freedom\nAIC: 33.675\n\nNumber of Fisher Scoring iterations: 6\n```\n\n\n:::\n:::\n\n\nWe can see that the p-values of the `intercept` and `temp` are significant. We can also use the intercept and `temp` coefficients to construct the logistic equation, which we can use to sketch the logistic curve.\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_chl_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   23\nModel:                              GLM   Df Residuals:                       21\nModel Family:                  Binomial   Df Model:                            1\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -14.837\nDate:                  Tue, 06 Feb 2024   Deviance:                       16.912\nTime:                          15:43:32   Pearson chi2:                     28.1\nNo. Iterations:                       7   Pseudo R-squ. (CS):             0.6155\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     11.6630      3.296      3.538      0.000       5.202      18.124\ntemp          -0.2162      0.053     -4.066      0.000      -0.320      -0.112\n==============================================================================\n```\n\n\n:::\n:::\n\n:::\n\n$$E(prop \\ failed\\ orings) = \\frac{\\exp{(11.66 -  0.22 \\times temp)}}{1 + \\exp{(11.66 -  0.22 \\times temp)}}$$\n\nLet's see how well our model would have performed if we would have fed it the data from the ill-fated Challenger launch.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(challenger, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\n## Python\n\nWe can get the predicted values for the model as follows:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchallenger_py['predicted_values'] = glm_chl_py.predict()\n\nchallenger_py.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   temp  damage  total  intact  prop_damaged  predicted_values\n0    53       5      6       1      0.833333          0.550479\n1    57       1      6       5      0.166667          0.340217\n2    58       1      6       5      0.166667          0.293476\n3    63       1      6       5      0.166667          0.123496\n4    66       0      6       6      0.000000          0.068598\n```\n\n\n:::\n:::\n\n\nThis would only give us the predicted values for the data we already have. Instead we want to extrapolate to what would have been predicted for a wider range of temperatures. Here, we use a range of $[25, 85]$ degrees Fahrenheit.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel = pd.DataFrame({'temp': list(range(25, 86))})\n\nmodel[\"pred\"] = glm_chl_py.predict(model)\n\nmodel.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   temp      pred\n0    25  0.998087\n1    26  0.997626\n2    27  0.997055\n3    28  0.996347\n4    29  0.995469\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(challenger_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"temp\", y = \"pred\"), colour = \"blue\", size = 1))\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-17-1.png){width=614}\n:::\n:::\n\n\n\n::: {.callout-note collapse=true}\n## Generating predicted values\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nAnother way of doing this it to generate a table with data for a range of temperatures, from 25 to 85 degrees Fahrenheit, in steps of 1. We can then use these data to generate the logistic curve, based on the fitted model.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# create a table with sequential numbers ranging from 25 to 85\nmodel <- tibble(temp = seq(25, 85, by = 1)) %>% \n  # add a new column containing the predicted values\n  mutate(.pred = predict(glm_chl, newdata = ., type = \"response\"))\n\nggplot(model, aes(temp, .pred)) +\n  geom_line()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-18-3.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# plot the curve and the original data\nggplot(model, aes(temp, .pred)) +\n  geom_line(colour = \"blue\") +\n  geom_point(data = challenger, aes(temp, prop_damaged)) +\n  # add a vertical line at the disaster launch temperature\n  geom_vline(xintercept = 31, linetype = \"dashed\")\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-19-1.png){width=672}\n:::\n:::\n\n\nIt seems that there was a high probability of both o-rings failing at that launch temperature. One thing that the graph shows is that there is a lot of uncertainty involved in this model. We can tell, because the fit of the line is very poor at the lower temperature range. There is just very little data to work on, with the data point at 53 F having a large influence on the fit.\n\n## Python\n\nWe already did this above, since this is the most straightforward way of plotting the model in Python.\n\n:::\n:::\n:::\n\n## Exercises\n\n### Predicting failure {#sec-exr_failure}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nThe data point at 53 degrees Fahrenheit is quite influential for the analysis. Remove this data point and repeat the analysis. Is there still a predicted link between launch temperature and o-ring failure?\n\n::: {.callout-answer collapse=\"true\"}\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nFirst, we need to remove the influential data point:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchallenger_new <- challenger %>% filter(temp != 53)\n```\n:::\n\n\nWe can create a new generalised linear model, based on these data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_chl_new <- glm(cbind(damage, intact) ~ temp,\n               family = binomial,\n               data = challenger_new)\n```\n:::\n\n\nWe can get the model parameters as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_chl_new)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger_new)\n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)  \n(Intercept)  5.68223    4.43138   1.282   0.1997  \ntemp        -0.12817    0.06697  -1.914   0.0556 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 16.375  on 21  degrees of freedom\nResidual deviance: 12.633  on 20  degrees of freedom\nAIC: 27.572\n\nNumber of Fisher Scoring iterations: 5\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(challenger_new, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85) +\n  # add a vertical line at 53 F temperature\n  geom_vline(xintercept = 53, linetype = \"dashed\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-23-1.png){width=672}\n:::\n:::\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\n\nBefore we can make any firm conclusions, though, we need to check our model:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1- pchisq(12.633,20)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.8925695\n```\n\n\n:::\n:::\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty rubbish – our points are not very close to our curve, overall.\n\nIs the model any better than the null though?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(16.375 - 12.633, 1)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.0530609\n```\n\n\n:::\n\n```{.r .cell-code}\nanova(glm_chl_new, test = 'Chisq')\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: cbind(damage, intact)\n\nTerms added sequentially (first to last)\n\n     Df Deviance Resid. Df Resid. Dev Pr(>Chi)  \nNULL                    21     16.375           \ntemp  1   3.7421        20     12.633  0.05306 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n\n## Python\n\nFirst, we need to remove the influential data point:\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchallenger_new_py = challenger_py.query(\"temp != 53\")\n```\n:::\n\n\nWe can create a new generalised linear model, based on these data:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a generalised linear model\nmodel = smf.glm(formula = \"damage + intact ~ temp\",\n                family = sm.families.Binomial(),\n                data = challenger_new_py)\n# and get the fitted parameters of the model\nglm_chl_new_py = model.fit()\n```\n:::\n\n\nWe can get the model parameters as follows:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_chl_new_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   22\nModel:                              GLM   Df Residuals:                       20\nModel Family:                  Binomial   Df Model:                            1\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -11.786\nDate:                  Tue, 06 Feb 2024   Deviance:                       12.633\nTime:                          15:43:35   Pearson chi2:                     16.6\nNo. Iterations:                       6   Pseudo R-squ. (CS):             0.1564\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      5.6822      4.431      1.282      0.200      -3.003      14.368\ntemp          -0.1282      0.067     -1.914      0.056      -0.259       0.003\n==============================================================================\n```\n\n\n:::\n:::\n\n\nGenerate new model data:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel = pd.DataFrame({'temp': list(range(25, 86))})\n\nmodel[\"pred\"] = glm_chl_new_py.predict(model)\n\nmodel.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   temp      pred\n0    25  0.922585\n1    26  0.912920\n2    27  0.902177\n3    28  0.890269\n4    29  0.877107\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(challenger_new_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"temp\", y = \"pred\"), colour = \"blue\", size = 1) +\n     # add a vertical line at 53 F temperature\n     geom_vline(xintercept = 53, linetype = \"dashed\"))\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-30-1.png){width=614}\n:::\n:::\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\n\nBefore we can make any firm conclusions, though, we need to check our model:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(12.633, 20)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.8925694610786307\n```\n\n\n:::\n:::\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty rubbish – our points are not very close to our curve, overall.\n\nIs the model any better than the null though?\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(16.375 - 12.633, 23 - 22)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.053060897703157646\n```\n\n\n:::\n:::\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n:::\n\nSo, could NASA have predicted what happened? This model is not significantly different from the null, i.e., temperature is not a significant predictor. Note that it’s only marginally non-significant, and we do have a high goodness-of-fit value.\n\nIt is possible that if more data points were available that followed a similar trend, the story might be different). Even if we did use our non-significant model to make a prediction, it doesn’t give us a value anywhere near 5 failures for a temperature of 53 degrees Fahrenheit. So overall, based on the model we’ve fitted with these data, there was no indication that a temperature just a few degrees cooler than previous missions could have been so disastrous for the Challenger.\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n-   We can use a logistic model for proportion response variables\n\n:::\n",
+    "markdown": "---\ntitle: \"Proportional response\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n-   How do I analyse proportion responses?\n-   Be able to create a logistic model to test proportion response variables\n-   Be able to plot the data and fitted curve\n-   Assess the significance of the fit\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n### Functions\n\n## Python\n\n### Libraries\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n\n# Needed for additional probability functionality\nfrom scipy.stats import *\n```\n:::\n\n\n### Functions\n:::\n:::\n\nThe example in this section uses the following data set:\n\n`data/challenger.csv`\n\nThese data, obtained from the [faraway package](https://www.rdocumentation.org/packages/faraway/versions/1.0.7) in R, contain information related to the explosion of the USA Space Shuttle Challenger on 28 January, 1986. An investigation after the disaster traced back to certain joints on one of the two solid booster rockets, each containing O-rings that ensured no exhaust gases could escape from the booster.\n\nThe night before the launch was unusually cold, with temperatures below freezing. The final report suggested that the cold snap during the night made the o-rings stiff, and unable to adjust to changes in pressure. As a result, exhaust gases leaked away from the solid booster rockets, causing one of them to break loose and rupture the main fuel tank, leading to the final explosion.\n\nThe question we're trying to answer in this session is: based on the data from the previous flights, would it have been possible to predict the failure of most o-rings on the Challenger flight?\n\n## Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchallenger <- read_csv(\"data/challenger.csv\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 23 Columns: 2\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (2): temp, damage\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchallenger_py = pd.read_csv(\"data/challenger.csv\")\n```\n:::\n\n\n:::\n\nThe data set contains several columns:\n\n1.  `temp`, the launch temperature in degrees Fahrenheit\n2.  `damage`, the number of o-rings that showed erosion\n\nBefore we have a further look at the data, let's calculate the proportion of damaged o-rings (`prop_damaged`) and the total number of o-rings (`total`) and update our data set.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchallenger <-\nchallenger %>%\n  mutate(total = 6,                     # total number of o-rings\n         intact = 6 - damage,           # number of undamaged o-rings\n         prop_damaged = damage / total) # proportion damaged o-rings\n\nchallenger\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 23 × 5\n    temp damage total intact prop_damaged\n   <dbl>  <dbl> <dbl>  <dbl>        <dbl>\n 1    53      5     6      1        0.833\n 2    57      1     6      5        0.167\n 3    58      1     6      5        0.167\n 4    63      1     6      5        0.167\n 5    66      0     6      6        0    \n 6    67      0     6      6        0    \n 7    67      0     6      6        0    \n 8    67      0     6      6        0    \n 9    68      0     6      6        0    \n10    69      0     6      6        0    \n# ℹ 13 more rows\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchallenger_py['total'] = 6\nchallenger_py['intact'] = challenger_py['total'] - challenger_py['damage']\nchallenger_py['prop_damaged'] = challenger_py['damage'] / challenger_py['total']\n```\n:::\n\n\n:::\n\nPlotting the proportion of damaged o-rings against the launch temperature shows the following picture:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(challenger, aes(x = temp, y = prop_damaged)) +\n  geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(challenger_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-9-1.png){width=614}\n:::\n:::\n\n\n:::\n\nThe point on the left is the data point corresponding to the coldest flight experienced before the disaster, where five damaged o-rings were found. Fortunately, this did not result in a disaster.\n\nHere we'll explore if we could have reasonably predicted the failure of both o-rings on the Challenger flight, where the launch temperature was 31 degrees Fahrenheit.\n\n## Creating a suitable model\n\nWe only have 23 data points in total. So we're building a model on not that much data - we should keep this in mind when we draw our conclusions!\n\nWe are using a logistic regression for a proportion response in this case, since we're interested in the proportion of o-rings that are damaged.\n\nWe can define this as follows:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_chl <- glm(cbind(damage, intact) ~ temp,\n               family = binomial,\n               data = challenger)\n```\n:::\n\n\nDefining the relationship for proportion responses is a bit annoying, where you have to give the `glm` model a two-column matrix to specify the response variable.\n\nHere, the first column corresponds to the number of damaged o-rings, whereas the second column refers to the number of intact o-rings. We use the `cbind()` function to bind these two together into a matrix.\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a generalised linear model\nmodel = smf.glm(formula = \"damage + intact ~ temp\",\n                family = sm.families.Binomial(),\n                data = challenger_py)\n# and get the fitted parameters of the model\nglm_chl_py = model.fit()\n```\n:::\n\n\n:::\n\n## Model output\n\nThat's the easy part done! The trickier part is interpreting the output. First of all, we'll get some summary information.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nNext, we can have a closer look at the results:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_chl)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger)\n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)    \n(Intercept) 11.66299    3.29626   3.538 0.000403 ***\ntemp        -0.21623    0.05318  -4.066 4.78e-05 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 38.898  on 22  degrees of freedom\nResidual deviance: 16.912  on 21  degrees of freedom\nAIC: 33.675\n\nNumber of Fisher Scoring iterations: 6\n```\n\n\n:::\n:::\n\n\nWe can see that the p-values of the `intercept` and `temp` are significant. We can also use the intercept and `temp` coefficients to construct the logistic equation, which we can use to sketch the logistic curve.\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_chl_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   23\nModel:                              GLM   Df Residuals:                       21\nModel Family:                  Binomial   Df Model:                            1\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -14.837\nDate:                  Tue, 06 Feb 2024   Deviance:                       16.912\nTime:                          16:12:12   Pearson chi2:                     28.1\nNo. Iterations:                       7   Pseudo R-squ. (CS):             0.6155\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     11.6630      3.296      3.538      0.000       5.202      18.124\ntemp          -0.2162      0.053     -4.066      0.000      -0.320      -0.112\n==============================================================================\n```\n\n\n:::\n:::\n\n:::\n\n$$E(prop \\ failed\\ orings) = \\frac{\\exp{(11.66 -  0.22 \\times temp)}}{1 + \\exp{(11.66 -  0.22 \\times temp)}}$$\n\nLet's see how well our model would have performed if we would have fed it the data from the ill-fated Challenger launch.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(challenger, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\n## Python\n\nWe can get the predicted values for the model as follows:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchallenger_py['predicted_values'] = glm_chl_py.predict()\n\nchallenger_py.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   temp  damage  total  intact  prop_damaged  predicted_values\n0    53       5      6       1      0.833333          0.550479\n1    57       1      6       5      0.166667          0.340217\n2    58       1      6       5      0.166667          0.293476\n3    63       1      6       5      0.166667          0.123496\n4    66       0      6       6      0.000000          0.068598\n```\n\n\n:::\n:::\n\n\nThis would only give us the predicted values for the data we already have. Instead we want to extrapolate to what would have been predicted for a wider range of temperatures. Here, we use a range of $[25, 85]$ degrees Fahrenheit.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel = pd.DataFrame({'temp': list(range(25, 86))})\n\nmodel[\"pred\"] = glm_chl_py.predict(model)\n\nmodel.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   temp      pred\n0    25  0.998087\n1    26  0.997626\n2    27  0.997055\n3    28  0.996347\n4    29  0.995469\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(challenger_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"temp\", y = \"pred\"), colour = \"blue\", size = 1))\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-17-1.png){width=614}\n:::\n:::\n\n\n\n::: {.callout-note collapse=true}\n## Generating predicted values\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nAnother way of doing this it to generate a table with data for a range of temperatures, from 25 to 85 degrees Fahrenheit, in steps of 1. We can then use these data to generate the logistic curve, based on the fitted model.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# create a table with sequential numbers ranging from 25 to 85\nmodel <- tibble(temp = seq(25, 85, by = 1)) %>% \n  # add a new column containing the predicted values\n  mutate(.pred = predict(glm_chl, newdata = ., type = \"response\"))\n\nggplot(model, aes(temp, .pred)) +\n  geom_line()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-18-3.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# plot the curve and the original data\nggplot(model, aes(temp, .pred)) +\n  geom_line(colour = \"blue\") +\n  geom_point(data = challenger, aes(temp, prop_damaged)) +\n  # add a vertical line at the disaster launch temperature\n  geom_vline(xintercept = 31, linetype = \"dashed\")\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-19-1.png){width=672}\n:::\n:::\n\n\nIt seems that there was a high probability of both o-rings failing at that launch temperature. One thing that the graph shows is that there is a lot of uncertainty involved in this model. We can tell, because the fit of the line is very poor at the lower temperature range. There is just very little data to work on, with the data point at 53 F having a large influence on the fit.\n\n## Python\n\nWe already did this above, since this is the most straightforward way of plotting the model in Python.\n\n:::\n:::\n:::\n\n## Exercises\n\n### Predicting failure {#sec-exr_failure}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nThe data point at 53 degrees Fahrenheit is quite influential for the analysis. Remove this data point and repeat the analysis. Is there still a predicted link between launch temperature and o-ring failure?\n\n::: {.callout-answer collapse=\"true\"}\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nFirst, we need to remove the influential data point:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchallenger_new <- challenger %>% filter(temp != 53)\n```\n:::\n\n\nWe can create a new generalised linear model, based on these data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_chl_new <- glm(cbind(damage, intact) ~ temp,\n               family = binomial,\n               data = challenger_new)\n```\n:::\n\n\nWe can get the model parameters as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_chl_new)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger_new)\n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)  \n(Intercept)  5.68223    4.43138   1.282   0.1997  \ntemp        -0.12817    0.06697  -1.914   0.0556 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 16.375  on 21  degrees of freedom\nResidual deviance: 12.633  on 20  degrees of freedom\nAIC: 27.572\n\nNumber of Fisher Scoring iterations: 5\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(challenger_new, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85) +\n  # add a vertical line at 53 F temperature\n  geom_vline(xintercept = 53, linetype = \"dashed\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-23-1.png){width=672}\n:::\n:::\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\n\nBefore we can make any firm conclusions, though, we need to check our model:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1- pchisq(12.633,20)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.8925695\n```\n\n\n:::\n:::\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty rubbish – our points are not very close to our curve, overall.\n\nIs the model any better than the null though?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(16.375 - 12.633, 21 - 20)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.0530609\n```\n\n\n:::\n\n```{.r .cell-code}\nanova(glm_chl_new, test = 'Chisq')\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: cbind(damage, intact)\n\nTerms added sequentially (first to last)\n\n     Df Deviance Resid. Df Resid. Dev Pr(>Chi)  \nNULL                    21     16.375           \ntemp  1   3.7421        20     12.633  0.05306 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n\n## Python\n\nFirst, we need to remove the influential data point:\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchallenger_new_py = challenger_py.query(\"temp != 53\")\n```\n:::\n\n\nWe can create a new generalised linear model, based on these data:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a generalised linear model\nmodel = smf.glm(formula = \"damage + intact ~ temp\",\n                family = sm.families.Binomial(),\n                data = challenger_new_py)\n# and get the fitted parameters of the model\nglm_chl_new_py = model.fit()\n```\n:::\n\n\nWe can get the model parameters as follows:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_chl_new_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   22\nModel:                              GLM   Df Residuals:                       20\nModel Family:                  Binomial   Df Model:                            1\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -11.786\nDate:                  Tue, 06 Feb 2024   Deviance:                       12.633\nTime:                          16:12:15   Pearson chi2:                     16.6\nNo. Iterations:                       6   Pseudo R-squ. (CS):             0.1564\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      5.6822      4.431      1.282      0.200      -3.003      14.368\ntemp          -0.1282      0.067     -1.914      0.056      -0.259       0.003\n==============================================================================\n```\n\n\n:::\n:::\n\n\nGenerate new model data:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel = pd.DataFrame({'temp': list(range(25, 86))})\n\nmodel[\"pred\"] = glm_chl_new_py.predict(model)\n\nmodel.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   temp      pred\n0    25  0.922585\n1    26  0.912920\n2    27  0.902177\n3    28  0.890269\n4    29  0.877107\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(challenger_new_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"temp\", y = \"pred\"), colour = \"blue\", size = 1) +\n     # add a vertical line at 53 F temperature\n     geom_vline(xintercept = 53, linetype = \"dashed\"))\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-30-1.png){width=614}\n:::\n:::\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\n\nBefore we can make any firm conclusions, though, we need to check our model:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(12.633, 20)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.8925694610786307\n```\n\n\n:::\n:::\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty rubbish – our points are not very close to our curve, overall.\n\nIs the model any better than the null though?\n\nFirst we need to define the null model:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.glm(formula = \"damage + intact ~ 1\",\n                family = sm.families.Binomial(),\n                data = challenger_new_py)\n# and get the fitted parameters of the model\nglm_chl_new_null_py = model.fit()\n\nprint(glm_chl_new_null_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   22\nModel:                              GLM   Df Residuals:                       21\nModel Family:                  Binomial   Df Model:                            0\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -13.657\nDate:                  Tue, 06 Feb 2024   Deviance:                       16.375\nTime:                          16:12:16   Pearson chi2:                     16.8\nNo. Iterations:                       6   Pseudo R-squ. (CS):         -2.220e-16\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     -3.0445      0.418     -7.286      0.000      -3.864      -2.226\n==============================================================================\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(16.375 - 12.633, 21 - 20)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.053060897703157646\n```\n\n\n:::\n:::\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n:::\n\nSo, could NASA have predicted what happened? This model is not significantly different from the null, i.e., temperature is not a significant predictor. Note that it’s only marginally non-significant, and we do have a high goodness-of-fit value.\n\nIt is possible that if more data points were available that followed a similar trend, the story might be different). Even if we did use our non-significant model to make a prediction, it doesn’t give us a value anywhere near 5 failures for a temperature of 53 degrees Fahrenheit. So overall, based on the model we’ve fitted with these data, there was no indication that a temperature just a few degrees cooler than previous missions could have been so disastrous for the Challenger.\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n-   We can use a logistic model for proportion response variables\n\n:::\n",
     "supporting": [
       "glm-practical-logistic-proportion_files"
     ],
diff --git a/_freeze/materials/glm-practical-poisson/execute-results/html.json b/_freeze/materials/glm-practical-poisson/execute-results/html.json
index e459571..5a4ad40 100644
--- a/_freeze/materials/glm-practical-poisson/execute-results/html.json
+++ b/_freeze/materials/glm-practical-poisson/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "fc1b580e26839e30f21b8ad25b20203a",
+  "hash": "a7944edaa2f705428f83c869d555dfd3",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: \"Count data\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n**Questions**\n\n-   How do we analyse count data?\n\n**Objectives**\n\n-   Be able to perform a poisson regression on count data\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n### Functions\n\n## Python\n\n### Libraries\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n\n# Needed for additional probability functionality\nfrom scipy.stats import *\n```\n:::\n\n\n### Functions\n\n:::\n:::\n\nThe examples in this section use the following data sets:\n\n`data/islands.csv`\n\nThis is a data set comprising 35 observations of two variables (one dependent and one predictor). This records the number of species recorded on different small islands along with the area (km<sup>2</sup>) of the islands. The variables are `species` and `area`.\n\nThe second data set is on seat belts.\n\nThe `seatbelts` data set is a multiple time-series data set that was commissioned by the Department of Transport in 1984 to measure differences in deaths before and after front seat belt legislation was introduced on 31st January 1983. It provides monthly total numerical data on a number of incidents including those related to death and injury in Road Traffic Accidents (RTA's). The data set starts in January 1969 and observations run until December 1984.\n\nYou can find the file in `data/seatbelts.csv`\n\n## Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nislands <- read_csv(\"data/islands.csv\")\n```\n:::\n\n\nLet's have a glimpse at the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nislands\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 35 × 2\n   species  area\n     <dbl> <dbl>\n 1     114  12.1\n 2     130  13.4\n 3     113  13.7\n 4     109  14.5\n 5     118  16.8\n 6     136  19.0\n 7     149  19.6\n 8     162  20.6\n 9     145  20.9\n10     148  21.0\n# ℹ 25 more rows\n```\n\n\n:::\n:::\n\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nislands_py = pd.read_csv(\"data/islands.csv\")\n```\n:::\n\n\nLet's have a glimpse at the data:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nislands_py.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   species       area\n0      114  12.076133\n1      130  13.405439\n2      113  13.723525\n3      109  14.540359\n4      118  16.792122\n```\n\n\n:::\n:::\n\n\n:::\n\nLooking at the data, we can see that there are two columns: `species`, which contains the number of species recorded on each island and `area`, which contains the surface area of the island in square kilometers.\n\nWe can plot the data:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(islands, aes(x = area, y = species)) +\n  geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(islands_py, aes(x = \"area\", y = \"species\")) +\n  geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-9-1.png){width=614}\n:::\n:::\n\n:::\n\nIt looks as though `area` may have an effect on the number of species that we observe on each island. We note that the response variable is count data and so we try to construct a Poisson regression.\n\n## Constructing a model\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_isl <- glm(species ~ area,\n               data = islands, family = \"poisson\")\n```\n:::\n\n\nand we look at the model summary:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_isl)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = species ~ area, family = \"poisson\", data = islands)\n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)    \n(Intercept) 4.241129   0.041322  102.64   <2e-16 ***\narea        0.035613   0.001247   28.55   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 856.899  on 34  degrees of freedom\nResidual deviance:  30.437  on 33  degrees of freedom\nAIC: 282.66\n\nNumber of Fisher Scoring iterations: 3\n```\n\n\n:::\n:::\n\n\nThe output is strikingly similar to the logistic regression models (who’d have guessed, eh?) and the main numbers to extract from the output are the two numbers underneath `Estimate.Std` in the `Coefficients` table:\n\n```\n(Intercept)    4.241129\narea           0.035613\n```\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a generalised linear model\nmodel = smf.glm(formula = \"species ~ area\",\n                family = sm.families.Poisson(),\n                data = islands_py)\n# and get the fitted parameters of the model\nglm_isl_py = model.fit()\n```\n:::\n\n\nLet's look at the model output:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_isl_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:                species   No. Observations:                   35\nModel:                            GLM   Df Residuals:                       33\nModel Family:                 Poisson   Df Model:                            1\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -139.33\nDate:                Tue, 06 Feb 2024   Deviance:                       30.437\nTime:                        14:14:58   Pearson chi2:                     30.3\nNo. Iterations:                     4   Pseudo R-squ. (CS):              1.000\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      4.2411      0.041    102.636      0.000       4.160       4.322\narea           0.0356      0.001     28.551      0.000       0.033       0.038\n==============================================================================\n```\n\n\n:::\n:::\n\n\n:::\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\n$$ E(species) = \\exp(4.24 + 0.036 \\times area) $$\n\nInterpreting this requires a bit of thought (not much, but a bit).\nThe intercept coefficient, `4.24`, is related to the number of species we would expect on an island of zero area (this is statistics, not real life. You’d do well to remember that before you worry too much about what that even means). But in order to turn this number into something meaningful we have to exponentiate it. Since `exp(4.24) ≈ 70`, we can say that the baseline number of species the model expects on any island is 70. This isn’t actually the interesting bit though.\n\nThe coefficient of `area` is the fun bit. For starters we can see that it is a positive number which does mean that increasing `area` leads to increasing numbers of `species`. Good so far.\n\nBut what does the value `0.036` actually mean? Well, if we exponentiate it as well, we get `exp(0.036) ≈ 1.04`. This means that for every increase in `area` of 1 km^2 (the original units of the area variable), the number of species on the island is multiplied by `1.04`. So, an island of area 1 km^2 will have `1.04 x 70 ≈ 72` species.\n\nSo, in order to interpret Poisson coefficients, you have to exponentiate them.\n\n## Plotting the Poisson regression\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(islands, aes(area, species)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = poisson)) +\n  xlim(10,50)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n`geom_smooth()` using formula = 'y ~ x'\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel = pd.DataFrame({'area': list(range(10, 50))})\n\nmodel[\"pred\"] = glm_isl_py.predict(model)\n\nmodel.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   area        pred\n0    10   99.212463\n1    11  102.809432\n2    12  106.536811\n3    13  110.399326\n4    14  114.401877\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(islands_py,\n         aes(x = \"area\",\n             y = \"species\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"area\", y = \"pred\"), colour = \"blue\", size = 1))\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-16-1.png){width=614}\n:::\n:::\n\n:::\n\n## Assumptions\n\nAs we mentioned earlier, Poisson regressions require that the variance of the data at any point is the same as the mean of the data at that point. We checked that earlier by looking at the residual deviance values.\n\nWe can look for influential points using the Cook’s distance plot:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot(glm_isl , which=4)\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-17-3.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# extract the Cook's distances\nglm_isl_py_resid = pd.DataFrame(glm_isl_py.\n                                get_influence().\n                                summary_frame()[\"cooks_d\"])\n\n# add row index \nglm_isl_py_resid['obs'] = glm_isl_py_resid.reset_index().index\n```\n:::\n\n\nWe can use these to create the plot:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(glm_isl_py_resid,\n         aes(x = \"obs\",\n             y = \"cooks_d\")) +\n     geom_segment(aes(x = \"obs\", y = \"cooks_d\", xend = \"obs\", yend = 0)) +\n     geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-19-1.png){width=614}\n:::\n:::\n\n\n:::\n\nNone of our points have particularly large Cook’s distances and so life is rosy.\n\n## Assessing significance\n\nWe can ask the same three questions we asked before.\n\n1. Is the model well-specified?\n2. Is the overall model better than the null model?\n3. Are any of the individual predictors significant?\n\nAgain, in this case, questions 2 and 3 are effectively asking the same thing because we still only have a single predictor variable.\n\nTo assess if the model is any good we’ll again use the residual deviance and the residual degrees of freedom.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(30.437, 33)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.5953482\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(30.437, 33)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.5953481872979622\n```\n\n\n:::\n:::\n\n\n:::\n\nThis gives a probability of `0.60`. This suggests that this model is actually a reasonably decent one and that the data are pretty well supported by the model. For Poisson models this has an extra interpretation. This can be used to assess whether we have significant over-dispersion in our data.\n\nFor a Poisson model to be appropriate we need that the variance of the data to be exactly the same as the mean of the data. Visually, this would correspond to the data spreading out more for higher predicted values of `species.` However, we don’t want the data to spread out too much. If that happens then a Poisson model wouldn’t be appropriate.\n\nThe easy way to check this is to look at the ratio of the residual deviance to the residual degrees of freedom (in this case `0.922`). For a Poisson model to be valid, this ratio should be about 1. If the ratio is significantly bigger than 1 then we say that we have over-dispersion in the model and we wouldn’t be able to trust any of the significance testing that we are about to do using a Poisson regression.\n\nThankfully the probability we have just created (`0.60`) is exactly the right one we need to look at to assess whether we have significant over-dispersion in our model.\n\nSecondly, to assess whether the overall model, with all of the terms, is better than the null model we’ll look at the difference in deviances and the difference in degrees of freedom:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(856.899 - 30.437, 34 - 33)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(856.899 - 30.437, 34 - 33)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n9.524927068555617e-182\n```\n\n\n:::\n:::\n\n:::\n\nThis gives a reported p-value of pretty much zero, which is pretty damn small. So, yes, this model is better than nothing at all and species does appear to change with some of our predictors\n\nFinally, we’ll construct an analysis of deviance table to look at the individual terms:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(glm_isl , test = \"Chisq\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nAnalysis of Deviance Table\n\nModel: poisson, link: log\n\nResponse: species\n\nTerms added sequentially (first to last)\n\n     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    \nNULL                    34     856.90              \narea  1   826.46        33      30.44 < 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nThe p-value in this table is just as small as we’d expect given our previous result (`<2.2e-16` is pretty close to 0), and we have the nice consistent result that `area` definitely has an effect on `species`.\n\n## Python\n\nAs mentioned before, this is not quite possible in Python.\n:::\n\n## Exercises\n\n### Seat belts {#sec-exr_seatbelts}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nFor this exercise we'll be using the data from `data/seatbelts.csv`.\n\nI'd like you to do the following:\n\n1.  Load the data\n2.  Visualise the data and create a poisson regression model\n3.  Plot the regression model on top of the data\n4.  Assess if the model is a decent predictor for the number of fatalities\n\n::: {.callout-answer collapse=\"true\"}\n\n#### Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseatbelts <- read_csv(\"data/seatbelts.csv\")\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nseatbelts_py = pd.read_csv(\"data/seatbelts.csv\")\n```\n:::\n\n\nLet's have a glimpse at the data:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nseatbelts_py.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   drivers_killed  drivers  front  rear  ...  van_killed  law  year  month\n0             107     1687    867   269  ...          12    0  1969    Jan\n1              97     1508    825   265  ...           6    0  1969    Feb\n2             102     1507    806   319  ...          12    0  1969    Mar\n3              87     1385    814   407  ...           8    0  1969    Apr\n4             119     1632    991   454  ...          10    0  1969    May\n\n[5 rows x 10 columns]\n```\n\n\n:::\n:::\n\n:::\n\nThe data tracks the number of drivers killed in road traffic accidents, before and after the seat belt law was introduced. The information on whether the law was in place is encoded in the `law` column as `0` (law not in place) or `1` (law in place).\n\nThere are many more observations when the law was *not* in place, so we need to keep this in mind when we're interpreting the data.\n\nFirst we have a look at the data comparing no law vs law:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe have to convert the `law` column to a factor, otherwise R will see it as numerical.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseatbelts %>% \n  ggplot(aes(as_factor(law), drivers_killed)) +\n   geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-28-1.png){width=672}\n:::\n:::\n\n\nThe data are recorded by month and year, so we can also display the number of drivers killed by year:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseatbelts %>% \n  ggplot(aes(year, drivers_killed)) +\n  geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-29-1.png){width=672}\n:::\n:::\n\n\n:::\n\nThe data look a bit weird. There is quite some variation within years (keeping in mind that the data are aggregated monthly). The data also seems to wave around a bit... with some vague peaks (e.g. 1972 - 1973) and some troughs (e.g. around 1976).\n\nSo my initial thought is that these data are going to be a bit tricky to interpret. But that's OK.\n\n#### Constructing a model\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_stb <- glm(drivers_killed ~ year,\n               data = seatbelts, family = \"poisson\")\n```\n:::\n\n\nand we look at the model summary:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_stb)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = drivers_killed ~ year, family = \"poisson\", data = seatbelts)\n\nCoefficients:\n             Estimate Std. Error z value Pr(>|z|)    \n(Intercept) 37.168958   2.796636   13.29   <2e-16 ***\nyear        -0.016373   0.001415  -11.57   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 984.50  on 191  degrees of freedom\nResidual deviance: 850.41  on 190  degrees of freedom\nAIC: 2127.2\n\nNumber of Fisher Scoring iterations: 4\n```\n\n\n:::\n:::\n\n\n```\n(Intercept)    37.168958\nyear           0.016373\n```\n:::\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\n$$ E(drivers\\_killed) = \\exp(37.17 + 0.164 \\times year) $$\n\n#### Assessing significance\n\nIs the model well-specified?\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(850.41, 190)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\nThis value indicates that the model is actually pretty good. Remember, it is between $[0, 1]$ and the closer to zero, the better the model.\n:::\n\nHow about the overall fit?\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(984.50 - 850.41, 191 - 190)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\nAgain, this indicates that the model is markedly better than the null model.\n\n:::\n\n#### Plotting the regression\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(seatbelts, aes(year, drivers_killed)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = poisson)) +\n  xlim(1970,1985)\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-34-1.png){width=672}\n:::\n:::\n\n\n:::\n\n\n#### Conclusions\n\nThe model we constructed appears to be a decent predictor for the number of fatalities.\n\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n-   Poisson regression is useful when dealing with count data\n:::\n",
+    "markdown": "---\ntitle: \"Count data\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n**Questions**\n\n-   How do we analyse count data?\n\n**Objectives**\n\n-   Be able to perform a poisson regression on count data\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n### Functions\n\n## Python\n\n### Libraries\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n\n# Needed for additional probability functionality\nfrom scipy.stats import *\n```\n:::\n\n\n### Functions\n\n:::\n:::\n\nThe examples in this section use the following data sets:\n\n`data/islands.csv`\n\nThis is a data set comprising 35 observations of two variables (one dependent and one predictor). This records the number of species recorded on different small islands along with the area (km<sup>2</sup>) of the islands. The variables are `species` and `area`.\n\nThe second data set is on seat belts.\n\nThe `seatbelts` data set is a multiple time-series data set that was commissioned by the Department of Transport in 1984 to measure differences in deaths before and after front seat belt legislation was introduced on 31st January 1983. It provides monthly total numerical data on a number of incidents including those related to death and injury in Road Traffic Accidents (RTA's). The data set starts in January 1969 and observations run until December 1984.\n\nYou can find the file in `data/seatbelts.csv`\n\n## Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nislands <- read_csv(\"data/islands.csv\")\n```\n:::\n\n\nLet's have a glimpse at the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nislands\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 35 × 2\n   species  area\n     <dbl> <dbl>\n 1     114  12.1\n 2     130  13.4\n 3     113  13.7\n 4     109  14.5\n 5     118  16.8\n 6     136  19.0\n 7     149  19.6\n 8     162  20.6\n 9     145  20.9\n10     148  21.0\n# ℹ 25 more rows\n```\n\n\n:::\n:::\n\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nislands_py = pd.read_csv(\"data/islands.csv\")\n```\n:::\n\n\nLet's have a glimpse at the data:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nislands_py.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   species       area\n0      114  12.076133\n1      130  13.405439\n2      113  13.723525\n3      109  14.540359\n4      118  16.792122\n```\n\n\n:::\n:::\n\n\n:::\n\nLooking at the data, we can see that there are two columns: `species`, which contains the number of species recorded on each island and `area`, which contains the surface area of the island in square kilometers.\n\nWe can plot the data:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(islands, aes(x = area, y = species)) +\n  geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(islands_py, aes(x = \"area\", y = \"species\")) +\n  geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-9-1.png){width=614}\n:::\n:::\n\n:::\n\nIt looks as though `area` may have an effect on the number of species that we observe on each island. We note that the response variable is count data and so we try to construct a Poisson regression.\n\n## Constructing a model\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_isl <- glm(species ~ area,\n               data = islands, family = \"poisson\")\n```\n:::\n\n\nand we look at the model summary:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_isl)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = species ~ area, family = \"poisson\", data = islands)\n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)    \n(Intercept) 4.241129   0.041322  102.64   <2e-16 ***\narea        0.035613   0.001247   28.55   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 856.899  on 34  degrees of freedom\nResidual deviance:  30.437  on 33  degrees of freedom\nAIC: 282.66\n\nNumber of Fisher Scoring iterations: 3\n```\n\n\n:::\n:::\n\n\nThe output is strikingly similar to the logistic regression models (who’d have guessed, eh?) and the main numbers to extract from the output are the two numbers underneath `Estimate.Std` in the `Coefficients` table:\n\n```\n(Intercept)    4.241129\narea           0.035613\n```\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a generalised linear model\nmodel = smf.glm(formula = \"species ~ area\",\n                family = sm.families.Poisson(),\n                data = islands_py)\n# and get the fitted parameters of the model\nglm_isl_py = model.fit()\n```\n:::\n\n\nLet's look at the model output:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_isl_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:                species   No. Observations:                   35\nModel:                            GLM   Df Residuals:                       33\nModel Family:                 Poisson   Df Model:                            1\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -139.33\nDate:                Tue, 06 Feb 2024   Deviance:                       30.437\nTime:                        16:16:33   Pearson chi2:                     30.3\nNo. Iterations:                     4   Pseudo R-squ. (CS):              1.000\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      4.2411      0.041    102.636      0.000       4.160       4.322\narea           0.0356      0.001     28.551      0.000       0.033       0.038\n==============================================================================\n```\n\n\n:::\n:::\n\n\n:::\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\n$$ E(species) = \\exp(4.24 + 0.036 \\times area) $$\n\nInterpreting this requires a bit of thought (not much, but a bit).\nThe intercept coefficient, `4.24`, is related to the number of species we would expect on an island of zero area (this is statistics, not real life. You’d do well to remember that before you worry too much about what that even means). But in order to turn this number into something meaningful we have to exponentiate it. Since `exp(4.24) ≈ 70`, we can say that the baseline number of species the model expects on any island is 70. This isn’t actually the interesting bit though.\n\nThe coefficient of `area` is the fun bit. For starters we can see that it is a positive number which does mean that increasing `area` leads to increasing numbers of `species`. Good so far.\n\nBut what does the value `0.036` actually mean? Well, if we exponentiate it as well, we get `exp(0.036) ≈ 1.04`. This means that for every increase in `area` of 1 km^2 (the original units of the area variable), the number of species on the island is multiplied by `1.04`. So, an island of area 1 km^2 will have `1.04 x 70 ≈ 72` species.\n\nSo, in order to interpret Poisson coefficients, you have to exponentiate them.\n\n## Plotting the Poisson regression\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(islands, aes(area, species)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = poisson)) +\n  xlim(10,50)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n`geom_smooth()` using formula = 'y ~ x'\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel = pd.DataFrame({'area': list(range(10, 50))})\n\nmodel[\"pred\"] = glm_isl_py.predict(model)\n\nmodel.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   area        pred\n0    10   99.212463\n1    11  102.809432\n2    12  106.536811\n3    13  110.399326\n4    14  114.401877\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(islands_py,\n         aes(x = \"area\",\n             y = \"species\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"area\", y = \"pred\"), colour = \"blue\", size = 1))\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-16-1.png){width=614}\n:::\n:::\n\n:::\n\n## Assumptions\n\nAs we mentioned earlier, Poisson regressions require that the variance of the data at any point is the same as the mean of the data at that point. We checked that earlier by looking at the residual deviance values.\n\nWe can look for influential points using the Cook’s distance plot:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot(glm_isl , which=4)\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-17-3.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# extract the Cook's distances\nglm_isl_py_resid = pd.DataFrame(glm_isl_py.\n                                get_influence().\n                                summary_frame()[\"cooks_d\"])\n\n# add row index \nglm_isl_py_resid['obs'] = glm_isl_py_resid.reset_index().index\n```\n:::\n\n\nWe can use these to create the plot:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(glm_isl_py_resid,\n         aes(x = \"obs\",\n             y = \"cooks_d\")) +\n     geom_segment(aes(x = \"obs\", y = \"cooks_d\", xend = \"obs\", yend = 0)) +\n     geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-19-1.png){width=614}\n:::\n:::\n\n\n:::\n\nNone of our points have particularly large Cook’s distances and so life is rosy.\n\n## Assessing significance\n\nWe can ask the same three questions we asked before.\n\n1. Is the model well-specified?\n2. Is the overall model better than the null model?\n3. Are any of the individual predictors significant?\n\nAgain, in this case, questions 2 and 3 are effectively asking the same thing because we still only have a single predictor variable.\n\nTo assess if the model is any good we’ll again use the residual deviance and the residual degrees of freedom.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(30.437, 33)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.5953482\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(30.437, 33)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.5953481872979622\n```\n\n\n:::\n:::\n\n\n:::\n\nThis gives a probability of `0.60`. This suggests that this model is actually a reasonably decent one and that the data are pretty well supported by the model. For Poisson models this has an extra interpretation. This can be used to assess whether we have significant over-dispersion in our data.\n\nFor a Poisson model to be appropriate we need that the variance of the data to be exactly the same as the mean of the data. Visually, this would correspond to the data spreading out more for higher predicted values of `species.` However, we don’t want the data to spread out too much. If that happens then a Poisson model wouldn’t be appropriate.\n\nThe easy way to check this is to look at the ratio of the residual deviance to the residual degrees of freedom (in this case `0.922`). For a Poisson model to be valid, this ratio should be about 1. If the ratio is significantly bigger than 1 then we say that we have over-dispersion in the model and we wouldn’t be able to trust any of the significance testing that we are about to do using a Poisson regression.\n\nThankfully the probability we have just created (`0.60`) is exactly the right one we need to look at to assess whether we have significant over-dispersion in our model.\n\nSecondly, to assess whether the overall model, with all of the terms, is better than the null model we’ll look at the difference in deviances and the difference in degrees of freedom:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(856.899 - 30.437, 34 - 33)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(856.899 - 30.437, 34 - 33)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n9.524927068555617e-182\n```\n\n\n:::\n:::\n\n:::\n\nThis gives a reported p-value of pretty much zero, which is pretty damn small. So, yes, this model is better than nothing at all and species does appear to change with some of our predictors\n\nFinally, we’ll construct an analysis of deviance table to look at the individual terms:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(glm_isl , test = \"Chisq\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nAnalysis of Deviance Table\n\nModel: poisson, link: log\n\nResponse: species\n\nTerms added sequentially (first to last)\n\n     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    \nNULL                    34     856.90              \narea  1   826.46        33      30.44 < 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nThe p-value in this table is just as small as we’d expect given our previous result (`<2.2e-16` is pretty close to 0), and we have the nice consistent result that `area` definitely has an effect on `species`.\n\n## Python\n\nAs mentioned before, this is not quite possible in Python.\n:::\n\n## Exercises\n\n### Seat belts {#sec-exr_seatbelts}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nFor this exercise we'll be using the data from `data/seatbelts.csv`.\n\nI'd like you to do the following:\n\n1.  Load the data\n2.  Visualise the data and create a poisson regression model\n3.  Plot the regression model on top of the data\n4.  Assess if the model is a decent predictor for the number of fatalities\n\n::: {.callout-answer collapse=\"true\"}\n\n#### Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseatbelts <- read_csv(\"data/seatbelts.csv\")\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nseatbelts_py = pd.read_csv(\"data/seatbelts.csv\")\n```\n:::\n\n\nLet's have a glimpse at the data:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nseatbelts_py.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   casualties  drivers  front  rear  ...  van_killed  law  year  month\n0         107     1687    867   269  ...          12    0  1969    Jan\n1          97     1508    825   265  ...           6    0  1969    Feb\n2         102     1507    806   319  ...          12    0  1969    Mar\n3          87     1385    814   407  ...           8    0  1969    Apr\n4         119     1632    991   454  ...          10    0  1969    May\n\n[5 rows x 10 columns]\n```\n\n\n:::\n:::\n\n:::\n\nThe data tracks the number of drivers killed in road traffic accidents, before and after the seat belt law was introduced. The information on whether the law was in place is encoded in the `law` column as `0` (law not in place) or `1` (law in place).\n\nThere are many more observations when the law was *not* in place, so we need to keep this in mind when we're interpreting the data.\n\nFirst we have a look at the data comparing no law vs law:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe have to convert the `law` column to a factor, otherwise R will see it as numerical.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseatbelts %>% \n  ggplot(aes(as_factor(law), casualties)) +\n   geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-28-1.png){width=672}\n:::\n:::\n\n\nThe data are recorded by month and year, so we can also display the number of drivers killed by year:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseatbelts %>% \n  ggplot(aes(year, casualties)) +\n  geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-29-1.png){width=672}\n:::\n:::\n\n\n## Python\n\nWe have to convert the `law` column to a factor, otherwise R will see it as numerical.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(seatbelts_py,\n         aes(x = seatbelts_py.law.astype(object),\n             y = \"casualties\")) +\n     geom_boxplot())\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-30-1.png){width=614}\n:::\n:::\n\n\nThe data are recorded by month and year, so we can also display the number of casualties by year:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(seatbelts_py,\n         aes(x = \"year\",\n             y = \"casualties\")) +\n     geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-31-3.png){width=614}\n:::\n:::\n\n\n:::\n\nThe data look a bit weird. There is quite some variation within years (keeping in mind that the data are aggregated monthly). The data also seems to wave around a bit... with some vague peaks (e.g. 1972 - 1973) and some troughs (e.g. around 1976).\n\nSo my initial thought is that these data are going to be a bit tricky to interpret. But that's OK.\n\n#### Constructing a model\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_stb <- glm(casualties ~ year,\n               data = seatbelts, family = \"poisson\")\n```\n:::\n\n\nand we look at the model summary:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_stb)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = casualties ~ year, family = \"poisson\", data = seatbelts)\n\nCoefficients:\n             Estimate Std. Error z value Pr(>|z|)    \n(Intercept) 37.168958   2.796636   13.29   <2e-16 ***\nyear        -0.016373   0.001415  -11.57   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 984.50  on 191  degrees of freedom\nResidual deviance: 850.41  on 190  degrees of freedom\nAIC: 2127.2\n\nNumber of Fisher Scoring iterations: 4\n```\n\n\n:::\n:::\n\n\n```\n(Intercept)    37.168958\nyear           -0.016373\n```\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.glm(formula = \"casualties ~ year\",\n                family = sm.families.Poisson(),\n                data = seatbelts_py)\n# and get the fitted parameters of the model\nglm_stb_py = model.fit()\n```\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_stb_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:             casualties   No. Observations:                  192\nModel:                            GLM   Df Residuals:                      190\nModel Family:                 Poisson   Df Model:                            1\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -1061.6\nDate:                Tue, 06 Feb 2024   Deviance:                       850.41\nTime:                        16:16:38   Pearson chi2:                     862.\nNo. Iterations:                     4   Pseudo R-squ. (CS):             0.5026\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     37.1690      2.797     13.291      0.000      31.688      42.650\nyear          -0.0164      0.001    -11.569      0.000      -0.019      -0.014\n==============================================================================\n```\n\n\n:::\n:::\n\n\n```\n======================\n                 coef  \n----------------------\nIntercept     37.1690 \nyear          -0.0164 \n======================\n```\n:::\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\n$$ E(casualties) = \\exp(37.17 - 0.164 \\times year) $$\n\n#### Assessing significance\n\nIs the model well-specified?\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(850.41, 190)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(850.41, 190)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n3.1319689119997022e-84\n```\n\n\n:::\n:::\n\n:::\n\nThis value indicates that the model is actually pretty good. Remember, it is between $[0, 1]$ and the closer to zero, the better the model.\n\nHow about the overall fit?\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(984.50 - 850.41, 191 - 190)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\n## Python\n\nFirst we need to define the null model:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.glm(formula = \"casualties ~ 1\",\n                family = sm.families.Poisson(),\n                data = seatbelts_py)\n# and get the fitted parameters of the model\nglm_stb_null_py = model.fit()\n\nprint(glm_stb_null_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:             casualties   No. Observations:                  192\nModel:                            GLM   Df Residuals:                      191\nModel Family:                 Poisson   Df Model:                            0\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -1128.6\nDate:                Tue, 06 Feb 2024   Deviance:                       984.50\nTime:                        16:16:38   Pearson chi2:                 1.00e+03\nNo. Iterations:                     4   Pseudo R-squ. (CS):          1.942e-13\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P>|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      4.8106      0.007    738.670      0.000       4.798       4.823\n==============================================================================\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\nchi2.sf(984.50 - 850.41, 191 - 190)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n5.2214097202831414e-31\n```\n\n\n:::\n:::\n\n:::\n\nAgain, this indicates that the model is markedly better than the null model.\n\n#### Plotting the regression\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(seatbelts, aes(year, casualties)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = poisson)) +\n  xlim(1970,1985)\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-41-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel = pd.DataFrame({'year': list(range(1968, 1985))})\n\nmodel[\"pred\"] = glm_stb_py.predict(model)\n\nmodel.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   year        pred\n0  1968  140.737690\n1  1969  138.452153\n2  1970  136.203733\n3  1971  133.991827\n4  1972  131.815842\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(seatbelts_py,\n         aes(x = \"year\",\n             y = \"casualties\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"year\", y = \"pred\"), colour = \"blue\", size = 1))\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-43-1.png){width=614}\n:::\n:::\n\n:::\n\n\n#### Conclusions\n\nThe model we constructed appears to be a decent predictor for the number of fatalities.\n\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n-   Poisson regression is useful when dealing with count data\n:::\n",
     "supporting": [
       "glm-practical-poisson_files"
     ],
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-28-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-28-1.png
index 7608c9e..e9b3e64 100644
Binary files a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-28-1.png and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-28-1.png differ
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-29-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-29-1.png
index 5c19875..97036f5 100644
Binary files a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-29-1.png and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-29-1.png differ
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-30-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-30-1.png
new file mode 100644
index 0000000..6ea8e6f
Binary files /dev/null and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-30-1.png differ
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-31-3.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-31-3.png
new file mode 100644
index 0000000..b7ce4a8
Binary files /dev/null and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-31-3.png differ
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-41-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-41-1.png
new file mode 100644
index 0000000..4d61df5
Binary files /dev/null and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-41-1.png differ
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-43-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-43-1.png
new file mode 100644
index 0000000..a1b6b49
Binary files /dev/null and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-43-1.png differ
diff --git a/_site/search.json b/_site/search.json
index f9d6d9a..a949747 100644
--- a/_site/search.json
+++ b/_site/search.json
@@ -323,7 +323,7 @@
     "href": "materials/glm-practical-logistic-proportion.html#model-output",
     "title": "\n8  Proportional response\n",
     "section": "\n8.4 Model output",
-    "text": "8.4 Model output\nThat’s the easy part done! The trickier part is interpreting the output. First of all, we’ll get some summary information.\n\n\nR\nPython\n\n\n\nNext, we can have a closer look at the results:\n\nsummary(glm_chl)\n\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)    \n(Intercept) 11.66299    3.29626   3.538 0.000403 ***\ntemp        -0.21623    0.05318  -4.066 4.78e-05 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 38.898  on 22  degrees of freedom\nResidual deviance: 16.912  on 21  degrees of freedom\nAIC: 33.675\n\nNumber of Fisher Scoring iterations: 6\n\n\nWe can see that the p-values of the intercept and temp are significant. We can also use the intercept and temp coefficients to construct the logistic equation, which we can use to sketch the logistic curve.\n\n\n\nprint(glm_chl_py.summary())\n\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   23\nModel:                              GLM   Df Residuals:                       21\nModel Family:                  Binomial   Df Model:                            1\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -14.837\nDate:                  Tue, 06 Feb 2024   Deviance:                       16.912\nTime:                          15:43:32   Pearson chi2:                     28.1\nNo. Iterations:                       7   Pseudo R-squ. (CS):             0.6155\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     11.6630      3.296      3.538      0.000       5.202      18.124\ntemp          -0.2162      0.053     -4.066      0.000      -0.320      -0.112\n==============================================================================\n\n\n\n\n\n\\[E(prop \\ failed\\ orings) = \\frac{\\exp{(11.66 -  0.22 \\times temp)}}{1 + \\exp{(11.66 -  0.22 \\times temp)}}\\]\nLet’s see how well our model would have performed if we would have fed it the data from the ill-fated Challenger launch.\n\n\nR\nPython\n\n\n\n\nggplot(challenger, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85)\n\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n\n\n\n\n\n\n\n\n\n\nWe can get the predicted values for the model as follows:\n\nchallenger_py['predicted_values'] = glm_chl_py.predict()\n\nchallenger_py.head()\n\n   temp  damage  total  intact  prop_damaged  predicted_values\n0    53       5      6       1      0.833333          0.550479\n1    57       1      6       5      0.166667          0.340217\n2    58       1      6       5      0.166667          0.293476\n3    63       1      6       5      0.166667          0.123496\n4    66       0      6       6      0.000000          0.068598\n\n\nThis would only give us the predicted values for the data we already have. Instead we want to extrapolate to what would have been predicted for a wider range of temperatures. Here, we use a range of \\([25, 85]\\) degrees Fahrenheit.\n\nmodel = pd.DataFrame({'temp': list(range(25, 86))})\n\nmodel[\"pred\"] = glm_chl_py.predict(model)\n\nmodel.head()\n\n   temp      pred\n0    25  0.998087\n1    26  0.997626\n2    27  0.997055\n3    28  0.996347\n4    29  0.995469\n\n\n\n(ggplot(challenger_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"temp\", y = \"pred\"), colour = \"blue\", size = 1))\n\n\n\n\n\n\n\n\n\n\n\n\n\nGenerating predicted values\n\n\n\n\n\n\n\nR\nPython\n\n\n\nAnother way of doing this it to generate a table with data for a range of temperatures, from 25 to 85 degrees Fahrenheit, in steps of 1. We can then use these data to generate the logistic curve, based on the fitted model.\n\n# create a table with sequential numbers ranging from 25 to 85\nmodel &lt;- tibble(temp = seq(25, 85, by = 1)) %&gt;% \n  # add a new column containing the predicted values\n  mutate(.pred = predict(glm_chl, newdata = ., type = \"response\"))\n\nggplot(model, aes(temp, .pred)) +\n  geom_line()\n\n\n\n\n\n\n\n\n# plot the curve and the original data\nggplot(model, aes(temp, .pred)) +\n  geom_line(colour = \"blue\") +\n  geom_point(data = challenger, aes(temp, prop_damaged)) +\n  # add a vertical line at the disaster launch temperature\n  geom_vline(xintercept = 31, linetype = \"dashed\")\n\n\n\n\n\n\n\nIt seems that there was a high probability of both o-rings failing at that launch temperature. One thing that the graph shows is that there is a lot of uncertainty involved in this model. We can tell, because the fit of the line is very poor at the lower temperature range. There is just very little data to work on, with the data point at 53 F having a large influence on the fit.\n\n\nWe already did this above, since this is the most straightforward way of plotting the model in Python.",
+    "text": "8.4 Model output\nThat’s the easy part done! The trickier part is interpreting the output. First of all, we’ll get some summary information.\n\n\nR\nPython\n\n\n\nNext, we can have a closer look at the results:\n\nsummary(glm_chl)\n\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)    \n(Intercept) 11.66299    3.29626   3.538 0.000403 ***\ntemp        -0.21623    0.05318  -4.066 4.78e-05 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 38.898  on 22  degrees of freedom\nResidual deviance: 16.912  on 21  degrees of freedom\nAIC: 33.675\n\nNumber of Fisher Scoring iterations: 6\n\n\nWe can see that the p-values of the intercept and temp are significant. We can also use the intercept and temp coefficients to construct the logistic equation, which we can use to sketch the logistic curve.\n\n\n\nprint(glm_chl_py.summary())\n\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   23\nModel:                              GLM   Df Residuals:                       21\nModel Family:                  Binomial   Df Model:                            1\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -14.837\nDate:                  Tue, 06 Feb 2024   Deviance:                       16.912\nTime:                          16:12:12   Pearson chi2:                     28.1\nNo. Iterations:                       7   Pseudo R-squ. (CS):             0.6155\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     11.6630      3.296      3.538      0.000       5.202      18.124\ntemp          -0.2162      0.053     -4.066      0.000      -0.320      -0.112\n==============================================================================\n\n\n\n\n\n\\[E(prop \\ failed\\ orings) = \\frac{\\exp{(11.66 -  0.22 \\times temp)}}{1 + \\exp{(11.66 -  0.22 \\times temp)}}\\]\nLet’s see how well our model would have performed if we would have fed it the data from the ill-fated Challenger launch.\n\n\nR\nPython\n\n\n\n\nggplot(challenger, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85)\n\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n\n\n\n\n\n\n\n\n\n\nWe can get the predicted values for the model as follows:\n\nchallenger_py['predicted_values'] = glm_chl_py.predict()\n\nchallenger_py.head()\n\n   temp  damage  total  intact  prop_damaged  predicted_values\n0    53       5      6       1      0.833333          0.550479\n1    57       1      6       5      0.166667          0.340217\n2    58       1      6       5      0.166667          0.293476\n3    63       1      6       5      0.166667          0.123496\n4    66       0      6       6      0.000000          0.068598\n\n\nThis would only give us the predicted values for the data we already have. Instead we want to extrapolate to what would have been predicted for a wider range of temperatures. Here, we use a range of \\([25, 85]\\) degrees Fahrenheit.\n\nmodel = pd.DataFrame({'temp': list(range(25, 86))})\n\nmodel[\"pred\"] = glm_chl_py.predict(model)\n\nmodel.head()\n\n   temp      pred\n0    25  0.998087\n1    26  0.997626\n2    27  0.997055\n3    28  0.996347\n4    29  0.995469\n\n\n\n(ggplot(challenger_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"temp\", y = \"pred\"), colour = \"blue\", size = 1))\n\n\n\n\n\n\n\n\n\n\n\n\n\nGenerating predicted values\n\n\n\n\n\n\n\nR\nPython\n\n\n\nAnother way of doing this it to generate a table with data for a range of temperatures, from 25 to 85 degrees Fahrenheit, in steps of 1. We can then use these data to generate the logistic curve, based on the fitted model.\n\n# create a table with sequential numbers ranging from 25 to 85\nmodel &lt;- tibble(temp = seq(25, 85, by = 1)) %&gt;% \n  # add a new column containing the predicted values\n  mutate(.pred = predict(glm_chl, newdata = ., type = \"response\"))\n\nggplot(model, aes(temp, .pred)) +\n  geom_line()\n\n\n\n\n\n\n\n\n# plot the curve and the original data\nggplot(model, aes(temp, .pred)) +\n  geom_line(colour = \"blue\") +\n  geom_point(data = challenger, aes(temp, prop_damaged)) +\n  # add a vertical line at the disaster launch temperature\n  geom_vline(xintercept = 31, linetype = \"dashed\")\n\n\n\n\n\n\n\nIt seems that there was a high probability of both o-rings failing at that launch temperature. One thing that the graph shows is that there is a lot of uncertainty involved in this model. We can tell, because the fit of the line is very poor at the lower temperature range. There is just very little data to work on, with the data point at 53 F having a large influence on the fit.\n\n\nWe already did this above, since this is the most straightforward way of plotting the model in Python.",
     "crumbs": [
       "Binary and proportional data",
       "<span class='chapter-number'>8</span>  <span class='chapter-title'>Proportional response</span>"
@@ -334,7 +334,7 @@
     "href": "materials/glm-practical-logistic-proportion.html#exercises",
     "title": "\n8  Proportional response\n",
     "section": "\n8.5 Exercises",
-    "text": "8.5 Exercises\n\n8.5.1 Predicting failure\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nLevel: \nThe data point at 53 degrees Fahrenheit is quite influential for the analysis. Remove this data point and repeat the analysis. Is there still a predicted link between launch temperature and o-ring failure?\n\n\n\n\n\n\nAnswer\n\n\n\n\n\n\n\n\n\nR\nPython\n\n\n\nFirst, we need to remove the influential data point:\n\nchallenger_new &lt;- challenger %&gt;% filter(temp != 53)\n\nWe can create a new generalised linear model, based on these data:\n\nglm_chl_new &lt;- glm(cbind(damage, intact) ~ temp,\n               family = binomial,\n               data = challenger_new)\n\nWe can get the model parameters as follows:\n\nsummary(glm_chl_new)\n\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger_new)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)  \n(Intercept)  5.68223    4.43138   1.282   0.1997  \ntemp        -0.12817    0.06697  -1.914   0.0556 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 16.375  on 21  degrees of freedom\nResidual deviance: 12.633  on 20  degrees of freedom\nAIC: 27.572\n\nNumber of Fisher Scoring iterations: 5\n\n\n\nggplot(challenger_new, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85) +\n  # add a vertical line at 53 F temperature\n  geom_vline(xintercept = 53, linetype = \"dashed\")\n\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n\n\n\n\n\n\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\nBefore we can make any firm conclusions, though, we need to check our model:\n\n1- pchisq(12.633,20)\n\n[1] 0.8925695\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty rubbish – our points are not very close to our curve, overall.\nIs the model any better than the null though?\n\n1 - pchisq(16.375 - 12.633, 1)\n\n[1] 0.0530609\n\nanova(glm_chl_new, test = 'Chisq')\n\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: cbind(damage, intact)\n\nTerms added sequentially (first to last)\n\n     Df Deviance Resid. Df Resid. Dev Pr(&gt;Chi)  \nNULL                    21     16.375           \ntemp  1   3.7421        20     12.633  0.05306 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n\n\nFirst, we need to remove the influential data point:\n\nchallenger_new_py = challenger_py.query(\"temp != 53\")\n\nWe can create a new generalised linear model, based on these data:\n\n# create a generalised linear model\nmodel = smf.glm(formula = \"damage + intact ~ temp\",\n                family = sm.families.Binomial(),\n                data = challenger_new_py)\n# and get the fitted parameters of the model\nglm_chl_new_py = model.fit()\n\nWe can get the model parameters as follows:\n\nprint(glm_chl_new_py.summary())\n\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   22\nModel:                              GLM   Df Residuals:                       20\nModel Family:                  Binomial   Df Model:                            1\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -11.786\nDate:                  Tue, 06 Feb 2024   Deviance:                       12.633\nTime:                          15:43:35   Pearson chi2:                     16.6\nNo. Iterations:                       6   Pseudo R-squ. (CS):             0.1564\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      5.6822      4.431      1.282      0.200      -3.003      14.368\ntemp          -0.1282      0.067     -1.914      0.056      -0.259       0.003\n==============================================================================\n\n\nGenerate new model data:\n\nmodel = pd.DataFrame({'temp': list(range(25, 86))})\n\nmodel[\"pred\"] = glm_chl_new_py.predict(model)\n\nmodel.head()\n\n   temp      pred\n0    25  0.922585\n1    26  0.912920\n2    27  0.902177\n3    28  0.890269\n4    29  0.877107\n\n\n\n(ggplot(challenger_new_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"temp\", y = \"pred\"), colour = \"blue\", size = 1) +\n     # add a vertical line at 53 F temperature\n     geom_vline(xintercept = 53, linetype = \"dashed\"))\n\n\n\n\n\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\nBefore we can make any firm conclusions, though, we need to check our model:\n\nchi2.sf(12.633, 20)\n\n0.8925694610786307\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty rubbish – our points are not very close to our curve, overall.\nIs the model any better than the null though?\n\nchi2.sf(16.375 - 12.633, 23 - 22)\n\n0.053060897703157646\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n\n\n\nSo, could NASA have predicted what happened? This model is not significantly different from the null, i.e., temperature is not a significant predictor. Note that it’s only marginally non-significant, and we do have a high goodness-of-fit value.\nIt is possible that if more data points were available that followed a similar trend, the story might be different). Even if we did use our non-significant model to make a prediction, it doesn’t give us a value anywhere near 5 failures for a temperature of 53 degrees Fahrenheit. So overall, based on the model we’ve fitted with these data, there was no indication that a temperature just a few degrees cooler than previous missions could have been so disastrous for the Challenger.",
+    "text": "8.5 Exercises\n\n8.5.1 Predicting failure\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nLevel: \nThe data point at 53 degrees Fahrenheit is quite influential for the analysis. Remove this data point and repeat the analysis. Is there still a predicted link between launch temperature and o-ring failure?\n\n\n\n\n\n\nAnswer\n\n\n\n\n\n\n\n\n\nR\nPython\n\n\n\nFirst, we need to remove the influential data point:\n\nchallenger_new &lt;- challenger %&gt;% filter(temp != 53)\n\nWe can create a new generalised linear model, based on these data:\n\nglm_chl_new &lt;- glm(cbind(damage, intact) ~ temp,\n               family = binomial,\n               data = challenger_new)\n\nWe can get the model parameters as follows:\n\nsummary(glm_chl_new)\n\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger_new)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)  \n(Intercept)  5.68223    4.43138   1.282   0.1997  \ntemp        -0.12817    0.06697  -1.914   0.0556 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 16.375  on 21  degrees of freedom\nResidual deviance: 12.633  on 20  degrees of freedom\nAIC: 27.572\n\nNumber of Fisher Scoring iterations: 5\n\n\n\nggplot(challenger_new, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85) +\n  # add a vertical line at 53 F temperature\n  geom_vline(xintercept = 53, linetype = \"dashed\")\n\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n\n\n\n\n\n\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\nBefore we can make any firm conclusions, though, we need to check our model:\n\n1- pchisq(12.633,20)\n\n[1] 0.8925695\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty rubbish – our points are not very close to our curve, overall.\nIs the model any better than the null though?\n\n1 - pchisq(16.375 - 12.633, 21 - 20)\n\n[1] 0.0530609\n\nanova(glm_chl_new, test = 'Chisq')\n\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: cbind(damage, intact)\n\nTerms added sequentially (first to last)\n\n     Df Deviance Resid. Df Resid. Dev Pr(&gt;Chi)  \nNULL                    21     16.375           \ntemp  1   3.7421        20     12.633  0.05306 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n\n\nFirst, we need to remove the influential data point:\n\nchallenger_new_py = challenger_py.query(\"temp != 53\")\n\nWe can create a new generalised linear model, based on these data:\n\n# create a generalised linear model\nmodel = smf.glm(formula = \"damage + intact ~ temp\",\n                family = sm.families.Binomial(),\n                data = challenger_new_py)\n# and get the fitted parameters of the model\nglm_chl_new_py = model.fit()\n\nWe can get the model parameters as follows:\n\nprint(glm_chl_new_py.summary())\n\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   22\nModel:                              GLM   Df Residuals:                       20\nModel Family:                  Binomial   Df Model:                            1\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -11.786\nDate:                  Tue, 06 Feb 2024   Deviance:                       12.633\nTime:                          16:12:15   Pearson chi2:                     16.6\nNo. Iterations:                       6   Pseudo R-squ. (CS):             0.1564\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      5.6822      4.431      1.282      0.200      -3.003      14.368\ntemp          -0.1282      0.067     -1.914      0.056      -0.259       0.003\n==============================================================================\n\n\nGenerate new model data:\n\nmodel = pd.DataFrame({'temp': list(range(25, 86))})\n\nmodel[\"pred\"] = glm_chl_new_py.predict(model)\n\nmodel.head()\n\n   temp      pred\n0    25  0.922585\n1    26  0.912920\n2    27  0.902177\n3    28  0.890269\n4    29  0.877107\n\n\n\n(ggplot(challenger_new_py,\n         aes(x = \"temp\",\n             y = \"prop_damaged\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"temp\", y = \"pred\"), colour = \"blue\", size = 1) +\n     # add a vertical line at 53 F temperature\n     geom_vline(xintercept = 53, linetype = \"dashed\"))\n\n\n\n\n\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\nBefore we can make any firm conclusions, though, we need to check our model:\n\nchi2.sf(12.633, 20)\n\n0.8925694610786307\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty rubbish – our points are not very close to our curve, overall.\nIs the model any better than the null though?\nFirst we need to define the null model:\n\n# create a linear model\nmodel = smf.glm(formula = \"damage + intact ~ 1\",\n                family = sm.families.Binomial(),\n                data = challenger_new_py)\n# and get the fitted parameters of the model\nglm_chl_new_null_py = model.fit()\n\nprint(glm_chl_new_null_py.summary())\n\n                  Generalized Linear Model Regression Results                   \n================================================================================\nDep. Variable:     ['damage', 'intact']   No. Observations:                   22\nModel:                              GLM   Df Residuals:                       21\nModel Family:                  Binomial   Df Model:                            0\nLink Function:                    Logit   Scale:                          1.0000\nMethod:                            IRLS   Log-Likelihood:                -13.657\nDate:                  Tue, 06 Feb 2024   Deviance:                       16.375\nTime:                          16:12:16   Pearson chi2:                     16.8\nNo. Iterations:                       6   Pseudo R-squ. (CS):         -2.220e-16\nCovariance Type:              nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     -3.0445      0.418     -7.286      0.000      -3.864      -2.226\n==============================================================================\n\n\n\nchi2.sf(16.375 - 12.633, 21 - 20)\n\n0.053060897703157646\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n\n\n\nSo, could NASA have predicted what happened? This model is not significantly different from the null, i.e., temperature is not a significant predictor. Note that it’s only marginally non-significant, and we do have a high goodness-of-fit value.\nIt is possible that if more data points were available that followed a similar trend, the story might be different). Even if we did use our non-significant model to make a prediction, it doesn’t give us a value anywhere near 5 failures for a temperature of 53 degrees Fahrenheit. So overall, based on the model we’ve fitted with these data, there was no indication that a temperature just a few degrees cooler than previous missions could have been so disastrous for the Challenger.",
     "crumbs": [
       "Binary and proportional data",
       "<span class='chapter-number'>8</span>  <span class='chapter-title'>Proportional response</span>"
@@ -389,7 +389,7 @@
     "href": "materials/glm-practical-poisson.html#constructing-a-model",
     "title": "\n9  Count data\n",
     "section": "\n9.3 Constructing a model",
-    "text": "9.3 Constructing a model\n\n\nR\nPython\n\n\n\n\nglm_isl &lt;- glm(species ~ area,\n               data = islands, family = \"poisson\")\n\nand we look at the model summary:\n\nsummary(glm_isl)\n\n\nCall:\nglm(formula = species ~ area, family = \"poisson\", data = islands)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)    \n(Intercept) 4.241129   0.041322  102.64   &lt;2e-16 ***\narea        0.035613   0.001247   28.55   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 856.899  on 34  degrees of freedom\nResidual deviance:  30.437  on 33  degrees of freedom\nAIC: 282.66\n\nNumber of Fisher Scoring iterations: 3\n\n\nThe output is strikingly similar to the logistic regression models (who’d have guessed, eh?) and the main numbers to extract from the output are the two numbers underneath Estimate.Std in the Coefficients table:\n(Intercept)    4.241129\narea           0.035613\n\n\n\n# create a generalised linear model\nmodel = smf.glm(formula = \"species ~ area\",\n                family = sm.families.Poisson(),\n                data = islands_py)\n# and get the fitted parameters of the model\nglm_isl_py = model.fit()\n\nLet’s look at the model output:\n\nprint(glm_isl_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:                species   No. Observations:                   35\nModel:                            GLM   Df Residuals:                       33\nModel Family:                 Poisson   Df Model:                            1\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -139.33\nDate:                Tue, 06 Feb 2024   Deviance:                       30.437\nTime:                        14:14:58   Pearson chi2:                     30.3\nNo. Iterations:                     4   Pseudo R-squ. (CS):              1.000\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      4.2411      0.041    102.636      0.000       4.160       4.322\narea           0.0356      0.001     28.551      0.000       0.033       0.038\n==============================================================================\n\n\n\n\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\\[ E(species) = \\exp(4.24 + 0.036 \\times area) \\]\nInterpreting this requires a bit of thought (not much, but a bit). The intercept coefficient, 4.24, is related to the number of species we would expect on an island of zero area (this is statistics, not real life. You’d do well to remember that before you worry too much about what that even means). But in order to turn this number into something meaningful we have to exponentiate it. Since exp(4.24) ≈ 70, we can say that the baseline number of species the model expects on any island is 70. This isn’t actually the interesting bit though.\nThe coefficient of area is the fun bit. For starters we can see that it is a positive number which does mean that increasing area leads to increasing numbers of species. Good so far.\nBut what does the value 0.036 actually mean? Well, if we exponentiate it as well, we get exp(0.036) ≈ 1.04. This means that for every increase in area of 1 km^2 (the original units of the area variable), the number of species on the island is multiplied by 1.04. So, an island of area 1 km^2 will have 1.04 x 70 ≈ 72 species.\nSo, in order to interpret Poisson coefficients, you have to exponentiate them.",
+    "text": "9.3 Constructing a model\n\n\nR\nPython\n\n\n\n\nglm_isl &lt;- glm(species ~ area,\n               data = islands, family = \"poisson\")\n\nand we look at the model summary:\n\nsummary(glm_isl)\n\n\nCall:\nglm(formula = species ~ area, family = \"poisson\", data = islands)\n\nCoefficients:\n            Estimate Std. Error z value Pr(&gt;|z|)    \n(Intercept) 4.241129   0.041322  102.64   &lt;2e-16 ***\narea        0.035613   0.001247   28.55   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 856.899  on 34  degrees of freedom\nResidual deviance:  30.437  on 33  degrees of freedom\nAIC: 282.66\n\nNumber of Fisher Scoring iterations: 3\n\n\nThe output is strikingly similar to the logistic regression models (who’d have guessed, eh?) and the main numbers to extract from the output are the two numbers underneath Estimate.Std in the Coefficients table:\n(Intercept)    4.241129\narea           0.035613\n\n\n\n# create a generalised linear model\nmodel = smf.glm(formula = \"species ~ area\",\n                family = sm.families.Poisson(),\n                data = islands_py)\n# and get the fitted parameters of the model\nglm_isl_py = model.fit()\n\nLet’s look at the model output:\n\nprint(glm_isl_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:                species   No. Observations:                   35\nModel:                            GLM   Df Residuals:                       33\nModel Family:                 Poisson   Df Model:                            1\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -139.33\nDate:                Tue, 06 Feb 2024   Deviance:                       30.437\nTime:                        16:16:33   Pearson chi2:                     30.3\nNo. Iterations:                     4   Pseudo R-squ. (CS):              1.000\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      4.2411      0.041    102.636      0.000       4.160       4.322\narea           0.0356      0.001     28.551      0.000       0.033       0.038\n==============================================================================\n\n\n\n\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\\[ E(species) = \\exp(4.24 + 0.036 \\times area) \\]\nInterpreting this requires a bit of thought (not much, but a bit). The intercept coefficient, 4.24, is related to the number of species we would expect on an island of zero area (this is statistics, not real life. You’d do well to remember that before you worry too much about what that even means). But in order to turn this number into something meaningful we have to exponentiate it. Since exp(4.24) ≈ 70, we can say that the baseline number of species the model expects on any island is 70. This isn’t actually the interesting bit though.\nThe coefficient of area is the fun bit. For starters we can see that it is a positive number which does mean that increasing area leads to increasing numbers of species. Good so far.\nBut what does the value 0.036 actually mean? Well, if we exponentiate it as well, we get exp(0.036) ≈ 1.04. This means that for every increase in area of 1 km^2 (the original units of the area variable), the number of species on the island is multiplied by 1.04. So, an island of area 1 km^2 will have 1.04 x 70 ≈ 72 species.\nSo, in order to interpret Poisson coefficients, you have to exponentiate them.",
     "crumbs": [
       "Count data",
       "<span class='chapter-number'>9</span>  <span class='chapter-title'>Count data</span>"
@@ -433,7 +433,7 @@
     "href": "materials/glm-practical-poisson.html#exercises",
     "title": "\n9  Count data\n",
     "section": "\n9.7 Exercises",
-    "text": "9.7 Exercises\n\n9.7.1 Seat belts\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nLevel: \nFor this exercise we’ll be using the data from data/seatbelts.csv.\nI’d like you to do the following:\n\nLoad the data\nVisualise the data and create a poisson regression model\nPlot the regression model on top of the data\nAssess if the model is a decent predictor for the number of fatalities\n\n\n\n\n\n\n\nAnswer\n\n\n\n\n\n\n\nLoad and visualise the data\nFirst we load the data, then we visualise it.\n\n\nR\nPython\n\n\n\n\nseatbelts &lt;- read_csv(\"data/seatbelts.csv\")\n\n\n\n\nseatbelts_py = pd.read_csv(\"data/seatbelts.csv\")\n\nLet’s have a glimpse at the data:\n\nseatbelts_py.head()\n\n   drivers_killed  drivers  front  rear  ...  van_killed  law  year  month\n0             107     1687    867   269  ...          12    0  1969    Jan\n1              97     1508    825   265  ...           6    0  1969    Feb\n2             102     1507    806   319  ...          12    0  1969    Mar\n3              87     1385    814   407  ...           8    0  1969    Apr\n4             119     1632    991   454  ...          10    0  1969    May\n\n[5 rows x 10 columns]\n\n\n\n\n\nThe data tracks the number of drivers killed in road traffic accidents, before and after the seat belt law was introduced. The information on whether the law was in place is encoded in the law column as 0 (law not in place) or 1 (law in place).\nThere are many more observations when the law was not in place, so we need to keep this in mind when we’re interpreting the data.\nFirst we have a look at the data comparing no law vs law:\n\nR\n\n\nWe have to convert the law column to a factor, otherwise R will see it as numerical.\n\nseatbelts %&gt;% \n  ggplot(aes(as_factor(law), drivers_killed)) +\n   geom_boxplot()\n\n\n\n\n\n\n\nThe data are recorded by month and year, so we can also display the number of drivers killed by year:\n\nseatbelts %&gt;% \n  ggplot(aes(year, drivers_killed)) +\n  geom_point()\n\n\n\n\n\n\n\n\n\n\nThe data look a bit weird. There is quite some variation within years (keeping in mind that the data are aggregated monthly). The data also seems to wave around a bit… with some vague peaks (e.g. 1972 - 1973) and some troughs (e.g. around 1976).\nSo my initial thought is that these data are going to be a bit tricky to interpret. But that’s OK.\nConstructing a model\n\nR\n\n\n\nglm_stb &lt;- glm(drivers_killed ~ year,\n               data = seatbelts, family = \"poisson\")\n\nand we look at the model summary:\n\nsummary(glm_stb)\n\n\nCall:\nglm(formula = drivers_killed ~ year, family = \"poisson\", data = seatbelts)\n\nCoefficients:\n             Estimate Std. Error z value Pr(&gt;|z|)    \n(Intercept) 37.168958   2.796636   13.29   &lt;2e-16 ***\nyear        -0.016373   0.001415  -11.57   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 984.50  on 191  degrees of freedom\nResidual deviance: 850.41  on 190  degrees of freedom\nAIC: 2127.2\n\nNumber of Fisher Scoring iterations: 4\n\n\n(Intercept)    37.168958\nyear           0.016373\n\n\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\\[ E(drivers\\_killed) = \\exp(37.17 + 0.164 \\times year) \\]\nAssessing significance\nIs the model well-specified?\n\nR\n\n\n\n1 - pchisq(850.41, 190)\n\n[1] 0\n\n\nThis value indicates that the model is actually pretty good. Remember, it is between \\([0, 1]\\) and the closer to zero, the better the model.\n\n\n\nHow about the overall fit?\n\nR\n\n\n\n1 - pchisq(984.50 - 850.41, 191 - 190)\n\n[1] 0\n\n\nAgain, this indicates that the model is markedly better than the null model.\n\n\n\nPlotting the regression\n\nR\n\n\n\nggplot(seatbelts, aes(year, drivers_killed)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = poisson)) +\n  xlim(1970,1985)\n\n\n\n\n\n\n\n\n\n\nConclusions\nThe model we constructed appears to be a decent predictor for the number of fatalities.",
+    "text": "9.7 Exercises\n\n9.7.1 Seat belts\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nLevel: \nFor this exercise we’ll be using the data from data/seatbelts.csv.\nI’d like you to do the following:\n\nLoad the data\nVisualise the data and create a poisson regression model\nPlot the regression model on top of the data\nAssess if the model is a decent predictor for the number of fatalities\n\n\n\n\n\n\n\nAnswer\n\n\n\n\n\n\n\nLoad and visualise the data\nFirst we load the data, then we visualise it.\n\n\nR\nPython\n\n\n\n\nseatbelts &lt;- read_csv(\"data/seatbelts.csv\")\n\n\n\n\nseatbelts_py = pd.read_csv(\"data/seatbelts.csv\")\n\nLet’s have a glimpse at the data:\n\nseatbelts_py.head()\n\n   casualties  drivers  front  rear  ...  van_killed  law  year  month\n0         107     1687    867   269  ...          12    0  1969    Jan\n1          97     1508    825   265  ...           6    0  1969    Feb\n2         102     1507    806   319  ...          12    0  1969    Mar\n3          87     1385    814   407  ...           8    0  1969    Apr\n4         119     1632    991   454  ...          10    0  1969    May\n\n[5 rows x 10 columns]\n\n\n\n\n\nThe data tracks the number of drivers killed in road traffic accidents, before and after the seat belt law was introduced. The information on whether the law was in place is encoded in the law column as 0 (law not in place) or 1 (law in place).\nThere are many more observations when the law was not in place, so we need to keep this in mind when we’re interpreting the data.\nFirst we have a look at the data comparing no law vs law:\n\n\nR\nPython\n\n\n\nWe have to convert the law column to a factor, otherwise R will see it as numerical.\n\nseatbelts %&gt;% \n  ggplot(aes(as_factor(law), casualties)) +\n   geom_boxplot()\n\n\n\n\n\n\n\nThe data are recorded by month and year, so we can also display the number of drivers killed by year:\n\nseatbelts %&gt;% \n  ggplot(aes(year, casualties)) +\n  geom_point()\n\n\n\n\n\n\n\n\n\nWe have to convert the law column to a factor, otherwise R will see it as numerical.\n\n(ggplot(seatbelts_py,\n         aes(x = seatbelts_py.law.astype(object),\n             y = \"casualties\")) +\n     geom_boxplot())\n\n\n\n\n\n\n\nThe data are recorded by month and year, so we can also display the number of casualties by year:\n\n(ggplot(seatbelts_py,\n         aes(x = \"year\",\n             y = \"casualties\")) +\n     geom_point())\n\n\n\n\n\n\n\n\n\n\nThe data look a bit weird. There is quite some variation within years (keeping in mind that the data are aggregated monthly). The data also seems to wave around a bit… with some vague peaks (e.g. 1972 - 1973) and some troughs (e.g. around 1976).\nSo my initial thought is that these data are going to be a bit tricky to interpret. But that’s OK.\nConstructing a model\n\n\nR\nPython\n\n\n\n\nglm_stb &lt;- glm(casualties ~ year,\n               data = seatbelts, family = \"poisson\")\n\nand we look at the model summary:\n\nsummary(glm_stb)\n\n\nCall:\nglm(formula = casualties ~ year, family = \"poisson\", data = seatbelts)\n\nCoefficients:\n             Estimate Std. Error z value Pr(&gt;|z|)    \n(Intercept) 37.168958   2.796636   13.29   &lt;2e-16 ***\nyear        -0.016373   0.001415  -11.57   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 984.50  on 191  degrees of freedom\nResidual deviance: 850.41  on 190  degrees of freedom\nAIC: 2127.2\n\nNumber of Fisher Scoring iterations: 4\n\n\n(Intercept)    37.168958\nyear           -0.016373\n\n\n\n# create a linear model\nmodel = smf.glm(formula = \"casualties ~ year\",\n                family = sm.families.Poisson(),\n                data = seatbelts_py)\n# and get the fitted parameters of the model\nglm_stb_py = model.fit()\n\n\nprint(glm_stb_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:             casualties   No. Observations:                  192\nModel:                            GLM   Df Residuals:                      190\nModel Family:                 Poisson   Df Model:                            1\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -1061.6\nDate:                Tue, 06 Feb 2024   Deviance:                       850.41\nTime:                        16:16:38   Pearson chi2:                     862.\nNo. Iterations:                     4   Pseudo R-squ. (CS):             0.5026\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     37.1690      2.797     13.291      0.000      31.688      42.650\nyear          -0.0164      0.001    -11.569      0.000      -0.019      -0.014\n==============================================================================\n\n\n======================\n                 coef  \n----------------------\nIntercept     37.1690 \nyear          -0.0164 \n======================\n\n\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\\[ E(casualties) = \\exp(37.17 - 0.164 \\times year) \\]\nAssessing significance\nIs the model well-specified?\n\n\nR\nPython\n\n\n\n\n1 - pchisq(850.41, 190)\n\n[1] 0\n\n\n\n\n\nchi2.sf(850.41, 190)\n\n3.1319689119997022e-84\n\n\n\n\n\nThis value indicates that the model is actually pretty good. Remember, it is between \\([0, 1]\\) and the closer to zero, the better the model.\nHow about the overall fit?\n\n\nR\nPython\n\n\n\n\n1 - pchisq(984.50 - 850.41, 191 - 190)\n\n[1] 0\n\n\n\n\nFirst we need to define the null model:\n\n# create a linear model\nmodel = smf.glm(formula = \"casualties ~ 1\",\n                family = sm.families.Poisson(),\n                data = seatbelts_py)\n# and get the fitted parameters of the model\nglm_stb_null_py = model.fit()\n\nprint(glm_stb_null_py.summary())\n\n                 Generalized Linear Model Regression Results                  \n==============================================================================\nDep. Variable:             casualties   No. Observations:                  192\nModel:                            GLM   Df Residuals:                      191\nModel Family:                 Poisson   Df Model:                            0\nLink Function:                    Log   Scale:                          1.0000\nMethod:                          IRLS   Log-Likelihood:                -1128.6\nDate:                Tue, 06 Feb 2024   Deviance:                       984.50\nTime:                        16:16:38   Pearson chi2:                 1.00e+03\nNo. Iterations:                     4   Pseudo R-squ. (CS):          1.942e-13\nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          z      P&gt;|z|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept      4.8106      0.007    738.670      0.000       4.798       4.823\n==============================================================================\n\n\n\nchi2.sf(984.50 - 850.41, 191 - 190)\n\n5.2214097202831414e-31\n\n\n\n\n\nAgain, this indicates that the model is markedly better than the null model.\nPlotting the regression\n\n\nR\nPython\n\n\n\n\nggplot(seatbelts, aes(year, casualties)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = poisson)) +\n  xlim(1970,1985)\n\n\n\n\n\n\n\n\n\n\nmodel = pd.DataFrame({'year': list(range(1968, 1985))})\n\nmodel[\"pred\"] = glm_stb_py.predict(model)\n\nmodel.head()\n\n   year        pred\n0  1968  140.737690\n1  1969  138.452153\n2  1970  136.203733\n3  1971  133.991827\n4  1972  131.815842\n\n\n\n(ggplot(seatbelts_py,\n         aes(x = \"year\",\n             y = \"casualties\")) +\n     geom_point() +\n     geom_line(model, aes(x = \"year\", y = \"pred\"), colour = \"blue\", size = 1))\n\n\n\n\n\n\n\n\n\n\nConclusions\nThe model we constructed appears to be a decent predictor for the number of fatalities.",
     "crumbs": [
       "Count data",
       "<span class='chapter-number'>9</span>  <span class='chapter-title'>Count data</span>"
diff --git a/materials/data/seatbelts.csv b/materials/data/seatbelts.csv
index 2750fcb..ea0eb83 100644
--- a/materials/data/seatbelts.csv
+++ b/materials/data/seatbelts.csv
@@ -1,193 +1,193 @@
-drivers_killed,drivers,front,rear,kms,petrol_price,van_killed,law,year,month
-107,1687,867,269,9059,0.102971811805368,12,0,1969,Jan
-97,1508,825,265,7685,0.102362995884646,6,0,1969,Feb
-102,1507,806,319,9963,0.102062490635914,12,0,1969,Mar
-87,1385,814,407,10955,0.100873300511862,8,0,1969,Apr
-119,1632,991,454,11823,0.101019672891934,10,0,1969,May
-106,1511,945,427,12391,0.10058119170287,13,0,1969,Jun
-110,1559,1004,522,13460,0.103773981457839,11,0,1969,Jul
-106,1630,1091,536,14055,0.104076403554621,6,0,1969,Aug
-107,1579,958,405,12106,0.103773981457839,10,0,1969,Sep
-134,1653,850,437,11372,0.103026401330572,16,0,1969,Oct
-147,2152,1109,434,9834,0.102730112155946,13,0,1969,Nov
-180,2148,1113,437,9267,0.101997191539847,14,0,1969,Dec
-125,1752,925,316,9130,0.101274563494893,14,0,1970,Jan
-134,1765,903,311,8933,0.10070397563972,6,0,1970,Feb
-110,1717,1006,351,11000,0.100139606658898,8,0,1970,Mar
-102,1558,892,362,10733,0.0986211043713023,11,0,1970,Apr
-103,1575,990,486,12912,0.0983492854059603,7,0,1970,May
-111,1520,866,429,12926,0.0980801772105387,13,0,1970,Jun
-120,1805,1095,551,13990,0.0972792082183714,13,0,1970,Jul
-129,1800,1204,646,14926,0.0974106238350488,11,0,1970,Aug
-122,1719,1029,456,12900,0.0974252365245483,11,0,1970,Sep
-183,2008,1147,475,12034,0.0963806330037465,14,0,1970,Oct
-169,2242,1171,456,10643,0.0957389559626943,16,0,1970,Nov
-190,2478,1299,468,10742,0.0951063062359475,14,0,1970,Dec
-134,2030,944,356,10266,0.0967359671470176,17,0,1971,Jan
-108,1655,874,271,10281,0.096109222487367807,16,0,1971,Feb
-104,1693,840,354,11527,0.095367254851379,15,0,1971,Mar
-117,1623,893,427,12281,0.0947095915871269,13,0,1971,Apr
-157,1805,1007,465,13587,0.0941176202174608,13,0,1971,May
-148,1746,973,440,13049,0.0935321548190638,15,0,1971,Jun
-130,1795,1097,539,16055,0.0929540494377308,12,0,1971,Jul
-140,1926,1194,646,15220,0.0928397862431927,6,0,1971,Aug
-136,1619,988,457,13824,0.0927247362539862,9,0,1971,Sep
-140,1992,1077,446,12729,0.0922696509793897,13,0,1971,Oct
-187,2233,1045,402,11467,0.0917066851479679,14,0,1971,Nov
-150,2192,1115,441,11351,0.0912620719433678,15,0,1971,Dec
-159,2080,1005,359,10803,0.090711603254936,14,0,1972,Jan
-143,1768,857,334,10548,0.090276328119195,3,0,1972,Feb
-114,1835,879,312,12368,0.0899519176272147,12,0,1972,Mar
-127,1569,887,427,13311,0.0890996386561615,13,0,1972,Apr
-159,1976,1075,434,13885,0.0886791925043499,12,0,1972,May
-156,1853,1121,486,14088,0.0881592888670634,8,0,1972,Jun
-138,1965,1190,569,16932,0.0889020568552906,8,0,1972,Jul
-120,1689,1058,523,16164,0.0881813314444876,15,0,1972,Aug
-117,1778,939,418,14883,0.0889402929599117,8,0,1972,Sep
-170,1976,1074,452,13532,0.0877266104275971,5,0,1972,Oct
-168,2397,1089,462,12220,0.087428846437772,17,0,1972,Nov
-198,2654,1208,497,12025,0.0870354301608856,14,0,1972,Dec
-144,2097,903,354,11692,0.0864499193294655,13,0,1973,Jan
-146,1963,916,347,11081,0.0858726409121568,5,0,1973,Feb
-109,1677,787,276,13745,0.0853982218357345,8,0,1973,Mar
-131,1941,1114,472,14382,0.083821981233605,5,0,1973,Apr
-151,2003,1014,487,14391,0.0845907801489325,12,0,1973,May
-140,1813,1022,505,15597,0.0841369037739444,11,0,1973,Jun
-153,2012,1114,619,16834,0.0837784051341314,13,0,1973,Jul
-140,1912,1132,640,17282,0.0835107427259604,15,0,1973,Aug
-161,2084,1111,559,15779,0.0828063938633846,11,0,1973,Sep
-168,2080,1008,453,13946,0.0811788933269884,11,0,1973,Oct
-152,2118,916,418,12701,0.0828536069623417,10,0,1973,Nov
-136,2150,992,419,10431,0.0941901186933595,13,0,1973,Dec
-113,1608,731,262,11616,0.0923998429510411,8,0,1974,Jan
-100,1503,665,299,10808,0.108161478199019,6,0,1974,Feb
-103,1548,724,303,12421,0.10721168869023,8,0,1974,Mar
-103,1382,744,401,13605,0.114042966782082,14,0,1974,Apr
-121,1731,910,413,14455,0.112454115810183,12,0,1974,May
-134,1798,883,426,15019,0.111316253290611,14,0,1974,Jun
-133,1779,900,516,15662,0.11030125221242,13,0,1974,Jul
-129,1887,1057,600,16745,0.108197177376865,9,0,1974,Aug
-144,2004,1076,459,14717,0.107027443082328,4,0,1974,Sep
-154,2077,919,443,13756,0.104946980916917,13,0,1974,Oct
-156,2092,920,412,12531,0.119357749193208,6,0,1974,Nov
-163,2051,953,400,12568,0.117621904277373,15,0,1974,Dec
-122,1577,664,278,11249,0.133027420877451,12,0,1975,Jan
-92,1356,607,302,11096,0.130845243689729,16,0,1975,Feb
-117,1652,777,381,12637,0.128318477474772,7,0,1975,Mar
-95,1382,633,279,13018,0.123547448292297,12,0,1975,Apr
-96,1519,791,442,15005,0.118586811514179,10,0,1975,May
-108,1421,790,409,15235,0.116337480161004,9,0,1975,Jun
-108,1442,803,416,15552,0.11516147558196,9,0,1975,Jul
-106,1543,884,511,16905,0.114501197216867,6,0,1975,Aug
-140,1656,769,393,14776,0.113522979499817,7,0,1975,Sep
-114,1561,732,345,14104,0.111930179432996,13,0,1975,Oct
-158,1905,859,391,12854,0.110610528503361,14,0,1975,Nov
-161,2199,994,470,12956,0.11527438914664,13,0,1975,Dec
-102,1473,704,266,12177,0.113793485966034,14,0,1976,Jan
-127,1655,684,312,11918,0.112349582098189,11,0,1976,Feb
-125,1407,671,300,13517,0.111753469387189,11,0,1976,Mar
-101,1395,643,373,14417,0.109642522576533,10,0,1976,Apr
-97,1530,771,412,15911,0.10844089510559,4,0,1976,May
-112,1309,644,322,15589,0.107884938936114,8,0,1976,Jun
-112,1526,828,458,16543,0.109084769191454,9,0,1976,Jul
-113,1327,748,427,17925,0.107571450111271,10,0,1976,Aug
-108,1627,767,346,15406,0.106164022368002,10,0,1976,Sep
-128,1748,825,421,14601,0.106299999323319,5,0,1976,Oct
-154,1958,810,344,13107,0.104825313000088,13,0,1976,Nov
-162,2274,986,370,12268,0.103451745711815,12,0,1976,Dec
-112,1648,714,291,11972,0.101449920129493,10,0,1977,Jan
-79,1401,567,224,12028,0.100402316427863,9,0,1977,Feb
-82,1411,616,266,14033,0.098862033680192,7,0,1977,Mar
-127,1403,678,338,14244,0.102496154313521,5,0,1977,Apr
-108,1394,742,298,15287,0.103027431599736,10,0,1977,May
-110,1520,840,386,16954,0.102178908220655,5,0,1977,Jun
-123,1528,888,479,17361,0.0998366428726473,6,0,1977,Jul
-103,1643,852,473,17694,0.0926366895833353,8,0,1977,Aug
-97,1515,774,332,16222,0.0918149629077569,6,0,1977,Sep
-140,1685,831,391,14969,0.090724303768407,12,0,1977,Oct
-165,2000,889,370,13624,0.0900212072768793,15,0,1977,Nov
-183,2215,1046,431,13842,0.0893307058230937,7,0,1977,Dec
-148,1956,889,366,12387,0.0884427348717763,14,0,1978,Jan
-111,1462,626,250,11608,0.0883525692744791,4,0,1978,Feb
-116,1563,808,355,15021,0.0867573619308237,10,0,1978,Mar
-115,1459,746,304,14834,0.0849952420449752,8,0,1978,Apr
-100,1446,754,379,16565,0.0845679437213488,7,0,1978,May
-106,1622,865,440,16882,0.0844318988774436,11,0,1978,Jun
-134,1657,980,500,18012,0.0843508831482932,3,0,1978,Jul
-125,1638,959,511,18855,0.0836009830491076,5,0,1978,Aug
-117,1643,856,384,17243,0.0834172630524962,11,0,1978,Sep
-122,1683,798,366,16045,0.0827451397987249,10,0,1978,Oct
-153,2050,942,432,14745,0.0852352669035281,10,0,1978,Nov
-178,2262,1010,390,13726,0.0847703028296526,7,0,1978,Dec
-114,1813,796,306,11196,0.0844589214084587,10,0,1979,Jan
-94,1445,643,232,12105,0.085352119244763,11,0,1979,Feb
-128,1762,794,342,14723,0.0875592125175749,9,0,1979,Mar
-119,1461,750,329,15582,0.0903829170614837,7,0,1979,Apr
-111,1556,809,394,16863,0.0907832937355188,8,0,1979,May
-110,1431,716,355,16758,0.108742780219868,13,0,1979,Jun
-114,1427,851,385,17434,0.114142227335262,8,0,1979,Jul
-118,1554,931,463,18359,0.112992933231466,5,0,1979,Aug
-115,1645,834,453,17189,0.111320706029796,8,0,1979,Sep
-132,1653,762,373,16909,0.109126229280665,7,0,1979,Oct
-153,2016,880,401,15380,0.107698459343112,12,0,1979,Nov
-171,2207,1077,466,15161,0.107601574334496,10,0,1979,Dec
-115,1665,748,306,14027,0.103775019202843,7,0,1980,Jan
-95,1361,593,263,14478,0.107114170431059,4,0,1980,Feb
-92,1506,720,323,16155,0.107374774370757,10,0,1980,Mar
-100,1360,646,310,16585,0.111695372689559,4,0,1980,Apr
-95,1453,765,424,18117,0.110638184592354,8,0,1980,May
-114,1522,820,403,17552,0.111855211329895,8,0,1980,Jun
-102,1460,807,406,18299,0.109742342683337,7,0,1980,Jul
-104,1552,885,466,19361,0.108193931510232,10,0,1980,Aug
-132,1548,803,381,17924,0.106255362697951,8,0,1980,Sep
-136,1827,860,369,17872,0.104193034427699,14,0,1980,Oct
-117,1737,825,378,16058,0.101933972880902,8,0,1980,Nov
-137,1941,911,392,15746,0.102793824574291,9,0,1980,Dec
-111,1474,704,284,15226,0.10476034144929,8,0,1981,Jan
-106,1458,691,316,14932,0.104002535534347,6,0,1981,Feb
-98,1542,688,321,16846,0.116655515402424,7,0,1981,Mar
-84,1404,714,358,16854,0.11516147558196,6,0,1981,Apr
-94,1522,814,378,18146,0.112989543494316,5,0,1981,May
-105,1385,736,382,17559,0.113860643932406,4,0,1981,Jun
-123,1641,876,433,18655,0.119118081064489,5,0,1981,Jul
-109,1510,829,506,19453,0.124489986005886,10,0,1981,Aug
-130,1681,818,428,17923,0.123222945411622,7,0,1981,Sep
-153,1938,942,479,17915,0.12067793212866,10,0,1981,Oct
-134,1868,782,370,16496,0.121048982651421,12,0,1981,Nov
-99,1726,823,349,13544,0.116968571491487,7,0,1981,Dec
-115,1456,595,238,13601,0.112750259392875,4,0,1982,Jan
-104,1445,673,285,15667,0.108079306704711,5,0,1982,Feb
-131,1456,660,324,17358,0.108838515984019,6,0,1982,Mar
-108,1365,676,346,18112,0.111291766408542,4,0,1982,Apr
-103,1487,755,410,18581,0.111304009176187,4,0,1982,May
-115,1558,815,411,18759,0.115454357532553,8,0,1982,Jun
-122,1488,867,496,20668,0.114768296055692,8,0,1982,Jul
-122,1684,933,534,21040,0.117207430931122,3,0,1982,Aug
-125,1594,798,396,18993,0.119076397031248,7,0,1982,Sep
-137,1850,950,470,18668,0.117965862171995,12,0,1982,Oct
-138,1998,825,385,16768,0.117449127100423,2,0,1982,Nov
-152,2079,911,411,16551,0.116988457838933,7,0,1982,Dec
-120,1494,619,281,16231,0.11261053571781,8,0,1983,Jan
-95,1057,426,300,15511,0.113657015681422,3,1,1983,Feb
-100,1218,475,318,18308,0.113144445252379,2,1,1983,Mar
-89,1168,556,391,17793,0.118495534815352,6,1,1983,Apr
-82,1236,559,398,19205,0.117969401200945,3,1,1983,May
-89,1076,483,337,19162,0.1176866141183,7,1,1983,Jun
-60,1174,587,477,20997,0.120059238961094,6,1,1983,Jul
-84,1139,615,422,20705,0.119437745680998,8,1,1983,Aug
-113,1427,618,495,18759,0.118881271786551,8,1,1983,Sep
-126,1487,662,471,19240,0.118462360710195,4,1,1983,Oct
-122,1483,519,368,17504,0.118016598400236,3,1,1983,Nov
-118,1513,585,345,16591,0.117706622543368,5,1,1983,Dec
-92,1357,483,296,16224,0.117776089941536,5,1,1984,Jan
-86,1165,434,319,16670,0.114796991716514,3,1,1984,Feb
-81,1282,513,349,18539,0.11573525277085,4,1,1984,Mar
-84,1110,548,375,19759,0.115356263024722,3,1,1984,Apr
-87,1297,586,441,19584,0.114815360704668,6,1,1984,May
-90,1185,522,465,19976,0.114777477886645,6,1,1984,Jun
-79,1222,601,472,21486,0.114935980147534,7,1,1984,Jul
-96,1284,644,521,21626,0.114796991716514,5,1,1984,Aug
-122,1444,643,429,20195,0.114093156728444,7,1,1984,Sep
-120,1575,641,408,19928,0.116465521799171,7,1,1984,Oct
-137,1737,711,490,18564,0.116026113132354,4,1,1984,Nov
-154,1763,721,491,18149,0.116066729379379,7,1,1984,Dec
+casualties,drivers,front,rear,kms,petrol_price,van_killed,law,year,month
+107,1687,867,269,9059,0.102971812,12,0,1969,Jan
+97,1508,825,265,7685,0.102362996,6,0,1969,Feb
+102,1507,806,319,9963,0.102062491,12,0,1969,Mar
+87,1385,814,407,10955,0.100873301,8,0,1969,Apr
+119,1632,991,454,11823,0.101019673,10,0,1969,May
+106,1511,945,427,12391,0.100581192,13,0,1969,Jun
+110,1559,1004,522,13460,0.103773981,11,0,1969,Jul
+106,1630,1091,536,14055,0.104076404,6,0,1969,Aug
+107,1579,958,405,12106,0.103773981,10,0,1969,Sep
+134,1653,850,437,11372,0.103026401,16,0,1969,Oct
+147,2152,1109,434,9834,0.102730112,13,0,1969,Nov
+180,2148,1113,437,9267,0.101997192,14,0,1969,Dec
+125,1752,925,316,9130,0.101274563,14,0,1970,Jan
+134,1765,903,311,8933,0.100703976,6,0,1970,Feb
+110,1717,1006,351,11000,0.100139607,8,0,1970,Mar
+102,1558,892,362,10733,0.098621104,11,0,1970,Apr
+103,1575,990,486,12912,0.098349285,7,0,1970,May
+111,1520,866,429,12926,0.098080177,13,0,1970,Jun
+120,1805,1095,551,13990,0.097279208,13,0,1970,Jul
+129,1800,1204,646,14926,0.097410624,11,0,1970,Aug
+122,1719,1029,456,12900,0.097425237,11,0,1970,Sep
+183,2008,1147,475,12034,0.096380633,14,0,1970,Oct
+169,2242,1171,456,10643,0.095738956,16,0,1970,Nov
+190,2478,1299,468,10742,0.095106306,14,0,1970,Dec
+134,2030,944,356,10266,0.096735967,17,0,1971,Jan
+108,1655,874,271,10281,0.096109222487367807,16,0,1971,Feb
+104,1693,840,354,11527,0.095367255,15,0,1971,Mar
+117,1623,893,427,12281,0.094709592,13,0,1971,Apr
+157,1805,1007,465,13587,0.09411762,13,0,1971,May
+148,1746,973,440,13049,0.093532155,15,0,1971,Jun
+130,1795,1097,539,16055,0.092954049,12,0,1971,Jul
+140,1926,1194,646,15220,0.092839786,6,0,1971,Aug
+136,1619,988,457,13824,0.092724736,9,0,1971,Sep
+140,1992,1077,446,12729,0.092269651,13,0,1971,Oct
+187,2233,1045,402,11467,0.091706685,14,0,1971,Nov
+150,2192,1115,441,11351,0.091262072,15,0,1971,Dec
+159,2080,1005,359,10803,0.090711603,14,0,1972,Jan
+143,1768,857,334,10548,0.090276328,3,0,1972,Feb
+114,1835,879,312,12368,0.089951918,12,0,1972,Mar
+127,1569,887,427,13311,0.089099639,13,0,1972,Apr
+159,1976,1075,434,13885,0.088679193,12,0,1972,May
+156,1853,1121,486,14088,0.088159289,8,0,1972,Jun
+138,1965,1190,569,16932,0.088902057,8,0,1972,Jul
+120,1689,1058,523,16164,0.088181331,15,0,1972,Aug
+117,1778,939,418,14883,0.088940293,8,0,1972,Sep
+170,1976,1074,452,13532,0.08772661,5,0,1972,Oct
+168,2397,1089,462,12220,0.087428846,17,0,1972,Nov
+198,2654,1208,497,12025,0.08703543,14,0,1972,Dec
+144,2097,903,354,11692,0.086449919,13,0,1973,Jan
+146,1963,916,347,11081,0.085872641,5,0,1973,Feb
+109,1677,787,276,13745,0.085398222,8,0,1973,Mar
+131,1941,1114,472,14382,0.083821981,5,0,1973,Apr
+151,2003,1014,487,14391,0.08459078,12,0,1973,May
+140,1813,1022,505,15597,0.084136904,11,0,1973,Jun
+153,2012,1114,619,16834,0.083778405,13,0,1973,Jul
+140,1912,1132,640,17282,0.083510743,15,0,1973,Aug
+161,2084,1111,559,15779,0.082806394,11,0,1973,Sep
+168,2080,1008,453,13946,0.081178893,11,0,1973,Oct
+152,2118,916,418,12701,0.082853607,10,0,1973,Nov
+136,2150,992,419,10431,0.094190119,13,0,1973,Dec
+113,1608,731,262,11616,0.092399843,8,0,1974,Jan
+100,1503,665,299,10808,0.108161478,6,0,1974,Feb
+103,1548,724,303,12421,0.107211689,8,0,1974,Mar
+103,1382,744,401,13605,0.114042967,14,0,1974,Apr
+121,1731,910,413,14455,0.112454116,12,0,1974,May
+134,1798,883,426,15019,0.111316253,14,0,1974,Jun
+133,1779,900,516,15662,0.110301252,13,0,1974,Jul
+129,1887,1057,600,16745,0.108197177,9,0,1974,Aug
+144,2004,1076,459,14717,0.107027443,4,0,1974,Sep
+154,2077,919,443,13756,0.104946981,13,0,1974,Oct
+156,2092,920,412,12531,0.119357749,6,0,1974,Nov
+163,2051,953,400,12568,0.117621904,15,0,1974,Dec
+122,1577,664,278,11249,0.133027421,12,0,1975,Jan
+92,1356,607,302,11096,0.130845244,16,0,1975,Feb
+117,1652,777,381,12637,0.128318477,7,0,1975,Mar
+95,1382,633,279,13018,0.123547448,12,0,1975,Apr
+96,1519,791,442,15005,0.118586812,10,0,1975,May
+108,1421,790,409,15235,0.11633748,9,0,1975,Jun
+108,1442,803,416,15552,0.115161476,9,0,1975,Jul
+106,1543,884,511,16905,0.114501197,6,0,1975,Aug
+140,1656,769,393,14776,0.113522979,7,0,1975,Sep
+114,1561,732,345,14104,0.111930179,13,0,1975,Oct
+158,1905,859,391,12854,0.110610529,14,0,1975,Nov
+161,2199,994,470,12956,0.115274389,13,0,1975,Dec
+102,1473,704,266,12177,0.113793486,14,0,1976,Jan
+127,1655,684,312,11918,0.112349582,11,0,1976,Feb
+125,1407,671,300,13517,0.111753469,11,0,1976,Mar
+101,1395,643,373,14417,0.109642523,10,0,1976,Apr
+97,1530,771,412,15911,0.108440895,4,0,1976,May
+112,1309,644,322,15589,0.107884939,8,0,1976,Jun
+112,1526,828,458,16543,0.109084769,9,0,1976,Jul
+113,1327,748,427,17925,0.10757145,10,0,1976,Aug
+108,1627,767,346,15406,0.106164022,10,0,1976,Sep
+128,1748,825,421,14601,0.106299999,5,0,1976,Oct
+154,1958,810,344,13107,0.104825313,13,0,1976,Nov
+162,2274,986,370,12268,0.103451746,12,0,1976,Dec
+112,1648,714,291,11972,0.10144992,10,0,1977,Jan
+79,1401,567,224,12028,0.100402316,9,0,1977,Feb
+82,1411,616,266,14033,0.098862034,7,0,1977,Mar
+127,1403,678,338,14244,0.102496154,5,0,1977,Apr
+108,1394,742,298,15287,0.103027432,10,0,1977,May
+110,1520,840,386,16954,0.102178908,5,0,1977,Jun
+123,1528,888,479,17361,0.099836643,6,0,1977,Jul
+103,1643,852,473,17694,0.09263669,8,0,1977,Aug
+97,1515,774,332,16222,0.091814963,6,0,1977,Sep
+140,1685,831,391,14969,0.090724304,12,0,1977,Oct
+165,2000,889,370,13624,0.090021207,15,0,1977,Nov
+183,2215,1046,431,13842,0.089330706,7,0,1977,Dec
+148,1956,889,366,12387,0.088442735,14,0,1978,Jan
+111,1462,626,250,11608,0.088352569,4,0,1978,Feb
+116,1563,808,355,15021,0.086757362,10,0,1978,Mar
+115,1459,746,304,14834,0.084995242,8,0,1978,Apr
+100,1446,754,379,16565,0.084567944,7,0,1978,May
+106,1622,865,440,16882,0.084431899,11,0,1978,Jun
+134,1657,980,500,18012,0.084350883,3,0,1978,Jul
+125,1638,959,511,18855,0.083600983,5,0,1978,Aug
+117,1643,856,384,17243,0.083417263,11,0,1978,Sep
+122,1683,798,366,16045,0.08274514,10,0,1978,Oct
+153,2050,942,432,14745,0.085235267,10,0,1978,Nov
+178,2262,1010,390,13726,0.084770303,7,0,1978,Dec
+114,1813,796,306,11196,0.084458921,10,0,1979,Jan
+94,1445,643,232,12105,0.085352119,11,0,1979,Feb
+128,1762,794,342,14723,0.087559213,9,0,1979,Mar
+119,1461,750,329,15582,0.090382917,7,0,1979,Apr
+111,1556,809,394,16863,0.090783294,8,0,1979,May
+110,1431,716,355,16758,0.10874278,13,0,1979,Jun
+114,1427,851,385,17434,0.114142227,8,0,1979,Jul
+118,1554,931,463,18359,0.112992933,5,0,1979,Aug
+115,1645,834,453,17189,0.111320706,8,0,1979,Sep
+132,1653,762,373,16909,0.109126229,7,0,1979,Oct
+153,2016,880,401,15380,0.107698459,12,0,1979,Nov
+171,2207,1077,466,15161,0.107601574,10,0,1979,Dec
+115,1665,748,306,14027,0.103775019,7,0,1980,Jan
+95,1361,593,263,14478,0.10711417,4,0,1980,Feb
+92,1506,720,323,16155,0.107374774,10,0,1980,Mar
+100,1360,646,310,16585,0.111695373,4,0,1980,Apr
+95,1453,765,424,18117,0.110638185,8,0,1980,May
+114,1522,820,403,17552,0.111855211,8,0,1980,Jun
+102,1460,807,406,18299,0.109742343,7,0,1980,Jul
+104,1552,885,466,19361,0.108193932,10,0,1980,Aug
+132,1548,803,381,17924,0.106255363,8,0,1980,Sep
+136,1827,860,369,17872,0.104193034,14,0,1980,Oct
+117,1737,825,378,16058,0.101933973,8,0,1980,Nov
+137,1941,911,392,15746,0.102793825,9,0,1980,Dec
+111,1474,704,284,15226,0.104760341,8,0,1981,Jan
+106,1458,691,316,14932,0.104002536,6,0,1981,Feb
+98,1542,688,321,16846,0.116655515,7,0,1981,Mar
+84,1404,714,358,16854,0.115161476,6,0,1981,Apr
+94,1522,814,378,18146,0.112989543,5,0,1981,May
+105,1385,736,382,17559,0.113860644,4,0,1981,Jun
+123,1641,876,433,18655,0.119118081,5,0,1981,Jul
+109,1510,829,506,19453,0.124489986,10,0,1981,Aug
+130,1681,818,428,17923,0.123222945,7,0,1981,Sep
+153,1938,942,479,17915,0.120677932,10,0,1981,Oct
+134,1868,782,370,16496,0.121048983,12,0,1981,Nov
+99,1726,823,349,13544,0.116968571,7,0,1981,Dec
+115,1456,595,238,13601,0.112750259,4,0,1982,Jan
+104,1445,673,285,15667,0.108079307,5,0,1982,Feb
+131,1456,660,324,17358,0.108838516,6,0,1982,Mar
+108,1365,676,346,18112,0.111291766,4,0,1982,Apr
+103,1487,755,410,18581,0.111304009,4,0,1982,May
+115,1558,815,411,18759,0.115454358,8,0,1982,Jun
+122,1488,867,496,20668,0.114768296,8,0,1982,Jul
+122,1684,933,534,21040,0.117207431,3,0,1982,Aug
+125,1594,798,396,18993,0.119076397,7,0,1982,Sep
+137,1850,950,470,18668,0.117965862,12,0,1982,Oct
+138,1998,825,385,16768,0.117449127,2,0,1982,Nov
+152,2079,911,411,16551,0.116988458,7,0,1982,Dec
+120,1494,619,281,16231,0.112610536,8,0,1983,Jan
+95,1057,426,300,15511,0.113657016,3,1,1983,Feb
+100,1218,475,318,18308,0.113144445,2,1,1983,Mar
+89,1168,556,391,17793,0.118495535,6,1,1983,Apr
+82,1236,559,398,19205,0.117969401,3,1,1983,May
+89,1076,483,337,19162,0.117686614,7,1,1983,Jun
+60,1174,587,477,20997,0.120059239,6,1,1983,Jul
+84,1139,615,422,20705,0.119437746,8,1,1983,Aug
+113,1427,618,495,18759,0.118881272,8,1,1983,Sep
+126,1487,662,471,19240,0.118462361,4,1,1983,Oct
+122,1483,519,368,17504,0.118016598,3,1,1983,Nov
+118,1513,585,345,16591,0.117706623,5,1,1983,Dec
+92,1357,483,296,16224,0.11777609,5,1,1984,Jan
+86,1165,434,319,16670,0.114796992,3,1,1984,Feb
+81,1282,513,349,18539,0.115735253,4,1,1984,Mar
+84,1110,548,375,19759,0.115356263,3,1,1984,Apr
+87,1297,586,441,19584,0.114815361,6,1,1984,May
+90,1185,522,465,19976,0.114777478,6,1,1984,Jun
+79,1222,601,472,21486,0.11493598,7,1,1984,Jul
+96,1284,644,521,21626,0.114796992,5,1,1984,Aug
+122,1444,643,429,20195,0.114093157,7,1,1984,Sep
+120,1575,641,408,19928,0.116465522,7,1,1984,Oct
+137,1737,711,490,18564,0.116026113,4,1,1984,Nov
+154,1763,721,491,18149,0.116066729,7,1,1984,Dec
\ No newline at end of file
diff --git a/materials/glm-practical-logistic-proportion.qmd b/materials/glm-practical-logistic-proportion.qmd
index 10dc60d..b1e671e 100644
--- a/materials/glm-practical-logistic-proportion.qmd
+++ b/materials/glm-practical-logistic-proportion.qmd
@@ -347,7 +347,7 @@ We get quite a high score (around 0.9) for this, which tells us that our goodnes
 Is the model any better than the null though?
 
 ```{r}
-1 - pchisq(16.375 - 12.633, 1)
+1 - pchisq(16.375 - 12.633, 21 - 20)
 
 anova(glm_chl_new, test = 'Chisq')
 ```
@@ -414,8 +414,21 @@ We get quite a high score (around 0.9) for this, which tells us that our goodnes
 
 Is the model any better than the null though?
 
+First we need to define the null model:
+
+```{python}
+# create a linear model
+model = smf.glm(formula = "damage + intact ~ 1",
+                family = sm.families.Binomial(),
+                data = challenger_new_py)
+# and get the fitted parameters of the model
+glm_chl_new_null_py = model.fit()
+
+print(glm_chl_new_null_py.summary())
+```
+
 ```{python}
-chi2.sf(16.375 - 12.633, 23 - 22)
+chi2.sf(16.375 - 12.633, 21 - 20)
 ```
 
 However, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).
diff --git a/materials/glm-practical-poisson.qmd b/materials/glm-practical-poisson.qmd
index 783b727..56d7366 100644
--- a/materials/glm-practical-poisson.qmd
+++ b/materials/glm-practical-poisson.qmd
@@ -392,7 +392,7 @@ We have to convert the `law` column to a factor, otherwise R will see it as nume
 
 ```{r}
 seatbelts %>% 
-  ggplot(aes(as_factor(law), drivers_killed)) +
+  ggplot(aes(as_factor(law), casualties)) +
    geom_boxplot()
 ```
 
@@ -400,10 +400,32 @@ The data are recorded by month and year, so we can also display the number of dr
 
 ```{r}
 seatbelts %>% 
-  ggplot(aes(year, drivers_killed)) +
+  ggplot(aes(year, casualties)) +
   geom_point()
 ```
 
+## Python
+
+We have to convert the `law` column to a factor, otherwise R will see it as numerical.
+
+```{python}
+#| results: hide
+(ggplot(seatbelts_py,
+         aes(x = seatbelts_py.law.astype(object),
+             y = "casualties")) +
+     geom_boxplot())
+```
+
+The data are recorded by month and year, so we can also display the number of casualties by year:
+
+```{python}
+#| results: hide
+(ggplot(seatbelts_py,
+         aes(x = "year",
+             y = "casualties")) +
+     geom_point())
+```
+
 :::
 
 The data look a bit weird. There is quite some variation within years (keeping in mind that the data are aggregated monthly). The data also seems to wave around a bit... with some vague peaks (e.g. 1972 - 1973) and some troughs (e.g. around 1976).
@@ -416,7 +438,7 @@ So my initial thought is that these data are going to be a bit tricky to interpr
 ## R
 
 ```{r}
-glm_stb <- glm(drivers_killed ~ year,
+glm_stb <- glm(casualties ~ year,
                data = seatbelts, family = "poisson")
 ```
 
@@ -428,13 +450,37 @@ summary(glm_stb)
 
 ```
 (Intercept)    37.168958
-year           0.016373
+year           -0.016373
+```
+
+## Python
+
+```{python}
+# create a linear model
+model = smf.glm(formula = "casualties ~ year",
+                family = sm.families.Poisson(),
+                data = seatbelts_py)
+# and get the fitted parameters of the model
+glm_stb_py = model.fit()
+```
+
+```{python}
+print(glm_stb_py.summary())
+```
+
+```
+======================
+                 coef  
+----------------------
+Intercept     37.1690 
+year          -0.0164 
+======================
 ```
 :::
 
 These are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:
 
-$$ E(drivers\_killed) = \exp(37.17 + 0.164 \times year) $$
+$$ E(casualties) = \exp(37.17 - 0.164 \times year) $$
 
 #### Assessing significance
 
@@ -447,9 +493,15 @@ Is the model well-specified?
 1 - pchisq(850.41, 190)
 ```
 
-This value indicates that the model is actually pretty good. Remember, it is between $[0, 1]$ and the closer to zero, the better the model.
+## Python
+
+```{python}
+chi2.sf(850.41, 190)
+```
 :::
 
+This value indicates that the model is actually pretty good. Remember, it is between $[0, 1]$ and the closer to zero, the better the model.
+
 How about the overall fit?
 
 ::: {.panel-tabset group="language"}
@@ -459,10 +511,28 @@ How about the overall fit?
 1 - pchisq(984.50 - 850.41, 191 - 190)
 ```
 
-Again, this indicates that the model is markedly better than the null model.
+## Python
 
+First we need to define the null model:
+
+```{python}
+# create a linear model
+model = smf.glm(formula = "casualties ~ 1",
+                family = sm.families.Poisson(),
+                data = seatbelts_py)
+# and get the fitted parameters of the model
+glm_stb_null_py = model.fit()
+
+print(glm_stb_null_py.summary())
+```
+
+```{python}
+chi2.sf(984.50 - 850.41, 191 - 190)
+```
 :::
 
+Again, this indicates that the model is markedly better than the null model.
+
 #### Plotting the regression
 
 ::: {.panel-tabset group="language"}
@@ -471,13 +541,31 @@ Again, this indicates that the model is markedly better than the null model.
 ```{r}
 #| message: false
 #| warning: false
-ggplot(seatbelts, aes(year, drivers_killed)) +
+ggplot(seatbelts, aes(year, casualties)) +
   geom_point() +
   geom_smooth(method = "glm", se = FALSE, fullrange = TRUE, 
               method.args = list(family = poisson)) +
   xlim(1970,1985)
 ```
 
+## Python
+
+```{python}
+model = pd.DataFrame({'year': list(range(1968, 1985))})
+
+model["pred"] = glm_stb_py.predict(model)
+
+model.head()
+```
+
+```{python}
+#| results: hide
+(ggplot(seatbelts_py,
+         aes(x = "year",
+             y = "casualties")) +
+     geom_point() +
+     geom_line(model, aes(x = "year", y = "pred"), colour = "blue", size = 1))
+```
 :::