diff --git a/_freeze/materials/glm-practical-logistic-binary/execute-results/html.json b/_freeze/materials/glm-practical-logistic-binary/execute-results/html.json index 8e9870a..2009dde 100644 --- a/_freeze/materials/glm-practical-logistic-binary/execute-results/html.json +++ b/_freeze/materials/glm-practical-logistic-binary/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "28252b951f9aa43eac78479dce327bea", + "hash": "31c1151d757c28ef559650d26fb3cb04", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Binary response\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n**Questions**\n\n- How do we analyse data with a binary outcome?\n- Can we test if our model is any good?\n- Be able to perform a logistic regression with a binary outcome\n- Predict outcomes of new data, based on a defined model\n\n**Objectives**\n\n- Be able to analyse binary outcome data\n- Understand different methods of testing model fit\n- Be able to make model predictions\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n### Functions\n\n## Python\n\n### Libraries\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n```\n:::\n\n\n### Functions\n:::\n:::\n\nThe example in this section uses the following data set:\n\n`data/finches_early.csv`\n\nThese data come from an analysis of gene flow across two finch species [@lamichhaney2020]. They are slightly adapted here for illustrative purposes.\n\nThe data focus on two species, _Geospiza fortis_ and _G. scandens_. The original measurements are split by a uniquely timed event: a particularly strong El Niño event in 1983. This event changed the vegetation and food supply of the finches, allowing F1 hybrids of the two species to survive, whereas before 1983 they could not. The measurements are classed as `early` (pre-1983) and `late` (1983 onwards).\n\nHere we are looking only at the `early` data. We are specifically focussing on the beak shape classification, which we saw earlier in @fig-beak_shape_glm.\n\n## Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nearly_finches <- read_csv(\"data/finches_early.csv\")\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nearly_finches_py = pd.read_csv(\"data/finches_early.csv\")\n```\n:::\n\n\n:::\n\nLooking at the data, we can see that the `pointed_beak` column contains zeros and ones. These are actually yes/no classification outcomes and not numeric representations.\n\nWe'll have to deal with this soon. For now, we can plot the data:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(early_finches,\n aes(x = factor(pointed_beak),\n y = blength)) +\n geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-6-1.png){width=672}\n:::\n:::\n\n\n## Python\n\nWe could just give Python the `pointed_beak` data directly, but then it would view the values as numeric. Which doesn't really work, because we have two groups as such: those with a pointed beak (`1`), and those with a blunt one (`0`).\n\nWe can force Python to temporarily covert the data to a factor, by making the `pointed_beak` column an `object` type. We can do this directly inside the `ggplot()` function.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(early_finches_py,\n aes(x = early_finches_py.pointed_beak.astype(object),\n y = \"blength\")) +\n geom_boxplot())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-7-1.png){width=614}\n:::\n:::\n\n:::\n\nIt looks as though the finches with blunt beaks generally have shorter beak lengths.\n\nWe can visualise that differently by plotting all the data points as a classic binary response plot:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(early_finches,\n aes(x = blength, y = pointed_beak)) +\n geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-8-3.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(early_finches_py,\n aes(x = \"blength\",\n y = \"pointed_beak\")) +\n geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-9-1.png){width=614}\n:::\n:::\n\n\n:::\n\nThis presents us with a bit of an issue. We could fit a linear regression model to these data, although we already know that this is a bad idea...\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(early_finches,\n aes(x = blength, y = pointed_beak)) +\n geom_point() +\n geom_smooth(method = \"lm\", se = FALSE)\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-10-3.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(early_finches_py,\n aes(x = \"blength\",\n y = \"pointed_beak\")) +\n geom_point() +\n geom_smooth(method = \"lm\",\n colour = \"blue\",\n se = False))\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-11-1.png){width=614}\n:::\n:::\n\n\n:::\n\nOf course this is rubbish - we can't have a beak classification outside the range of $[0, 1]$. It's either blunt (`0`) or pointed (`1`).\n\nBut for the sake of exploration, let's look at the assumptions:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_bks <- lm(pointed_beak ~ blength,\n data = early_finches)\n\nresid_panel(lm_bks,\n plots = c(\"resid\", \"qq\", \"ls\", \"cookd\"),\n smoother = TRUE)\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-12-3.png){width=672}\n:::\n:::\n\n\n## Python\n\nFirst, we create a linear model:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.ols(formula = \"pointed_beak ~ blength\",\n data = early_finches_py)\n# and get the fitted parameters of the model\nlm_bks_py = model.fit()\n```\n:::\n\n\nNext, we can create the diagnostic plots:\n\n::: {.cell}\n\n```{.python .cell-code}\ndgplots(lm_bks_py)\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-15-1.png){width=96}\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/dgplots/2024_01_19-06-05-54_PM_dgplots.png){width=804}\n:::\n:::\n\n\n:::\n\nThey're ~~pretty~~ extremely bad.\n\n- The response is not linear (Residual Plot, binary response plot, common sense).\n- The residuals do not appear to be distributed normally (Q-Q Plot)\n- The variance is not homogeneous across the predicted values (Location-Scale Plot)\n- But - there is always a silver lining - we don't have influential data points.\n\n## Creating a suitable model\n\nSo far we've established that using a simple linear model to describe a potential relationship between beak length and the probability of having a pointed beak is not a good idea. So, what _can_ we do?\n\nOne of the ways we can deal with binary outcome data is by performing a logistic regression. Instead of fitting a straight line to our data, and performing a regression on that, we fit a line that has an S shape. This avoids the model making predictions outside the $[0, 1]$ range.\n\nWe described our standard linear relationship as follows:\n\n$Y = \\beta_0 + \\beta_1X$\n\nWe can now map this to our non-linear relationship via the **logistic link function**:\n\n$Y = \\frac{\\exp(\\beta_0 + \\beta_1X)}{1 + \\exp(\\beta_0 + \\beta_1X)}$\n\nNote that the $\\beta_0 + \\beta_1X$ part is identical to the formula of a straight line.\n\nThe rest of the function is what makes the straight line curve into its characteristic S shape. \n\n:::{.callout-note collapse=true}\n## Euler's number ($\\exp$): would you like to know more?\n\nIn mathematics, $\\rm e$ represents a constant of around 2.718. Another notation is $\\exp$, which is often used when notations become a bit cumbersome. Here, I exclusively use the $\\exp$ notation for consistency.\n:::\n\n::: {.callout-important}\n## The logistic function\n\nThe shape of the logistic function is hugely influenced by the different parameters, in particular $\\beta_1$. The plots below show different situations, where $\\beta_0 = 0$ in all cases, but $\\beta_1$ varies.\n\nThe first plot shows the logistic function in its simplest form, with the others showing the effect of varying $\\beta_1$.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-17-1.png){width=672}\n:::\n:::\n\n\n* when $\\beta_1 = 1$, this gives the simplest logistic function\n* when $\\beta_1 = 0$ gives a horizontal line, with $Y = \\frac{\\exp(\\beta_0)}{1+\\exp(\\beta_0)}$\n* when $\\beta_1$ is negative flips the curve around, so it slopes down\n* when $\\beta_1$ is very large then the curve becomes extremely steep\n\n:::\n\nWe can fit such an S-shaped curve to our `early_finches` data set, by creating a generalised linear model.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nIn R we have a few options to do this, and by far the most familiar function would be `glm()`. Here we save the model in an object called `glm_bks`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_bks <- glm(pointed_beak ~ blength,\n family = binomial,\n data = early_finches)\n```\n:::\n\n\nThe format of this function is similar to that used by the `lm()` function for linear models. The important difference is that we must specify the _family_ of error distribution to use. For logistic regression we must set the family to **binomial**.\n\nIf you forget to set the `family` argument, then the `glm()` function will perform a standard linear model fit, identical to what the `lm()` function would do.\n\n## Python\n\nIn Python we have a few options to do this, and by far the most familiar function would be `glm()`. Here we save the model in an object called `glm_bks_py`:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.glm(formula = \"pointed_beak ~ blength\",\n family = sm.families.Binomial(),\n data = early_finches_py)\n# and get the fitted parameters of the model\nglm_bks_py = model.fit()\n```\n:::\n\n\nThe format of this function is similar to that used by the `ols()` function for linear models. The important difference is that we must specify the _family_ of error distribution to use. For logistic regression we must set the family to **binomial**. This is buried deep inside the `statsmodels` package and needs to be defined as `sm.families.Binomial()`.\n\n:::\n\n## Model output\n\nThat's the easy part done! The trickier part is interpreting the output. First of all, we'll get some summary information.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_bks)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = pointed_beak ~ blength, family = binomial, data = early_finches)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) -43.410 15.250 -2.847 0.00442 **\nblength 3.387 1.193 2.839 0.00452 **\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 84.5476 on 60 degrees of freedom\nResidual deviance: 9.1879 on 59 degrees of freedom\nAIC: 13.188\n\nNumber of Fisher Scoring iterations: 8\n```\n\n\n:::\n:::\n\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_bks_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n Generalized Linear Model Regression Results \n==============================================================================\nDep. Variable: pointed_beak No. Observations: 61\nModel: GLM Df Residuals: 59\nModel Family: Binomial Df Model: 1\nLink Function: Logit Scale: 1.0000\nMethod: IRLS Log-Likelihood: -4.5939\nDate: Fri, 19 Jan 2024 Deviance: 9.1879\nTime: 18:05:55 Pearson chi2: 15.1\nNo. Iterations: 8 Pseudo R-squ. (CS): 0.7093\nCovariance Type: nonrobust \n==============================================================================\n coef std err z P>|z| [0.025 0.975]\n------------------------------------------------------------------------------\nIntercept -43.4096 15.250 -2.847 0.004 -73.298 -13.521\nblength 3.3866 1.193 2.839 0.005 1.049 5.724\n==============================================================================\n```\n\n\n:::\n:::\n\n\n:::\n\nThere’s a lot to unpack here, but let's start with what we're familiar with: coefficients!\n\n## Parameter interpretation\n\n::: {.panel-tabset group=\"language\"}\n## R\nThe coefficients or parameters can be found in the `Coefficients` block. The main numbers to extract from the output are the two numbers underneath `Estimate.Std`:\n\n```\nCoefficients:\n Estimate Std.\n(Intercept) -43.410\nblength 3.387 \n```\n\n## Python\n\nRight at the bottom is a table showing the model coefficients. The main numbers to extract from the output are the two numbers in the `coef` column:\n\n```\n======================\n coef\n----------------------\nIntercept -43.4096\nblength 3.3866\n======================\n```\n\n:::\n\nThese are the coefficients of the logistic model equation and need to be placed in the correct equation if we want to be able to calculate the probability of having a pointed beak for a given beak length.\n\nThe $p$ values at the end of each coefficient row merely show whether that particular coefficient is significantly different from zero. This is similar to the $p$ values obtained in the summary output of a linear model. As with continuous predictors in simple models, these $p$ values can be used to decide whether that predictor is important (so in this case beak length appears to be significant). However, these $p$ values aren’t great to work with when we have multiple predictor variables, or when we have categorical predictors with multiple levels (since the output will give us a $p$ value for each level rather than for the predictor as a whole).\n\nWe can use the coefficients to calculate the probability of having a pointed beak for a given beak length:\n\n$$ P(pointed \\ beak) = \\frac{\\exp(-43.41 + 3.39 \\times blength)}{1 + \\exp(-43.41 + 3.39 \\times blength)} $$\n\nHaving this formula means that we can calculate the probability of having a pointed beak for any beak length. How do we work this out in practice? \n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWell, the probability of having a pointed beak if the beak length is large (for example 15 mm) can be calculated as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nexp(-43.41 + 3.39 * 15) / (1 + exp(-43.41 + 3.39 * 15))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.9994131\n```\n\n\n:::\n:::\n\n\nIf the beak length is small (for example 10 mm), the probability of having a pointed beak is extremely low:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nexp(-43.41 + 3.39 * 10) / (1 + exp(-43.41 + 3.39 * 10))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 7.410155e-05\n```\n\n\n:::\n:::\n\n\n## Python\n\nWell, the probability of having a pointed beak if the beak length is large (for example 15 mm) can be calculated as follows:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# import the math library\nimport math\n```\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\nmath.exp(-43.41 + 3.39 * 15) / (1 + math.exp(-43.41 + 3.39 * 15))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.9994130595039192\n```\n\n\n:::\n:::\n\n\nIf the beak length is small (for example 10 mm), the probability of having a pointed beak is extremely low:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmath.exp(-43.41 + 3.39 * 10) / (1 + math.exp(-43.41 + 3.39 * 10))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n7.410155028945912e-05\n```\n\n\n:::\n:::\n\n:::\n\nWe can calculate the the probabilities for all our observed values and if we do that then we can see that the larger the beak length is, the higher the probability that a beak shape would be pointed. I'm visualising this together with the logistic curve, where the blue points are the calculated probabilities:\n\n::: {.callout-note collapse=true}\n## Code available here\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_bks %>% \n augment(type.predict = \"response\") %>% \n ggplot() +\n geom_point(aes(x = blength, y = pointed_beak)) +\n geom_line(aes(x = blength, y = .fitted),\n linetype = \"dashed\",\n colour = \"blue\") +\n geom_point(aes(x = blength, y = .fitted),\n colour = \"blue\", alpha = 0.5) +\n labs(x = \"beak length (mm)\",\n y = \"Probability\")\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(early_finches_py) +\n geom_point(aes(x = \"blength\", y = \"pointed_beak\")) +\n geom_line(aes(x = \"blength\", y = glm_bks_py.fittedvalues),\n linetype = \"dashed\",\n colour = \"blue\") +\n geom_point(aes(x = \"blength\", y = glm_bks_py.fittedvalues),\n colour = \"blue\", alpha = 0.5) +\n labs(x = \"beak length (mm)\",\n y = \"Probability\"))\n```\n:::\n\n:::\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Predicted probabilities for beak classification](glm-practical-logistic-binary_files/figure-html/fig-beak_class_glm_probs-2.png){#fig-beak_class_glm_probs width=672}\n:::\n:::\n\n\nThe graph shows us that, based on the data that we have and the model we used to make predictions about our response variable, the probability of seeing a pointed beak increases with beak length.\n\nShort beaks are more closely associated with the bluntly shaped beaks, whereas long beaks are more closely associated with the pointed shape. It's also clear that there is a range of beak lengths (around 13 mm) where the probability of getting one shape or another is much more even.\n\n\n\n\n\n\n\n\n\n\n\n\n\n## Exercises\n\n### Diabetes {#sec-exr_diabetes}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nFor this exercise we'll be using the data from `data/diabetes.csv`.\n\nThis is a data set comprising 768 observations of three variables (one dependent and two predictor variables). This records the results of a diabetes test result as a binary variable (1 is a positive result, 0 is a negative result), along with the result of a glucose tolerance test and the diastolic blood pressure for each of 768 women. The variables are called `test_result`, `glucose` and `diastolic`.\n\nWe want to see if the `glucose` tolerance is a meaningful predictor for predictions on a diabetes test. To investigate this, do the following:\n\n1. Load and visualise the data\n2. Create a suitable model\n3. Determine if there are any statistically significant predictors\n4. Calculate the probability of a positive diabetes test result for a glucose tolerance test value of `glucose = 150`\n\n::: {.callout-answer collapse=\"true\"}\n\n#### Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndiabetes <- read_csv(\"data/diabetes.csv\")\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndiabetes_py = pd.read_csv(\"data/diabetes.csv\")\n```\n:::\n\n\n:::\n\nLooking at the data, we can see that the `test_result` column contains zeros and ones. These are yes/no test result outcomes and not actually numeric representations.\n\nWe'll have to deal with this soon. For now, we can plot the data, by outcome:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(diabetes,\n aes(x = factor(test_result),\n y = glucose)) +\n geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-32-1.png){width=672}\n:::\n:::\n\n\n## Python\n\nWe could just give Python the `test_result` data directly, but then it would view the values as numeric. Which doesn't really work, because we have two groups as such: those with a negative diabetes test result, and those with a positive one.\n\nWe can force Python to temporarily covert the data to a factor, by making the `test_result` column an `object` type. We can do this directly inside the `ggplot()` function.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(diabetes_py,\n aes(x = diabetes_py.test_result.astype(object),\n y = \"glucose\")) +\n geom_boxplot())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-33-1.png){width=614}\n:::\n:::\n\n:::\n\nIt looks as though the patients with a positive diabetes test have slightly higher glucose levels than those with a negative diabetes test.\n\nWe can visualise that differently by plotting all the data points as a classic binary response plot:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(diabetes,\n aes(x = glucose,\n y = test_result)) +\n geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-34-3.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(diabetes_py,\n aes(x = \"glucose\",\n y = \"test_result\")) +\n geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-35-1.png){width=614}\n:::\n:::\n\n\n:::\n\n#### Create a suitable model\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe'll use the `glm()` function to create a generalised linear model. Here we save the model in an object called `glm_dia`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_dia <- glm(test_result ~ glucose,\n family = binomial,\n data = diabetes)\n```\n:::\n\n\nThe format of this function is similar to that used by the `lm()` function for linear models. The important difference is that we must specify the _family_ of error distribution to use. For logistic regression we must set the family to **binomial**.\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.glm(formula = \"test_result ~ glucose\",\n family = sm.families.Binomial(),\n data = diabetes_py)\n# and get the fitted parameters of the model\nglm_dia_py = model.fit()\n```\n:::\n\n\n:::\n\nLet's look at the model parameters:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_dia)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = test_result ~ glucose, family = binomial, data = diabetes)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) -5.611732 0.442289 -12.69 <2e-16 ***\nglucose 0.039510 0.003398 11.63 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 936.6 on 727 degrees of freedom\nResidual deviance: 752.2 on 726 degrees of freedom\nAIC: 756.2\n\nNumber of Fisher Scoring iterations: 4\n```\n\n\n:::\n:::\n\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_dia_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n Generalized Linear Model Regression Results \n==============================================================================\nDep. Variable: test_result No. Observations: 728\nModel: GLM Df Residuals: 726\nModel Family: Binomial Df Model: 1\nLink Function: Logit Scale: 1.0000\nMethod: IRLS Log-Likelihood: -376.10\nDate: Fri, 19 Jan 2024 Deviance: 752.20\nTime: 18:05:59 Pearson chi2: 713.\nNo. Iterations: 4 Pseudo R-squ. (CS): 0.2238\nCovariance Type: nonrobust \n==============================================================================\n coef std err z P>|z| [0.025 0.975]\n------------------------------------------------------------------------------\nIntercept -5.6117 0.442 -12.688 0.000 -6.479 -4.745\nglucose 0.0395 0.003 11.628 0.000 0.033 0.046\n==============================================================================\n```\n\n\n:::\n:::\n\n:::\n\nWe can see that `glucose` is a significant predictor for the `test_result` (the $p$ value is much smaller than 0.05).\n\nKnowing this, we're interested in the coefficients. We have an intercept of `-5.61` and `0.0395` for `glucose`. We can use these coefficients to write a formula that describes the potential relationship between the probability of having a positive test result, dependent on the glucose tolerance level value:\n\n$$ P(positive \\ test\\ result) = \\frac{\\exp(-5.61 + 0.04 \\times glucose)}{1 + \\exp(-5.61 + 0.04 \\times glucose)} $$\n\n#### Calculating probabilities\n\nUsing the formula above, we can now calculate the probability of having a positive test result, for a given `glucose` value. If we do this for `glucose = 150`, we get the following:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nexp(-5.61 + 0.04 * 150) / (1 + exp(-5.61 + 0.04 * 145))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.6685441\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmath.exp(-5.61 + 0.04 * 150) / (1 + math.exp(-5.61 + 0.04 * 145))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.6685441044999503\n```\n\n\n:::\n:::\n\n:::\n\nThis tells us that the probability of having a positive diabetes test result, given a glucose tolerance level of 150 is around 67%.\n\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n- We use a logistic regression to model a binary response\n- We can feed new observations into the model and get probabilities for the outcome\n:::\n", + "markdown": "---\ntitle: \"Binary response\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n**Questions**\n\n- How do we analyse data with a binary outcome?\n- Can we test if our model is any good?\n- Be able to perform a logistic regression with a binary outcome\n- Predict outcomes of new data, based on a defined model\n\n**Objectives**\n\n- Be able to analyse binary outcome data\n- Understand different methods of testing model fit\n- Be able to make model predictions\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n### Functions\n\n## Python\n\n### Libraries\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n```\n:::\n\n\n### Functions\n:::\n:::\n\nThe example in this section uses the following data set:\n\n`data/finches_early.csv`\n\nThese data come from an analysis of gene flow across two finch species [@lamichhaney2020]. They are slightly adapted here for illustrative purposes.\n\nThe data focus on two species, _Geospiza fortis_ and _G. scandens_. The original measurements are split by a uniquely timed event: a particularly strong El Niño event in 1983. This event changed the vegetation and food supply of the finches, allowing F1 hybrids of the two species to survive, whereas before 1983 they could not. The measurements are classed as `early` (pre-1983) and `late` (1983 onwards).\n\nHere we are looking only at the `early` data. We are specifically focussing on the beak shape classification, which we saw earlier in @fig-beak_shape_glm.\n\n## Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nearly_finches <- read_csv(\"data/finches_early.csv\")\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nearly_finches_py = pd.read_csv(\"data/finches_early.csv\")\n```\n:::\n\n\n:::\n\nLooking at the data, we can see that the `pointed_beak` column contains zeros and ones. These are actually yes/no classification outcomes and not numeric representations.\n\nWe'll have to deal with this soon. For now, we can plot the data:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(early_finches,\n aes(x = factor(pointed_beak),\n y = blength)) +\n geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-6-1.png){width=672}\n:::\n:::\n\n\n## Python\n\nWe could just give Python the `pointed_beak` data directly, but then it would view the values as numeric. Which doesn't really work, because we have two groups as such: those with a pointed beak (`1`), and those with a blunt one (`0`).\n\nWe can force Python to temporarily covert the data to a factor, by making the `pointed_beak` column an `object` type. We can do this directly inside the `ggplot()` function.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(early_finches_py,\n aes(x = early_finches_py.pointed_beak.astype(object),\n y = \"blength\")) +\n geom_boxplot())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-7-1.png){width=614}\n:::\n:::\n\n:::\n\nIt looks as though the finches with blunt beaks generally have shorter beak lengths.\n\nWe can visualise that differently by plotting all the data points as a classic binary response plot:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(early_finches,\n aes(x = blength, y = pointed_beak)) +\n geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-8-3.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(early_finches_py,\n aes(x = \"blength\",\n y = \"pointed_beak\")) +\n geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-9-1.png){width=614}\n:::\n:::\n\n\n:::\n\nThis presents us with a bit of an issue. We could fit a linear regression model to these data, although we already know that this is a bad idea...\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(early_finches,\n aes(x = blength, y = pointed_beak)) +\n geom_point() +\n geom_smooth(method = \"lm\", se = FALSE)\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-10-3.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(early_finches_py,\n aes(x = \"blength\",\n y = \"pointed_beak\")) +\n geom_point() +\n geom_smooth(method = \"lm\",\n colour = \"blue\",\n se = False))\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-11-1.png){width=614}\n:::\n:::\n\n\n:::\n\nOf course this is rubbish - we can't have a beak classification outside the range of $[0, 1]$. It's either blunt (`0`) or pointed (`1`).\n\nBut for the sake of exploration, let's look at the assumptions:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_bks <- lm(pointed_beak ~ blength,\n data = early_finches)\n\nresid_panel(lm_bks,\n plots = c(\"resid\", \"qq\", \"ls\", \"cookd\"),\n smoother = TRUE)\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-12-3.png){width=672}\n:::\n:::\n\n\n## Python\n\nFirst, we create a linear model:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.ols(formula = \"pointed_beak ~ blength\",\n data = early_finches_py)\n# and get the fitted parameters of the model\nlm_bks_py = model.fit()\n```\n:::\n\n\nNext, we can create the diagnostic plots:\n\n::: {.cell}\n\n```{.python .cell-code}\ndgplots(lm_bks_py)\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-15-1.png){width=96}\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/dgplots/2024_01_22-11-16-01_AM_dgplots.png){width=804}\n:::\n:::\n\n\n:::\n\nThey're ~~pretty~~ extremely bad.\n\n- The response is not linear (Residual Plot, binary response plot, common sense).\n- The residuals do not appear to be distributed normally (Q-Q Plot)\n- The variance is not homogeneous across the predicted values (Location-Scale Plot)\n- But - there is always a silver lining - we don't have influential data points.\n\n## Creating a suitable model\n\nSo far we've established that using a simple linear model to describe a potential relationship between beak length and the probability of having a pointed beak is not a good idea. So, what _can_ we do?\n\nOne of the ways we can deal with binary outcome data is by performing a logistic regression. Instead of fitting a straight line to our data, and performing a regression on that, we fit a line that has an S shape. This avoids the model making predictions outside the $[0, 1]$ range.\n\nWe described our standard linear relationship as follows:\n\n$Y = \\beta_0 + \\beta_1X$\n\nWe can now map this to our non-linear relationship via the **logistic link function**:\n\n$Y = \\frac{\\exp(\\beta_0 + \\beta_1X)}{1 + \\exp(\\beta_0 + \\beta_1X)}$\n\nNote that the $\\beta_0 + \\beta_1X$ part is identical to the formula of a straight line.\n\nThe rest of the function is what makes the straight line curve into its characteristic S shape. \n\n:::{.callout-note collapse=true}\n## Euler's number ($\\exp$): would you like to know more?\n\nIn mathematics, $\\rm e$ represents a constant of around 2.718. Another notation is $\\exp$, which is often used when notations become a bit cumbersome. Here, I exclusively use the $\\exp$ notation for consistency.\n:::\n\n::: {.callout-important}\n## The logistic function\n\nThe shape of the logistic function is hugely influenced by the different parameters, in particular $\\beta_1$. The plots below show different situations, where $\\beta_0 = 0$ in all cases, but $\\beta_1$ varies.\n\nThe first plot shows the logistic function in its simplest form, with the others showing the effect of varying $\\beta_1$.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-17-1.png){width=672}\n:::\n:::\n\n\n* when $\\beta_1 = 1$, this gives the simplest logistic function\n* when $\\beta_1 = 0$ gives a horizontal line, with $Y = \\frac{\\exp(\\beta_0)}{1+\\exp(\\beta_0)}$\n* when $\\beta_1$ is negative flips the curve around, so it slopes down\n* when $\\beta_1$ is very large then the curve becomes extremely steep\n\n:::\n\nWe can fit such an S-shaped curve to our `early_finches` data set, by creating a generalised linear model.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nIn R we have a few options to do this, and by far the most familiar function would be `glm()`. Here we save the model in an object called `glm_bks`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_bks <- glm(pointed_beak ~ blength,\n family = binomial,\n data = early_finches)\n```\n:::\n\n\nThe format of this function is similar to that used by the `lm()` function for linear models. The important difference is that we must specify the _family_ of error distribution to use. For logistic regression we must set the family to **binomial**.\n\nIf you forget to set the `family` argument, then the `glm()` function will perform a standard linear model fit, identical to what the `lm()` function would do.\n\n## Python\n\nIn Python we have a few options to do this, and by far the most familiar function would be `glm()`. Here we save the model in an object called `glm_bks_py`:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.glm(formula = \"pointed_beak ~ blength\",\n family = sm.families.Binomial(),\n data = early_finches_py)\n# and get the fitted parameters of the model\nglm_bks_py = model.fit()\n```\n:::\n\n\nThe format of this function is similar to that used by the `ols()` function for linear models. The important difference is that we must specify the _family_ of error distribution to use. For logistic regression we must set the family to **binomial**. This is buried deep inside the `statsmodels` package and needs to be defined as `sm.families.Binomial()`.\n\n:::\n\n## Model output\n\nThat's the easy part done! The trickier part is interpreting the output. First of all, we'll get some summary information.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_bks)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = pointed_beak ~ blength, family = binomial, data = early_finches)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) -43.410 15.250 -2.847 0.00442 **\nblength 3.387 1.193 2.839 0.00452 **\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 84.5476 on 60 degrees of freedom\nResidual deviance: 9.1879 on 59 degrees of freedom\nAIC: 13.188\n\nNumber of Fisher Scoring iterations: 8\n```\n\n\n:::\n:::\n\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_bks_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n Generalized Linear Model Regression Results \n==============================================================================\nDep. Variable: pointed_beak No. Observations: 61\nModel: GLM Df Residuals: 59\nModel Family: Binomial Df Model: 1\nLink Function: Logit Scale: 1.0000\nMethod: IRLS Log-Likelihood: -4.5939\nDate: Mon, 22 Jan 2024 Deviance: 9.1879\nTime: 11:16:02 Pearson chi2: 15.1\nNo. Iterations: 8 Pseudo R-squ. (CS): 0.7093\nCovariance Type: nonrobust \n==============================================================================\n coef std err z P>|z| [0.025 0.975]\n------------------------------------------------------------------------------\nIntercept -43.4096 15.250 -2.847 0.004 -73.298 -13.521\nblength 3.3866 1.193 2.839 0.005 1.049 5.724\n==============================================================================\n```\n\n\n:::\n:::\n\n\n:::\n\nThere’s a lot to unpack here, but let's start with what we're familiar with: coefficients!\n\n## Parameter interpretation\n\n::: {.panel-tabset group=\"language\"}\n## R\nThe coefficients or parameters can be found in the `Coefficients` block. The main numbers to extract from the output are the two numbers underneath `Estimate.Std`:\n\n```\nCoefficients:\n Estimate Std.\n(Intercept) -43.410\nblength 3.387 \n```\n\n## Python\n\nRight at the bottom is a table showing the model coefficients. The main numbers to extract from the output are the two numbers in the `coef` column:\n\n```\n======================\n coef\n----------------------\nIntercept -43.4096\nblength 3.3866\n======================\n```\n\n:::\n\nThese are the coefficients of the logistic model equation and need to be placed in the correct equation if we want to be able to calculate the probability of having a pointed beak for a given beak length.\n\nThe $p$ values at the end of each coefficient row merely show whether that particular coefficient is significantly different from zero. This is similar to the $p$ values obtained in the summary output of a linear model. As with continuous predictors in simple models, these $p$ values can be used to decide whether that predictor is important (so in this case beak length appears to be significant). However, these $p$ values aren’t great to work with when we have multiple predictor variables, or when we have categorical predictors with multiple levels (since the output will give us a $p$ value for each level rather than for the predictor as a whole).\n\nWe can use the coefficients to calculate the probability of having a pointed beak for a given beak length:\n\n$$ P(pointed \\ beak) = \\frac{\\exp(-43.41 + 3.39 \\times blength)}{1 + \\exp(-43.41 + 3.39 \\times blength)} $$\n\nHaving this formula means that we can calculate the probability of having a pointed beak for any beak length. How do we work this out in practice? \n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWell, the probability of having a pointed beak if the beak length is large (for example 15 mm) can be calculated as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nexp(-43.41 + 3.39 * 15) / (1 + exp(-43.41 + 3.39 * 15))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.9994131\n```\n\n\n:::\n:::\n\n\nIf the beak length is small (for example 10 mm), the probability of having a pointed beak is extremely low:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nexp(-43.41 + 3.39 * 10) / (1 + exp(-43.41 + 3.39 * 10))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 7.410155e-05\n```\n\n\n:::\n:::\n\n\n## Python\n\nWell, the probability of having a pointed beak if the beak length is large (for example 15 mm) can be calculated as follows:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# import the math library\nimport math\n```\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\nmath.exp(-43.41 + 3.39 * 15) / (1 + math.exp(-43.41 + 3.39 * 15))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.9994130595039192\n```\n\n\n:::\n:::\n\n\nIf the beak length is small (for example 10 mm), the probability of having a pointed beak is extremely low:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmath.exp(-43.41 + 3.39 * 10) / (1 + math.exp(-43.41 + 3.39 * 10))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n7.410155028945912e-05\n```\n\n\n:::\n:::\n\n:::\n\nWe can calculate the the probabilities for all our observed values and if we do that then we can see that the larger the beak length is, the higher the probability that a beak shape would be pointed. I'm visualising this together with the logistic curve, where the blue points are the calculated probabilities:\n\n::: {.callout-note collapse=true}\n## Code available here\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_bks %>% \n augment(type.predict = \"response\") %>% \n ggplot() +\n geom_point(aes(x = blength, y = pointed_beak)) +\n geom_line(aes(x = blength, y = .fitted),\n linetype = \"dashed\",\n colour = \"blue\") +\n geom_point(aes(x = blength, y = .fitted),\n colour = \"blue\", alpha = 0.5) +\n labs(x = \"beak length (mm)\",\n y = \"Probability\")\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(early_finches_py) +\n geom_point(aes(x = \"blength\", y = \"pointed_beak\")) +\n geom_line(aes(x = \"blength\", y = glm_bks_py.fittedvalues),\n linetype = \"dashed\",\n colour = \"blue\") +\n geom_point(aes(x = \"blength\", y = glm_bks_py.fittedvalues),\n colour = \"blue\", alpha = 0.5) +\n labs(x = \"beak length (mm)\",\n y = \"Probability\"))\n```\n:::\n\n:::\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Predicted probabilities for beak classification](glm-practical-logistic-binary_files/figure-html/fig-beak_class_glm_probs-2.png){#fig-beak_class_glm_probs width=672}\n:::\n:::\n\n\nThe graph shows us that, based on the data that we have and the model we used to make predictions about our response variable, the probability of seeing a pointed beak increases with beak length.\n\nShort beaks are more closely associated with the bluntly shaped beaks, whereas long beaks are more closely associated with the pointed shape. It's also clear that there is a range of beak lengths (around 13 mm) where the probability of getting one shape or another is much more even.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n## Assumptions\n\nAs explained in the background chapter, we can't really use the standard diagnostic plots to assess assumptions. We're not going to go into a lot of detail for now, but there is one thing that we can do: look for influential points using the Cook’s distance plot:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot(glm_bks , which=4)\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-30-1.png){width=672}\n:::\n:::\n\n\n:::\n\nThis shows that there are no very obvious influential points. You could regard point `34` as potentially influential (it's got a Cook's distance of around `0.8`), but I'm not overly worried.\n\n## Assessing significance\n\nWe can ask several questions.\n\n**Is the model well-specified?**\n\nRoughly speaking this asks \"can our model predict our data reasonably well?\"\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nUnfortunately, there isn’t a single command that does this for us, and we have to lift some of the numbers from the summary output ourselves. \n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(9.1879, 59)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1\n```\n\n\n:::\n:::\n\n\n\nHere, we’ve used the `pchisq` function (which calculates the correct probability for us – ask if you want a hand-wavy explanation). The first argument to it is the residual deviance value from the summary table, the second argument to is the residual degrees of freedom argument from the same table.\n\nThis gives us a probability of `1`. We can interpret this as the probability that the model is actually good. There aren’t any strict conventions on how to interpret this value but, for me, a tiny value would indicate a rubbish model.\n\n:::\n\n**Is the overall model better than the null model?**\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(84.5476 - 9.1879, 60 - 59)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\nHere we’ve used the `pchisq` function again (if you didn’t ask before, you probably aren’t going to ask now). The first argument is the difference between the null and residual deviances and the second argument is the difference in degrees of freedom between the null and residual models. All of these values can be lifted from the summary table.\n\nThis gives us a probability of 0, which is technically not possible. This value is doing a formal test to see whether our fitted model is significantly different from the null model. Here we can treat this a classical hypothesis test and since this p-value is less than 0.05 then we can say that our fitted model (with `blength` as a predictor variable) is definitely better than the null model (which has no predictor variables). Woohoo!\n\n:::\n\n**Are any of the individual predictors significant?**\n\nFinally, we’ll use the anova function from before to determine which predictor variables are important, and specifically in this case whether the glucose predictor is significant.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(glm_bks , test = \"Chisq\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: pointed_beak\n\nTerms added sequentially (first to last)\n\n Df Deviance Resid. Df Resid. Dev Pr(>Chi) \nNULL 60 84.548 \nblength 1 75.36 59 9.188 < 2.2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nThe `anova()` function is a true workhorse within R! This time we’ve used it to create an Analysis of Deviance table. This is exactly equivalent to an ordinary ANOVA table where we have rows corresponding to each predictor variable and a p-value telling us whether that variable is significant or not.\n\nThe p-value for the `blength` predictor is written under then Pr(>Chi) column and we can see that it is less than `< 2.2e-16`. So, beak length is a significant predictor.\n\nThis shouldn’t be surprising since we already saw that our overall model was better than the null model, which in this case is exactly the same as asking whether the beak length term is significant. However, in more complicated models with multiple predictors these two comparisons (and p-values) won’t be the same.\n\n\n:::\n\n\n\n\n## Exercises\n\n### Diabetes {#sec-exr_diabetes}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nFor this exercise we'll be using the data from `data/diabetes.csv`.\n\nThis is a data set comprising 768 observations of three variables (one dependent and two predictor variables). This records the results of a diabetes test result as a binary variable (1 is a positive result, 0 is a negative result), along with the result of a glucose tolerance test and the diastolic blood pressure for each of 768 women. The variables are called `test_result`, `glucose` and `diastolic`.\n\nWe want to see if the `glucose` tolerance is a meaningful predictor for predictions on a diabetes test. To investigate this, do the following:\n\n1. Load and visualise the data\n2. Create a suitable model\n3. Determine if there are any statistically significant predictors\n4. Calculate the probability of a positive diabetes test result for a glucose tolerance test value of `glucose = 150`\n\n::: {.callout-answer collapse=\"true\"}\n\n#### Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndiabetes <- read_csv(\"data/diabetes.csv\")\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndiabetes_py = pd.read_csv(\"data/diabetes.csv\")\n```\n:::\n\n\n:::\n\nLooking at the data, we can see that the `test_result` column contains zeros and ones. These are yes/no test result outcomes and not actually numeric representations.\n\nWe'll have to deal with this soon. For now, we can plot the data, by outcome:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(diabetes,\n aes(x = factor(test_result),\n y = glucose)) +\n geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-36-1.png){width=672}\n:::\n:::\n\n\n## Python\n\nWe could just give Python the `test_result` data directly, but then it would view the values as numeric. Which doesn't really work, because we have two groups as such: those with a negative diabetes test result, and those with a positive one.\n\nWe can force Python to temporarily covert the data to a factor, by making the `test_result` column an `object` type. We can do this directly inside the `ggplot()` function.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(diabetes_py,\n aes(x = diabetes_py.test_result.astype(object),\n y = \"glucose\")) +\n geom_boxplot())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-37-1.png){width=614}\n:::\n:::\n\n:::\n\nIt looks as though the patients with a positive diabetes test have slightly higher glucose levels than those with a negative diabetes test.\n\nWe can visualise that differently by plotting all the data points as a classic binary response plot:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(diabetes,\n aes(x = glucose,\n y = test_result)) +\n geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-38-3.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(diabetes_py,\n aes(x = \"glucose\",\n y = \"test_result\")) +\n geom_point())\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-binary_files/figure-html/unnamed-chunk-39-1.png){width=614}\n:::\n:::\n\n\n:::\n\n#### Create a suitable model\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe'll use the `glm()` function to create a generalised linear model. Here we save the model in an object called `glm_dia`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_dia <- glm(test_result ~ glucose,\n family = binomial,\n data = diabetes)\n```\n:::\n\n\nThe format of this function is similar to that used by the `lm()` function for linear models. The important difference is that we must specify the _family_ of error distribution to use. For logistic regression we must set the family to **binomial**.\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create a linear model\nmodel = smf.glm(formula = \"test_result ~ glucose\",\n family = sm.families.Binomial(),\n data = diabetes_py)\n# and get the fitted parameters of the model\nglm_dia_py = model.fit()\n```\n:::\n\n\n:::\n\nLet's look at the model parameters:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_dia)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = test_result ~ glucose, family = binomial, data = diabetes)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) -5.611732 0.442289 -12.69 <2e-16 ***\nglucose 0.039510 0.003398 11.63 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 936.6 on 727 degrees of freedom\nResidual deviance: 752.2 on 726 degrees of freedom\nAIC: 756.2\n\nNumber of Fisher Scoring iterations: 4\n```\n\n\n:::\n:::\n\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(glm_dia_py.summary())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n Generalized Linear Model Regression Results \n==============================================================================\nDep. Variable: test_result No. Observations: 728\nModel: GLM Df Residuals: 726\nModel Family: Binomial Df Model: 1\nLink Function: Logit Scale: 1.0000\nMethod: IRLS Log-Likelihood: -376.10\nDate: Mon, 22 Jan 2024 Deviance: 752.20\nTime: 11:16:06 Pearson chi2: 713.\nNo. Iterations: 4 Pseudo R-squ. (CS): 0.2238\nCovariance Type: nonrobust \n==============================================================================\n coef std err z P>|z| [0.025 0.975]\n------------------------------------------------------------------------------\nIntercept -5.6117 0.442 -12.688 0.000 -6.479 -4.745\nglucose 0.0395 0.003 11.628 0.000 0.033 0.046\n==============================================================================\n```\n\n\n:::\n:::\n\n:::\n\nWe can see that `glucose` is a significant predictor for the `test_result` (the $p$ value is much smaller than 0.05).\n\nKnowing this, we're interested in the coefficients. We have an intercept of `-5.61` and `0.0395` for `glucose`. We can use these coefficients to write a formula that describes the potential relationship between the probability of having a positive test result, dependent on the glucose tolerance level value:\n\n$$ P(positive \\ test\\ result) = \\frac{\\exp(-5.61 + 0.04 \\times glucose)}{1 + \\exp(-5.61 + 0.04 \\times glucose)} $$\n\n#### Calculating probabilities\n\nUsing the formula above, we can now calculate the probability of having a positive test result, for a given `glucose` value. If we do this for `glucose = 150`, we get the following:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nexp(-5.61 + 0.04 * 150) / (1 + exp(-5.61 + 0.04 * 145))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.6685441\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmath.exp(-5.61 + 0.04 * 150) / (1 + math.exp(-5.61 + 0.04 * 145))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.6685441044999503\n```\n\n\n:::\n:::\n\n:::\n\nThis tells us that the probability of having a positive diabetes test result, given a glucose tolerance level of 150 is around 67%.\n\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n- We use a logistic regression to model a binary response\n- We can feed new observations into the model and get probabilities for the outcome\n:::\n", "supporting": [ "glm-practical-logistic-binary_files" ], diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-30-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-30-1.png new file mode 100644 index 0000000..fe8b9b7 Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-30-1.png differ diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-36-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-36-1.png new file mode 100644 index 0000000..68a5fa8 Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-36-1.png differ diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-37-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-37-1.png new file mode 100644 index 0000000..a849e0c Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-37-1.png differ diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-38-3.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-38-3.png new file mode 100644 index 0000000..0d59157 Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-38-3.png differ diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-39-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-39-1.png new file mode 100644 index 0000000..bb9f4f8 Binary files /dev/null and b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-39-1.png differ diff --git a/_freeze/materials/glm-practical-poisson/execute-results/html.json b/_freeze/materials/glm-practical-poisson/execute-results/html.json new file mode 100644 index 0000000..a2f8046 --- /dev/null +++ b/_freeze/materials/glm-practical-poisson/execute-results/html.json @@ -0,0 +1,17 @@ +{ + "hash": "e28c115b310dab335e2846c5ee4d47b0", + "result": { + "engine": "knitr", + "markdown": "---\ntitle: \"Count data\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n**Questions**\n\n- How do we analyse count data?\n\n**Objectives**\n\n- Be able to perform a poisson regression on count data\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n### Functions\n:::\n:::\n\nThe examples in this section use the following data sets:\n\n`data/islands.csv`\n\nThis is a data set comprising 35 observations of two variables (one dependent and one predictor). This records the number of species recorded on different small islands along with the area (km2) of the islands. The variables are `species` and `area`.\n\nThe second data set is on seat belts.\n\nThe `seatbelts` data set is a multiple time-series data set that was commissioned by the Department of Transport in 1984 to measure differences in deaths before and after front seat belt legislation was introduced on 31st January 1983. It provides monthly total numerical data on a number of incidents including those related to death and injury in Road Traffic Accidents (RTA's). The data set starts in January 1969 and observations run until December 1984.\n\nYou can find the file in `data/seatbelts.csv`\n\n## Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nislands <- read_csv(\"data/islands.csv\")\n```\n:::\n\n\nLet's have a glimpse at the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nislands\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 35 × 2\n species area\n \n 1 114 12.1\n 2 130 13.4\n 3 113 13.7\n 4 109 14.5\n 5 118 16.8\n 6 136 19.0\n 7 149 19.6\n 8 162 20.6\n 9 145 20.9\n10 148 21.0\n# ℹ 25 more rows\n```\n\n\n:::\n:::\n\n\n:::\n\nLooking at the data, we can see that there are two columns: `species`, which contains the number of species recorded on each island and `area`, which contains the surface area of the island in square kilometers.\n\nWe can plot the data:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(islands, aes(x = area, y = species)) +\n geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n\n:::\n\nIt looks as though `area` may have an effect on the number of species that we observe on each island. We note that the response variable is count data and so we try to construct a Poisson regression.\n\n## Constructing a model\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_isl <- glm(species ~ area,\n data = islands, family = \"poisson\")\n```\n:::\n\n\nand we look at the model summary:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_isl)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = species ~ area, family = \"poisson\", data = islands)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) 4.241129 0.041322 102.64 <2e-16 ***\narea 0.035613 0.001247 28.55 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n Null deviance: 856.899 on 34 degrees of freedom\nResidual deviance: 30.437 on 33 degrees of freedom\nAIC: 282.66\n\nNumber of Fisher Scoring iterations: 3\n```\n\n\n:::\n:::\n\n\nThe output is strikingly similar to the logistic regression models (who’d have guessed, eh?) and the main numbers to extract from the output are the two numbers underneath `Estimate.Std` in the `Coefficients` table:\n\n```\n(Intercept) 4.241129\narea 0.035613\n```\n\n:::\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\n$$ E(species) = \\exp(4.24 + 0.036 \\times area) $$\n\nInterpreting this requires a bit of thought (not much, but a bit).\nThe intercept coefficient, `4.24`, is related to the number of species we would expect on an island of zero area (this is statistics, not real life. You’d do well to remember that before you worry too much about what that even means). But in order to turn this number into something meaningful we have to exponentiate it. Since `exp(4.24) ≈ 70`, we can say that the baseline number of species the model expects on any island is 70. This isn’t actually the interesting bit though.\n\nThe coefficient of `area` is the fun bit. For starters we can see that it is a positive number which does mean that increasing `area` leads to increasing numbers of `species`. Good so far.\n\nBut what does the value `0.036` actually mean? Well, if we exponentiate it as well, we get `exp(0.036) ≈ 1.04`. This means that for every increase in `area` of 1 km^2 (the original units of the area variable), the number of species on the island is multiplied by `1.04`. So, an island of area 1 km^2 will have `1.04 x 70 ≈ 72` species.\n\nSo, in order to interpret Poisson coefficients, you have to exponentiate them.\n\n## Plotting the Poisson regression\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(islands, aes(area, species)) +\n geom_point() +\n geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n method.args = list(family = poisson)) +\n xlim(10,50)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n`geom_smooth()` using formula = 'y ~ x'\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\n:::\n\n## Assumptions\n\nAs we mentioned earlier, Poisson regressions require that the variance of the data at any point is the same as the mean of the data at that point. We checked that earlier by looking at the residual deviance values.\n\nWe can look for influential points using the Cook’s distance plot:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot(glm_isl , which=4)\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n\n:::\n\nNone of our points have particularly large Cook’s distances and so life is rosy.\n\n## Assessing significance\n\nWe can ask the same three questions we asked before.\n\n1. Is the model well-specified?\n2. Is the overall model better than the null model?\n3. Are any of the individual predictors significant?\n\nAgain, in this case, questions 2 and 3 are effectively asking the same thing because we still only have a single predictor variable.\n\nTo assess if the model is any good we’ll again use the residual deviance and the residual degrees of freedom.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(30.437, 33)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.5953482\n```\n\n\n:::\n:::\n\n\n:::\n\nThis gives a probability of `0.60`. This suggests that this model is actually a good one and that the data are pretty well supported by the model. For Poisson models this has an extra interpretation. This can be used to assess whether we have significant over-dispersion in our data.\n\nFor a Poisson model to be appropriate we need that the variance of the data to be exactly the same as the mean of the data. Visually, this would correspond to the data spreading out more for higher predicted values of `species.` However, we don’t want the data to spread out too much. If that happens then a Poisson model wouldn’t be appropriate.\n\nThe easy way to check this is to look at the ratio of the residual deviance to the residual degrees of freedom (in this case `0.922`). For a Poisson model to be valid, this ratio should be about 1. If the ratio is significantly bigger than 1 then we say that we have over-dispersion in the model and we wouldn’t be able to trust any of the significance testing that we are about to do using a Poisson regression.\n\nThankfully the probability we have just created (`0.60`) is exactly the right one we need to look at to assess whether we have significant over-dispersion in our model.\n\nSecondly, to assess whether the overall model, with all of the terms, is better than the null model we’ll look at the difference in deviances and the difference in degrees of freedom:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(856.899 - 30.437, 34 - 33)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\n:::\n\nThis gives a reported p-value of 0, which is pretty damn small. So, yes, this model is better than nothing at all and species does appear to change with some of our predictors\n\nFinally, we’ll construct an analysis of deviance table to look at the individual terms:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(glm_isl , test = \"Chisq\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nAnalysis of Deviance Table\n\nModel: poisson, link: log\n\nResponse: species\n\nTerms added sequentially (first to last)\n\n Df Deviance Resid. Df Resid. Dev Pr(>Chi) \nNULL 34 856.90 \narea 1 826.46 33 30.44 < 2.2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nThe p-value in this table is just as small as we’d expect given our previous result (`<2.2e-16` is pretty close to 0), and we have the nice consistent result that `area` definitely has an effect on `species`.\n\n:::\n\n## Exercises\n\n### Seat belts {#sec-exr_seatbelts}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nFor this exercise we'll be using the data from `data/seatbelts.csv`.\n\nI'd like you to do the following:\n\n1. Load the data\n2. Visualise the data and create a poisson regression model\n3. Plot the regression model on top of the data\n4. Assess if the model is a decent predictor for the number of fatalities\n\n::: {.callout-answer collapse=\"true\"}\n\n#### Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseatbelts <- read_csv(\"data/seatbelts.csv\")\n```\n:::\n\n\n:::\n\nThe data tracks the number of drivers killed in road traffic accidents, before and after the seat belt law was introduced. The information on whether the law was in place is encoded in the `law` column as `0` (law not in place) or `1` (law in place).\n\nThere are many more observations when the law was *not* in place, so we need to keep this in mind when we're interpreting the data.\n\nFirst we have a look at the data comparing no law vs law:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe have to convert the `law` column to a factor, otherwise R will see it as numerical.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseatbelts %>% \n ggplot(aes(as_factor(law), drivers_killed)) +\n geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\nThe data are recorded by month and year, so we can also display the number of drivers killed by year:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseatbelts %>% \n ggplot(aes(year, drivers_killed)) +\n geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-15-1.png){width=672}\n:::\n:::\n\n\n:::\n\nThe data look a bit weird. There is quite some variation within years (keeping in mind that the data are aggregated monthly). The data also seems to wave around a bit... with some vague peaks (e.g. 1972 - 1973) and some troughs (e.g. around 1976).\n\nSo my initial thought is that these data are going to be a bit tricky to interpret. But that's OK.\n\n#### Constructing a model\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_stb <- glm(drivers_killed ~ year,\n data = seatbelts, family = \"poisson\")\n```\n:::\n\n\nand we look at the model summary:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_stb)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = drivers_killed ~ year, family = \"poisson\", data = seatbelts)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) 37.168958 2.796636 13.29 <2e-16 ***\nyear -0.016373 0.001415 -11.57 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n Null deviance: 984.50 on 191 degrees of freedom\nResidual deviance: 850.41 on 190 degrees of freedom\nAIC: 2127.2\n\nNumber of Fisher Scoring iterations: 4\n```\n\n\n:::\n:::\n\n\n```\n(Intercept) 37.168958\nyear 0.016373\n```\n:::\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\n$$ E(drivers\\_killed) = \\exp(37.17 + 0.164 \\times year) $$\n\n#### Assessing significance\n\nIs the model well-specified?\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(850.41, 190)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\n:::\n\nHow about the overall fit?\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(984.50 - 850.41, 191 - 190)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\n:::\n\n#### Plotting the regression\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(seatbelts, aes(year, drivers_killed)) +\n geom_point() +\n geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n method.args = list(family = poisson)) +\n xlim(1970,1985)\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-20-1.png){width=672}\n:::\n:::\n\n\n:::\n\n\n#### Conclusions\n\nThe model we constructed appears to be a decent predictor for the number of fatalities.\n\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n- Poisson regression is useful when dealing with count data\n:::\n", + "supporting": [ + "glm-practical-poisson_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-11-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-11-1.png new file mode 100644 index 0000000..54e74a3 Binary files /dev/null and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-11-1.png differ diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-12-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-12-1.png new file mode 100644 index 0000000..6b94180 Binary files /dev/null and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-12-1.png differ diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-14-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-14-1.png new file mode 100644 index 0000000..7608c9e Binary files /dev/null and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-14-1.png differ diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-15-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-15-1.png new file mode 100644 index 0000000..5c19875 Binary files /dev/null and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-15-1.png differ diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-20-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-20-1.png new file mode 100644 index 0000000..6a23d16 Binary files /dev/null and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-20-1.png differ diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-4-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-4-1.png new file mode 100644 index 0000000..9c0d82f Binary files /dev/null and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-4-1.png differ diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-5-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-5-1.png new file mode 100644 index 0000000..9c0d82f Binary files /dev/null and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-5-1.png differ diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-6-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-6-1.png new file mode 100644 index 0000000..7608c9e Binary files /dev/null and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-6-1.png differ diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-7-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-7-1.png new file mode 100644 index 0000000..5c19875 Binary files /dev/null and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-7-1.png differ diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-8-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-8-1.png new file mode 100644 index 0000000..54e74a3 Binary files /dev/null and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-8-1.png differ diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-9-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-9-1.png new file mode 100644 index 0000000..6b94180 Binary files /dev/null and b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-9-1.png differ diff --git a/_site/index.html b/_site/index.html index 037a579..ff094d9 100644 --- a/_site/index.html +++ b/_site/index.html @@ -208,6 +208,23 @@ 8  Proportional response + + + + diff --git a/_site/materials/glm-practical-logistic-binary.html b/_site/materials/glm-practical-logistic-binary.html index 9228cb9..b9673ac 100644 --- a/_site/materials/glm-practical-logistic-binary.html +++ b/_site/materials/glm-practical-logistic-binary.html @@ -273,6 +273,23 @@ + + @@ -287,13 +304,15 @@
  • 7.3 Creating a suitable model
  • 7.4 Model output
  • 7.5 Parameter interpretation
  • +
  • 7.6 Assumptions
  • +
  • 7.7 Assessing significance
  • -7.6 Exercises +7.8 Exercises
  • -
  • 7.7 Summary
  • +
  • 7.9 Summary
  • @@ -584,7 +603,7 @@

    -

    +

    @@ -729,8 +748,8 @@

    Model Family: Binomial Df Model: 1 Link Function: Logit Scale: 1.0000 Method: IRLS Log-Likelihood: -4.5939 -Date: Fri, 19 Jan 2024 Deviance: 9.1879 -Time: 18:05:55 Pearson chi2: 15.1 +Date: Mon, 22 Jan 2024 Deviance: 9.1879 +Time: 11:16:02 Pearson chi2: 15.1 No. Iterations: 8 Pseudo R-squ. (CS): 0.7093 Covariance Type: nonrobust ============================================================================== @@ -893,10 +912,97 @@

    -

    -7.6 Exercises

    -

    -7.6.1 Diabetes

    +

    +7.6 Assumptions

    +

    As explained in the background chapter, we can’t really use the standard diagnostic plots to assess assumptions. We’re not going to go into a lot of detail for now, but there is one thing that we can do: look for influential points using the Cook’s distance plot:

    +
    + +
    +
    +
    +
    plot(glm_bks , which=4)
    +
    +
    +

    +
    +
    +
    +
    +
    +
    +
    +

    This shows that there are no very obvious influential points. You could regard point 34 as potentially influential (it’s got a Cook’s distance of around 0.8), but I’m not overly worried.

    +

    +7.7 Assessing significance

    +

    We can ask several questions.

    +

    Is the model well-specified?

    +

    Roughly speaking this asks “can our model predict our data reasonably well?”

    +
    + +
    +
    +

    Unfortunately, there isn’t a single command that does this for us, and we have to lift some of the numbers from the summary output ourselves.

    +
    +
    1 - pchisq(9.1879, 59)
    +
    +
    [1] 1
    +
    +
    +

    Here, we’ve used the pchisq function (which calculates the correct probability for us – ask if you want a hand-wavy explanation). The first argument to it is the residual deviance value from the summary table, the second argument to is the residual degrees of freedom argument from the same table.

    +

    This gives us a probability of 1. We can interpret this as the probability that the model is actually good. There aren’t any strict conventions on how to interpret this value but, for me, a tiny value would indicate a rubbish model.

    +
    +
    +
    +

    Is the overall model better than the null model?

    +
    + +
    +
    +
    +
    1 - pchisq(84.5476 - 9.1879, 60 - 59)
    +
    +
    [1] 0
    +
    +
    +

    Here we’ve used the pchisq function again (if you didn’t ask before, you probably aren’t going to ask now). The first argument is the difference between the null and residual deviances and the second argument is the difference in degrees of freedom between the null and residual models. All of these values can be lifted from the summary table.

    +

    This gives us a probability of 0, which is technically not possible. This value is doing a formal test to see whether our fitted model is significantly different from the null model. Here we can treat this a classical hypothesis test and since this p-value is less than 0.05 then we can say that our fitted model (with blength as a predictor variable) is definitely better than the null model (which has no predictor variables). Woohoo!

    +
    +
    +
    +

    Are any of the individual predictors significant?

    +

    Finally, we’ll use the anova function from before to determine which predictor variables are important, and specifically in this case whether the glucose predictor is significant.

    +
    + +
    +
    +
    +
    anova(glm_bks , test = "Chisq")
    +
    +
    Analysis of Deviance Table
    +
    +Model: binomial, link: logit
    +
    +Response: pointed_beak
    +
    +Terms added sequentially (first to last)
    +
    +        Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
    +NULL                       60     84.548              
    +blength  1    75.36        59      9.188 < 2.2e-16 ***
    +---
    +Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    +
    +
    +

    The anova() function is a true workhorse within R! This time we’ve used it to create an Analysis of Deviance table. This is exactly equivalent to an ordinary ANOVA table where we have rows corresponding to each predictor variable and a p-value telling us whether that variable is significant or not.

    +

    The p-value for the blength predictor is written under then Pr(>Chi) column and we can see that it is less than < 2.2e-16. So, beak length is a significant predictor.

    +

    This shouldn’t be surprising since we already saw that our overall model was better than the null model, which in this case is exactly the same as asking whether the beak length term is significant. However, in more complicated models with multiple predictors these two comparisons (and p-values) won’t be the same.

    +
    +
    +
    +

    +7.8 Exercises

    +

    +7.8.1 Diabetes

    @@ -940,18 +1046,18 @@

    First we load the data, then we visualise it.

    -
    +
    -
    diabetes <- read_csv("data/diabetes.csv")
    +
    diabetes <- read_csv("data/diabetes.csv")
    -
    +
    -
    diabetes_py = pd.read_csv("data/diabetes.csv")
    +
    diabetes_py = pd.read_csv("data/diabetes.csv")
    @@ -960,35 +1066,35 @@

    We’ll have to deal with this soon. For now, we can plot the data, by outcome:

    -
    +
    -
    ggplot(diabetes,
    +
    ggplot(diabetes,
            aes(x = factor(test_result),
                y = glucose)) +
       geom_boxplot()
    -

    +

    -
    +

    We could just give Python the test_result data directly, but then it would view the values as numeric. Which doesn’t really work, because we have two groups as such: those with a negative diabetes test result, and those with a positive one.

    We can force Python to temporarily covert the data to a factor, by making the test_result column an object type. We can do this directly inside the ggplot() function.

    -
    (ggplot(diabetes_py,
    -         aes(x = diabetes_py.test_result.astype(object),
    -             y = "glucose")) +
    -     geom_boxplot())
    +
    (ggplot(diabetes_py,
    +         aes(x = diabetes_py.test_result.astype(object),
    +             y = "glucose")) +
    +     geom_boxplot())
    -

    +

    @@ -1000,33 +1106,33 @@

    We can visualise that differently by plotting all the data points as a classic binary response plot:

    -
    +
    -
    ggplot(diabetes,
    +
    ggplot(diabetes,
            aes(x = glucose,
                y = test_result)) +
       geom_point()
    -

    +

    -
    +
    -
    (ggplot(diabetes_py,
    -         aes(x = "glucose",
    -             y = "test_result")) +
    -     geom_point())
    +
    (ggplot(diabetes_py,
    +         aes(x = "glucose",
    +             y = "test_result")) +
    +     geom_point())
    -

    +

    @@ -1037,27 +1143,27 @@

    Create a suitable model

    -
    +

    We’ll use the glm() function to create a generalised linear model. Here we save the model in an object called glm_dia:

    -
    glm_dia <- glm(test_result ~ glucose,
    +
    glm_dia <- glm(test_result ~ glucose,
                    family = binomial,
                    data = diabetes)

    The format of this function is similar to that used by the lm() function for linear models. The important difference is that we must specify the family of error distribution to use. For logistic regression we must set the family to binomial.

    -
    +
    -
    # create a linear model
    -model = smf.glm(formula = "test_result ~ glucose",
    -                family = sm.families.Binomial(),
    -                data = diabetes_py)
    -# and get the fitted parameters of the model
    -glm_dia_py = model.fit()
    +
    # create a linear model
    +model = smf.glm(formula = "test_result ~ glucose",
    +                family = sm.families.Binomial(),
    +                data = diabetes_py)
    +# and get the fitted parameters of the model
    +glm_dia_py = model.fit()
    @@ -1065,13 +1171,13 @@

    Let’s look at the model parameters:

    -
    +
    -
    summary(glm_dia)
    +
    summary(glm_dia)
    
     Call:
    @@ -1094,9 +1200,9 @@ 

    -
    +
    -
    print(glm_dia_py.summary())
    +
    print(glm_dia_py.summary())
                     Generalized Linear Model Regression Results                  
     ==============================================================================
    @@ -1105,8 +1211,8 @@ 

    Model Family: Binomial Df Model: 1 Link Function: Logit Scale: 1.0000 Method: IRLS Log-Likelihood: -376.10 -Date: Fri, 19 Jan 2024 Deviance: 752.20 -Time: 18:05:59 Pearson chi2: 713. +Date: Mon, 22 Jan 2024 Deviance: 752.20 +Time: 11:16:06 Pearson chi2: 713. No. Iterations: 4 Pseudo R-squ. (CS): 0.2238 Covariance Type: nonrobust ============================================================================== @@ -1127,21 +1233,21 @@

    Using the formula above, we can now calculate the probability of having a positive test result, for a given glucose value. If we do this for glucose = 150, we get the following:

    -
    +
    -
    exp(-5.61 + 0.04 * 150) / (1 + exp(-5.61 + 0.04 * 145))
    +
    exp(-5.61 + 0.04 * 150) / (1 + exp(-5.61 + 0.04 * 145))
    [1] 0.6685441
    -
    +
    -
    math.exp(-5.61 + 0.04 * 150) / (1 + math.exp(-5.61 + 0.04 * 145))
    +
    math.exp(-5.61 + 0.04 * 150) / (1 + math.exp(-5.61 + 0.04 * 145))
    0.6685441044999503
    @@ -1161,8 +1267,8 @@

    -

    -7.7 Summary

    +

    +7.9 Summary

    diff --git a/_site/search.json b/_site/search.json index c33ca7f..c6ff993 100644 --- a/_site/search.json +++ b/_site/search.json @@ -213,7 +213,7 @@ "href": "materials/glm-practical-logistic-binary.html#model-output", "title": "\n7  Binary response\n", "section": "\n7.4 Model output", - "text": "7.4 Model output\nThat’s the easy part done! The trickier part is interpreting the output. First of all, we’ll get some summary information.\n\n\nR\nPython\n\n\n\n\nsummary(glm_bks)\n\n\nCall:\nglm(formula = pointed_beak ~ blength, family = binomial, data = early_finches)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) -43.410 15.250 -2.847 0.00442 **\nblength 3.387 1.193 2.839 0.00452 **\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 84.5476 on 60 degrees of freedom\nResidual deviance: 9.1879 on 59 degrees of freedom\nAIC: 13.188\n\nNumber of Fisher Scoring iterations: 8\n\n\n\n\n\nprint(glm_bks_py.summary())\n\n Generalized Linear Model Regression Results \n==============================================================================\nDep. Variable: pointed_beak No. Observations: 61\nModel: GLM Df Residuals: 59\nModel Family: Binomial Df Model: 1\nLink Function: Logit Scale: 1.0000\nMethod: IRLS Log-Likelihood: -4.5939\nDate: Fri, 19 Jan 2024 Deviance: 9.1879\nTime: 18:05:55 Pearson chi2: 15.1\nNo. Iterations: 8 Pseudo R-squ. (CS): 0.7093\nCovariance Type: nonrobust \n==============================================================================\n coef std err z P>|z| [0.025 0.975]\n------------------------------------------------------------------------------\nIntercept -43.4096 15.250 -2.847 0.004 -73.298 -13.521\nblength 3.3866 1.193 2.839 0.005 1.049 5.724\n==============================================================================\n\n\n\n\n\nThere’s a lot to unpack here, but let’s start with what we’re familiar with: coefficients!", + "text": "7.4 Model output\nThat’s the easy part done! The trickier part is interpreting the output. First of all, we’ll get some summary information.\n\n\nR\nPython\n\n\n\n\nsummary(glm_bks)\n\n\nCall:\nglm(formula = pointed_beak ~ blength, family = binomial, data = early_finches)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) -43.410 15.250 -2.847 0.00442 **\nblength 3.387 1.193 2.839 0.00452 **\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 84.5476 on 60 degrees of freedom\nResidual deviance: 9.1879 on 59 degrees of freedom\nAIC: 13.188\n\nNumber of Fisher Scoring iterations: 8\n\n\n\n\n\nprint(glm_bks_py.summary())\n\n Generalized Linear Model Regression Results \n==============================================================================\nDep. Variable: pointed_beak No. Observations: 61\nModel: GLM Df Residuals: 59\nModel Family: Binomial Df Model: 1\nLink Function: Logit Scale: 1.0000\nMethod: IRLS Log-Likelihood: -4.5939\nDate: Mon, 22 Jan 2024 Deviance: 9.1879\nTime: 11:16:02 Pearson chi2: 15.1\nNo. Iterations: 8 Pseudo R-squ. (CS): 0.7093\nCovariance Type: nonrobust \n==============================================================================\n coef std err z P>|z| [0.025 0.975]\n------------------------------------------------------------------------------\nIntercept -43.4096 15.250 -2.847 0.004 -73.298 -13.521\nblength 3.3866 1.193 2.839 0.005 1.049 5.724\n==============================================================================\n\n\n\n\n\nThere’s a lot to unpack here, but let’s start with what we’re familiar with: coefficients!", "crumbs": [ "Binary and proportional data", "7  Binary response" @@ -230,12 +230,34 @@ "7  Binary response" ] }, + { + "objectID": "materials/glm-practical-logistic-binary.html#assumptions", + "href": "materials/glm-practical-logistic-binary.html#assumptions", + "title": "\n7  Binary response\n", + "section": "\n7.6 Assumptions", + "text": "7.6 Assumptions\nAs explained in the background chapter, we can’t really use the standard diagnostic plots to assess assumptions. We’re not going to go into a lot of detail for now, but there is one thing that we can do: look for influential points using the Cook’s distance plot:\n\nR\n\n\n\nplot(glm_bks , which=4)\n\n\n\n\n\n\n\n\n\n\nThis shows that there are no very obvious influential points. You could regard point 34 as potentially influential (it’s got a Cook’s distance of around 0.8), but I’m not overly worried.", + "crumbs": [ + "Binary and proportional data", + "7  Binary response" + ] + }, + { + "objectID": "materials/glm-practical-logistic-binary.html#assessing-significance", + "href": "materials/glm-practical-logistic-binary.html#assessing-significance", + "title": "\n7  Binary response\n", + "section": "\n7.7 Assessing significance", + "text": "7.7 Assessing significance\nWe can ask several questions.\nIs the model well-specified?\nRoughly speaking this asks “can our model predict our data reasonably well?”\n\nR\n\n\nUnfortunately, there isn’t a single command that does this for us, and we have to lift some of the numbers from the summary output ourselves.\n\n1 - pchisq(9.1879, 59)\n\n[1] 1\n\n\nHere, we’ve used the pchisq function (which calculates the correct probability for us – ask if you want a hand-wavy explanation). The first argument to it is the residual deviance value from the summary table, the second argument to is the residual degrees of freedom argument from the same table.\nThis gives us a probability of 1. We can interpret this as the probability that the model is actually good. There aren’t any strict conventions on how to interpret this value but, for me, a tiny value would indicate a rubbish model.\n\n\n\nIs the overall model better than the null model?\n\nR\n\n\n\n1 - pchisq(84.5476 - 9.1879, 60 - 59)\n\n[1] 0\n\n\nHere we’ve used the pchisq function again (if you didn’t ask before, you probably aren’t going to ask now). The first argument is the difference between the null and residual deviances and the second argument is the difference in degrees of freedom between the null and residual models. All of these values can be lifted from the summary table.\nThis gives us a probability of 0, which is technically not possible. This value is doing a formal test to see whether our fitted model is significantly different from the null model. Here we can treat this a classical hypothesis test and since this p-value is less than 0.05 then we can say that our fitted model (with blength as a predictor variable) is definitely better than the null model (which has no predictor variables). Woohoo!\n\n\n\nAre any of the individual predictors significant?\nFinally, we’ll use the anova function from before to determine which predictor variables are important, and specifically in this case whether the glucose predictor is significant.\n\nR\n\n\n\nanova(glm_bks , test = \"Chisq\")\n\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: pointed_beak\n\nTerms added sequentially (first to last)\n\n Df Deviance Resid. Df Resid. Dev Pr(>Chi) \nNULL 60 84.548 \nblength 1 75.36 59 9.188 < 2.2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nThe anova() function is a true workhorse within R! This time we’ve used it to create an Analysis of Deviance table. This is exactly equivalent to an ordinary ANOVA table where we have rows corresponding to each predictor variable and a p-value telling us whether that variable is significant or not.\nThe p-value for the blength predictor is written under then Pr(>Chi) column and we can see that it is less than < 2.2e-16. So, beak length is a significant predictor.\nThis shouldn’t be surprising since we already saw that our overall model was better than the null model, which in this case is exactly the same as asking whether the beak length term is significant. However, in more complicated models with multiple predictors these two comparisons (and p-values) won’t be the same.", + "crumbs": [ + "Binary and proportional data", + "7  Binary response" + ] + }, { "objectID": "materials/glm-practical-logistic-binary.html#exercises", "href": "materials/glm-practical-logistic-binary.html#exercises", "title": "\n7  Binary response\n", - "section": "\n7.6 Exercises", - "text": "7.6 Exercises\n\n7.6.1 Diabetes\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nLevel: \nFor this exercise we’ll be using the data from data/diabetes.csv.\nThis is a data set comprising 768 observations of three variables (one dependent and two predictor variables). This records the results of a diabetes test result as a binary variable (1 is a positive result, 0 is a negative result), along with the result of a glucose tolerance test and the diastolic blood pressure for each of 768 women. The variables are called test_result, glucose and diastolic.\nWe want to see if the glucose tolerance is a meaningful predictor for predictions on a diabetes test. To investigate this, do the following:\n\nLoad and visualise the data\nCreate a suitable model\nDetermine if there are any statistically significant predictors\nCalculate the probability of a positive diabetes test result for a glucose tolerance test value of glucose = 150\n\n\n\n\n\n\n\n\nAnswer\n\n\n\n\n\n\n\nLoad and visualise the data\nFirst we load the data, then we visualise it.\n\n\nR\nPython\n\n\n\n\ndiabetes <- read_csv(\"data/diabetes.csv\")\n\n\n\n\ndiabetes_py = pd.read_csv(\"data/diabetes.csv\")\n\n\n\n\nLooking at the data, we can see that the test_result column contains zeros and ones. These are yes/no test result outcomes and not actually numeric representations.\nWe’ll have to deal with this soon. For now, we can plot the data, by outcome:\n\n\nR\nPython\n\n\n\n\nggplot(diabetes,\n aes(x = factor(test_result),\n y = glucose)) +\n geom_boxplot()\n\n\n\n\n\n\n\n\n\nWe could just give Python the test_result data directly, but then it would view the values as numeric. Which doesn’t really work, because we have two groups as such: those with a negative diabetes test result, and those with a positive one.\nWe can force Python to temporarily covert the data to a factor, by making the test_result column an object type. We can do this directly inside the ggplot() function.\n\n(ggplot(diabetes_py,\n aes(x = diabetes_py.test_result.astype(object),\n y = \"glucose\")) +\n geom_boxplot())\n\n\n\n\n\n\n\n\n\n\nIt looks as though the patients with a positive diabetes test have slightly higher glucose levels than those with a negative diabetes test.\nWe can visualise that differently by plotting all the data points as a classic binary response plot:\n\n\nR\nPython\n\n\n\n\nggplot(diabetes,\n aes(x = glucose,\n y = test_result)) +\n geom_point()\n\n\n\n\n\n\n\n\n\n\n(ggplot(diabetes_py,\n aes(x = \"glucose\",\n y = \"test_result\")) +\n geom_point())\n\n\n\n\n\n\n\n\n\n\nCreate a suitable model\n\n\nR\nPython\n\n\n\nWe’ll use the glm() function to create a generalised linear model. Here we save the model in an object called glm_dia:\n\nglm_dia <- glm(test_result ~ glucose,\n family = binomial,\n data = diabetes)\n\nThe format of this function is similar to that used by the lm() function for linear models. The important difference is that we must specify the family of error distribution to use. For logistic regression we must set the family to binomial.\n\n\n\n# create a linear model\nmodel = smf.glm(formula = \"test_result ~ glucose\",\n family = sm.families.Binomial(),\n data = diabetes_py)\n# and get the fitted parameters of the model\nglm_dia_py = model.fit()\n\n\n\n\nLet’s look at the model parameters:\n\n\nR\nPython\n\n\n\n\nsummary(glm_dia)\n\n\nCall:\nglm(formula = test_result ~ glucose, family = binomial, data = diabetes)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) -5.611732 0.442289 -12.69 <2e-16 ***\nglucose 0.039510 0.003398 11.63 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 936.6 on 727 degrees of freedom\nResidual deviance: 752.2 on 726 degrees of freedom\nAIC: 756.2\n\nNumber of Fisher Scoring iterations: 4\n\n\n\n\n\nprint(glm_dia_py.summary())\n\n Generalized Linear Model Regression Results \n==============================================================================\nDep. Variable: test_result No. Observations: 728\nModel: GLM Df Residuals: 726\nModel Family: Binomial Df Model: 1\nLink Function: Logit Scale: 1.0000\nMethod: IRLS Log-Likelihood: -376.10\nDate: Fri, 19 Jan 2024 Deviance: 752.20\nTime: 18:05:59 Pearson chi2: 713.\nNo. Iterations: 4 Pseudo R-squ. (CS): 0.2238\nCovariance Type: nonrobust \n==============================================================================\n coef std err z P>|z| [0.025 0.975]\n------------------------------------------------------------------------------\nIntercept -5.6117 0.442 -12.688 0.000 -6.479 -4.745\nglucose 0.0395 0.003 11.628 0.000 0.033 0.046\n==============================================================================\n\n\n\n\n\nWe can see that glucose is a significant predictor for the test_result (the \\(p\\) value is much smaller than 0.05).\nKnowing this, we’re interested in the coefficients. We have an intercept of -5.61 and 0.0395 for glucose. We can use these coefficients to write a formula that describes the potential relationship between the probability of having a positive test result, dependent on the glucose tolerance level value:\n\\[ P(positive \\ test\\ result) = \\frac{\\exp(-5.61 + 0.04 \\times glucose)}{1 + \\exp(-5.61 + 0.04 \\times glucose)} \\]\nCalculating probabilities\nUsing the formula above, we can now calculate the probability of having a positive test result, for a given glucose value. If we do this for glucose = 150, we get the following:\n\n\nR\nPython\n\n\n\n\nexp(-5.61 + 0.04 * 150) / (1 + exp(-5.61 + 0.04 * 145))\n\n[1] 0.6685441\n\n\n\n\n\nmath.exp(-5.61 + 0.04 * 150) / (1 + math.exp(-5.61 + 0.04 * 145))\n\n0.6685441044999503\n\n\n\n\n\nThis tells us that the probability of having a positive diabetes test result, given a glucose tolerance level of 150 is around 67%.", + "section": "\n7.8 Exercises", + "text": "7.8 Exercises\n\n7.8.1 Diabetes\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nLevel: \nFor this exercise we’ll be using the data from data/diabetes.csv.\nThis is a data set comprising 768 observations of three variables (one dependent and two predictor variables). This records the results of a diabetes test result as a binary variable (1 is a positive result, 0 is a negative result), along with the result of a glucose tolerance test and the diastolic blood pressure for each of 768 women. The variables are called test_result, glucose and diastolic.\nWe want to see if the glucose tolerance is a meaningful predictor for predictions on a diabetes test. To investigate this, do the following:\n\nLoad and visualise the data\nCreate a suitable model\nDetermine if there are any statistically significant predictors\nCalculate the probability of a positive diabetes test result for a glucose tolerance test value of glucose = 150\n\n\n\n\n\n\n\n\nAnswer\n\n\n\n\n\n\n\nLoad and visualise the data\nFirst we load the data, then we visualise it.\n\n\nR\nPython\n\n\n\n\ndiabetes <- read_csv(\"data/diabetes.csv\")\n\n\n\n\ndiabetes_py = pd.read_csv(\"data/diabetes.csv\")\n\n\n\n\nLooking at the data, we can see that the test_result column contains zeros and ones. These are yes/no test result outcomes and not actually numeric representations.\nWe’ll have to deal with this soon. For now, we can plot the data, by outcome:\n\n\nR\nPython\n\n\n\n\nggplot(diabetes,\n aes(x = factor(test_result),\n y = glucose)) +\n geom_boxplot()\n\n\n\n\n\n\n\n\n\nWe could just give Python the test_result data directly, but then it would view the values as numeric. Which doesn’t really work, because we have two groups as such: those with a negative diabetes test result, and those with a positive one.\nWe can force Python to temporarily covert the data to a factor, by making the test_result column an object type. We can do this directly inside the ggplot() function.\n\n(ggplot(diabetes_py,\n aes(x = diabetes_py.test_result.astype(object),\n y = \"glucose\")) +\n geom_boxplot())\n\n\n\n\n\n\n\n\n\n\nIt looks as though the patients with a positive diabetes test have slightly higher glucose levels than those with a negative diabetes test.\nWe can visualise that differently by plotting all the data points as a classic binary response plot:\n\n\nR\nPython\n\n\n\n\nggplot(diabetes,\n aes(x = glucose,\n y = test_result)) +\n geom_point()\n\n\n\n\n\n\n\n\n\n\n(ggplot(diabetes_py,\n aes(x = \"glucose\",\n y = \"test_result\")) +\n geom_point())\n\n\n\n\n\n\n\n\n\n\nCreate a suitable model\n\n\nR\nPython\n\n\n\nWe’ll use the glm() function to create a generalised linear model. Here we save the model in an object called glm_dia:\n\nglm_dia <- glm(test_result ~ glucose,\n family = binomial,\n data = diabetes)\n\nThe format of this function is similar to that used by the lm() function for linear models. The important difference is that we must specify the family of error distribution to use. For logistic regression we must set the family to binomial.\n\n\n\n# create a linear model\nmodel = smf.glm(formula = \"test_result ~ glucose\",\n family = sm.families.Binomial(),\n data = diabetes_py)\n# and get the fitted parameters of the model\nglm_dia_py = model.fit()\n\n\n\n\nLet’s look at the model parameters:\n\n\nR\nPython\n\n\n\n\nsummary(glm_dia)\n\n\nCall:\nglm(formula = test_result ~ glucose, family = binomial, data = diabetes)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) -5.611732 0.442289 -12.69 <2e-16 ***\nglucose 0.039510 0.003398 11.63 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 936.6 on 727 degrees of freedom\nResidual deviance: 752.2 on 726 degrees of freedom\nAIC: 756.2\n\nNumber of Fisher Scoring iterations: 4\n\n\n\n\n\nprint(glm_dia_py.summary())\n\n Generalized Linear Model Regression Results \n==============================================================================\nDep. Variable: test_result No. Observations: 728\nModel: GLM Df Residuals: 726\nModel Family: Binomial Df Model: 1\nLink Function: Logit Scale: 1.0000\nMethod: IRLS Log-Likelihood: -376.10\nDate: Mon, 22 Jan 2024 Deviance: 752.20\nTime: 11:16:06 Pearson chi2: 713.\nNo. Iterations: 4 Pseudo R-squ. (CS): 0.2238\nCovariance Type: nonrobust \n==============================================================================\n coef std err z P>|z| [0.025 0.975]\n------------------------------------------------------------------------------\nIntercept -5.6117 0.442 -12.688 0.000 -6.479 -4.745\nglucose 0.0395 0.003 11.628 0.000 0.033 0.046\n==============================================================================\n\n\n\n\n\nWe can see that glucose is a significant predictor for the test_result (the \\(p\\) value is much smaller than 0.05).\nKnowing this, we’re interested in the coefficients. We have an intercept of -5.61 and 0.0395 for glucose. We can use these coefficients to write a formula that describes the potential relationship between the probability of having a positive test result, dependent on the glucose tolerance level value:\n\\[ P(positive \\ test\\ result) = \\frac{\\exp(-5.61 + 0.04 \\times glucose)}{1 + \\exp(-5.61 + 0.04 \\times glucose)} \\]\nCalculating probabilities\nUsing the formula above, we can now calculate the probability of having a positive test result, for a given glucose value. If we do this for glucose = 150, we get the following:\n\n\nR\nPython\n\n\n\n\nexp(-5.61 + 0.04 * 150) / (1 + exp(-5.61 + 0.04 * 145))\n\n[1] 0.6685441\n\n\n\n\n\nmath.exp(-5.61 + 0.04 * 150) / (1 + math.exp(-5.61 + 0.04 * 145))\n\n0.6685441044999503\n\n\n\n\n\nThis tells us that the probability of having a positive diabetes test result, given a glucose tolerance level of 150 is around 67%.", "crumbs": [ "Binary and proportional data", "7  Binary response" @@ -245,8 +267,8 @@ "objectID": "materials/glm-practical-logistic-binary.html#summary", "href": "materials/glm-practical-logistic-binary.html#summary", "title": "\n7  Binary response\n", - "section": "\n7.7 Summary", - "text": "7.7 Summary\n\n\n\n\n\n\nKey points\n\n\n\n\nWe use a logistic regression to model a binary response\nWe can feed new observations into the model and get probabilities for the outcome\n\n\n\n\n\n\n\nLamichhaney, Sangeet, Fan Han, Matthew T. Webster, B. Rosemary Grant, Peter R. Grant, and Leif Andersson. 2020. “Female-Biased Gene Flow Between Two Species of Darwin’s Finches.” Nature Ecology & Evolution 4 (7): 979–86. https://doi.org/10.1038/s41559-020-1183-9.", + "section": "\n7.9 Summary", + "text": "7.9 Summary\n\n\n\n\n\n\nKey points\n\n\n\n\nWe use a logistic regression to model a binary response\nWe can feed new observations into the model and get probabilities for the outcome\n\n\n\n\n\n\n\nLamichhaney, Sangeet, Fan Han, Matthew T. Webster, B. Rosemary Grant, Peter R. Grant, and Leif Andersson. 2020. “Female-Biased Gene Flow Between Two Species of Darwin’s Finches.” Nature Ecology & Evolution 4 (7): 979–86. https://doi.org/10.1038/s41559-020-1183-9.", "crumbs": [ "Binary and proportional data", "7  Binary response" @@ -328,5 +350,104 @@ "Binary and proportional data", "8  Proportional response" ] + }, + { + "objectID": "materials/glm-practical-poisson.html", + "href": "materials/glm-practical-poisson.html", + "title": "\n9  Count data\n", + "section": "", + "text": "9.1 Libraries and functions\nThe examples in this section use the following data sets:\ndata/islands.csv\nThis is a data set comprising 35 observations of two variables (one dependent and one predictor). This records the number of species recorded on different small islands along with the area (km2) of the islands. The variables are species and area.\nThe second data set is on seat belts.\nThe seatbelts data set is a multiple time-series data set that was commissioned by the Department of Transport in 1984 to measure differences in deaths before and after front seat belt legislation was introduced on 31st January 1983. It provides monthly total numerical data on a number of incidents including those related to death and injury in Road Traffic Accidents (RTA’s). The data set starts in January 1969 and observations run until December 1984.\nYou can find the file in data/seatbelts.csv", + "crumbs": [ + "Count data", + "9  Count data" + ] + }, + { + "objectID": "materials/glm-practical-poisson.html#libraries-and-functions", + "href": "materials/glm-practical-poisson.html#libraries-and-functions", + "title": "\n9  Count data\n", + "section": "", + "text": "Click to expand\n\n\n\n\n\n\nR\n\n\n\n9.1.1 Libraries\n\n9.1.2 Functions", + "crumbs": [ + "Count data", + "9  Count data" + ] + }, + { + "objectID": "materials/glm-practical-poisson.html#load-and-visualise-the-data", + "href": "materials/glm-practical-poisson.html#load-and-visualise-the-data", + "title": "\n9  Count data\n", + "section": "\n9.2 Load and visualise the data", + "text": "9.2 Load and visualise the data\nFirst we load the data, then we visualise it.\n\nR\n\n\n\nislands <- read_csv(\"data/islands.csv\")\n\nLet’s have a glimpse at the data:\n\nislands\n\n# A tibble: 35 × 2\n species area\n <dbl> <dbl>\n 1 114 12.1\n 2 130 13.4\n 3 113 13.7\n 4 109 14.5\n 5 118 16.8\n 6 136 19.0\n 7 149 19.6\n 8 162 20.6\n 9 145 20.9\n10 148 21.0\n# ℹ 25 more rows\n\n\n\n\n\nLooking at the data, we can see that there are two columns: species, which contains the number of species recorded on each island and area, which contains the surface area of the island in square kilometers.\nWe can plot the data:\n\nR\n\n\n\nggplot(islands, aes(x = area, y = species)) +\n geom_point()\n\n\n\n\n\n\n\n\n\n\nIt looks as though area may have an effect on the number of species that we observe on each island. We note that the response variable is count data and so we try to construct a Poisson regression.", + "crumbs": [ + "Count data", + "9  Count data" + ] + }, + { + "objectID": "materials/glm-practical-poisson.html#constructing-a-model", + "href": "materials/glm-practical-poisson.html#constructing-a-model", + "title": "\n9  Count data\n", + "section": "\n9.3 Constructing a model", + "text": "9.3 Constructing a model\n\nR\n\n\n\nglm_isl <- glm(species ~ area,\n data = islands, family = \"poisson\")\n\nand we look at the model summary:\n\nsummary(glm_isl)\n\n\nCall:\nglm(formula = species ~ area, family = \"poisson\", data = islands)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) 4.241129 0.041322 102.64 <2e-16 ***\narea 0.035613 0.001247 28.55 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n Null deviance: 856.899 on 34 degrees of freedom\nResidual deviance: 30.437 on 33 degrees of freedom\nAIC: 282.66\n\nNumber of Fisher Scoring iterations: 3\n\n\nThe output is strikingly similar to the logistic regression models (who’d have guessed, eh?) and the main numbers to extract from the output are the two numbers underneath Estimate.Std in the Coefficients table:\n(Intercept) 4.241129\narea 0.035613\n\n\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\\[ E(species) = \\exp(4.24 + 0.036 \\times area) \\]\nInterpreting this requires a bit of thought (not much, but a bit). The intercept coefficient, 4.24, is related to the number of species we would expect on an island of zero area (this is statistics, not real life. You’d do well to remember that before you worry too much about what that even means). But in order to turn this number into something meaningful we have to exponentiate it. Since exp(4.24) ≈ 70, we can say that the baseline number of species the model expects on any island is 70. This isn’t actually the interesting bit though.\nThe coefficient of area is the fun bit. For starters we can see that it is a positive number which does mean that increasing area leads to increasing numbers of species. Good so far.\nBut what does the value 0.036 actually mean? Well, if we exponentiate it as well, we get exp(0.036) ≈ 1.04. This means that for every increase in area of 1 km^2 (the original units of the area variable), the number of species on the island is multiplied by 1.04. So, an island of area 1 km^2 will have 1.04 x 70 ≈ 72 species.\nSo, in order to interpret Poisson coefficients, you have to exponentiate them.", + "crumbs": [ + "Count data", + "9  Count data" + ] + }, + { + "objectID": "materials/glm-practical-poisson.html#plotting-the-poisson-regression", + "href": "materials/glm-practical-poisson.html#plotting-the-poisson-regression", + "title": "\n9  Count data\n", + "section": "\n9.4 Plotting the Poisson regression", + "text": "9.4 Plotting the Poisson regression\n\nR\n\n\n\nggplot(islands, aes(area, species)) +\n geom_point() +\n geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n method.args = list(family = poisson)) +\n xlim(10,50)\n\n`geom_smooth()` using formula = 'y ~ x'", + "crumbs": [ + "Count data", + "9  Count data" + ] + }, + { + "objectID": "materials/glm-practical-poisson.html#assumptions", + "href": "materials/glm-practical-poisson.html#assumptions", + "title": "\n9  Count data\n", + "section": "\n9.5 Assumptions", + "text": "9.5 Assumptions\nAs we mentioned earlier, Poisson regressions require that the variance of the data at any point is the same as the mean of the data at that point. We checked that earlier by looking at the residual deviance values.\nWe can look for influential points using the Cook’s distance plot:\n\nR\n\n\n\nplot(glm_isl , which=4)\n\n\n\n\n\n\n\n\n\n\nNone of our points have particularly large Cook’s distances and so life is rosy.", + "crumbs": [ + "Count data", + "9  Count data" + ] + }, + { + "objectID": "materials/glm-practical-poisson.html#assessing-significance", + "href": "materials/glm-practical-poisson.html#assessing-significance", + "title": "\n9  Count data\n", + "section": "\n9.6 Assessing significance", + "text": "9.6 Assessing significance\nWe can ask the same three questions we asked before.\n\nIs the model well-specified?\nIs the overall model better than the null model?\nAre any of the individual predictors significant?\n\nAgain, in this case, questions 2 and 3 are effectively asking the same thing because we still only have a single predictor variable.\nTo assess if the model is any good we’ll again use the residual deviance and the residual degrees of freedom.\n\nR\n\n\n\n1 - pchisq(30.437, 33)\n\n[1] 0.5953482\n\n\n\n\n\nThis gives a probability of 0.60. This suggests that this model is actually a good one and that the data are pretty well supported by the model. For Poisson models this has an extra interpretation. This can be used to assess whether we have significant over-dispersion in our data.\nFor a Poisson model to be appropriate we need that the variance of the data to be exactly the same as the mean of the data. Visually, this would correspond to the data spreading out more for higher predicted values of species. However, we don’t want the data to spread out too much. If that happens then a Poisson model wouldn’t be appropriate.\nThe easy way to check this is to look at the ratio of the residual deviance to the residual degrees of freedom (in this case 0.922). For a Poisson model to be valid, this ratio should be about 1. If the ratio is significantly bigger than 1 then we say that we have over-dispersion in the model and we wouldn’t be able to trust any of the significance testing that we are about to do using a Poisson regression.\nThankfully the probability we have just created (0.60) is exactly the right one we need to look at to assess whether we have significant over-dispersion in our model.\nSecondly, to assess whether the overall model, with all of the terms, is better than the null model we’ll look at the difference in deviances and the difference in degrees of freedom:\n\nR\n\n\n\n1 - pchisq(856.899 - 30.437, 34 - 33)\n\n[1] 0\n\n\n\n\n\nThis gives a reported p-value of 0, which is pretty damn small. So, yes, this model is better than nothing at all and species does appear to change with some of our predictors\nFinally, we’ll construct an analysis of deviance table to look at the individual terms:\n\nR\n\n\n\nanova(glm_isl , test = \"Chisq\")\n\nAnalysis of Deviance Table\n\nModel: poisson, link: log\n\nResponse: species\n\nTerms added sequentially (first to last)\n\n Df Deviance Resid. Df Resid. Dev Pr(>Chi) \nNULL 34 856.90 \narea 1 826.46 33 30.44 < 2.2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nThe p-value in this table is just as small as we’d expect given our previous result (<2.2e-16 is pretty close to 0), and we have the nice consistent result that area definitely has an effect on species.", + "crumbs": [ + "Count data", + "9  Count data" + ] + }, + { + "objectID": "materials/glm-practical-poisson.html#exercises", + "href": "materials/glm-practical-poisson.html#exercises", + "title": "\n9  Count data\n", + "section": "\n9.7 Exercises", + "text": "9.7 Exercises\n\n9.7.1 Seat belts\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nLevel: \nFor this exercise we’ll be using the data from data/seatbelts.csv.\nI’d like you to do the following:\n\nLoad the data\nVisualise the data and create a poisson regression model\nPlot the regression model on top of the data\nAssess if the model is a decent predictor for the number of fatalities\n\n\n\n\n\n\n\nAnswer\n\n\n\n\n\n\n\nLoad and visualise the data\nFirst we load the data, then we visualise it.\n\nR\n\n\n\nseatbelts <- read_csv(\"data/seatbelts.csv\")\n\n\n\n\nThe data tracks the number of drivers killed in road traffic accidents, before and after the seat belt law was introduced. The information on whether the law was in place is encoded in the law column as 0 (law not in place) or 1 (law in place).\nThere are many more observations when the law was not in place, so we need to keep this in mind when we’re interpreting the data.\nFirst we have a look at the data comparing no law vs law:\n\nR\n\n\nWe have to convert the law column to a factor, otherwise R will see it as numerical.\n\nseatbelts %>% \n ggplot(aes(as_factor(law), drivers_killed)) +\n geom_boxplot()\n\n\n\n\n\n\n\nThe data are recorded by month and year, so we can also display the number of drivers killed by year:\n\nseatbelts %>% \n ggplot(aes(year, drivers_killed)) +\n geom_point()\n\n\n\n\n\n\n\n\n\n\nThe data look a bit weird. There is quite some variation within years (keeping in mind that the data are aggregated monthly). The data also seems to wave around a bit… with some vague peaks (e.g. 1972 - 1973) and some troughs (e.g. around 1976).\nSo my initial thought is that these data are going to be a bit tricky to interpret. But that’s OK.\nConstructing a model\n\nR\n\n\n\nglm_stb <- glm(drivers_killed ~ year,\n data = seatbelts, family = \"poisson\")\n\nand we look at the model summary:\n\nsummary(glm_stb)\n\n\nCall:\nglm(formula = drivers_killed ~ year, family = \"poisson\", data = seatbelts)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) 37.168958 2.796636 13.29 <2e-16 ***\nyear -0.016373 0.001415 -11.57 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n Null deviance: 984.50 on 191 degrees of freedom\nResidual deviance: 850.41 on 190 degrees of freedom\nAIC: 2127.2\n\nNumber of Fisher Scoring iterations: 4\n\n\n(Intercept) 37.168958\nyear 0.016373\n\n\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\\[ E(drivers\\_killed) = \\exp(37.17 + 0.164 \\times year) \\]\nAssessing significance\nIs the model well-specified?\n\nR\n\n\n\n1 - pchisq(850.41, 190)\n\n[1] 0\n\n\n\n\n\nHow about the overall fit?\n\nR\n\n\n\n1 - pchisq(984.50 - 850.41, 191 - 190)\n\n[1] 0\n\n\n\n\n\nPlotting the regression\n\nR\n\n\n\nggplot(seatbelts, aes(year, drivers_killed)) +\n geom_point() +\n geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n method.args = list(family = poisson)) +\n xlim(1970,1985)\n\n\n\n\n\n\n\n\n\n\nConclusions\nThe model we constructed appears to be a decent predictor for the number of fatalities.", + "crumbs": [ + "Count data", + "9  Count data" + ] + }, + { + "objectID": "materials/glm-practical-poisson.html#summary", + "href": "materials/glm-practical-poisson.html#summary", + "title": "\n9  Count data\n", + "section": "\n9.8 Summary", + "text": "9.8 Summary\n\n\n\n\n\n\nKey points\n\n\n\n\nPoisson regression is useful when dealing with count data", + "crumbs": [ + "Count data", + "9  Count data" + ] } ] \ No newline at end of file diff --git a/_site/setup.html b/_site/setup.html index b3dec5b..742ab7f 100644 --- a/_site/setup.html +++ b/_site/setup.html @@ -209,6 +209,23 @@ 8  Proportional response
    + + + + diff --git a/materials/_chapters.yml b/materials/_chapters.yml index 0b8f38c..6a7d5b1 100644 --- a/materials/_chapters.yml +++ b/materials/_chapters.yml @@ -8,6 +8,6 @@ book: chapters: - materials/glm-practical-logistic-binary.qmd - materials/glm-practical-logistic-proportion.qmd - #- part: "Count data" - #chapters: - #- materials/glm-practical-poisson.qmd + - part: "Count data" + chapters: + - materials/glm-practical-poisson.qmd diff --git a/materials/data/island.csv b/materials/data/islands.csv similarity index 100% rename from materials/data/island.csv rename to materials/data/islands.csv diff --git a/materials/glm-practical-logistic-binary.qmd b/materials/glm-practical-logistic-binary.qmd index 14fd64c..25a351f 100644 --- a/materials/glm-practical-logistic-binary.qmd +++ b/materials/glm-practical-logistic-binary.qmd @@ -513,6 +513,84 @@ Short beaks are more closely associated with the bluntly shaped beaks, whereas l + +## Assumptions + +As explained in the background chapter, we can't really use the standard diagnostic plots to assess assumptions. We're not going to go into a lot of detail for now, but there is one thing that we can do: look for influential points using the Cook’s distance plot: + +::: {.panel-tabset group="language"} +## R + +```{r} +plot(glm_bks , which=4) +``` + +::: + +This shows that there are no very obvious influential points. You could regard point `34` as potentially influential (it's got a Cook's distance of around `0.8`), but I'm not overly worried. + +## Assessing significance + +We can ask several questions. + +**Is the model well-specified?** + +Roughly speaking this asks "can our model predict our data reasonably well?" + +::: {.panel-tabset group="language"} +## R + +Unfortunately, there isn’t a single command that does this for us, and we have to lift some of the numbers from the summary output ourselves. + +```{r} +1 - pchisq(9.1879, 59) +``` + + +Here, we’ve used the `pchisq` function (which calculates the correct probability for us – ask if you want a hand-wavy explanation). The first argument to it is the residual deviance value from the summary table, the second argument to is the residual degrees of freedom argument from the same table. + +This gives us a probability of `1`. We can interpret this as the probability that the model is actually good. There aren’t any strict conventions on how to interpret this value but, for me, a tiny value would indicate a rubbish model. + +::: + +**Is the overall model better than the null model?** + +::: {.panel-tabset group="language"} +## R + +```{r} +1 - pchisq(84.5476 - 9.1879, 60 - 59) +``` + +Here we’ve used the `pchisq` function again (if you didn’t ask before, you probably aren’t going to ask now). The first argument is the difference between the null and residual deviances and the second argument is the difference in degrees of freedom between the null and residual models. All of these values can be lifted from the summary table. + +This gives us a probability of 0, which is technically not possible. This value is doing a formal test to see whether our fitted model is significantly different from the null model. Here we can treat this a classical hypothesis test and since this p-value is less than 0.05 then we can say that our fitted model (with `blength` as a predictor variable) is definitely better than the null model (which has no predictor variables). Woohoo! + +::: + +**Are any of the individual predictors significant?** + +Finally, we’ll use the anova function from before to determine which predictor variables are important, and specifically in this case whether the glucose predictor is significant. + +::: {.panel-tabset group="language"} +## R + +```{r} +anova(glm_bks , test = "Chisq") +``` + +The `anova()` function is a true workhorse within R! This time we’ve used it to create an Analysis of Deviance table. This is exactly equivalent to an ordinary ANOVA table where we have rows corresponding to each predictor variable and a p-value telling us whether that variable is significant or not. + +The p-value for the `blength` predictor is written under then Pr(>Chi) column and we can see that it is less than `< 2.2e-16`. So, beak length is a significant predictor. + +This shouldn’t be surprising since we already saw that our overall model was better than the null model, which in this case is exactly the same as asking whether the beak length term is significant. However, in more complicated models with multiple predictors these two comparisons (and p-values) won’t be the same. + + +::: + + + + ## Exercises ### Diabetes {#sec-exr_diabetes} diff --git a/materials/glm-practical-poisson-original.qmd b/materials/glm-practical-poisson-original.qmd new file mode 100644 index 0000000..bdf470e --- /dev/null +++ b/materials/glm-practical-poisson-original.qmd @@ -0,0 +1,473 @@ +--- +title: "Poisson regression" +--- + +```{r, echo=FALSE} +# adjust and load as needed +source(file = "setup_files/setup.R") +``` + +::: callout-note +## Aims & objectives +- How do we analyse count data? +- Be able to perform a poisson regression on count data +::: + +## Libraries and functions + +::: panel-tabset +## tidyverse + +| Library | Description | +|:-----------------------------------|:-----------------------------------| +| `tidyverse` | A collection of R packages designed for data science | +| `tidymodels` | A collection of packages for modelling and machine learning using tidyverse principles | +| `poissonreg` | Enables the `parsnip` package to fit various types of Poisson regression models | +::: + +## Datasets + +::: panel-tabset +## Islands + +The example in this section uses the following data set: + +`data/islands.csv` + +This is a data set comprising 35 observations of two variables (one dependent and one predictor). This records the number of species recorded on different small islands along with the area (km2) of the islands. The variables are `species` and `area`. + +## Seatbelts + +The `seatbelts` data set is a multiple time-series data set that was commissioned by the Department of Transport in 1984 to measure differences in deaths before and after front seatbelt legislation was introduced on 31st January 1983. It provides monthly total numerical data on a number of incidents including those related to death and injury in Road Traffic Accidents (RTA's). The data set starts in January 1969 and observations run until December 1984. + +You can find the file in `data/seatbelts.csv` +::: + +## Visualise the data + +A good first step is always to explore your data prior to any further analysis. + +::: panel-tabset +## tidyverse + +First, we load and inspect the data: + +```{r, message=FALSE, warning=FALSE} +islands <- read_csv("data/island.csv") + +islands +``` + +Looking at the data, we can see that there are two columns: `species`, which contains the number of species recorded on each island and `area`, which contains the surface area of the island in square kilometers. + +We can plot the data: + +```{r} +islands %>% + ggplot(aes(x = area, y = species)) + + geom_point() +``` +::: + +It looks as though `area` may have an effect on the number of species that we observe on each island. We note that the response variable is count data and so we try to construct a Poisson regression. + +## Model building + +To create a poisson regression we do the following: + +::: panel-tabset +## tidyverse + +Again, similar to what we've done for the logistic models, we will use the `parsnip` package from the `tidymodels` library. Yes, the workflow still seems a bit faffy, but it provides a common syntax for a whole range of modelling libraries. This means that the syntax will stay the same as you do different kind of model comparisons. + +If you haven't loaded `tidymodels` yet, now is a really good time. We also need to load `poissonreg`, which adds extra functionality to `parsnip`. + +```{r, eval=FALSE} +# install.packages("tidymodels") +library(tidymodels) +# install.packages("poissonreg") +library(poissonreg) +``` + +Remember that the workflow in `parsnip` is a bit different to what we're used to so far. Using `parsnip` we approach things in a more systematic manner. We specify a model in three steps: + +1. **Specify the type of model based on its mathematical structure** (e.g., linear regression, logistic regression, poisson regression etc). + +For example: + +- `linear_reg()` for linear regression +- `logistic_reg()` for logistic regression +- `poisson_reg()` for poisson regression (we'll be using that here) + +2. **When required, declare the mode of the model.** The mode reflects the type of prediction outcome. For numeric outcomes, the mode is *regression*; for qualitative outcomes, it is *classification.* If a model can only create one type of model, such as poisson regression, the mode is already set to, in this case, `mode = "regression"`. + +3. **Specify the engine for fitting the model.** This usually is the software package or library that should be used. + +For example, + +- `"lm"` for linear models +- `"glm"` for generalised linear models +- `"stan"` for Bayesian inference + +You can find out which engines can be used with the `show_engines()` function. The command `show_engines("poisson_reg")` will give you the available engines for the `poisson_reg()` function. + +So, we can create the model as follows: + +```{r} +isl_mod <- poisson_reg() %>% + set_mode("regression") %>% + set_engine("glm") +``` + +Note that we are not actually specifying any of the variables just yet! All we've done is tell R what kind of model we're planning to use. If we want to see how `parsnip` converts this code to the package syntax, we can check this with `translate()`: + +```{r} +isl_mod %>% translate() +``` + +This shows that we have a poisson regression model, where the outcome is going to be a regression. The model fit template tells us that we'll be using the `glm()` function from the `stats` package, which can take a `formula`, `data`, `weights` and `family` argument. The `family` argument is already set to poisson. + +The model fit template tells us that we'll be using the `glm()` function from the `stats` package (`stats::glm`). This function has several arguments: + +1. a `formula`, which we'll specify later +2. `data`, which we'll provide in a bit +3. `weights`, if we want to add prior weights to our variable - we don't have to concern ourselves with this - and +4. a `family` argument, which is already set to `poisson` + +::: callout-important +## The `family` argument + +The `family` argument gives us a description of the error distribution and link function that will be used in the model. For the `islands` data set we are looking at count outcome - which we can model using a *poisson* regression model. + +If we'd want to specify it manually, then we'd use + +`set_engine("glm", family = stats::poisson(link = "log"))` + +which sets the family to poisson, using a log link function. +::: + +Now we've specified what kind of model we're planning to use, we can fit our data to it, using the `fit()` function: + +```{r} +isl_fit <- isl_mod %>% + fit(species ~ area, + data = islands) +``` + +We can look at the output directly, but I prefer to tidy the data up using the `tidy()` function from `broom` package: + +```{r} +isl_fit %>% tidy() +``` + +The output is strikingly similar to the binomial models (who'd have guessed, eh?) and the main numbers to extract from the output are the two numbers in the `estimate` column. +::: + +These are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size: + +$$ E(species) = \exp(4.24 + 0.036 \times area) $$ + +Interpreting this requires a bit of thought (not much, but a bit). + +The intercept coefficient ($\beta_0$), 4.24, is related to the number of species we would expect on an island of zero area (this is statistics, not real life. You'd do well to remember that before you worry too much about what that even means). But in order to turn this number into something meaningful we have to exponentiate it. Since $\exp(4.24) \approx 70$, we can say that the baseline number of species the model expects on any island is 70. This isn't actually the interesting bit though. + +The coefficient of `area` ($\beta_1$) is the fun bit. For starters we can see that it is a positive number which does mean that increasing area leads to increasing numbers of species. Good so far - since this matches what we saw when we plotted our data. But what does the value 0.036 actually mean? + +Well, if we exponentiate it too we get $\exp(0.036) \approx 1.04$. This means that for every increase in area of 1 km2 (the original units of the `area` variable) the number of species on the island is multiplied by 1.04. So, an island of area 1 km2 is predicted to have $1.04 \times 70 \approx 72$ species. + +So, in order to interpret Poisson coefficients, you have to exponentiate them. + +## Model predictions + +Now that we can interpret the Poisson coefficients, it would be good to see if using a poisson regression to describe these data is actually a good idea. + +Visualisation is always useful, so in order to get an idea of how our data fits a Poisson regression, we'll plot the Poisson regression curve. Next, we overlay our original data. + +::: panel-tabset +## tidyverse + +First, we create a table that contains data for the curve, starting for an `area` with value 1 to 50, in steps of 1. + +```{r} +model <- tibble(area = seq(1, 50, 1)) +``` + +Next, we feed our model these data: + +```{r} +curve <- isl_fit %>% augment(new_data = model) +``` + +This gives the predicted number of `species` for each given value of `area`. If we have a closer look at these data we can see that, for example, for an `area` with a surface area of 4 km2 the predicted number of species is around 80. Nice. + +```{r} +head(curve) +``` + +Using these data, we can now plot all the *predicted* number of species and overlay our original *measured* data. + +```{r} +ggplot(curve, aes(area, .pred)) + + geom_line(colour = "red") + + geom_point(data = islands, aes(area, species)) +``` +::: + +That looks like a pretty decent fit, really. But of course we want to have a (slightly) less hand-wavy conclusion than that. + +## Goodness-of-fit + +We can use the model's residual deviance to assess how much the predicted values differ from the observed. This gives us an idea of how well-specified the model is. When a model is "true", *i.e.* the model makes pretty accurate predictions, then we expect the residual deviance to be distributed as a $\chi^2$ random variable with degrees of freedom equal to the model's residual degrees of freedom. + +::: panel-tabset +## tidyverse + +We can get these parameters as follows and we'll store them in a new object, so we can extract them in a bit. + +```{r} +isl_fit %>% glance() + +isl_parameters <- isl_fit %>% glance() +``` + +The values we are interested in are in the `deviance` and `df.residual` columns, respectively. + +Next, we use the `pchisq()` function to calculate the correct probability. + +```{r} +pchisq(isl_parameters$deviance, + isl_parameters$df.residual, + lower.tail = FALSE) +``` + +This gives us a value of around 0.60. This suggests that this model is actually a pretty good one (if it wasn't then the value would be close to zero) and that the data are pretty well supported by the model. + +::: callout-important +The `pchisq()` function gives the lower tail probability that $\chi^2 \le x$ by default. We're actually interested in the probability that $\chi^2 \ge x$. These two probabilities must sum to one, so we get the upper tail probability by setting the argument `lower.tail = FALSE`. An alternative way would be to use the default, but do `1 - pchisq()`. +::: + +For Poisson models this has an extra interpretation. This can be used to assess whether we have significant overdispersion in our data. For a Poisson model to be appropriate we need that the variance of the data to be exactly the same as the mean of the data. If there is overdispersion then the data would spread out more for higher predicted values of species than for lower ones. Our visualisation shows that this isn't really happening. The spread is unlikely to be perfectly homogeneous, but we don't want the data to spread out too much. + +The easy way to check this is to look at the ratio of the residual deviance to the residual degrees of freedom (in this case 0.922). For a Poisson model to be valid, this ratio should be about 1. If the ratio is significantly bigger than 1 then we say that we have over-dispersion in the model and we wouldn't be able to trust any of the significance testing using a Poisson regression. +::: + +## Confidence intervals + +We can also assess how reliable our model is by looking at the confidence intervals of the estimated parameters. + +::: panel-tabset +## tidyverse + +We extracted the parameters of the model by using + +```{r} +isl_fit %>% tidy() +``` + +Although we focussed on the `estimate` column, we can see that the associated standard errors for each estimate is also given in the `std.error` column. We can use these values to calculate the 95% confidence intervals. + +We can either do this by hand through multiplying the standard errors by 1.96. We can then subtract from (giving the lower confidence estimate) or add to (giving the higher confidence estimate) the estimated parameter. This gives a pretty decent approximation. + +But then again, life is short, so we can just use the additional argument that is available for the `tidy()` function. You can look at what columns are returned, but I'm selecting the relevant ones here: + +```{r} +isl_fit %>% tidy(conf.int = TRUE, # default is FALSE + conf.level = 0.95) %>% # is the default + select(term, estimate, conf.low, conf.high) +``` + +What we're interested in here is the confidence intervals for the `area` parameter. Before we delve into that, I'm also going to calculate the exponent for these confidence intervals. We can do this using the `exp()` function. + +```{r} +isl_fit %>% tidy(conf.int = TRUE, # default is FALSE + conf.level = 0.95) %>% # is the default + select(term, estimate, conf.low, conf.high) %>% + mutate(conf.low_exp = exp(conf.low), + conf.high_exp = exp(conf.high)) +``` + +These values are a bit familiar, since we've previously determined based on the `area` coefficient that for each increase in square kilometer, the number of species increases by approximately 1.04. + +What these values tell us is that we can be 95% certain that for every increase in square kilometer of island area size, the number of species increases by a factor of somewhere between 1.034 and 1.039. + +::: callout-note +If there was no association between `area` and `species`, then the $\beta_1$ coefficient would be zero. That would mean that the exponent would be ${e}^{\beta_1}=1$. The interval that we calculated for ${e}^{\beta_1}$ lies between 1.034 and 1.039 and therefore does not include 1, so the model with `area` is preferred over the model without `area`. + +Equivalently, the interval for $\beta_1$ (0.033 - 0.038) does not include 0, again showing the significance of `area` as a predictor of `species`. +::: +::: + +## Exercise: Seatbelts + +I'd like you to do the following: + +1. Load the data +2. Visualise the data and create a poisson regression model +3. Plot the regression model on top of the data +4. Assess if the model is a decent predictor for the number of fatalities + +::: {.callout-caution collapse="true"} +## Answer + +### Load the data + +First, we load the data. + +::: panel-tabset +## tidyverse + +```{r} +seatbelts <- read_csv("data/seatbelts.csv") +``` +::: + +### Visualise the data + +The data tracks the number of drivers killed in road traffic accidents, before and after the seatbelt law was introduced. The information on whether the law was in place is encoded in the `law` column as `0` (law not in place) or `1` (law in place). + +There are many more observations when the law was *not* in place, so we need to keep this in mind when we're interpreting the data. + +First we have a look at the data comparing no law vs law: + +::: panel-tabset +## tidyverse + +We have to convert the `law` column to a factor, otherwise R will see it as numerical. + +```{r} +seatbelts %>% + ggplot(aes(as_factor(law), drivers_killed)) + + geom_boxplot() +``` + +The data are recorded by month and year, so we can also display the number of drivers killed by year: + +```{r} +seatbelts %>% + ggplot(aes(year, drivers_killed)) + + geom_point() +``` +::: + +The data look a bit weird. There is quite some variation within years (keeping in mind that the data are aggregated monthly). The data also seems to wave around a bit... with some vague peaks (e.g. 1972 - 1973) and some troughs (e.g. around 1976). + +So my initial thought is that these data are going to be a bit tricky to interpret. But that's OK. + +### Model building + +We're dealing with count data, so we're going to use a poisson regression. + +::: panel-tabset +## tidyverse + +As before, we first define the model type. + +```{r} +stb_mod <- poisson_reg() %>% + set_mode("regression") %>% + set_engine("glm") +``` + +And check that everything is in order. + +```{r} +stb_mod %>% translate() +``` + +Next, we fit our data to the model we just specified: + +```{r} +stb_fit <- stb_mod %>% + fit(drivers_killed ~ year, + data = seatbelts) +``` + +And we can extract the estimated coefficients from these fitted data: + +```{r} +stb_fit %>% tidy() +``` +::: + +### Visualise the model + +To see if the model is actually a decent prediction for our data, we can plot it. + +::: panel-tabset +## tidyverse + +To do this, we create modelled values for our predictor variable `year`. Because we have monthly data, we create a sequence of "years" in 1/12th intervals - one for each month. + +Next, we ask the model to predict the number of drivers killed based on these values. + +Lastly, we can plot those predicted values against the observed values in our data set. + +```{r} +# create the sequence of values that are used +# in predicting number of deaths +model <- tibble(year = seq(1969, 1984, (1/12))) + +# fit these data to the model +stb_curve <- stb_fit %>% augment(new_data = model) + +# plot the predicted values +# and overlay the observed values +ggplot(seatbelts, aes(year, drivers_killed)) + + geom_point() + + geom_point(data = stb_curve, aes(x = year, y = .pred), + colour = "red") +``` +::: + +That does not look like a very good fit, but then again the data look rather messy as well. + +### Goodness-of-fit + +Let's check the goodness-of-fit. + +::: panel-tabset +## tidyverse + +First we store the parameter estimates in an object. Then we use the `pchisq()` function to calculate the probability that the residual deviance is actually distributed as a $\chi^2$ random variable with degrees of freedom equal to the model's residual degrees of freedom. + +Yes, you can read that sentence three times and still wonder what that really means. Suffice to say is that the outcome gives us a measure of how well-specified the model is. + +```{r} +stb_parameter <- stb_fit %>% glance() + +stb_parameter + +pchisq(stb_parameter$deviance, + stb_parameter$df.residual, + lower.tail = FALSE) +``` +::: + +Well, that's a bit of a blow. The probability value is extremely low, suggesting that the model is not very well-specified. Which kind of matches what we already saw in the plot. It might still be better than the null model ("the data can be modelled as the average across all the observations"), but we seem to be missing some parameters here. + +### Confidence intervals + +Similar to the `islands` example, we can also calculate the confidence intervals associated with our estimated parameters. + +::: panel-tabset +## tidyverse + +```{r} +stb_fit %>% tidy(conf.int = TRUE, # default is FALSE + conf.level = 0.95) %>% # is the default + select(term, estimate, conf.low, conf.high) %>% + mutate(conf.low_exp = exp(conf.low), + conf.high_exp = exp(conf.high)) +``` + +We're interested in the confidence interval for the `year` variable. Remember that if there is no association at all between `year` and `drivers_killed` then the parameter $e^{\beta_1} = 1$. + +In our case the interval we calculated for $e^{\beta_1}$ lies between 0.981 and 0.986. This does not include 1, so it seems that the model that takes `year` into account is still preferred over a model that doesn't. +::: +::: + +## Key points + +::: callout-note +- Poisson regression is useful when dealing with count data +::: diff --git a/materials/glm-practical-poisson.qmd b/materials/glm-practical-poisson.qmd index bdf470e..2dc79eb 100644 --- a/materials/glm-practical-poisson.qmd +++ b/materials/glm-practical-poisson.qmd @@ -1,166 +1,118 @@ --- -title: "Poisson regression" +title: "Count data" --- -```{r, echo=FALSE} -# adjust and load as needed +```{r} +#| echo: false +#| message: false +#| results: hide source(file = "setup_files/setup.R") ``` -::: callout-note -## Aims & objectives +```{python} +#| echo: false +#| message: false +import shutup;shutup.please() +exec(open('setup_files/setup.py').read()) +``` + +::: {.callout-tip} +## Learning outcomes + +**Questions** + - How do we analyse count data? + +**Objectives** + - Be able to perform a poisson regression on count data ::: ## Libraries and functions -::: panel-tabset -## tidyverse +::: {.callout-note collapse="true"} +## Click to expand -| Library | Description | -|:-----------------------------------|:-----------------------------------| -| `tidyverse` | A collection of R packages designed for data science | -| `tidymodels` | A collection of packages for modelling and machine learning using tidyverse principles | -| `poissonreg` | Enables the `parsnip` package to fit various types of Poisson regression models | -::: - -## Datasets +::: {.panel-tabset group="language"} +## R -::: panel-tabset -## Islands +### Libraries +### Functions +::: +::: -The example in this section uses the following data set: +The examples in this section use the following data sets: `data/islands.csv` This is a data set comprising 35 observations of two variables (one dependent and one predictor). This records the number of species recorded on different small islands along with the area (km2) of the islands. The variables are `species` and `area`. -## Seatbelts +The second data set is on seat belts. -The `seatbelts` data set is a multiple time-series data set that was commissioned by the Department of Transport in 1984 to measure differences in deaths before and after front seatbelt legislation was introduced on 31st January 1983. It provides monthly total numerical data on a number of incidents including those related to death and injury in Road Traffic Accidents (RTA's). The data set starts in January 1969 and observations run until December 1984. +The `seatbelts` data set is a multiple time-series data set that was commissioned by the Department of Transport in 1984 to measure differences in deaths before and after front seat belt legislation was introduced on 31st January 1983. It provides monthly total numerical data on a number of incidents including those related to death and injury in Road Traffic Accidents (RTA's). The data set starts in January 1969 and observations run until December 1984. You can find the file in `data/seatbelts.csv` -::: -## Visualise the data +## Load and visualise the data -A good first step is always to explore your data prior to any further analysis. +First we load the data, then we visualise it. -::: panel-tabset -## tidyverse +::: {.panel-tabset group="language"} +## R -First, we load and inspect the data: +```{r} +#| message: false +#| warning: false +islands <- read_csv("data/islands.csv") +``` -```{r, message=FALSE, warning=FALSE} -islands <- read_csv("data/island.csv") +Let's have a glimpse at the data: +```{r} islands ``` +::: + Looking at the data, we can see that there are two columns: `species`, which contains the number of species recorded on each island and `area`, which contains the surface area of the island in square kilometers. We can plot the data: +::: {.panel-tabset group="language"} +## R + ```{r} -islands %>% - ggplot(aes(x = area, y = species)) + +ggplot(islands, aes(x = area, y = species)) + geom_point() ``` + ::: It looks as though `area` may have an effect on the number of species that we observe on each island. We note that the response variable is count data and so we try to construct a Poisson regression. -## Model building - -To create a poisson regression we do the following: - -::: panel-tabset -## tidyverse - -Again, similar to what we've done for the logistic models, we will use the `parsnip` package from the `tidymodels` library. Yes, the workflow still seems a bit faffy, but it provides a common syntax for a whole range of modelling libraries. This means that the syntax will stay the same as you do different kind of model comparisons. - -If you haven't loaded `tidymodels` yet, now is a really good time. We also need to load `poissonreg`, which adds extra functionality to `parsnip`. - -```{r, eval=FALSE} -# install.packages("tidymodels") -library(tidymodels) -# install.packages("poissonreg") -library(poissonreg) -``` - -Remember that the workflow in `parsnip` is a bit different to what we're used to so far. Using `parsnip` we approach things in a more systematic manner. We specify a model in three steps: +## Constructing a model -1. **Specify the type of model based on its mathematical structure** (e.g., linear regression, logistic regression, poisson regression etc). - -For example: - -- `linear_reg()` for linear regression -- `logistic_reg()` for logistic regression -- `poisson_reg()` for poisson regression (we'll be using that here) - -2. **When required, declare the mode of the model.** The mode reflects the type of prediction outcome. For numeric outcomes, the mode is *regression*; for qualitative outcomes, it is *classification.* If a model can only create one type of model, such as poisson regression, the mode is already set to, in this case, `mode = "regression"`. - -3. **Specify the engine for fitting the model.** This usually is the software package or library that should be used. - -For example, - -- `"lm"` for linear models -- `"glm"` for generalised linear models -- `"stan"` for Bayesian inference - -You can find out which engines can be used with the `show_engines()` function. The command `show_engines("poisson_reg")` will give you the available engines for the `poisson_reg()` function. - -So, we can create the model as follows: +::: {.panel-tabset group="language"} +## R ```{r} -isl_mod <- poisson_reg() %>% - set_mode("regression") %>% - set_engine("glm") +glm_isl <- glm(species ~ area, + data = islands, family = "poisson") ``` -Note that we are not actually specifying any of the variables just yet! All we've done is tell R what kind of model we're planning to use. If we want to see how `parsnip` converts this code to the package syntax, we can check this with `translate()`: +and we look at the model summary: ```{r} -isl_mod %>% translate() +summary(glm_isl) ``` -This shows that we have a poisson regression model, where the outcome is going to be a regression. The model fit template tells us that we'll be using the `glm()` function from the `stats` package, which can take a `formula`, `data`, `weights` and `family` argument. The `family` argument is already set to poisson. - -The model fit template tells us that we'll be using the `glm()` function from the `stats` package (`stats::glm`). This function has several arguments: - -1. a `formula`, which we'll specify later -2. `data`, which we'll provide in a bit -3. `weights`, if we want to add prior weights to our variable - we don't have to concern ourselves with this - and -4. a `family` argument, which is already set to `poisson` - -::: callout-important -## The `family` argument - -The `family` argument gives us a description of the error distribution and link function that will be used in the model. For the `islands` data set we are looking at count outcome - which we can model using a *poisson* regression model. - -If we'd want to specify it manually, then we'd use +The output is strikingly similar to the logistic regression models (who’d have guessed, eh?) and the main numbers to extract from the output are the two numbers underneath `Estimate.Std` in the `Coefficients` table: -`set_engine("glm", family = stats::poisson(link = "log"))` - -which sets the family to poisson, using a log link function. -::: - -Now we've specified what kind of model we're planning to use, we can fit our data to it, using the `fit()` function: - -```{r} -isl_fit <- isl_mod %>% - fit(species ~ area, - data = islands) ``` - -We can look at the output directly, but I prefer to tidy the data up using the `tidy()` function from `broom` package: - -```{r} -isl_fit %>% tidy() +(Intercept) 4.241129 +area 0.035613 ``` -The output is strikingly similar to the binomial models (who'd have guessed, eh?) and the main numbers to extract from the output are the two numbers in the `estimate` column. ::: These are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size: @@ -168,136 +120,110 @@ These are the coefficients of the Poisson model equation and need to be placed i $$ E(species) = \exp(4.24 + 0.036 \times area) $$ Interpreting this requires a bit of thought (not much, but a bit). +The intercept coefficient, `4.24`, is related to the number of species we would expect on an island of zero area (this is statistics, not real life. You’d do well to remember that before you worry too much about what that even means). But in order to turn this number into something meaningful we have to exponentiate it. Since `exp(4.24) ≈ 70`, we can say that the baseline number of species the model expects on any island is 70. This isn’t actually the interesting bit though. -The intercept coefficient ($\beta_0$), 4.24, is related to the number of species we would expect on an island of zero area (this is statistics, not real life. You'd do well to remember that before you worry too much about what that even means). But in order to turn this number into something meaningful we have to exponentiate it. Since $\exp(4.24) \approx 70$, we can say that the baseline number of species the model expects on any island is 70. This isn't actually the interesting bit though. - -The coefficient of `area` ($\beta_1$) is the fun bit. For starters we can see that it is a positive number which does mean that increasing area leads to increasing numbers of species. Good so far - since this matches what we saw when we plotted our data. But what does the value 0.036 actually mean? +The coefficient of `area` is the fun bit. For starters we can see that it is a positive number which does mean that increasing `area` leads to increasing numbers of `species`. Good so far. -Well, if we exponentiate it too we get $\exp(0.036) \approx 1.04$. This means that for every increase in area of 1 km2 (the original units of the `area` variable) the number of species on the island is multiplied by 1.04. So, an island of area 1 km2 is predicted to have $1.04 \times 70 \approx 72$ species. +But what does the value `0.036` actually mean? Well, if we exponentiate it as well, we get `exp(0.036) ≈ 1.04`. This means that for every increase in `area` of 1 km^2 (the original units of the area variable), the number of species on the island is multiplied by `1.04`. So, an island of area 1 km^2 will have `1.04 x 70 ≈ 72` species. So, in order to interpret Poisson coefficients, you have to exponentiate them. -## Model predictions - -Now that we can interpret the Poisson coefficients, it would be good to see if using a poisson regression to describe these data is actually a good idea. - -Visualisation is always useful, so in order to get an idea of how our data fits a Poisson regression, we'll plot the Poisson regression curve. Next, we overlay our original data. - -::: panel-tabset -## tidyverse +## Plotting the Poisson regression -First, we create a table that contains data for the curve, starting for an `area` with value 1 to 50, in steps of 1. +::: {.panel-tabset group="language"} +## R ```{r} -model <- tibble(area = seq(1, 50, 1)) +ggplot(islands, aes(area, species)) + + geom_point() + + geom_smooth(method = "glm", se = FALSE, fullrange = TRUE, + method.args = list(family = poisson)) + + xlim(10,50) ``` -Next, we feed our model these data: +::: -```{r} -curve <- isl_fit %>% augment(new_data = model) -``` +## Assumptions -This gives the predicted number of `species` for each given value of `area`. If we have a closer look at these data we can see that, for example, for an `area` with a surface area of 4 km2 the predicted number of species is around 80. Nice. +As we mentioned earlier, Poisson regressions require that the variance of the data at any point is the same as the mean of the data at that point. We checked that earlier by looking at the residual deviance values. -```{r} -head(curve) -``` +We can look for influential points using the Cook’s distance plot: -Using these data, we can now plot all the *predicted* number of species and overlay our original *measured* data. +::: {.panel-tabset group="language"} +## R ```{r} -ggplot(curve, aes(area, .pred)) + - geom_line(colour = "red") + - geom_point(data = islands, aes(area, species)) +plot(glm_isl , which=4) ``` -::: -That looks like a pretty decent fit, really. But of course we want to have a (slightly) less hand-wavy conclusion than that. - -## Goodness-of-fit +::: -We can use the model's residual deviance to assess how much the predicted values differ from the observed. This gives us an idea of how well-specified the model is. When a model is "true", *i.e.* the model makes pretty accurate predictions, then we expect the residual deviance to be distributed as a $\chi^2$ random variable with degrees of freedom equal to the model's residual degrees of freedom. +None of our points have particularly large Cook’s distances and so life is rosy. -::: panel-tabset -## tidyverse +## Assessing significance -We can get these parameters as follows and we'll store them in a new object, so we can extract them in a bit. +We can ask the same three questions we asked before. -```{r} -isl_fit %>% glance() +1. Is the model well-specified? +2. Is the overall model better than the null model? +3. Are any of the individual predictors significant? -isl_parameters <- isl_fit %>% glance() -``` +Again, in this case, questions 2 and 3 are effectively asking the same thing because we still only have a single predictor variable. -The values we are interested in are in the `deviance` and `df.residual` columns, respectively. +To assess if the model is any good we’ll again use the residual deviance and the residual degrees of freedom. -Next, we use the `pchisq()` function to calculate the correct probability. +::: {.panel-tabset group="language"} +## R ```{r} -pchisq(isl_parameters$deviance, - isl_parameters$df.residual, - lower.tail = FALSE) +1 - pchisq(30.437, 33) ``` -This gives us a value of around 0.60. This suggests that this model is actually a pretty good one (if it wasn't then the value would be close to zero) and that the data are pretty well supported by the model. - -::: callout-important -The `pchisq()` function gives the lower tail probability that $\chi^2 \le x$ by default. We're actually interested in the probability that $\chi^2 \ge x$. These two probabilities must sum to one, so we get the upper tail probability by setting the argument `lower.tail = FALSE`. An alternative way would be to use the default, but do `1 - pchisq()`. ::: -For Poisson models this has an extra interpretation. This can be used to assess whether we have significant overdispersion in our data. For a Poisson model to be appropriate we need that the variance of the data to be exactly the same as the mean of the data. If there is overdispersion then the data would spread out more for higher predicted values of species than for lower ones. Our visualisation shows that this isn't really happening. The spread is unlikely to be perfectly homogeneous, but we don't want the data to spread out too much. +This gives a probability of `0.60`. This suggests that this model is actually a good one and that the data are pretty well supported by the model. For Poisson models this has an extra interpretation. This can be used to assess whether we have significant over-dispersion in our data. -The easy way to check this is to look at the ratio of the residual deviance to the residual degrees of freedom (in this case 0.922). For a Poisson model to be valid, this ratio should be about 1. If the ratio is significantly bigger than 1 then we say that we have over-dispersion in the model and we wouldn't be able to trust any of the significance testing using a Poisson regression. -::: +For a Poisson model to be appropriate we need that the variance of the data to be exactly the same as the mean of the data. Visually, this would correspond to the data spreading out more for higher predicted values of `species.` However, we don’t want the data to spread out too much. If that happens then a Poisson model wouldn’t be appropriate. -## Confidence intervals +The easy way to check this is to look at the ratio of the residual deviance to the residual degrees of freedom (in this case `0.922`). For a Poisson model to be valid, this ratio should be about 1. If the ratio is significantly bigger than 1 then we say that we have over-dispersion in the model and we wouldn’t be able to trust any of the significance testing that we are about to do using a Poisson regression. -We can also assess how reliable our model is by looking at the confidence intervals of the estimated parameters. +Thankfully the probability we have just created (`0.60`) is exactly the right one we need to look at to assess whether we have significant over-dispersion in our model. -::: panel-tabset -## tidyverse +Secondly, to assess whether the overall model, with all of the terms, is better than the null model we’ll look at the difference in deviances and the difference in degrees of freedom: -We extracted the parameters of the model by using +::: {.panel-tabset group="language"} +## R ```{r} -isl_fit %>% tidy() +1 - pchisq(856.899 - 30.437, 34 - 33) ``` -Although we focussed on the `estimate` column, we can see that the associated standard errors for each estimate is also given in the `std.error` column. We can use these values to calculate the 95% confidence intervals. +::: + +This gives a reported p-value of 0, which is pretty damn small. So, yes, this model is better than nothing at all and species does appear to change with some of our predictors -We can either do this by hand through multiplying the standard errors by 1.96. We can then subtract from (giving the lower confidence estimate) or add to (giving the higher confidence estimate) the estimated parameter. This gives a pretty decent approximation. +Finally, we’ll construct an analysis of deviance table to look at the individual terms: -But then again, life is short, so we can just use the additional argument that is available for the `tidy()` function. You can look at what columns are returned, but I'm selecting the relevant ones here: +::: {.panel-tabset group="language"} +## R ```{r} -isl_fit %>% tidy(conf.int = TRUE, # default is FALSE - conf.level = 0.95) %>% # is the default - select(term, estimate, conf.low, conf.high) +anova(glm_isl , test = "Chisq") ``` -What we're interested in here is the confidence intervals for the `area` parameter. Before we delve into that, I'm also going to calculate the exponent for these confidence intervals. We can do this using the `exp()` function. +The p-value in this table is just as small as we’d expect given our previous result (`<2.2e-16` is pretty close to 0), and we have the nice consistent result that `area` definitely has an effect on `species`. -```{r} -isl_fit %>% tidy(conf.int = TRUE, # default is FALSE - conf.level = 0.95) %>% # is the default - select(term, estimate, conf.low, conf.high) %>% - mutate(conf.low_exp = exp(conf.low), - conf.high_exp = exp(conf.high)) -``` +::: -These values are a bit familiar, since we've previously determined based on the `area` coefficient that for each increase in square kilometer, the number of species increases by approximately 1.04. +## Exercises -What these values tell us is that we can be 95% certain that for every increase in square kilometer of island area size, the number of species increases by a factor of somewhere between 1.034 and 1.039. +### Seat belts {#sec-exr_seatbelts} -::: callout-note -If there was no association between `area` and `species`, then the $\beta_1$ coefficient would be zero. That would mean that the exponent would be ${e}^{\beta_1}=1$. The interval that we calculated for ${e}^{\beta_1}$ lies between 1.034 and 1.039 and therefore does not include 1, so the model with `area` is preferred over the model without `area`. +:::{.callout-exercise} -Equivalently, the interval for $\beta_1$ (0.033 - 0.038) does not include 0, again showing the significance of `area` as a predictor of `species`. -::: -::: +{{< level 2 >}} -## Exercise: Seatbelts +For this exercise we'll be using the data from `data/seatbelts.csv`. I'd like you to do the following: @@ -306,31 +232,31 @@ I'd like you to do the following: 3. Plot the regression model on top of the data 4. Assess if the model is a decent predictor for the number of fatalities -::: {.callout-caution collapse="true"} -## Answer +::: {.callout-answer collapse="true"} -### Load the data +#### Load and visualise the data -First, we load the data. +First we load the data, then we visualise it. -::: panel-tabset -## tidyverse +::: {.panel-tabset group="language"} +## R ```{r} +#| message: false +#| warning: false seatbelts <- read_csv("data/seatbelts.csv") ``` -::: -### Visualise the data +::: -The data tracks the number of drivers killed in road traffic accidents, before and after the seatbelt law was introduced. The information on whether the law was in place is encoded in the `law` column as `0` (law not in place) or `1` (law in place). +The data tracks the number of drivers killed in road traffic accidents, before and after the seat belt law was introduced. The information on whether the law was in place is encoded in the `law` column as `0` (law not in place) or `1` (law in place). There are many more observations when the law was *not* in place, so we need to keep this in mind when we're interpreting the data. First we have a look at the data comparing no law vs law: -::: panel-tabset -## tidyverse +::: {.panel-tabset group="language"} +## R We have to convert the `law` column to a factor, otherwise R will see it as numerical. @@ -347,127 +273,92 @@ seatbelts %>% ggplot(aes(year, drivers_killed)) + geom_point() ``` + ::: The data look a bit weird. There is quite some variation within years (keeping in mind that the data are aggregated monthly). The data also seems to wave around a bit... with some vague peaks (e.g. 1972 - 1973) and some troughs (e.g. around 1976). So my initial thought is that these data are going to be a bit tricky to interpret. But that's OK. -### Model building +#### Constructing a model -We're dealing with count data, so we're going to use a poisson regression. - -::: panel-tabset -## tidyverse - -As before, we first define the model type. +::: {.panel-tabset group="language"} +## R ```{r} -stb_mod <- poisson_reg() %>% - set_mode("regression") %>% - set_engine("glm") +glm_stb <- glm(drivers_killed ~ year, + data = seatbelts, family = "poisson") ``` -And check that everything is in order. +and we look at the model summary: ```{r} -stb_mod %>% translate() +summary(glm_stb) ``` -Next, we fit our data to the model we just specified: - -```{r} -stb_fit <- stb_mod %>% - fit(drivers_killed ~ year, - data = seatbelts) ``` - -And we can extract the estimated coefficients from these fitted data: - -```{r} -stb_fit %>% tidy() +(Intercept) 37.168958 +year 0.016373 ``` ::: -### Visualise the model - -To see if the model is actually a decent prediction for our data, we can plot it. +These are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size: -::: panel-tabset -## tidyverse +$$ E(drivers\_killed) = \exp(37.17 + 0.164 \times year) $$ -To do this, we create modelled values for our predictor variable `year`. Because we have monthly data, we create a sequence of "years" in 1/12th intervals - one for each month. +#### Assessing significance -Next, we ask the model to predict the number of drivers killed based on these values. +Is the model well-specified? -Lastly, we can plot those predicted values against the observed values in our data set. +::: {.panel-tabset group="language"} +## R ```{r} -# create the sequence of values that are used -# in predicting number of deaths -model <- tibble(year = seq(1969, 1984, (1/12))) - -# fit these data to the model -stb_curve <- stb_fit %>% augment(new_data = model) - -# plot the predicted values -# and overlay the observed values -ggplot(seatbelts, aes(year, drivers_killed)) + - geom_point() + - geom_point(data = stb_curve, aes(x = year, y = .pred), - colour = "red") +1 - pchisq(850.41, 190) ``` + ::: -That does not look like a very good fit, but then again the data look rather messy as well. +How about the overall fit? -### Goodness-of-fit +::: {.panel-tabset group="language"} +## R -Let's check the goodness-of-fit. +```{r} +1 - pchisq(984.50 - 850.41, 191 - 190) +``` -::: panel-tabset -## tidyverse +::: -First we store the parameter estimates in an object. Then we use the `pchisq()` function to calculate the probability that the residual deviance is actually distributed as a $\chi^2$ random variable with degrees of freedom equal to the model's residual degrees of freedom. +#### Plotting the regression -Yes, you can read that sentence three times and still wonder what that really means. Suffice to say is that the outcome gives us a measure of how well-specified the model is. +::: {.panel-tabset group="language"} +## R ```{r} -stb_parameter <- stb_fit %>% glance() - -stb_parameter - -pchisq(stb_parameter$deviance, - stb_parameter$df.residual, - lower.tail = FALSE) +#| message: false +#| warning: false +ggplot(seatbelts, aes(year, drivers_killed)) + + geom_point() + + geom_smooth(method = "glm", se = FALSE, fullrange = TRUE, + method.args = list(family = poisson)) + + xlim(1970,1985) ``` -::: -Well, that's a bit of a blow. The probability value is extremely low, suggesting that the model is not very well-specified. Which kind of matches what we already saw in the plot. It might still be better than the null model ("the data can be modelled as the average across all the observations"), but we seem to be missing some parameters here. - -### Confidence intervals - -Similar to the `islands` example, we can also calculate the confidence intervals associated with our estimated parameters. +::: -::: panel-tabset -## tidyverse -```{r} -stb_fit %>% tidy(conf.int = TRUE, # default is FALSE - conf.level = 0.95) %>% # is the default - select(term, estimate, conf.low, conf.high) %>% - mutate(conf.low_exp = exp(conf.low), - conf.high_exp = exp(conf.high)) -``` +#### Conclusions -We're interested in the confidence interval for the `year` variable. Remember that if there is no association at all between `year` and `drivers_killed` then the parameter $e^{\beta_1} = 1$. +The model we constructed appears to be a decent predictor for the number of fatalities. -In our case the interval we calculated for $e^{\beta_1}$ lies between 0.981 and 0.986. This does not include 1, so it seems that the model that takes `year` into account is still preferred over a model that doesn't. ::: ::: -## Key points +## Summary + +::: {.callout-tip} +#### Key points -::: callout-note - Poisson regression is useful when dealing with count data ::: diff --git a/materials/images/dgplots/2024_01_22-11-16-01_AM_dgplots.png b/materials/images/dgplots/2024_01_22-11-16-01_AM_dgplots.png new file mode 100644 index 0000000..17c3aa5 Binary files /dev/null and b/materials/images/dgplots/2024_01_22-11-16-01_AM_dgplots.png differ