poisson

cambiotraining · Jan 22, 2024 · a99589d · a99589d
1 parent 34f8d31
commit a99589d
Show file tree

Hide file tree

Showing 28 changed files with 1,068 additions and 348 deletions.
diff --git a/_freeze/materials/glm-practical-logistic-binary/execute-results/html.json b/_freeze/materials/glm-practical-logistic-binary/execute-results/html.json
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-30-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-30-1.png
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-36-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-36-1.png
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-37-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-37-1.png
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-38-3.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-38-3.png
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-39-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-39-1.png
diff --git a/_freeze/materials/glm-practical-poisson/execute-results/html.json b/_freeze/materials/glm-practical-poisson/execute-results/html.json
@@ -0,0 +1,17 @@
+{
+  "hash": "e28c115b310dab335e2846c5ee4d47b0",
+  "result": {
+    "engine": "knitr",
+    "markdown": "---\ntitle: \"Count data\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n**Questions**\n\n-   How do we analyse count data?\n\n**Objectives**\n\n-   Be able to perform a poisson regression on count data\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n### Functions\n:::\n:::\n\nThe examples in this section use the following data sets:\n\n`data/islands.csv`\n\nThis is a data set comprising 35 observations of two variables (one dependent and one predictor). This records the number of species recorded on different small islands along with the area (km<sup>2</sup>) of the islands. The variables are `species` and `area`.\n\nThe second data set is on seat belts.\n\nThe `seatbelts` data set is a multiple time-series data set that was commissioned by the Department of Transport in 1984 to measure differences in deaths before and after front seat belt legislation was introduced on 31st January 1983. It provides monthly total numerical data on a number of incidents including those related to death and injury in Road Traffic Accidents (RTA's). The data set starts in January 1969 and observations run until December 1984.\n\nYou can find the file in `data/seatbelts.csv`\n\n## Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nislands <- read_csv(\"data/islands.csv\")\n```\n:::\n\n\nLet's have a glimpse at the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nislands\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 35 × 2\n   species  area\n     <dbl> <dbl>\n 1     114  12.1\n 2     130  13.4\n 3     113  13.7\n 4     109  14.5\n 5     118  16.8\n 6     136  19.0\n 7     149  19.6\n 8     162  20.6\n 9     145  20.9\n10     148  21.0\n# ℹ 25 more rows\n```\n\n\n:::\n:::\n\n\n:::\n\nLooking at the data, we can see that there are two columns: `species`, which contains the number of species recorded on each island and `area`, which contains the surface area of the island in square kilometers.\n\nWe can plot the data:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(islands, aes(x = area, y = species)) +\n  geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n\n:::\n\nIt looks as though `area` may have an effect on the number of species that we observe on each island. We note that the response variable is count data and so we try to construct a Poisson regression.\n\n## Constructing a model\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_isl <- glm(species ~ area,\n               data = islands, family = \"poisson\")\n```\n:::\n\n\nand we look at the model summary:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_isl)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = species ~ area, family = \"poisson\", data = islands)\n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)    \n(Intercept) 4.241129   0.041322  102.64   <2e-16 ***\narea        0.035613   0.001247   28.55   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 856.899  on 34  degrees of freedom\nResidual deviance:  30.437  on 33  degrees of freedom\nAIC: 282.66\n\nNumber of Fisher Scoring iterations: 3\n```\n\n\n:::\n:::\n\n\nThe output is strikingly similar to the logistic regression models (who’d have guessed, eh?) and the main numbers to extract from the output are the two numbers underneath `Estimate.Std` in the `Coefficients` table:\n\n```\n(Intercept)    4.241129\narea           0.035613\n```\n\n:::\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\n$$ E(species) = \\exp(4.24 + 0.036 \\times area) $$\n\nInterpreting this requires a bit of thought (not much, but a bit).\nThe intercept coefficient, `4.24`, is related to the number of species we would expect on an island of zero area (this is statistics, not real life. You’d do well to remember that before you worry too much about what that even means). But in order to turn this number into something meaningful we have to exponentiate it. Since `exp(4.24) ≈ 70`, we can say that the baseline number of species the model expects on any island is 70. This isn’t actually the interesting bit though.\n\nThe coefficient of `area` is the fun bit. For starters we can see that it is a positive number which does mean that increasing `area` leads to increasing numbers of `species`. Good so far.\n\nBut what does the value `0.036` actually mean? Well, if we exponentiate it as well, we get `exp(0.036) ≈ 1.04`. This means that for every increase in `area` of 1 km^2 (the original units of the area variable), the number of species on the island is multiplied by `1.04`. So, an island of area 1 km^2 will have `1.04 x 70 ≈ 72` species.\n\nSo, in order to interpret Poisson coefficients, you have to exponentiate them.\n\n## Plotting the Poisson regression\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(islands, aes(area, species)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = poisson)) +\n  xlim(10,50)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n`geom_smooth()` using formula = 'y ~ x'\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\n:::\n\n## Assumptions\n\nAs we mentioned earlier, Poisson regressions require that the variance of the data at any point is the same as the mean of the data at that point. We checked that earlier by looking at the residual deviance values.\n\nWe can look for influential points using the Cook’s distance plot:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot(glm_isl , which=4)\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n\n:::\n\nNone of our points have particularly large Cook’s distances and so life is rosy.\n\n## Assessing significance\n\nWe can ask the same three questions we asked before.\n\n1. Is the model well-specified?\n2. Is the overall model better than the null model?\n3. Are any of the individual predictors significant?\n\nAgain, in this case, questions 2 and 3 are effectively asking the same thing because we still only have a single predictor variable.\n\nTo assess if the model is any good we’ll again use the residual deviance and the residual degrees of freedom.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(30.437, 33)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.5953482\n```\n\n\n:::\n:::\n\n\n:::\n\nThis gives a probability of `0.60`. This suggests that this model is actually a good one and that the data are pretty well supported by the model. For Poisson models this has an extra interpretation. This can be used to assess whether we have significant over-dispersion in our data.\n\nFor a Poisson model to be appropriate we need that the variance of the data to be exactly the same as the mean of the data. Visually, this would correspond to the data spreading out more for higher predicted values of `species.` However, we don’t want the data to spread out too much. If that happens then a Poisson model wouldn’t be appropriate.\n\nThe easy way to check this is to look at the ratio of the residual deviance to the residual degrees of freedom (in this case `0.922`). For a Poisson model to be valid, this ratio should be about 1. If the ratio is significantly bigger than 1 then we say that we have over-dispersion in the model and we wouldn’t be able to trust any of the significance testing that we are about to do using a Poisson regression.\n\nThankfully the probability we have just created (`0.60`) is exactly the right one we need to look at to assess whether we have significant over-dispersion in our model.\n\nSecondly, to assess whether the overall model, with all of the terms, is better than the null model we’ll look at the difference in deviances and the difference in degrees of freedom:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(856.899 - 30.437, 34 - 33)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\n:::\n\nThis gives a reported p-value of 0, which is pretty damn small. So, yes, this model is better than nothing at all and species does appear to change with some of our predictors\n\nFinally, we’ll construct an analysis of deviance table to look at the individual terms:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(glm_isl , test = \"Chisq\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nAnalysis of Deviance Table\n\nModel: poisson, link: log\n\nResponse: species\n\nTerms added sequentially (first to last)\n\n     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    \nNULL                    34     856.90              \narea  1   826.46        33      30.44 < 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nThe p-value in this table is just as small as we’d expect given our previous result (`<2.2e-16` is pretty close to 0), and we have the nice consistent result that `area` definitely has an effect on `species`.\n\n:::\n\n## Exercises\n\n### Seat belts {#sec-exr_seatbelts}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nFor this exercise we'll be using the data from `data/seatbelts.csv`.\n\nI'd like you to do the following:\n\n1.  Load the data\n2.  Visualise the data and create a poisson regression model\n3.  Plot the regression model on top of the data\n4.  Assess if the model is a decent predictor for the number of fatalities\n\n::: {.callout-answer collapse=\"true\"}\n\n#### Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseatbelts <- read_csv(\"data/seatbelts.csv\")\n```\n:::\n\n\n:::\n\nThe data tracks the number of drivers killed in road traffic accidents, before and after the seat belt law was introduced. The information on whether the law was in place is encoded in the `law` column as `0` (law not in place) or `1` (law in place).\n\nThere are many more observations when the law was *not* in place, so we need to keep this in mind when we're interpreting the data.\n\nFirst we have a look at the data comparing no law vs law:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe have to convert the `law` column to a factor, otherwise R will see it as numerical.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseatbelts %>% \n  ggplot(aes(as_factor(law), drivers_killed)) +\n   geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\nThe data are recorded by month and year, so we can also display the number of drivers killed by year:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseatbelts %>% \n  ggplot(aes(year, drivers_killed)) +\n  geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-15-1.png){width=672}\n:::\n:::\n\n\n:::\n\nThe data look a bit weird. There is quite some variation within years (keeping in mind that the data are aggregated monthly). The data also seems to wave around a bit... with some vague peaks (e.g. 1972 - 1973) and some troughs (e.g. around 1976).\n\nSo my initial thought is that these data are going to be a bit tricky to interpret. But that's OK.\n\n#### Constructing a model\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_stb <- glm(drivers_killed ~ year,\n               data = seatbelts, family = \"poisson\")\n```\n:::\n\n\nand we look at the model summary:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_stb)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = drivers_killed ~ year, family = \"poisson\", data = seatbelts)\n\nCoefficients:\n             Estimate Std. Error z value Pr(>|z|)    \n(Intercept) 37.168958   2.796636   13.29   <2e-16 ***\nyear        -0.016373   0.001415  -11.57   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 984.50  on 191  degrees of freedom\nResidual deviance: 850.41  on 190  degrees of freedom\nAIC: 2127.2\n\nNumber of Fisher Scoring iterations: 4\n```\n\n\n:::\n:::\n\n\n```\n(Intercept)    37.168958\nyear           0.016373\n```\n:::\n\nThese are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:\n\n$$ E(drivers\\_killed) = \\exp(37.17 + 0.164 \\times year) $$\n\n#### Assessing significance\n\nIs the model well-specified?\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(850.41, 190)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\n:::\n\nHow about the overall fit?\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(984.50 - 850.41, 191 - 190)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\n:::\n\n#### Plotting the regression\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(seatbelts, aes(year, drivers_killed)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = poisson)) +\n  xlim(1970,1985)\n```\n\n::: {.cell-output-display}\n![](glm-practical-poisson_files/figure-html/unnamed-chunk-20-1.png){width=672}\n:::\n:::\n\n\n:::\n\n\n#### Conclusions\n\nThe model we constructed appears to be a decent predictor for the number of fatalities.\n\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n-   Poisson regression is useful when dealing with count data\n:::\n",
+    "supporting": [
+      "glm-practical-poisson_files"
+    ],
+    "filters": [
+      "rmarkdown/pagebreak.lua"
+    ],
+    "includes": {},
+    "engineDependencies": {},
+    "preserve": {},
+    "postProcess": true
+  }
+}
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-11-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-11-1.png
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-12-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-12-1.png
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-14-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-14-1.png
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-15-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-15-1.png
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-20-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-20-1.png
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-4-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-4-1.png
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-5-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-5-1.png
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-6-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-6-1.png
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-7-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-7-1.png
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-8-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-8-1.png
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-9-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-9-1.png
diff --git a/_site/index.html b/_site/index.html
@@ -208,6 +208,23 @@
   <a href="./materials/glm-practical-logistic-proportion.html" class="sidebar-item-text sidebar-link">
  <span class="menu-text"><span class="chapter-number">8</span>&nbsp; <span class="chapter-title">Proportional response</span></span></a>
   </div>
+</li>
+      </ul>
+  </li>
+        <li class="sidebar-item sidebar-item-section">
+      <div class="sidebar-item-container"> 
+            <a class="sidebar-item-text sidebar-link text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-4" aria-expanded="true">
+ <span class="menu-text">Count data</span></a>
+          <a class="sidebar-item-toggle text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-4" aria-expanded="true" aria-label="Toggle section">
+            <i class="bi bi-chevron-right ms-2"></i>
+          </a> 
+      </div>
+      <ul id="quarto-sidebar-section-4" class="collapse list-unstyled sidebar-section depth1 show">  
+          <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="./materials/glm-practical-poisson.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Count data</span></span></a>
+  </div>
 </li>
       </ul>
   </li>