proportional

cambiotraining · Jan 19, 2024 · a3eff41 · a3eff41
1 parent 0f83bd9
commit a3eff41
Show file tree

Hide file tree

Showing 23 changed files with 411 additions and 160 deletions.
diff --git a/.quarto/idx/index.qmd.json b/.quarto/idx/index.qmd.json
diff --git a/_freeze/materials/glm-practical-logistic-binary/execute-results/html.json b/_freeze/materials/glm-practical-logistic-binary/execute-results/html.json
diff --git a/_freeze/materials/glm-practical-logistic-proportion/execute-results/html.json b/_freeze/materials/glm-practical-logistic-proportion/execute-results/html.json
@@ -0,0 +1,17 @@
+{
+  "hash": "afed90465b1cade91182440ceb6c776d",
+  "result": {
+    "engine": "knitr",
+    "markdown": "---\ntitle: \"Proportional response\"\n---\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n-   How do I analyse proportion responses?\n-   Be able to create a logistic model to test proportion response variables\n-   Be able to plot the data and fitted curve\n-   Assess the significance of the fit\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n### Functions\n\n## Python\n\n### Libraries\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# A maths library\nimport math\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n```\n:::\n\n\n### Functions\n:::\n:::\n\nThe example in this section uses the following data set:\n\n`data/challenger.csv`\n\nThese data, obtained from the [faraway package](https://www.rdocumentation.org/packages/faraway/versions/1.0.7), contain information related to the explosion of the USA Space Shuttle Challenger on 28 January, 1986. An investigation after the disaster traced back to certain joints on one of the two solid booster rockets, each containing O-rings that ensured no exhaust gases could escape from the booster.\n\nThe night before the launch was unusually cold, with temperatures below freezing. The final report suggested that the cold snap during the night made the o-rings stiff, and unable to adjust to changes in pressure. As a result, exhaust gases leaked away from the solid booster rockets, causing one of them to break loose and rupture the main fuel tank, leading to the final explosion.\n\nThe question we're trying to answer in this session is: based on the data from the previous flights, would it have been possible to predict the failure of most o-rings on the Challenger flight?\n\n## Load and visualise the data\n\nFirst we load the data, then we visualise it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchallenger <- read_csv(\"data/challenger.csv\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 23 Columns: 2\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (2): temp, damage\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n:::\n\n:::\n\nThe data set contains several columns:\n\n1.  `temp`, the launch temperature in degrees Fahrenheit\n2.  `damage`, the number of o-rings that showed erosion\n\nBefore we have a further look at the data, let's calculate the proportion of damaged o-rings (`prop_damaged`) and the total number of o-rings (`total`) and update our data set.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchallenger <-\nchallenger %>%\n  mutate(total = 6,                     # total number of o-rings\n         intact = 6 - damage,           # number of undamaged o-rings\n         prop_damaged = damage / total) # proportion damaged o-rings\n\nchallenger\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 23 × 5\n    temp damage total intact prop_damaged\n   <dbl>  <dbl> <dbl>  <dbl>        <dbl>\n 1    53      5     6      1        0.833\n 2    57      1     6      5        0.167\n 3    58      1     6      5        0.167\n 4    63      1     6      5        0.167\n 5    66      0     6      6        0    \n 6    67      0     6      6        0    \n 7    67      0     6      6        0    \n 8    67      0     6      6        0    \n 9    68      0     6      6        0    \n10    69      0     6      6        0    \n# ℹ 13 more rows\n```\n\n\n:::\n:::\n\n:::\n\nPlotting the proportion of damaged o-rings against the launch temperature shows the following picture:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(challenger, aes(x = temp, y = prop_damaged)) +\n  geom_point()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n:::\n\nThe point on the left is the data point corresponding to the coldest flight experienced before the disaster, where five damaged o-rings were found. Fortunately, this did not result in a disaster.\n\nHere we'll explore if we could have predicted the failure of both o-rings on the Challenger flight, where the launch temperature was 31 degrees Fahrenheit.\n\n## Creating a suitable model\n\nWe only have 23 data points in total. So we're building a model on not that much data - we should keep this in mind when we draw our conclusions!\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe are using a logistic regression for a proportion response in this case, since we're interested in the proportion of o-rings that are damaged.\n\nWe can define this as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_chl <- glm(cbind(damage, intact) ~ temp,\n               family = binomial,\n               data = challenger)\n```\n:::\n\n\nDefining the relationship for proportion responses is a bit annoying, where you have to give the `glm` model a two-column matrix to specify the response variable.\n\nHere, the first column corresponds to the number of damaged o-rings, whereas the second column refers to the number of intact o-rings. We use the `cbind()` function to bind these two together into a matrix.\n\n:::\n\n## Model output\n\nThat's the easy part done! The trickier part is interpreting the output. First of all, we'll get some summary information.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nNext, we can have a closer look at the results:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_chl)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger)\n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)    \n(Intercept) 11.66299    3.29626   3.538 0.000403 ***\ntemp        -0.21623    0.05318  -4.066 4.78e-05 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 38.898  on 22  degrees of freedom\nResidual deviance: 16.912  on 21  degrees of freedom\nAIC: 33.675\n\nNumber of Fisher Scoring iterations: 6\n```\n\n\n:::\n:::\n\n\nWe can see that the p-values of the `intercept` and `temp` are significant. We can also use the intercept and `temp` coefficients to construct the logistic equation, which we can use to sketch the logistic curve.\n\n:::\n\n$$E(prop \\ failed\\ orings) = \\frac{\\exp{(11.66 -  0.22 \\times temp)}}{1 + \\exp{(11.66 -  0.22 \\times temp)}}$$\n\nLet's see how well our model would have performed if we would have fed it the data from the ill-fated Challenger launch.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(challenger, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\n::: {.callout-note collapse=true}\n## Generating predicted values\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nAnother way of doing this it to generate a table with data for a range of temperatures, from 25 to 85 degrees Fahrenheit, in steps of 1. We can then use these data to generate the logistic curve, based on the fitted model.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# create a table with sequential numbers ranging from 25 to 85\nmodel <- tibble(temp = seq(25, 85, by = 1)) %>% \n  # add a new column containing the predicted values\n  mutate(.pred = predict(glm_chl, newdata = ., type = \"response\"))\n\nggplot(model, aes(temp, .pred)) +\n  geom_line()\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# plot the curve and the original data\nggplot(model, aes(temp, .pred)) +\n  geom_line(colour = \"blue\") +\n  geom_point(data = challenger, aes(temp, prop_damaged)) +\n  # add a vertical line at the disaster launch temperature\n  geom_vline(xintercept = 31, linetype = \"dashed\")\n```\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\nIt seems that there was a high probability of both o-rings failing at that launch temperature. One thing that the graph shows is that there is a lot of uncertainty involved in this model. We can tell, because the fit of the line is very poor at the lower temperature range. There is just very little data to work on, with the data point at 53 F having a large influence on the fit.\n:::\n:::\n:::\n\n## Exercises\n\n### Predicting failure {#sec-exr_failure}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nThe data point at 53 degrees Fahrenheit is quite influential for the analysis. Remove this data point and repeat the analysis. Is there still a predicted link between launch temperature and o-ring failure?\n\n::: {.callout-answer collapse=\"true\"}\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nFirst, we need to remove the influential data point:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchallenger_new <- challenger %>% filter(temp != 53)\n```\n:::\n\n\nWe can create a new generalised linear model, based on these data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm_chl_new <- glm(cbind(damage, intact) ~ temp,\n               family = binomial,\n               data = challenger_new)\n```\n:::\n\n\nWe can get the model parameters as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(glm_chl_new)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = cbind(damage, intact) ~ temp, family = binomial, \n    data = challenger_new)\n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)  \n(Intercept)  5.68223    4.43138   1.282   0.1997  \ntemp        -0.12817    0.06697  -1.914   0.0556 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 16.375  on 21  degrees of freedom\nResidual deviance: 12.633  on 20  degrees of freedom\nAIC: 27.572\n\nNumber of Fisher Scoring iterations: 5\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(challenger_new, aes(temp, prop_damaged)) +\n  geom_point() +\n  geom_smooth(method = \"glm\", se = FALSE, fullrange = TRUE, \n              method.args = list(family = binomial)) +\n  xlim(25,85) +\n  # add a vertical line at 53 F temperature\n  geom_vline(xintercept = 53, linetype = \"dashed\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in eval(family$initialize): non-integer #successes in a binomial glm!\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](glm-practical-logistic-proportion_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\nThe prediction proportion of damaged o-rings is markedly less than what was observed.\n\nBefore we can make any firm conclusions, though, we need to check our model:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1- pchisq(12.633,20)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.8925695\n```\n\n\n:::\n:::\n\n\nWe get quite a high score (around 0.9) for this, which tells us that our goodness of fit is pretty good – our points are quite close to our curve, overall.\n\nIs the model any better than the null though?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(16.375 - 12.633, 1)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.0530609\n```\n\n\n:::\n\n```{.r .cell-code}\nanova(glm_chl_new, test = 'Chisq')\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nAnalysis of Deviance Table\n\nModel: binomial, link: logit\n\nResponse: cbind(damage, intact)\n\nTerms added sequentially (first to last)\n\n     Df Deviance Resid. Df Resid. Dev Pr(>Chi)  \nNULL                    21     16.375           \ntemp  1   3.7421        20     12.633  0.05306 .\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nHowever, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).\n:::\n\nSo, could NASA have predicted what happened? This model is not significantly different from the null, i.e., temperature is not a significant predictor. Note that it’s only marginally non-significant, and we do have a high goodness-of-fit value.\n\nIt is possible that if more data points were available that followed a similar trend, the story might be different). Even if we did use our non-significant model to make a prediction, it doesn’t give us a value anywhere near 5 failures for a temperature of 53 degrees Fahrenheit. So overall, based on the model we’ve fitted with these data, there was no indication that a temperature just a few degrees cooler than previous missions could have been so disastrous for the Challenger.\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n-   We can use a logistic model for proportion response variables\n\n:::\n",
+    "supporting": [
+      "glm-practical-logistic-proportion_files"
+    ],
+    "filters": [
+      "rmarkdown/pagebreak.lua"
+    ],
+    "includes": {},
+    "engineDependencies": {},
+    "preserve": {},
+    "postProcess": true
+  }
+}
diff --git a/.../materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-10-1.png b/.../materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-10-1.png
diff --git a/.../materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-14-1.png b/.../materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-14-1.png
diff --git a/...n_files/figure-html/unnamed-chunk-4-1.png → ...portion/figure-html/unnamed-chunk-4-1.png b/...n_files/figure-html/unnamed-chunk-4-1.png → ...portion/figure-html/unnamed-chunk-4-1.png
diff --git a/...e/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-5-1.png b/...e/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-5-1.png
diff --git a/...e/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-8-1.png b/...e/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-8-1.png
diff --git a/...e/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-9-1.png b/...e/materials/glm-practical-logistic-proportion/figure-html/unnamed-chunk-9-1.png
diff --git a/_quarto.yml b/_quarto.yml
@@ -16,6 +16,9 @@ project:
 
 format: courseformat-html
 
+filters:
+  - courseformat
+
 bibliography: references.bib
 
 execute:

diff --git a/_site/index.html b/_site/index.html
@@ -2,7 +2,7 @@
 <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head>
 
 <meta charset="utf-8">
-<meta name="generator" content="quarto-1.4.531">
+<meta name="generator" content="quarto-1.4.546">
 
 <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
 
@@ -202,6 +202,12 @@
   <a href="./materials/glm-practical-logistic-binary.html" class="sidebar-item-text sidebar-link">
  <span class="menu-text"><span class="chapter-number">7</span>&nbsp; <span class="chapter-title">Binary response</span></span></a>
   </div>
+</li>
+          <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="./materials/glm-practical-logistic-proportion.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">8</span>&nbsp; <span class="chapter-title">Proportional response</span></span></a>
+  </div>
 </li>
       </ul>
   </li>
@@ -439,7 +445,12 @@ <h2 class="anchored" data-anchor-id="core-aims">Core aims</h2>
       if (window.Quarto?.typesetMath) {
         window.Quarto.typesetMath(note);
       }
-      return note.innerHTML;
+      // TODO in 1.5, we should make sure this works without a callout special case
+      if (note.classList.contains("callout")) {
+        return note.outerHTML;
+      } else {
+        return note.innerHTML;
+      }
     }
   }
   for (var i=0; i<xrefs.length; i++) {