diff --git a/_freeze/materials/cs5_practical_model-comparisons/execute-results/html.json b/_freeze/materials/cs5_practical_model-comparisons/execute-results/html.json index c9845ef..26396ba 100644 --- a/_freeze/materials/cs5_practical_model-comparisons/execute-results/html.json +++ b/_freeze/materials/cs5_practical_model-comparisons/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "7b0974de604b85246fd1ccb22fafb1ea", + "hash": "a2a4e1c753bbaeab658737ef09621db9", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Model comparisons\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n#### Learning outcomes\n\n**Questions**\n\n- How do I compare linear models?\n- How do decide which one is the \"best\" model?\n\n**Objectives**\n\n- Be able to compare models using the Akaike Information Criterion (AIC)\n- Use AIC in the context of Backwards Stepwise Elimination\n\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# A collection of R packages designed for data science\nlibrary(tidyverse)\n\n# Converts stats functions to a tidyverse-friendly format\nlibrary(rstatix)\n\n# Creates diagnostic plots using ggplot2\nlibrary(ggResidpanel)\n\n# Helper functions for tidying data\nlibrary(broom)\n```\n:::\n\n\n### Functions\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Calculates the Akaike Information Criterion\nstats::AIC()\n\n# Performs a backwards step-wise elimination process\nstats::step()\n```\n:::\n\n\n## Python\n\n### Libraries\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# A fundamental package for scientific computing in Python\nimport numpy as np\n\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Simple yet exhaustive stats functions.\nimport pingouin as pg\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n```\n:::\n\n\n### Functions\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# Reads in a .csv file\npandas.read_csv()\n\n# Creates a model from a formula and data frame\nstatsmodels.formula.api.ols()\n```\n:::\n\n\n:::\n:::\n:::\n\n## Purpose and aim\n\nIn the previous example we used a single data set and fitted five linear models to it depending on which predictor variables we used. Whilst this was fun (seriously, what else would you be doing right now?) it seems that there should be a \"better way\". Well, thankfully there is! In fact there a several methods that can be used to compare different models in order to help identify \"the best\" model. More specifically, we can determine if a full model (which uses all available predictor variables and interactions) is necessary to appropriately describe the dependent variable, or whether we can throw away some of the terms (e.g. an interaction term) because they don’t offer any useful predictive power.\n\nHere we will use the **Akaike Information Criterion** in order to compare different models.\n\n## Data and hypotheses\n\nThis section uses the `data/CS5-ladybird.csv` data set. This data set comprises of 20 observations of three variables (one dependent and two predictor). This records the clutch size (`eggs`) in a species of ladybird, alongside two potential predictor variables; the mass of the female (`weight`), and the colour of the male (`male`) which is a categorical variable.\n\n## Backwards Stepwise Elimination\n\nFirst, we load the data and visualise it:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nladybird <- read_csv(\"data/CS5-ladybird.csv\")\n\nhead(ladybird)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 4\n id male weight eggs\n \n1 1 Wild 10.6 27\n2 2 Wild 10.6 26\n3 3 Wild 12.5 28\n4 4 Wild 11.4 24\n5 5 Wild 14.9 30\n6 6 Wild 10.4 21\n```\n\n\n:::\n:::\n\n\nWe can visualise the data by `male`, so we can see if the eggs clutch size differs a lot between the two groups:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(ladybird,\n aes(x = male, y = eggs)) +\n geom_boxplot() +\n geom_jitter(width = 0.05)\n```\n\n::: {.cell-output-display}\n![](cs5_practical_model-comparisons_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\nWe can also plot the egg clutch size against the weight, again for each male colour group:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# visualise the data\nggplot(ladybird,\n aes(x = weight, y = eggs,\n colour = male)) +\n geom_point() +\n scale_color_brewer(palette = \"Dark2\")\n```\n\n::: {.cell-output-display}\n![](cs5_practical_model-comparisons_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nladybird_py = pd.read_csv(\"data/CS5-ladybird.csv\")\n```\n:::\n\n\nWe can visualise the data by `male`, so we can see if the eggs clutch size differs a lot between the two groups:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(ladybird_py,\n aes(x = \"male\", y = \"eggs\")) +\n geom_boxplot() +\n geom_jitter(width = 0.05))\n```\n\n::: {.cell-output-display}\n![](cs5_practical_model-comparisons_files/figure-html/unnamed-chunk-11-1.png){width=614}\n:::\n:::\n\n\nWe can also plot the egg clutch size against the weight, again for each male colour group:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(ladybird_py,\n aes(x = \"weight\", y = \"eggs\",\n colour = \"male\")) +\n geom_point() +\n scale_color_brewer(type = \"qual\",\n palette = \"Dark2\"))\n```\n\n::: {.cell-output-display}\n![](cs5_practical_model-comparisons_files/figure-html/unnamed-chunk-12-3.png){width=614}\n:::\n:::\n\n\n:::\n\nWe can see a few things already:\n\n1. There aren't a huge number of data points in each group, so we need to be a bit cautious with drawing any firm conclusions.\n2. There is quite some spread in the egg clutch sizes, with two observations in the `Melanic` group being rather low.\n3. From the box plot, there does not seem to be much difference in the egg clutch size between the two male colour groups.\n4. The scatter plot suggests that egg clutch size seems to increase somewhat linearly as the weight of the female goes up. There does not seem to be much difference between the two male colour groups in this respect.\n\n### Comparing models with AIC (step 1)\n\nWe start with the complete or _full_ model, that takes into account any possible interaction between `weight` and `male`.\n\nNext, we define the reduced model. This is the next, most simple model. In this case we're removing the interaction and constructing an additive model.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# define the full model\nlm_full <- lm(eggs ~ weight * male,\n data = ladybird)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# define the additive model\nlm_add <- lm(eggs ~ weight + male,\n data = ladybird)\n```\n:::\n\n\nWe then extract the AIC values for each of the models:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nAIC(lm_full)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 100.0421\n```\n\n\n:::\n\n```{.r .cell-code}\nAIC(lm_add)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 99.19573\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create the model\nmodel = smf.ols(formula= \"eggs ~ weight * C(male)\", data = ladybird_py)\n# and get the fitted parameters of the model\nlm_full_py = model.fit()\n```\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\n# create the additive linear model\nmodel = smf.ols(formula= \"eggs ~ weight + C(male)\", data = ladybird_py)\n# and get the fitted parameters of the model\nlm_add_py = model.fit()\n```\n:::\n\n\nWe then extract the AIC values for each of the models:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nlm_full_py.aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n98.04206256815984\n```\n\n\n:::\n\n```{.python .cell-code}\nlm_add_py.aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n97.19572958073036\n```\n\n\n:::\n:::\n\n:::\n\nEach line tells you the AIC score for that model. The full model has 4 parameters (the intercept, the coefficient for the continuous variable `weight`, the coefficient for the categorical variable `male` and a coefficient for the interaction term `weight:male`). The additive model has a lower AIC score with only 3 parameters (since we’ve dropped the interaction term). There are different ways of interpreting AIC scores but the most widely used interpretation says that:\n\n* if the difference between two AIC scores is **greater than 2**, then the model with the **smallest AIC score is more supported** than the model with the higher AIC score\n* if the difference between the two models’ AIC scores is **less than 2** then both models are **equally well supported**\n\nThis choice of language (supported vs significant) is deliberate and there are areas of statistics where AIC scores are used differently from the way we are going to use them here (ask if you want a bit of philosophical ramble from me). However, in this situation we will use the AIC scores to decide whether our reduced model is at least as good as the full model. Here since the difference in AIC scores is less than 2, we can say that dropping the interaction term has left us with a model that is both simpler (fewer terms) and as least as good (AIC score) as the full model. As such our additive model `eggs ~ weight + male` is designated our current _working minimal model_.\n\n### Comparing models with AIC (step 2)\n\nNext, we see which of the remaining terms can be dropped. We will look at the models where we have dropped both `male` and `weight` (i.e. `eggs ~ weight` and `eggs ~ male`) and compare their AIC values with the AIC of our current minimal model (`eggs ~ weight + male`). If the AIC values of at least one of our new reduced models is lower (or at least no more than 2 greater) than the AIC of our current minimal model, then we can drop the relevant term and get ourselves a new minimal model. If we find ourselves in a situation where we can drop more than one term we will drop the term that gives us the model with the lowest AIC.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nDrop the variable `weight` and examine the AIC:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# define the model\nlm_male <- lm(eggs ~ male,\n data = ladybird)\n\n# extract the AIC\nAIC(lm_male)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 118.7093\n```\n\n\n:::\n:::\n\n\nDrop the variable `male` and examine the AIC:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# define the model\nlm_weight <- lm(eggs ~ weight,\n data = ladybird)\n\n# extract the AIC\nAIC(lm_weight)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 97.52601\n```\n\n\n:::\n:::\n\n\n## Python\n\nDrop the variable `weight` and examine the AIC:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create the model\nmodel = smf.ols(formula= \"eggs ~ C(male)\", data = ladybird_py)\n# and get the fitted parameters of the model\nlm_male_py = model.fit()\n\n# extract the AIC\nlm_male_py.aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n116.7092646564482\n```\n\n\n:::\n:::\n\n\nDrop the variable `male` and examine the AIC:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create the model\nmodel = smf.ols(formula= \"eggs ~ weight\", data = ladybird_py)\n# and get the fitted parameters of the model\nlm_weight_py = model.fit()\n\n# extract the AIC\nlm_weight_py.aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n95.52601286248304\n```\n\n\n:::\n:::\n\n:::\n\nConsidering both outputs together and comparing with the AIC of our current minimal model, we can see that dropping `male` has decreased the AIC further, whereas dropping `weight` has actually increased the AIC and thus worsened the model quality.\n\nHence we can drop `male` and our new minimal model is `eggs ~ weight`.\n\n### Comparing models with AIC (step 3)\n\nOur final comparison is to drop the variable `weight` and compare this simple model with a null model (`eggs ~ 1`), which assumes that the clutch size is constant across all parameters.\n\nDrop the variable `weight` and see if that has an effect:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# define the model\nlm_null <- lm(eggs ~ 1,\n data = ladybird)\n\n# extract the AIC\nAIC(lm_null)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 117.2178\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create the model\nmodel = smf.ols(formula= \"eggs ~ 1\", data = ladybird_py)\n# and get the fitted parameters of the model\nlm_null_py = model.fit()\n\n# extract the AIC\nlm_null_py.aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n115.21783038624153\n```\n\n\n:::\n:::\n\n\n:::\n\nThe AIC of our null model is quite a bit larger than that of our current minimal model `eggs ~ weight` and so we conclude that `weight` is important. As such our minimal model is `eggs ~ weight`.\n\nSo, in summary, we could conclude that:\n\n> Female size is a useful predictor of clutch size, but male type is not so important.\n\nAt this point we can then continue analysing this minimal model, by checking the diagnostic plots and checking the assumptions. If they all pan out, then we can continue with an ANOVA.\n\n## Notes on Backwards Stepwise Elimination\n\nThis method of finding a minimal model by starting with a full model and removing variables is called backward stepwise elimination. Although regularly practised in data analysis, there is increasing criticism of this approach, with calls for it to be avoided entirely.\n\nWhy have we made you work through this procedure then? Given their prevalence in academic papers, it is very useful to be aware of these procedures and to know that there are issues with them. In other situations, using AIC for model comparisons are justified and you will come across them regularly. Additionally, there may be situations where you feel there are good reasons to drop a parameter from your model – using this technique you can justify that this doesn’t affect the model fit. Taken together, using backwards stepwise elimination for model comparison is still a useful technique.\n\n::: {.callout-note}\n## Automatic BSE in R (but not Python)\n\nPerforming backwards stepwise elimination manually can be quite tedious. Thankfully R acknowledges this and there is a single inbuilt function called `step()` that can perform all of the necessary steps for you using AIC.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# define the full model\nlm_full <- lm(eggs ~ weight * male,\n data = ladybird)\n\n# perform backwards stepwise elimination\nstep(lm_full)\n```\n:::\n\n\nThis will perform a full backwards stepwise elimination process and will find the minimal model for you.\n\nYes, I could have told you this earlier, but where’s the fun in that? (it is also useful for you to understand the steps behind the technique I suppose...)\n\nWhen doing this in Python, you are a bit stuck. There does not seem to be an equivalent function. If you want to cobble something together yourself, then use [this link](https://stackoverflow.com/questions/22428625/does-statsmodels-or-another-python-package-offer-an-equivalent-to-rs-step-f) as a starting point.\n\n:::\n\n## Exercises\n\nWe are going to practice the backwards stepwise elimination technique on some of the data sets we analysed previously.\n\nFor each of the following data sets I would like you to:\n\n1. Define the response variable\n2. Define the relevant predictor variables\n3. Define relevant interactions\n4. Perform a backwards stepwise elimination and discover the minimal model using AIC\n\nNB: if an interaction term is significant then any main factor that is part of the interaction term cannot be dropped from the model.\n\nPerform a BSE on the following data sets:\n\n* `data/CS5-trees.csv`\n* `data/CS5-pm2_5.csv`\n\n### BSE: Trees {#sec-exr_bsetrees}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nBSE on `trees`:\n\n::: {.callout-answer collapse=\"true\"}\n## Answer\n\nLet's start by reading in the data and checking which variables are in the data set.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrees <- read_csv(\"data/CS5-trees.csv\")\n\nhead(trees)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 3\n girth height volume\n \n1 8.3 70 10.3\n2 8.6 65 10.3\n3 8.8 63 10.2\n4 10.5 72 16.4\n5 10.7 81 18.8\n6 10.8 83 19.7\n```\n\n\n:::\n:::\n\n\n## Python\n\nFirst, we read in the data (if needed) and have a look at the variables:\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntrees_py = pd.read_csv(\"data/CS5-trees.csv\")\n\ntrees_py.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n girth height volume\n0 8.3 70 10.3\n1 8.6 65 10.3\n2 8.8 63 10.2\n3 10.5 72 16.4\n4 10.7 81 18.8\n```\n\n\n:::\n:::\n\n\n:::\n\n1. The response variable is `volume`\n2. The predictor variables are `girth` and `height`\n3. The only possible interaction term is `girth:height`\n4. We perform a BSE on the model using the `step()` function\n\nThe full model is `volume ~ girth * height`.\n\nWe perform the BSE as follows:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nDefine the model and use the `step()` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# define the full model\nlm_trees <- lm(volume ~ girth * height,\n data = trees)\n\n# perform BSE\nstep(lm_trees)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nStart: AIC=65.49\nvolume ~ girth * height\n\n Df Sum of Sq RSS AIC\n 198.08 65.495\n- girth:height 1 223.84 421.92 86.936\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = volume ~ girth * height, data = trees)\n\nCoefficients:\n (Intercept) girth height girth:height \n 69.3963 -5.8558 -1.2971 0.1347 \n```\n\n\n:::\n:::\n\n\n## Python\n\nWe first define the full model, then the model without the interaction (`model_1`).\n\nWe extract the AICs for both models. Because we do not have an automated way of performing the BSE, we're stringing together a few operations to make the code a bit more concise (getting the `.fit()` of the model and immediately extracting the `aic` value):\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# define the models\nmodel_full = smf.ols(formula= \"volume ~ girth * height\",\n data = trees_py)\n\nmodel_1 = smf.ols(formula= \"volume ~ girth + height\",\n data = trees_py)\n\n# get the AIC of the model\nmodel_full.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n153.46916240903528\n```\n\n\n:::\n\n```{.python .cell-code}\nmodel_1.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n174.9099729872703\n```\n\n\n:::\n:::\n\n\n:::\n\nThis BSE approach only gets as far as the first step (trying to drop the interaction term). We see immediately that dropping the interaction term makes the model worse. This means that the best model is still the full model.\n:::\n:::\n\n### BSE: Air pollution {#sec-exr_bsepollution}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nPerform a BSE on `pm2_5`. Let's start by reading in the data and checking which variables are in the data set.\n\n::: {.callout-answer collapse=\"true\"}\n## Answer\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\npm2_5 <- read_csv(\"data/CS5-pm2_5.csv\")\n\nhead(pm2_5)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 6\n avg_temp date location pm2_5 rain_mm wind_m_s\n \n1 4.5 01/01/2019 inner 17.1 2.3 3.87\n2 4.9 01/01/2019 outer 10.8 2.3 5.84\n3 4.3 02/01/2019 inner 14.9 2.3 3.76\n4 4.8 02/01/2019 outer 11.4 2.3 6 \n5 4 03/01/2019 inner 18.5 1.4 2.13\n6 4.5 03/01/2019 outer 15.0 1.4 2.57\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\npm2_5_py = pd.read_csv(\"data/CS5-pm2_5.csv\")\n\npm2_5_py.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n avg_temp date location pm2_5 rain_mm wind_m_s\n0 4.5 01/01/2019 inner 17.126 2.3 3.87\n1 4.9 01/01/2019 outer 10.821 2.3 5.84\n2 4.3 02/01/2019 inner 14.884 2.3 3.76\n3 4.8 02/01/2019 outer 11.416 2.3 6.00\n4 4.0 03/01/2019 inner 18.471 1.4 2.13\n```\n\n\n:::\n:::\n\n:::\n\n1. The response variable is `pm2_5`\n2. The predictor variables are all the variables, apart from `date` and `pm2_5`. It would be strange to try and create a model that relies on each individual measurement!\n3. We can add the `wind_m_s:location` interaction, since it appeared that there is a difference between `inner` and `outer` London pollution levels, in relation to wind speed\n4. We can start the backwards stepwise elimination with the following model:\n\n`pm2_5 ~ avg_temp + rain_mm + wind_m_s * location`\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# define the model\nlm_pm2_5 <- lm(pm2_5 ~ avg_temp + rain_mm + wind_m_s * location,\n data = pm2_5)\n\n# perform BSE\nstep(lm_pm2_5)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nStart: AIC=41.08\npm2_5 ~ avg_temp + rain_mm + wind_m_s * location\n\n Df Sum of Sq RSS AIC\n- rain_mm 1 0.351 760.01 39.412\n- avg_temp 1 1.804 761.47 40.806\n 759.66 41.075\n- wind_m_s:location 1 134.820 894.48 158.336\n\nStep: AIC=39.41\npm2_5 ~ avg_temp + wind_m_s + location + wind_m_s:location\n\n Df Sum of Sq RSS AIC\n- avg_temp 1 1.758 761.77 39.098\n 760.01 39.412\n- wind_m_s:location 1 134.530 894.54 156.385\n\nStep: AIC=39.1\npm2_5 ~ wind_m_s + location + wind_m_s:location\n\n Df Sum of Sq RSS AIC\n 761.77 39.098\n- wind_m_s:location 1 136.95 898.72 157.788\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = pm2_5 ~ wind_m_s + location + wind_m_s:location, \n data = pm2_5)\n\nCoefficients:\n (Intercept) wind_m_s locationouter \n 18.2422 -0.2851 -2.0597 \nwind_m_s:locationouter \n -0.4318 \n```\n\n\n:::\n:::\n\n\n## Python\n\nWe first define the full model, again stringing together a few operations to be more concise.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# define the model\nmodel_full = smf.ols(formula= \"pm2_5 ~ avg_temp + rain_mm + wind_m_s * C(location)\", data = pm2_5_py)\n\n# get the AIC of the model\nmodel_full.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2112.725435984824\n```\n\n\n:::\n:::\n\n\nCan we drop the interaction term or any of the remaining main effects?\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# define the model\nmodel_1 = smf.ols(\n formula= \"pm2_5 ~ avg_temp + rain_mm + wind_m_s + C(location)\",\n data = pm2_5_py)\n \nmodel_2 = smf.ols(\n formula= \"pm2_5 ~ avg_temp + wind_m_s * C(location)\",\n data = pm2_5_py)\n \nmodel_3 = smf.ols(\n formula= \"pm2_5 ~ rain_mm + wind_m_s * C(location)\",\n data = pm2_5_py)\n\n# get the AIC of the models\nmodel_1.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2229.9865962235162\n```\n\n\n:::\n\n```{.python .cell-code}\nmodel_2.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2111.06230638372\n```\n\n\n:::\n\n```{.python .cell-code}\nmodel_3.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2112.456639492829\n```\n\n\n:::\n\n```{.python .cell-code}\n# compare to the full model\nmodel_full.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2112.725435984824\n```\n\n\n:::\n:::\n\n\nThe AIC goes up quite a bit if we drop the interaction term. This means that we cannot drop the interaction term, _nor any of the main effects that are included in the interaction_. These are `wind_m_s` and `location`.\n\nThe model with the lowest AIC is the one without the `rain_mm` term, so our working model is:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nworking_model = smf.ols(\n formula= \"pm2_5 ~ avg_temp + wind_m_s * C(location)\",\n data = pm2_5_py)\n```\n:::\n\n\nNow we can again check if dropping the `avg_temp` term or the `wind_m_s:location` interaction has any effect on the model performance:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel_1 = smf.ols(\n formula= \"pm2_5 ~ wind_m_s * C(location)\",\n data = pm2_5_py)\n \nmodel_2 = smf.ols(\n formula= \"pm2_5 ~ avg_temp + wind_m_s + C(location)\",\n data = pm2_5_py)\n\n# get the AIC of the models\nmodel_1.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2110.748543452394\n```\n\n\n:::\n\n```{.python .cell-code}\nmodel_2.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2228.035595215867\n```\n\n\n:::\n\n```{.python .cell-code}\nworking_model.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2111.06230638372\n```\n\n\n:::\n:::\n\n\nThis shows that dropping the `avg_temp` term lowers the AIC, whereas dropping the interaction term makes the model markedly worse.\n\nSo, our new working model is:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nworking_model = smf.ols(\n formula= \"pm2_5 ~ wind_m_s * C(location)\",\n data = pm2_5_py)\n```\n:::\n\n\nLastly, now we've dropped the `avg_temp` term we can do one final check on the interaction term:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel_1 = smf.ols(\n formula= \"pm2_5 ~ wind_m_s + C(location) + wind_m_s + C(location)\",\n data = pm2_5_py)\n\nmodel_1.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2229.438631394105\n```\n\n\n:::\n\n```{.python .cell-code}\nworking_model.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2110.748543452394\n```\n\n\n:::\n:::\n\n\nDropping the interaction _still_ makes the model worse.\n:::\n\nOur minimal model is thus:\n\n`pm2_5 ~ wind_m_s + location + wind_m_s:location`\n\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n- We can use Backwards Stepwise Elimination (BSE) on a full model to see if certain terms add to the predictive power of the model or not\n- The AIC allows us to compare different models - if there is a difference in AIC of more than 2 between two models, then the smallest AIC score is more supported\n:::\n", + "markdown": "---\ntitle: \"Model comparisons\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n#### Learning outcomes\n\n**Questions**\n\n- How do I compare linear models?\n- How do decide which one is the \"best\" model?\n\n**Objectives**\n\n- Be able to compare models using the Akaike Information Criterion (AIC)\n- Use AIC in the context of Backwards Stepwise Elimination\n\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# A collection of R packages designed for data science\nlibrary(tidyverse)\n\n# Converts stats functions to a tidyverse-friendly format\nlibrary(rstatix)\n\n# Creates diagnostic plots using ggplot2\nlibrary(ggResidpanel)\n\n# Helper functions for tidying data\nlibrary(broom)\n```\n:::\n\n\n### Functions\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Calculates the Akaike Information Criterion\nstats::AIC()\n\n# Performs a backwards step-wise elimination process\nstats::step()\n```\n:::\n\n\n## Python\n\n### Libraries\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# A fundamental package for scientific computing in Python\nimport numpy as np\n\n# A Python data analysis and manipulation tool\nimport pandas as pd\n\n# Simple yet exhaustive stats functions.\nimport pingouin as pg\n\n# Python equivalent of `ggplot2`\nfrom plotnine import *\n\n# Statistical models, conducting tests and statistical data exploration\nimport statsmodels.api as sm\n\n# Convenience interface for specifying models using formula strings and DataFrames\nimport statsmodels.formula.api as smf\n```\n:::\n\n\n### Functions\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# Reads in a .csv file\npandas.read_csv()\n\n# Creates a model from a formula and data frame\nstatsmodels.formula.api.ols()\n```\n:::\n\n\n:::\n:::\n\n## Purpose and aim\n\nIn the previous example we used a single data set and fitted five linear models to it depending on which predictor variables we used. Whilst this was fun (seriously, what else would you be doing right now?) it seems that there should be a \"better way\". Well, thankfully there is! In fact there a several methods that can be used to compare different models in order to help identify \"the best\" model. More specifically, we can determine if a full model (which uses all available predictor variables and interactions) is necessary to appropriately describe the dependent variable, or whether we can throw away some of the terms (e.g. an interaction term) because they don’t offer any useful predictive power.\n\nHere we will use the **Akaike Information Criterion** in order to compare different models.\n\n## Data and hypotheses\n\nThis section uses the `data/CS5-ladybird.csv` data set. This data set comprises of 20 observations of three variables (one dependent and two predictor). This records the clutch size (`eggs`) in a species of ladybird, alongside two potential predictor variables; the mass of the female (`weight`), and the colour of the male (`male`) which is a categorical variable.\n\n## Backwards Stepwise Elimination\n\nFirst, we load the data and visualise it:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nladybird <- read_csv(\"data/CS5-ladybird.csv\")\n\nhead(ladybird)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 4\n id male weight eggs\n \n1 1 Wild 10.6 27\n2 2 Wild 10.6 26\n3 3 Wild 12.5 28\n4 4 Wild 11.4 24\n5 5 Wild 14.9 30\n6 6 Wild 10.4 21\n```\n\n\n:::\n:::\n\n\nWe can visualise the data by `male`, so we can see if the eggs clutch size differs a lot between the two groups:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(ladybird,\n aes(x = male, y = eggs)) +\n geom_boxplot() +\n geom_jitter(width = 0.05)\n```\n\n::: {.cell-output-display}\n![](cs5_practical_model-comparisons_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\nWe can also plot the egg clutch size against the weight, again for each male colour group:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# visualise the data\nggplot(ladybird,\n aes(x = weight, y = eggs,\n colour = male)) +\n geom_point() +\n scale_color_brewer(palette = \"Dark2\")\n```\n\n::: {.cell-output-display}\n![](cs5_practical_model-comparisons_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nladybird_py = pd.read_csv(\"data/CS5-ladybird.csv\")\n```\n:::\n\n\nWe can visualise the data by `male`, so we can see if the eggs clutch size differs a lot between the two groups:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(ladybird_py,\n aes(x = \"male\", y = \"eggs\")) +\n geom_boxplot() +\n geom_jitter(width = 0.05))\n```\n\n::: {.cell-output-display}\n![](cs5_practical_model-comparisons_files/figure-html/unnamed-chunk-11-1.png){width=614}\n:::\n:::\n\n\nWe can also plot the egg clutch size against the weight, again for each male colour group:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(ladybird_py,\n aes(x = \"weight\", y = \"eggs\",\n colour = \"male\")) +\n geom_point() +\n scale_color_brewer(type = \"qual\",\n palette = \"Dark2\"))\n```\n\n::: {.cell-output-display}\n![](cs5_practical_model-comparisons_files/figure-html/unnamed-chunk-12-3.png){width=614}\n:::\n:::\n\n\n:::\n\nWe can see a few things already:\n\n1. There aren't a huge number of data points in each group, so we need to be a bit cautious with drawing any firm conclusions.\n2. There is quite some spread in the egg clutch sizes, with two observations in the `Melanic` group being rather low.\n3. From the box plot, there does not seem to be much difference in the egg clutch size between the two male colour groups.\n4. The scatter plot suggests that egg clutch size seems to increase somewhat linearly as the weight of the female goes up. There does not seem to be much difference between the two male colour groups in this respect.\n\n### Comparing models with AIC (step 1)\n\nWe start with the complete or _full_ model, that takes into account any possible interaction between `weight` and `male`.\n\nNext, we define the reduced model. This is the next, most simple model. In this case we're removing the interaction and constructing an additive model.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# define the full model\nlm_full <- lm(eggs ~ weight * male,\n data = ladybird)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# define the additive model\nlm_add <- lm(eggs ~ weight + male,\n data = ladybird)\n```\n:::\n\n\nWe then extract the AIC values for each of the models:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nAIC(lm_full)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 100.0421\n```\n\n\n:::\n\n```{.r .cell-code}\nAIC(lm_add)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 99.19573\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create the model\nmodel = smf.ols(formula= \"eggs ~ weight * C(male)\", data = ladybird_py)\n# and get the fitted parameters of the model\nlm_full_py = model.fit()\n```\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\n# create the additive linear model\nmodel = smf.ols(formula= \"eggs ~ weight + C(male)\", data = ladybird_py)\n# and get the fitted parameters of the model\nlm_add_py = model.fit()\n```\n:::\n\n\nWe then extract the AIC values for each of the models:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nlm_full_py.aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n98.04206256815984\n```\n\n\n:::\n\n```{.python .cell-code}\nlm_add_py.aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n97.19572958073036\n```\n\n\n:::\n:::\n\n:::\n\nEach line tells you the AIC score for that model. The full model has 4 parameters (the intercept, the coefficient for the continuous variable `weight`, the coefficient for the categorical variable `male` and a coefficient for the interaction term `weight:male`). The additive model has a lower AIC score with only 3 parameters (since we’ve dropped the interaction term). There are different ways of interpreting AIC scores but the most widely used interpretation says that:\n\n* if the difference between two AIC scores is **greater than 2**, then the model with the **smallest AIC score is more supported** than the model with the higher AIC score\n* if the difference between the two models’ AIC scores is **less than 2** then both models are **equally well supported**\n\nThis choice of language (supported vs significant) is deliberate and there are areas of statistics where AIC scores are used differently from the way we are going to use them here (ask if you want a bit of philosophical ramble from me). However, in this situation we will use the AIC scores to decide whether our reduced model is at least as good as the full model. Here since the difference in AIC scores is less than 2, we can say that dropping the interaction term has left us with a model that is both simpler (fewer terms) and as least as good (AIC score) as the full model. As such our additive model `eggs ~ weight + male` is designated our current _working minimal model_.\n\n### Comparing models with AIC (step 2)\n\nNext, we see which of the remaining terms can be dropped. We will look at the models where we have dropped both `male` and `weight` (i.e. `eggs ~ weight` and `eggs ~ male`) and compare their AIC values with the AIC of our current minimal model (`eggs ~ weight + male`). If the AIC values of at least one of our new reduced models is lower (or at least no more than 2 greater) than the AIC of our current minimal model, then we can drop the relevant term and get ourselves a new minimal model. If we find ourselves in a situation where we can drop more than one term we will drop the term that gives us the model with the lowest AIC.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nDrop the variable `weight` and examine the AIC:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# define the model\nlm_male <- lm(eggs ~ male,\n data = ladybird)\n\n# extract the AIC\nAIC(lm_male)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 118.7093\n```\n\n\n:::\n:::\n\n\nDrop the variable `male` and examine the AIC:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# define the model\nlm_weight <- lm(eggs ~ weight,\n data = ladybird)\n\n# extract the AIC\nAIC(lm_weight)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 97.52601\n```\n\n\n:::\n:::\n\n\n## Python\n\nDrop the variable `weight` and examine the AIC:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create the model\nmodel = smf.ols(formula= \"eggs ~ C(male)\", data = ladybird_py)\n# and get the fitted parameters of the model\nlm_male_py = model.fit()\n\n# extract the AIC\nlm_male_py.aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n116.7092646564482\n```\n\n\n:::\n:::\n\n\nDrop the variable `male` and examine the AIC:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create the model\nmodel = smf.ols(formula= \"eggs ~ weight\", data = ladybird_py)\n# and get the fitted parameters of the model\nlm_weight_py = model.fit()\n\n# extract the AIC\nlm_weight_py.aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n95.52601286248304\n```\n\n\n:::\n:::\n\n:::\n\nConsidering both outputs together and comparing with the AIC of our current minimal model, we can see that dropping `male` has decreased the AIC further, whereas dropping `weight` has actually increased the AIC and thus worsened the model quality.\n\nHence we can drop `male` and our new minimal model is `eggs ~ weight`.\n\n### Comparing models with AIC (step 3)\n\nOur final comparison is to drop the variable `weight` and compare this simple model with a null model (`eggs ~ 1`), which assumes that the clutch size is constant across all parameters.\n\nDrop the variable `weight` and see if that has an effect:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# define the model\nlm_null <- lm(eggs ~ 1,\n data = ladybird)\n\n# extract the AIC\nAIC(lm_null)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 117.2178\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# create the model\nmodel = smf.ols(formula= \"eggs ~ 1\", data = ladybird_py)\n# and get the fitted parameters of the model\nlm_null_py = model.fit()\n\n# extract the AIC\nlm_null_py.aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n115.21783038624153\n```\n\n\n:::\n:::\n\n\n:::\n\nThe AIC of our null model is quite a bit larger than that of our current minimal model `eggs ~ weight` and so we conclude that `weight` is important. As such our minimal model is `eggs ~ weight`.\n\nSo, in summary, we could conclude that:\n\n> Female size is a useful predictor of clutch size, but male type is not so important.\n\nAt this point we can then continue analysing this minimal model, by checking the diagnostic plots and checking the assumptions. If they all pan out, then we can continue with an ANOVA.\n\n## Notes on Backwards Stepwise Elimination\n\nThis method of finding a minimal model by starting with a full model and removing variables is called backward stepwise elimination. Although regularly practised in data analysis, there is increasing criticism of this approach, with calls for it to be avoided entirely.\n\nWhy have we made you work through this procedure then? Given their prevalence in academic papers, it is very useful to be aware of these procedures and to know that there are issues with them. In other situations, using AIC for model comparisons are justified and you will come across them regularly. Additionally, there may be situations where you feel there are good reasons to drop a parameter from your model – using this technique you can justify that this doesn’t affect the model fit. Taken together, using backwards stepwise elimination for model comparison is still a useful technique.\n\n::: {.callout-note}\n## Automatic BSE in R (but not Python)\n\nPerforming backwards stepwise elimination manually can be quite tedious. Thankfully R acknowledges this and there is a single inbuilt function called `step()` that can perform all of the necessary steps for you using AIC.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# define the full model\nlm_full <- lm(eggs ~ weight * male,\n data = ladybird)\n\n# perform backwards stepwise elimination\nstep(lm_full)\n```\n:::\n\n\nThis will perform a full backwards stepwise elimination process and will find the minimal model for you.\n\nYes, I could have told you this earlier, but where’s the fun in that? (it is also useful for you to understand the steps behind the technique I suppose...)\n\nWhen doing this in Python, you are a bit stuck. There does not seem to be an equivalent function. If you want to cobble something together yourself, then use [this link](https://stackoverflow.com/questions/22428625/does-statsmodels-or-another-python-package-offer-an-equivalent-to-rs-step-f) as a starting point.\n\n:::\n\n## Exercises\n\nWe are going to practice the backwards stepwise elimination technique on some of the data sets we analysed previously.\n\nFor each of the following data sets I would like you to:\n\n1. Define the response variable\n2. Define the relevant predictor variables\n3. Define relevant interactions\n4. Perform a backwards stepwise elimination and discover the minimal model using AIC\n\nNB: if an interaction term is significant then any main factor that is part of the interaction term cannot be dropped from the model.\n\nPerform a BSE on the following data sets:\n\n* `data/CS5-trees.csv`\n* `data/CS5-pm2_5.csv`\n\n### BSE: Trees {#sec-exr_bsetrees}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nBSE on `trees`:\n\n::: {.callout-answer collapse=\"true\"}\n## Answer\n\nLet's start by reading in the data and checking which variables are in the data set.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrees <- read_csv(\"data/CS5-trees.csv\")\n\nhead(trees)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 3\n girth height volume\n \n1 8.3 70 10.3\n2 8.6 65 10.3\n3 8.8 63 10.2\n4 10.5 72 16.4\n5 10.7 81 18.8\n6 10.8 83 19.7\n```\n\n\n:::\n:::\n\n\n## Python\n\nFirst, we read in the data (if needed) and have a look at the variables:\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntrees_py = pd.read_csv(\"data/CS5-trees.csv\")\n\ntrees_py.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n girth height volume\n0 8.3 70 10.3\n1 8.6 65 10.3\n2 8.8 63 10.2\n3 10.5 72 16.4\n4 10.7 81 18.8\n```\n\n\n:::\n:::\n\n\n:::\n\n1. The response variable is `volume`\n2. The predictor variables are `girth` and `height`\n3. The only possible interaction term is `girth:height`\n4. We perform a BSE on the model using the `step()` function\n\nThe full model is `volume ~ girth * height`.\n\nWe perform the BSE as follows:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nDefine the model and use the `step()` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# define the full model\nlm_trees <- lm(volume ~ girth * height,\n data = trees)\n\n# perform BSE\nstep(lm_trees)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nStart: AIC=65.49\nvolume ~ girth * height\n\n Df Sum of Sq RSS AIC\n 198.08 65.495\n- girth:height 1 223.84 421.92 86.936\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = volume ~ girth * height, data = trees)\n\nCoefficients:\n (Intercept) girth height girth:height \n 69.3963 -5.8558 -1.2971 0.1347 \n```\n\n\n:::\n:::\n\n\n## Python\n\nWe first define the full model, then the model without the interaction (`model_1`).\n\nWe extract the AICs for both models. Because we do not have an automated way of performing the BSE, we're stringing together a few operations to make the code a bit more concise (getting the `.fit()` of the model and immediately extracting the `aic` value):\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# define the models\nmodel_full = smf.ols(formula= \"volume ~ girth * height\",\n data = trees_py)\n\nmodel_1 = smf.ols(formula= \"volume ~ girth + height\",\n data = trees_py)\n\n# get the AIC of the model\nmodel_full.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n153.46916240903528\n```\n\n\n:::\n\n```{.python .cell-code}\nmodel_1.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n174.9099729872703\n```\n\n\n:::\n:::\n\n\n:::\n\nThis BSE approach only gets as far as the first step (trying to drop the interaction term). We see immediately that dropping the interaction term makes the model worse. This means that the best model is still the full model.\n:::\n:::\n\n### BSE: Air pollution {#sec-exr_bsepollution}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nPerform a BSE on `pm2_5`. Let's start by reading in the data and checking which variables are in the data set.\n\n::: {.callout-answer collapse=\"true\"}\n## Answer\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\npm2_5 <- read_csv(\"data/CS5-pm2_5.csv\")\n\nhead(pm2_5)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 6\n avg_temp date location pm2_5 rain_mm wind_m_s\n \n1 4.5 01/01/2019 inner 17.1 2.3 3.87\n2 4.9 01/01/2019 outer 10.8 2.3 5.84\n3 4.3 02/01/2019 inner 14.9 2.3 3.76\n4 4.8 02/01/2019 outer 11.4 2.3 6 \n5 4 03/01/2019 inner 18.5 1.4 2.13\n6 4.5 03/01/2019 outer 15.0 1.4 2.57\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\npm2_5_py = pd.read_csv(\"data/CS5-pm2_5.csv\")\n\npm2_5_py.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n avg_temp date location pm2_5 rain_mm wind_m_s\n0 4.5 01/01/2019 inner 17.126 2.3 3.87\n1 4.9 01/01/2019 outer 10.821 2.3 5.84\n2 4.3 02/01/2019 inner 14.884 2.3 3.76\n3 4.8 02/01/2019 outer 11.416 2.3 6.00\n4 4.0 03/01/2019 inner 18.471 1.4 2.13\n```\n\n\n:::\n:::\n\n:::\n\n1. The response variable is `pm2_5`\n2. The predictor variables are all the variables, apart from `date` and `pm2_5`. It would be strange to try and create a model that relies on each individual measurement!\n3. We can add the `wind_m_s:location` interaction, since it appeared that there is a difference between `inner` and `outer` London pollution levels, in relation to wind speed\n4. We can start the backwards stepwise elimination with the following model:\n\n`pm2_5 ~ avg_temp + rain_mm + wind_m_s * location`\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# define the model\nlm_pm2_5 <- lm(pm2_5 ~ avg_temp + rain_mm + wind_m_s * location,\n data = pm2_5)\n\n# perform BSE\nstep(lm_pm2_5)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nStart: AIC=41.08\npm2_5 ~ avg_temp + rain_mm + wind_m_s * location\n\n Df Sum of Sq RSS AIC\n- rain_mm 1 0.351 760.01 39.412\n- avg_temp 1 1.804 761.47 40.806\n 759.66 41.075\n- wind_m_s:location 1 134.820 894.48 158.336\n\nStep: AIC=39.41\npm2_5 ~ avg_temp + wind_m_s + location + wind_m_s:location\n\n Df Sum of Sq RSS AIC\n- avg_temp 1 1.758 761.77 39.098\n 760.01 39.412\n- wind_m_s:location 1 134.530 894.54 156.385\n\nStep: AIC=39.1\npm2_5 ~ wind_m_s + location + wind_m_s:location\n\n Df Sum of Sq RSS AIC\n 761.77 39.098\n- wind_m_s:location 1 136.95 898.72 157.788\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = pm2_5 ~ wind_m_s + location + wind_m_s:location, \n data = pm2_5)\n\nCoefficients:\n (Intercept) wind_m_s locationouter \n 18.2422 -0.2851 -2.0597 \nwind_m_s:locationouter \n -0.4318 \n```\n\n\n:::\n:::\n\n\n## Python\n\nWe first define the full model, again stringing together a few operations to be more concise.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# define the model\nmodel_full = smf.ols(formula= \"pm2_5 ~ avg_temp + rain_mm + wind_m_s * C(location)\", data = pm2_5_py)\n\n# get the AIC of the model\nmodel_full.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2112.725435984824\n```\n\n\n:::\n:::\n\n\nCan we drop the interaction term or any of the remaining main effects?\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# define the model\nmodel_1 = smf.ols(\n formula= \"pm2_5 ~ avg_temp + rain_mm + wind_m_s + C(location)\",\n data = pm2_5_py)\n \nmodel_2 = smf.ols(\n formula= \"pm2_5 ~ avg_temp + wind_m_s * C(location)\",\n data = pm2_5_py)\n \nmodel_3 = smf.ols(\n formula= \"pm2_5 ~ rain_mm + wind_m_s * C(location)\",\n data = pm2_5_py)\n\n# get the AIC of the models\nmodel_1.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2229.9865962235162\n```\n\n\n:::\n\n```{.python .cell-code}\nmodel_2.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2111.06230638372\n```\n\n\n:::\n\n```{.python .cell-code}\nmodel_3.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2112.456639492829\n```\n\n\n:::\n\n```{.python .cell-code}\n# compare to the full model\nmodel_full.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2112.725435984824\n```\n\n\n:::\n:::\n\n\nThe AIC goes up quite a bit if we drop the interaction term. This means that we cannot drop the interaction term, _nor any of the main effects that are included in the interaction_. These are `wind_m_s` and `location`.\n\nThe model with the lowest AIC is the one without the `rain_mm` term, so our working model is:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nworking_model = smf.ols(\n formula= \"pm2_5 ~ avg_temp + wind_m_s * C(location)\",\n data = pm2_5_py)\n```\n:::\n\n\nNow we can again check if dropping the `avg_temp` term or the `wind_m_s:location` interaction has any effect on the model performance:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel_1 = smf.ols(\n formula= \"pm2_5 ~ wind_m_s * C(location)\",\n data = pm2_5_py)\n \nmodel_2 = smf.ols(\n formula= \"pm2_5 ~ avg_temp + wind_m_s + C(location)\",\n data = pm2_5_py)\n\n# get the AIC of the models\nmodel_1.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2110.748543452394\n```\n\n\n:::\n\n```{.python .cell-code}\nmodel_2.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2228.035595215867\n```\n\n\n:::\n\n```{.python .cell-code}\nworking_model.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2111.06230638372\n```\n\n\n:::\n:::\n\n\nThis shows that dropping the `avg_temp` term lowers the AIC, whereas dropping the interaction term makes the model markedly worse.\n\nSo, our new working model is:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nworking_model = smf.ols(\n formula= \"pm2_5 ~ wind_m_s * C(location)\",\n data = pm2_5_py)\n```\n:::\n\n\nLastly, now we've dropped the `avg_temp` term we can do one final check on the interaction term:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmodel_1 = smf.ols(\n formula= \"pm2_5 ~ wind_m_s + C(location) + wind_m_s + C(location)\",\n data = pm2_5_py)\n\nmodel_1.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2229.438631394105\n```\n\n\n:::\n\n```{.python .cell-code}\nworking_model.fit().aic\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2110.748543452394\n```\n\n\n:::\n:::\n\n\nDropping the interaction _still_ makes the model worse.\n:::\n\nOur minimal model is thus:\n\n`pm2_5 ~ wind_m_s + location + wind_m_s:location`\n\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n- We can use Backwards Stepwise Elimination (BSE) on a full model to see if certain terms add to the predictive power of the model or not\n- The AIC allows us to compare different models - if there is a difference in AIC of more than 2 between two models, then the smallest AIC score is more supported\n:::\n", "supporting": [ "cs5_practical_model-comparisons_files" ], diff --git a/_freeze/materials/cs5_practical_model-comparisons/figure-html/unnamed-chunk-11-1.png b/_freeze/materials/cs5_practical_model-comparisons/figure-html/unnamed-chunk-11-1.png index 0e9a8c8..0f041fd 100644 Binary files a/_freeze/materials/cs5_practical_model-comparisons/figure-html/unnamed-chunk-11-1.png and b/_freeze/materials/cs5_practical_model-comparisons/figure-html/unnamed-chunk-11-1.png differ diff --git a/_freeze/materials/cs5_practical_model-comparisons/figure-html/unnamed-chunk-8-1.png b/_freeze/materials/cs5_practical_model-comparisons/figure-html/unnamed-chunk-8-1.png index 1eb7fbe..81955cb 100644 Binary files a/_freeze/materials/cs5_practical_model-comparisons/figure-html/unnamed-chunk-8-1.png and b/_freeze/materials/cs5_practical_model-comparisons/figure-html/unnamed-chunk-8-1.png differ diff --git a/materials/cs5_practical_model-comparisons.qmd b/materials/cs5_practical_model-comparisons.qmd index 95454cd..cabf1dd 100644 --- a/materials/cs5_practical_model-comparisons.qmd +++ b/materials/cs5_practical_model-comparisons.qmd @@ -105,7 +105,6 @@ pandas.read_csv() statsmodels.formula.api.ols() ``` -::: ::: :::