glm python

cambiotraining · Jan 31, 2024 · 6257504 · 6257504
1 parent 0e6b1c5
commit 6257504
Show file tree

Hide file tree

Showing 19 changed files with 301 additions and 67 deletions.
diff --git a/_freeze/materials/glm-practical-logistic-binary/execute-results/html.json b/_freeze/materials/glm-practical-logistic-binary/execute-results/html.json
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-15-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-15-1.png
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-31-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-31-1.png
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-34-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-34-1.png
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-40-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-40-1.png
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-41-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-41-1.png
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-42-3.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-42-3.png
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-43-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-43-1.png
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-44-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-44-1.png
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-45-3.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-45-3.png
diff --git a/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-46-1.png b/_freeze/materials/glm-practical-logistic-binary/figure-html/unnamed-chunk-46-1.png
diff --git a/_site/materials/glm-practical-logistic-binary.html b/_site/materials/glm-practical-logistic-binary.html
diff --git a/_site/search.json b/_site/search.json
diff --git a/materials/glm-practical-logistic-binary.qmd b/materials/glm-practical-logistic-binary.qmd
@@ -63,6 +63,9 @@ import statsmodels.api as sm
 
 # Convenience interface for specifying models using formula strings and DataFrames
 import statsmodels.formula.api as smf
+
+# Needed for additional probability functionality
+from scipy.stats import *
 ```
 
 ### Functions
@@ -365,7 +368,9 @@ There’s a lot to unpack here, but let's start with what we're familiar with: c
 ## Parameter interpretation
 
 ::: {.panel-tabset group="language"}
+
 ## R
+
 The coefficients or parameters can be found in the `Coefficients` block. The main numbers to extract from the output are the two numbers underneath `Estimate.Std`:
 
 ```
@@ -401,6 +406,7 @@ $$ P(pointed \ beak) = \frac{\exp(-43.41 + 3.39 \times blength)}{1 + \exp(-43.41
 Having this formula means that we can calculate the probability of having a pointed beak for any beak length. How do we work this out in practice? 
 
 ::: {.panel-tabset group="language"}
+
 ## R
 
 Well, the probability of having a pointed beak if the beak length is large (for example 15 mm) can be calculated as follows:
@@ -516,19 +522,68 @@ Short beaks are more closely associated with the bluntly shaped beaks, whereas l
 
 ## Assumptions
 
-As explained in the background chapter, we can't really use the standard diagnostic plots to assess assumptions. We're not going to go into a lot of detail for now, but there is one thing that we can do: look for influential points using the Cook’s distance plot:
+As explained in the background chapter, we can't really use the standard diagnostic plots to assess assumptions. We're not going to go into a lot of detail for now, but there is one thing that we can do: look for influential points using the Cook’s distance plot.
 
 ::: {.panel-tabset group="language"}
 ## R
 
 ```{r}
-plot(glm_bks , which=4)
+plot(glm_bks , which = 4)
+```
+
+:::{.callout-note collapse=true}
+## Extracting the Cook's distances from the `glm` object
+
+Instead of using the `plot()` function, we can also extract the values directly from the `glm` object. We can use the `augment()` function to do this and create a lollipop or stem plot:
+
+```{r}
+glm_bks %>% 
+  augment() %>%            # get underlying data
+  select(.cooksd) %>%      # select the Cook's d
+  mutate(obs = 1:n()) %>%  # create an index column
+  ggplot(aes(x = obs, y = .cooksd)) +
+  geom_point() +
+  geom_segment(aes(x = obs, y = .cooksd, xend = obs, yend = 0))
+```
+:::
+
+## Python
+
+As always, there are different ways of doing this. Here we extract the Cook's d values from the `glm` object and put them in a Pandas DataFrame. We can then use that to plot them in a lollipop or stem plot.
+
+```{python}
+# extract the Cook's distances
+glm_bks_py_resid = pd.DataFrame(glm_bks_py.
+                                get_influence().
+                                summary_frame()["cooks_d"])
+
+# add row index 
+glm_bks_py_resid['obs'] = glm_bks_py_resid.reset_index().index
+```
+
+We now have two columns:
+
+```{python}
+glm_bks_py_resid
+```
+
+We can use these to create the plot:
+
+```{python}
+#| results: hide
+(ggplot(glm_bks_py_resid,
+         aes(x = "obs",
+             y = "cooks_d")) +
+     geom_segment(aes(x = "obs",y = "cooks_d", xend = "obs", yend = 0)) +
+     geom_point())
 ```
 
 :::
 
 This shows that there are no very obvious influential points. You could regard point `34` as potentially influential (it's got a Cook's distance of around `0.8`), but I'm not overly worried.
 
+If we were worried, we'd remove the troublesome data point, re-run the analysis and see if that changes the statistical outcome. If so, then our entire (statistical) conclusion hinges on one data point, which is not a very robust bit of research. If it *doesn't* change our significance, then all is well, even though that data point is influential.
+
 ## Assessing significance
 
 We can ask several questions.
@@ -543,40 +598,71 @@ Roughly speaking this asks "can our model predict our data reasonably well?"
 Unfortunately, there isn’t a single command that does this for us, and we have to lift some of the numbers from the summary output ourselves. 
 
 ```{r}
-1 - pchisq(9.1879, 59)
+pchisq(9.1879, 59, lower.tail = FALSE)
 ```
 
 
 Here, we’ve used the `pchisq` function (which calculates the correct probability for us – ask if you want a hand-wavy explanation). The first argument to it is the residual deviance value from the summary table, the second argument to is the residual degrees of freedom argument from the same table.
 
-This gives us a probability of `1`. We can interpret this as the probability that the model is actually good. There aren’t any strict conventions on how to interpret this value but, for me, a tiny value would indicate a rubbish model.
+## Python
+
+```{python}
+from scipy.stats import chi2
+
+chi2.sf(9.1879, 59)
+```
 
 :::
 
+This gives us a probability of `1`. We can interpret this as the probability that the model is actually good. There aren’t any strict conventions on how to interpret this value but, for me, a tiny value would indicate a rubbish model.
+
 **Is the overall model better than the null model?**
 
 ::: {.panel-tabset group="language"}
 ## R
 
 ```{r}
-1 - pchisq(84.5476 - 9.1879, 60 - 59)
+pchisq(84.5476 - 9.1879, 60 - 59, lower.tail = FALSE)
+```
+
+Here we’ve used the `pchisq` function again (if you didn’t ask before, you probably aren’t going to ask now).
+
+## Python
+
+First we need to define the null model:
+
+```{python}
+# create a linear model
+model = smf.glm(formula = "pointed_beak ~ 1",
+                family = sm.families.Binomial(),
+                data = early_finches_py)
+# and get the fitted parameters of the model
+glm_bks_null_py = model.fit()
+
+print(glm_bks_null_py.summary())
 ```
 
-Here we’ve used the `pchisq` function again (if you didn’t ask before, you probably aren’t going to ask now). The first argument is the difference between the null and residual deviances and the second argument is the difference in degrees of freedom between the null and residual models. All of these values can be lifted from the summary table.
+In order to compare our original fitted model to the null model we need to know the deviances of both models and the residual degrees of freedom of both models, which we could get from the summary method.
 
-This gives us a probability of 0, which is technically not possible. This value is doing a formal test to see whether our fitted model is significantly different from the null model. Here we can treat this a classical hypothesis test and since this p-value is less than 0.05 then we can say that our fitted model (with `blength` as a predictor variable) is definitely better than the null model (which has no predictor variables). Woohoo!
+```{python}
+chi2.sf(84.5476 - 9.1879, 60 - 59)
+```
 
 :::
 
+The first argument is the difference between the null and residual deviances and the second argument is the difference in degrees of freedom between the null and residual models. All of these values can be lifted from the summary table.
+
+This gives us a probability of pretty much zero. This value is doing a formal test to see whether our fitted model is significantly different from the null model. Here we can treat this a classical hypothesis test and since this p-value is less than 0.05 then we can say that our fitted model (with `blength` as a predictor variable) is definitely better than the null model (which has no predictor variables). Woohoo!
+
 **Are any of the individual predictors significant?**
 
-Finally, we’ll use the anova function from before to determine which predictor variables are important, and specifically in this case whether the glucose predictor is significant.
+Finally, we’ll use the anova function from before to determine which predictor variables are important, and specifically in this case whether the beak length predictor is significant.
 
 ::: {.panel-tabset group="language"}
 ## R
 
 ```{r}
-anova(glm_bks , test = "Chisq")
+anova(glm_bks, test = "Chisq")
 ```
 
 The `anova()` function is a true workhorse within R! This time we’ve used it to create an Analysis of Deviance table. This is exactly equivalent to an ordinary ANOVA table where we have rows corresponding to each predictor variable and a p-value telling us whether that variable is significant or not.
@@ -585,6 +671,9 @@ The p-value for the `blength` predictor is written under then Pr(>Chi) column an
 
 This shouldn’t be surprising since we already saw that our overall model was better than the null model, which in this case is exactly the same as asking whether the beak length term is significant. However, in more complicated models with multiple predictors these two comparisons (and p-values) won’t be the same.
 
+## Python
+
+Alas, for some inexplicable reason this is not (yet?) possible to do in Python. At least, not beknownst to me...
 
 :::
 

diff --git a/materials/images/dgplots/2024_01_30-02-31-06_PM_dgplots.png b/materials/images/dgplots/2024_01_30-02-31-06_PM_dgplots.png
diff --git a/materials/images/dgplots/2024_01_30-02-32-06_PM_dgplots.png b/materials/images/dgplots/2024_01_30-02-32-06_PM_dgplots.png
diff --git a/materials/images/dgplots/2024_01_30-02-37-21_PM_dgplots.png b/materials/images/dgplots/2024_01_30-02-37-21_PM_dgplots.png
diff --git a/materials/images/dgplots/2024_01_30-12-51-14_PM_dgplots.png b/materials/images/dgplots/2024_01_30-12-51-14_PM_dgplots.png
diff --git a/materials/images/dgplots/2024_01_31-02-26-30_PM_dgplots.png b/materials/images/dgplots/2024_01_31-02-26-30_PM_dgplots.png