Skip to content

Commit

Permalink
glm python
Browse files Browse the repository at this point in the history
  • Loading branch information
mvanrongen committed Jan 31, 2024
1 parent 0e6b1c5 commit 6257504
Show file tree
Hide file tree
Showing 19 changed files with 301 additions and 67 deletions.

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
247 changes: 196 additions & 51 deletions _site/materials/glm-practical-logistic-binary.html

Large diffs are not rendered by default.

10 changes: 5 additions & 5 deletions _site/search.json

Large diffs are not rendered by default.

107 changes: 98 additions & 9 deletions materials/glm-practical-logistic-binary.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,9 @@ import statsmodels.api as sm
# Convenience interface for specifying models using formula strings and DataFrames
import statsmodels.formula.api as smf
# Needed for additional probability functionality
from scipy.stats import *
```

### Functions
Expand Down Expand Up @@ -365,7 +368,9 @@ There’s a lot to unpack here, but let's start with what we're familiar with: c
## Parameter interpretation

::: {.panel-tabset group="language"}

## R

The coefficients or parameters can be found in the `Coefficients` block. The main numbers to extract from the output are the two numbers underneath `Estimate.Std`:

```
Expand Down Expand Up @@ -401,6 +406,7 @@ $$ P(pointed \ beak) = \frac{\exp(-43.41 + 3.39 \times blength)}{1 + \exp(-43.41
Having this formula means that we can calculate the probability of having a pointed beak for any beak length. How do we work this out in practice?

::: {.panel-tabset group="language"}

## R

Well, the probability of having a pointed beak if the beak length is large (for example 15 mm) can be calculated as follows:
Expand Down Expand Up @@ -516,19 +522,68 @@ Short beaks are more closely associated with the bluntly shaped beaks, whereas l

## Assumptions

As explained in the background chapter, we can't really use the standard diagnostic plots to assess assumptions. We're not going to go into a lot of detail for now, but there is one thing that we can do: look for influential points using the Cook’s distance plot:
As explained in the background chapter, we can't really use the standard diagnostic plots to assess assumptions. We're not going to go into a lot of detail for now, but there is one thing that we can do: look for influential points using the Cook’s distance plot.

::: {.panel-tabset group="language"}
## R

```{r}
plot(glm_bks , which=4)
plot(glm_bks , which = 4)
```

:::{.callout-note collapse=true}
## Extracting the Cook's distances from the `glm` object

Instead of using the `plot()` function, we can also extract the values directly from the `glm` object. We can use the `augment()` function to do this and create a lollipop or stem plot:

```{r}
glm_bks %>%
augment() %>% # get underlying data
select(.cooksd) %>% # select the Cook's d
mutate(obs = 1:n()) %>% # create an index column
ggplot(aes(x = obs, y = .cooksd)) +
geom_point() +
geom_segment(aes(x = obs, y = .cooksd, xend = obs, yend = 0))
```
:::

## Python

As always, there are different ways of doing this. Here we extract the Cook's d values from the `glm` object and put them in a Pandas DataFrame. We can then use that to plot them in a lollipop or stem plot.

```{python}
# extract the Cook's distances
glm_bks_py_resid = pd.DataFrame(glm_bks_py.
get_influence().
summary_frame()["cooks_d"])
# add row index
glm_bks_py_resid['obs'] = glm_bks_py_resid.reset_index().index
```

We now have two columns:

```{python}
glm_bks_py_resid
```

We can use these to create the plot:

```{python}
#| results: hide
(ggplot(glm_bks_py_resid,
aes(x = "obs",
y = "cooks_d")) +
geom_segment(aes(x = "obs",y = "cooks_d", xend = "obs", yend = 0)) +
geom_point())
```

:::

This shows that there are no very obvious influential points. You could regard point `34` as potentially influential (it's got a Cook's distance of around `0.8`), but I'm not overly worried.

If we were worried, we'd remove the troublesome data point, re-run the analysis and see if that changes the statistical outcome. If so, then our entire (statistical) conclusion hinges on one data point, which is not a very robust bit of research. If it *doesn't* change our significance, then all is well, even though that data point is influential.

## Assessing significance

We can ask several questions.
Expand All @@ -543,40 +598,71 @@ Roughly speaking this asks "can our model predict our data reasonably well?"
Unfortunately, there isn’t a single command that does this for us, and we have to lift some of the numbers from the summary output ourselves.

```{r}
1 - pchisq(9.1879, 59)
pchisq(9.1879, 59, lower.tail = FALSE)
```


Here, we’ve used the `pchisq` function (which calculates the correct probability for us – ask if you want a hand-wavy explanation). The first argument to it is the residual deviance value from the summary table, the second argument to is the residual degrees of freedom argument from the same table.

This gives us a probability of `1`. We can interpret this as the probability that the model is actually good. There aren’t any strict conventions on how to interpret this value but, for me, a tiny value would indicate a rubbish model.
## Python

```{python}
from scipy.stats import chi2
chi2.sf(9.1879, 59)
```

:::

This gives us a probability of `1`. We can interpret this as the probability that the model is actually good. There aren’t any strict conventions on how to interpret this value but, for me, a tiny value would indicate a rubbish model.

**Is the overall model better than the null model?**

::: {.panel-tabset group="language"}
## R

```{r}
1 - pchisq(84.5476 - 9.1879, 60 - 59)
pchisq(84.5476 - 9.1879, 60 - 59, lower.tail = FALSE)
```

Here we’ve used the `pchisq` function again (if you didn’t ask before, you probably aren’t going to ask now).

## Python

First we need to define the null model:

```{python}
# create a linear model
model = smf.glm(formula = "pointed_beak ~ 1",
family = sm.families.Binomial(),
data = early_finches_py)
# and get the fitted parameters of the model
glm_bks_null_py = model.fit()
print(glm_bks_null_py.summary())
```

Here we’ve used the `pchisq` function again (if you didn’t ask before, you probably aren’t going to ask now). The first argument is the difference between the null and residual deviances and the second argument is the difference in degrees of freedom between the null and residual models. All of these values can be lifted from the summary table.
In order to compare our original fitted model to the null model we need to know the deviances of both models and the residual degrees of freedom of both models, which we could get from the summary method.

This gives us a probability of 0, which is technically not possible. This value is doing a formal test to see whether our fitted model is significantly different from the null model. Here we can treat this a classical hypothesis test and since this p-value is less than 0.05 then we can say that our fitted model (with `blength` as a predictor variable) is definitely better than the null model (which has no predictor variables). Woohoo!
```{python}
chi2.sf(84.5476 - 9.1879, 60 - 59)
```

:::

The first argument is the difference between the null and residual deviances and the second argument is the difference in degrees of freedom between the null and residual models. All of these values can be lifted from the summary table.

This gives us a probability of pretty much zero. This value is doing a formal test to see whether our fitted model is significantly different from the null model. Here we can treat this a classical hypothesis test and since this p-value is less than 0.05 then we can say that our fitted model (with `blength` as a predictor variable) is definitely better than the null model (which has no predictor variables). Woohoo!

**Are any of the individual predictors significant?**

Finally, we’ll use the anova function from before to determine which predictor variables are important, and specifically in this case whether the glucose predictor is significant.
Finally, we’ll use the anova function from before to determine which predictor variables are important, and specifically in this case whether the beak length predictor is significant.

::: {.panel-tabset group="language"}
## R

```{r}
anova(glm_bks , test = "Chisq")
anova(glm_bks, test = "Chisq")
```

The `anova()` function is a true workhorse within R! This time we’ve used it to create an Analysis of Deviance table. This is exactly equivalent to an ordinary ANOVA table where we have rows corresponding to each predictor variable and a p-value telling us whether that variable is significant or not.
Expand All @@ -585,6 +671,9 @@ The p-value for the `blength` predictor is written under then Pr(>Chi) column an

This shouldn’t be surprising since we already saw that our overall model was better than the null model, which in this case is exactly the same as asking whether the beak length term is significant. However, in more complicated models with multiple predictors these two comparisons (and p-values) won’t be the same.

## Python

Alas, for some inexplicable reason this is not (yet?) possible to do in Python. At least, not beknownst to me...

:::

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 6257504

Please sign in to comment.