Skip to content

Commit

Permalink
python poisson
Browse files Browse the repository at this point in the history
  • Loading branch information
mvanrongen committed Feb 6, 2024
1 parent 5eb10df commit 6d93e36
Show file tree
Hide file tree
Showing 12 changed files with 312 additions and 211 deletions.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 4 additions & 4 deletions _site/search.json

Large diffs are not rendered by default.

386 changes: 193 additions & 193 deletions materials/data/seatbelts.csv

Large diffs are not rendered by default.

17 changes: 15 additions & 2 deletions materials/glm-practical-logistic-proportion.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -347,7 +347,7 @@ We get quite a high score (around 0.9) for this, which tells us that our goodnes
Is the model any better than the null though?

```{r}
1 - pchisq(16.375 - 12.633, 1)
1 - pchisq(16.375 - 12.633, 21 - 20)
anova(glm_chl_new, test = 'Chisq')
```
Expand Down Expand Up @@ -414,8 +414,21 @@ We get quite a high score (around 0.9) for this, which tells us that our goodnes

Is the model any better than the null though?

First we need to define the null model:

```{python}
# create a linear model
model = smf.glm(formula = "damage + intact ~ 1",
family = sm.families.Binomial(),
data = challenger_new_py)
# and get the fitted parameters of the model
glm_chl_new_null_py = model.fit()
print(glm_chl_new_null_py.summary())
```

```{python}
chi2.sf(16.375 - 12.633, 23 - 22)
chi2.sf(16.375 - 12.633, 21 - 20)
```

However, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).
Expand Down
104 changes: 96 additions & 8 deletions materials/glm-practical-poisson.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -392,18 +392,40 @@ We have to convert the `law` column to a factor, otherwise R will see it as nume

```{r}
seatbelts %>%
ggplot(aes(as_factor(law), drivers_killed)) +
ggplot(aes(as_factor(law), casualties)) +
geom_boxplot()
```

The data are recorded by month and year, so we can also display the number of drivers killed by year:

```{r}
seatbelts %>%
ggplot(aes(year, drivers_killed)) +
ggplot(aes(year, casualties)) +
geom_point()
```

## Python

We have to convert the `law` column to a factor, otherwise R will see it as numerical.

```{python}
#| results: hide
(ggplot(seatbelts_py,
aes(x = seatbelts_py.law.astype(object),
y = "casualties")) +
geom_boxplot())
```

The data are recorded by month and year, so we can also display the number of casualties by year:

```{python}
#| results: hide
(ggplot(seatbelts_py,
aes(x = "year",
y = "casualties")) +
geom_point())
```

:::

The data look a bit weird. There is quite some variation within years (keeping in mind that the data are aggregated monthly). The data also seems to wave around a bit... with some vague peaks (e.g. 1972 - 1973) and some troughs (e.g. around 1976).
Expand All @@ -416,7 +438,7 @@ So my initial thought is that these data are going to be a bit tricky to interpr
## R

```{r}
glm_stb <- glm(drivers_killed ~ year,
glm_stb <- glm(casualties ~ year,
data = seatbelts, family = "poisson")
```

Expand All @@ -428,13 +450,37 @@ summary(glm_stb)

```
(Intercept) 37.168958
year 0.016373
year -0.016373
```

## Python

```{python}
# create a linear model
model = smf.glm(formula = "casualties ~ year",
family = sm.families.Poisson(),
data = seatbelts_py)
# and get the fitted parameters of the model
glm_stb_py = model.fit()
```

```{python}
print(glm_stb_py.summary())
```

```
======================
coef
----------------------
Intercept 37.1690
year -0.0164
======================
```
:::

These are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:

$$ E(drivers\_killed) = \exp(37.17 + 0.164 \times year) $$
$$ E(casualties) = \exp(37.17 - 0.164 \times year) $$

#### Assessing significance

Expand All @@ -447,9 +493,15 @@ Is the model well-specified?
1 - pchisq(850.41, 190)
```

This value indicates that the model is actually pretty good. Remember, it is between $[0, 1]$ and the closer to zero, the better the model.
## Python

```{python}
chi2.sf(850.41, 190)
```
:::

This value indicates that the model is actually pretty good. Remember, it is between $[0, 1]$ and the closer to zero, the better the model.

How about the overall fit?

::: {.panel-tabset group="language"}
Expand All @@ -459,10 +511,28 @@ How about the overall fit?
1 - pchisq(984.50 - 850.41, 191 - 190)
```

Again, this indicates that the model is markedly better than the null model.
## Python

First we need to define the null model:

```{python}
# create a linear model
model = smf.glm(formula = "casualties ~ 1",
family = sm.families.Poisson(),
data = seatbelts_py)
# and get the fitted parameters of the model
glm_stb_null_py = model.fit()
print(glm_stb_null_py.summary())
```

```{python}
chi2.sf(984.50 - 850.41, 191 - 190)
```
:::

Again, this indicates that the model is markedly better than the null model.

#### Plotting the regression

::: {.panel-tabset group="language"}
Expand All @@ -471,13 +541,31 @@ Again, this indicates that the model is markedly better than the null model.
```{r}
#| message: false
#| warning: false
ggplot(seatbelts, aes(year, drivers_killed)) +
ggplot(seatbelts, aes(year, casualties)) +
geom_point() +
geom_smooth(method = "glm", se = FALSE, fullrange = TRUE,
method.args = list(family = poisson)) +
xlim(1970,1985)
```

## Python

```{python}
model = pd.DataFrame({'year': list(range(1968, 1985))})
model["pred"] = glm_stb_py.predict(model)
model.head()
```

```{python}
#| results: hide
(ggplot(seatbelts_py,
aes(x = "year",
y = "casualties")) +
geom_point() +
geom_line(model, aes(x = "year", y = "pred"), colour = "blue", size = 1))
```
:::


Expand Down

0 comments on commit 6d93e36

Please sign in to comment.