python poisson

cambiotraining · Feb 6, 2024 · 6d93e36 · 6d93e36
1 parent 5eb10df
commit 6d93e36
Show file tree

Hide file tree

Showing 12 changed files with 312 additions and 211 deletions.
diff --git a/_freeze/materials/glm-practical-logistic-proportion/execute-results/html.json b/_freeze/materials/glm-practical-logistic-proportion/execute-results/html.json
diff --git a/_freeze/materials/glm-practical-poisson/execute-results/html.json b/_freeze/materials/glm-practical-poisson/execute-results/html.json
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-28-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-28-1.png
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-29-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-29-1.png
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-30-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-30-1.png
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-31-3.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-31-3.png
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-41-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-41-1.png
diff --git a/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-43-1.png b/_freeze/materials/glm-practical-poisson/figure-html/unnamed-chunk-43-1.png
diff --git a/_site/search.json b/_site/search.json
diff --git a/materials/data/seatbelts.csv b/materials/data/seatbelts.csv
diff --git a/materials/glm-practical-logistic-proportion.qmd b/materials/glm-practical-logistic-proportion.qmd
@@ -347,7 +347,7 @@ We get quite a high score (around 0.9) for this, which tells us that our goodnes
 Is the model any better than the null though?
 
 ```{r}
-1 - pchisq(16.375 - 12.633, 1)
+1 - pchisq(16.375 - 12.633, 21 - 20)
 
 anova(glm_chl_new, test = 'Chisq')
 ```
@@ -414,8 +414,21 @@ We get quite a high score (around 0.9) for this, which tells us that our goodnes
 
 Is the model any better than the null though?
 
+First we need to define the null model:
+
+```{python}
+# create a linear model
+model = smf.glm(formula = "damage + intact ~ 1",
+                family = sm.families.Binomial(),
+                data = challenger_new_py)
+# and get the fitted parameters of the model
+glm_chl_new_null_py = model.fit()
+
+print(glm_chl_new_null_py.summary())
+```
+
 ```{python}
-chi2.sf(16.375 - 12.633, 23 - 22)
+chi2.sf(16.375 - 12.633, 21 - 20)
 ```
 
 However, the model is not significantly better than the null in this case, with a p-value here of just over 0.05 for both of these tests (they give a similar result since, yet again, we have just the one predictor variable).

diff --git a/materials/glm-practical-poisson.qmd b/materials/glm-practical-poisson.qmd
@@ -392,18 +392,40 @@ We have to convert the `law` column to a factor, otherwise R will see it as nume
 
 ```{r}
 seatbelts %>% 
-  ggplot(aes(as_factor(law), drivers_killed)) +
+  ggplot(aes(as_factor(law), casualties)) +
    geom_boxplot()
 ```
 
 The data are recorded by month and year, so we can also display the number of drivers killed by year:
 
 ```{r}
 seatbelts %>% 
-  ggplot(aes(year, drivers_killed)) +
+  ggplot(aes(year, casualties)) +
   geom_point()
 ```
 
+## Python
+
+We have to convert the `law` column to a factor, otherwise R will see it as numerical.
+
+```{python}
+#| results: hide
+(ggplot(seatbelts_py,
+         aes(x = seatbelts_py.law.astype(object),
+             y = "casualties")) +
+     geom_boxplot())
+```
+
+The data are recorded by month and year, so we can also display the number of casualties by year:
+
+```{python}
+#| results: hide
+(ggplot(seatbelts_py,
+         aes(x = "year",
+             y = "casualties")) +
+     geom_point())
+```
+
 :::
 
 The data look a bit weird. There is quite some variation within years (keeping in mind that the data are aggregated monthly). The data also seems to wave around a bit... with some vague peaks (e.g. 1972 - 1973) and some troughs (e.g. around 1976).
@@ -416,7 +438,7 @@ So my initial thought is that these data are going to be a bit tricky to interpr
 ## R
 
 ```{r}
-glm_stb <- glm(drivers_killed ~ year,
+glm_stb <- glm(casualties ~ year,
                data = seatbelts, family = "poisson")
 ```
 
@@ -428,13 +450,37 @@ summary(glm_stb)
 
 ```
 (Intercept)    37.168958
-year           0.016373
+year           -0.016373
+```
+
+## Python
+
+```{python}
+# create a linear model
+model = smf.glm(formula = "casualties ~ year",
+                family = sm.families.Poisson(),
+                data = seatbelts_py)
+# and get the fitted parameters of the model
+glm_stb_py = model.fit()
+```
+
+```{python}
+print(glm_stb_py.summary())
+```
+
+```
+======================
+                 coef  
+----------------------
+Intercept     37.1690 
+year          -0.0164 
+======================
 ```
 :::
 
 These are the coefficients of the Poisson model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of island size:
 
-$$ E(drivers\_killed) = \exp(37.17 + 0.164 \times year) $$
+$$ E(casualties) = \exp(37.17 - 0.164 \times year) $$
 
 #### Assessing significance
 
@@ -447,9 +493,15 @@ Is the model well-specified?
 1 - pchisq(850.41, 190)
 ```
 
-This value indicates that the model is actually pretty good. Remember, it is between $[0, 1]$ and the closer to zero, the better the model.
+## Python
+
+```{python}
+chi2.sf(850.41, 190)
+```
 :::
 
+This value indicates that the model is actually pretty good. Remember, it is between $[0, 1]$ and the closer to zero, the better the model.
+
 How about the overall fit?
 
 ::: {.panel-tabset group="language"}
@@ -459,10 +511,28 @@ How about the overall fit?
 1 - pchisq(984.50 - 850.41, 191 - 190)
 ```
 
-Again, this indicates that the model is markedly better than the null model.
+## Python
 
+First we need to define the null model:
+
+```{python}
+# create a linear model
+model = smf.glm(formula = "casualties ~ 1",
+                family = sm.families.Poisson(),
+                data = seatbelts_py)
+# and get the fitted parameters of the model
+glm_stb_null_py = model.fit()
+
+print(glm_stb_null_py.summary())
+```
+
+```{python}
+chi2.sf(984.50 - 850.41, 191 - 190)
+```
 :::
 
+Again, this indicates that the model is markedly better than the null model.
+
 #### Plotting the regression
 
 ::: {.panel-tabset group="language"}
@@ -471,13 +541,31 @@ Again, this indicates that the model is markedly better than the null model.
 ```{r}
 #| message: false
 #| warning: false
-ggplot(seatbelts, aes(year, drivers_killed)) +
+ggplot(seatbelts, aes(year, casualties)) +
   geom_point() +
   geom_smooth(method = "glm", se = FALSE, fullrange = TRUE, 
               method.args = list(family = poisson)) +
   xlim(1970,1985)
 ```
 
+## Python
+
+```{python}
+model = pd.DataFrame({'year': list(range(1968, 1985))})
+
+model["pred"] = glm_stb_py.predict(model)
+
+model.head()
+```
+
+```{python}
+#| results: hide
+(ggplot(seatbelts_py,
+         aes(x = "year",
+             y = "casualties")) +
+     geom_point() +
+     geom_line(model, aes(x = "year", y = "pred"), colour = "blue", size = 1))
+```
 :::