You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@mvanrongen and @Vicki-H here is some feedback on the materials as run on the June 2024 course iteration (feedback on slides sent separately):
here - you say here we are testing for beta1 = 0, but then mention ANOVA. The ANOVA is comparing the residuals from the linear model vs the null model (your horizontal line), which is not quite the same. I know (and agree) you want to keep things simple here, but it might be worth rephrasing it slightly so it's not misleading.
here - the outcome variable can also be categorical. E.g. in multinomial logistic regression, which is a generalisation of the binary logistic when we have more than 2 categories. In terms of phrasing, we're referring to things as "the outcome is distributed as X", but that's not quite right; phrasing it this way might make people think that if they look at the distribution of their data and it doesn't look normal/poisson/whatever then the model assumptions are violated.
here - I think this not the link function but its inverse. The logit link is log(p/(1-p)), which I actually think is quite a bit simpler to look at. So you could re-write this to say, "as we said, we model our outcome variable as the probability of having a pointed beak and use the link function in the equation, which becomes $log(\frac{p}{1-p}) = \beta_0 + \beta_1X$.
Later on you're using manual calculation using the inverse of the logit to get back your response variables, which I guess is fine. But you could also use plogis(x) function to get those values. You could explain this by saying "the outcome curve we get from our model is known as a logistic curve and to get the probability for a given point in that curve we can use the plogis() function. For example plogis(-43.41 + 3.39 * 15)"
Generally, it might be important to include something about log odds. It can be confusing, but probably important in particular clinical audiences where those terms are used a lot in the context of logistic regressions. At least one participant asked me about this, as they were trying to understand some work their collaborators had done.
here - What you are plotting is the outcome of the model glm(prop_damaged ~ temp), which is not quite right (and the warning is indicating this - it doesn't expect non-integer values either a single binary 0/1 or a 2-column matrix of counts). The curve is roughly right in this case, but I don't know if that will always be the case. In general, I'm not sure it's a good idea to rely on geom_smooth() to plot these curves as the models become more complex it might be harder to specify with this interface. It's a bit of work, but maybe extracting predictions from the model and then adding it is a more general approach; also see add confidence intervals to model outcomes #1 for a related point on visualising these.
here - it could be nice to have an illustration of "saturated model", just for visual purposes.
here - can you point us to a link in case we don't remember what this is
here - when you say "spread out as we expect" you mean for a Poisson model. There is no reason we generally expect that. Because dispersion is calculated as variance/mean (which you could mention to make it clearer), maybe it could be rephrased to say "a dispersion of 1 signifies that the variance = mean; values <1 are referred as underdispersion (the variance is smaller than the mean); values >1 as overdispersion (the variance is greater than the mean)".
here - this is not quite right. Zero-inflation is a problem in and of itself and negative binomial cannot solve it. You can have Poisson models with zero-inflation as well as NB models with zero-inflation. In single-cell transcriptomics this was actually an issue a few years ago where there were a lot of "dropouts" in gene quantification leading people to develop zero-inflated NB models to capture that feature in the data. I think it would be good to remove zero inflation from here, as it's a different concept (which might deserve its own section in the future). If you want an example of overdispersion, I can give you some examples from transcriptomics.
here - the exp() comes a bit out of nowhere. I think it would have been good to start the section mentioning the link function, to relate it to previous chapters. In this case the link is log(Y) and so we then use its inverse, which is exp(). In general, to convey the idea that we are always doing more or less the same thing, I think it would be helpful to keep the materials a bit "repetitive" across sections. We look at the kind of data we have, we find a suitable distribution to capture it, we find what the link and its inverse is, we fit the model, we do the model checks. Hopefully the repetition conveys to people that if they want to do a different kind of model in the future, the same concepts still apply.
here - the "zero area island" gives a nice opportunity for (somewhere) talking about centering predictor variables on their mean, i.e. area_centered = area - mean(area). That way the interpretation of the intercept would be meaningful as the number of species in an island of average size. I know it would be a distraction here, but perhaps the parenthesis could point to a box somewhere where you demonstrate this?
here - the seatbelt example, I guess technically everything looks fine. Although this could also be an opportunity to discuss that despite all the p-values being "what we want", the model is clearly not capturing important nuances in the data and so should be interpreted with care. As said earlier, there are fluctuations across years, maybe policy changed over these decades, there might be seasonal factors (e.g. more accidents in holiday season when people travel and drink more), etc. You could also use this as an opportunity to discuss the assumption of independence, which is probably not met here.
here - similar to comment above, I think rather than saying "the model equation" you should bring back the concept of link function and say the link for NB is also log(), therefore we use the same inverse link function exp() to get our predicted values.
here - this is a more general comment also for core stats. I think stepwise model selection is generally discouraged these days, as there are several issues with it. Either you specify the model that makes sense from your knowledge of the system and that's the model you work with (if some coefficients are very small, that's fine, it was part of your initial question and so you can report "this coefficient was very small"). Or you are interested in doing variable selection, maybe because you have too many variables and not enough data, in which case regularisation methods like lasso are more suitable, not stepwise regression.
here - the negative binomial exercise I think needs a hint, the question feels a bit open-ended. Remember we have a .callout-hint box you could use here.
Other content that seems useful to include:
I think we need more on model visualisation. I'm not sure relying on geom_smooth() is the best as it doesn't generalise for slightly more complex models (e.g. with two predictors).
Although I appreciate the intention of keeping things simple, I guess most people will be fitting models with more than 1 predictor, so I think it would be useful to show how to visualise/explore model outcomes in those scenarios (i.e. how to stratify predictions for different predictor levels). For example, a participant in the course was looking at a potential marker for early dementia (binary outcome variable). They included as predictors in the model the marker (continuous), adjusting for age (continuous) and sex (categorical). My suggestion was for them to visualise a scatterplot of marker vs outcome and overlay the model prediction for "male" and "female" separately with 3 lines each for median, lower and upper quartile of age. We did this by creating new data.frames and feeding that into the newdata argument of predict().
@mvanrongen and @Vicki-H here is some feedback on the materials as run on the June 2024 course iteration (feedback on slides sent separately):
log(p/(1-p))
, which I actually think is quite a bit simpler to look at. So you could re-write this to say, "as we said, we model our outcome variable as the probability of having a pointed beak and use the link function in the equation, which becomesplogis(x)
function to get those values. You could explain this by saying "the outcome curve we get from our model is known as a logistic curve and to get the probability for a given point in that curve we can use theplogis()
function. For exampleplogis(-43.41 + 3.39 * 15)
"glm(prop_damaged ~ temp)
, which is not quite right (and the warning is indicating this - it doesn't expect non-integer values either a single binary 0/1 or a 2-column matrix of counts). The curve is roughly right in this case, but I don't know if that will always be the case. In general, I'm not sure it's a good idea to rely ongeom_smooth()
to plot these curves as the models become more complex it might be harder to specify with this interface. It's a bit of work, but maybe extracting predictions from the model and then adding it is a more general approach; also see add confidence intervals to model outcomes #1 for a related point on visualising these.exp()
comes a bit out of nowhere. I think it would have been good to start the section mentioning the link function, to relate it to previous chapters. In this case the link islog(Y)
and so we then use its inverse, which isexp()
. In general, to convey the idea that we are always doing more or less the same thing, I think it would be helpful to keep the materials a bit "repetitive" across sections. We look at the kind of data we have, we find a suitable distribution to capture it, we find what the link and its inverse is, we fit the model, we do the model checks. Hopefully the repetition conveys to people that if they want to do a different kind of model in the future, the same concepts still apply.area_centered = area - mean(area)
. That way the interpretation of the intercept would be meaningful as the number of species in an island of average size. I know it would be a distraction here, but perhaps the parenthesis could point to a box somewhere where you demonstrate this?log()
, therefore we use the same inverse link functionexp()
to get our predicted values..callout-hint
box you could use here.Other content that seems useful to include:
geom_smooth()
is the best as it doesn't generalise for slightly more complex models (e.g. with two predictors).newdata
argument ofpredict()
.The text was updated successfully, but these errors were encountered: