-
-
Notifications
You must be signed in to change notification settings - Fork 35
add diff diagnostics #300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
add diff diagnostics #300
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I think this is a good idea. It's possible that this doesn't actually break any packages that depend on loo. I can check by running reverse dependency checks once the errors in the code are fixed (see individual comments).
-
The tests need to be updated, but we can wait to do that until we check that it doesn't break any reverse dependencies
-
We should put some information about this in main
loo_comparedocumentation, not only the print method -
I actually think it would be good to print these probabilities by default. What was the reason you turned it off by default? (Actually right now the printing will error by default, but once fixed it will be off by default). I don't think the printing should affect backwards compatibility (or am I missing something?). It would be adding more columns could potentially affect something, right? Or was there a different reason you turned off printing the probabilities by default?
Co-authored-by: Jonah Gabry <[email protected]>
|
Ok, let's change them to be printed by default. I had the column name as p_worse so that it will also tell direction. Then I considered to switching to pnorm as it tells that it's based on the normal approximation, and then did get betweeb |
|
Setting |
|
I would love to get rid of |
R/loo_compare.R
Outdated
| diag_pnorm[khat_diff > 0.5] <- paste0("khat_diff > 0.5") | ||
| } | ||
| rownames(comp) <- rnms | ||
| comp <- cbind(data.frame(elpd_diff = elpd_diff, se_diff = se_diff, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also changes the returned object to a data frame when it used to be a matrix. That could potentially cause reverse dependency issues, but we'll see when I run the checks. This is a necessary change, though, since we're mixing numeric columns with text columns for the diagnostic.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #300 +/- ##
==========================================
- Coverage 92.78% 92.52% -0.26%
==========================================
Files 31 31
Lines 2992 3036 +44
==========================================
+ Hits 2776 2809 +33
- Misses 216 227 +11 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I fixed the tests so this is passing now. I will run reverse dependency checks over the weekend hopefully. I just put one question for you @avehtari in the review comments about how to label to diagnostic. |
|
Starting a comment to keep track of reverse dependency issues. The reverse dependency checks are still running (they will take a while). I will update here as I discover new issues.
|
|
|
@avehtari I just added the I will make a new list of the reverse dependencies that need to be fixed, since it's not identical to the list in my previous comment. Also, @avehtari will work on improved documentation and I will work on updating loo_compare for subsampling to use a data frame and add the new |
|
Here's an example of how rstanarm output looks using this branch. There are two methods. The first just uses regular fit_wt <- stan_glm(mpg ~ wt, data = mtcars, refresh = 0)
fit_wt_cyl <- stan_glm(mpg ~ wt + cyl, data = mtcars, refresh = 0)
fit_wt_cyl_gear <- stan_glm(mpg ~ wt + cyl + gear, data = mtcars, refresh = 0)
loo_wt <- loo(fit_wt)
loo_wt_cyl <- loo(fit_wt_cyl)
loo_wt_cyl_gear <- loo(fit_wt_cyl_gear)
loo_compare(loo_wt, loo_wt_cyl, loo_wt_cyl_gear)The second method adds fit_wt$loo <- loo_wt
fit_wt_cyl$loo <- loo_wt_cyl
fit_wt_cyl_gear$loo <- loo_wt_cyl_gear
loo_compare(stanreg_list(fit_wt, fit_wt_cyl, fit_wt_cyl_gear), detail = TRUE) |
|
I have tested model as column name, and it works well I added yet another column The roaches example has an interesting case where in the end two models are compared, and the khat for pointwise elpds for both models is <0.5, but for the khat>0.5 for the pointwise differences. This is possible due to dependency from dependency of both models predicting the same target. This however goes slightly beyond what is said in the LOO uncertainty paper. You can see how the comparison tables look like with tinytable at |
Co-authored-by: Jonah Gabry <[email protected]>
It looks good, although none of the examples in this have anything in the |
That's why I linked also the roaches case study, which has stuff in diag_elpd |
Oops, sorry I forgot to look at that one! |
|
Maybe change |
Ok yeah that sounds good. |
|
I made some small doc edits, fixed the tests, and moved the computation of |
|
Hi, First of all, thank you for everything. I really like printing the diagnostics in the output of I think I would prefer the reporting of all diagnostic warnings, instead of only the most important one. A lot of analyzed datasets are smaller than they should be, and therefore the warning I also think Can I use it to compare model comparisons of the same models fitted with different datasets? Would you consider renaming Again, thank you for your work and all the best, |
|
@ndevln, thanks for the feedback! And thanks for commenting here in github. As you can see my answer is quite long for bsky
The challenge is to make them all fit. As default the output is printed data frame and if all the printed columns are too wide, the output is split and looks messy. We have also discussed an option of showing only a footnote numbers for diagnostics and then printing footnotes with longer text, but that is also a bit complicated. Another option would be to add an option to compute and show all the diagnostics. The logic of showing only one of them is the following
Yes, because the models have been ordered.
In all the examples with If N<100 diff_se is underestimated and the probabilities are estimated to be too close to 1. Maybe this is what you are seeing? In addition, in the examples with N>100 and |elpd_diff|<4 diff_se is also likely to be too small and the estimated probabilities are estimated to be too close to 1. If we pick examples which illustrate the different diagnostics, we are likely to have estimated probabilities that are often too close to 1.
No. We expect values above 0.5. Your question did make me think that we could add additional correction for using the ordered elpd_diffs, but that is not easy unless we have a large number of models to be compared.
It's p, because the equations use p(). I've not seen before anyone proposing we should give up using p() in equations. Instead of prob, more logical would be Pr as that is sometimes used in equations that evaluate only to probability (while p() is used for both density and probability). |
We could consider |
|
Hi, thanks for entertaining my input. Just as context: I teach epidemiologists with differing degrees of statistical background, because of this I am maybe a little bit too cautious. And I think this package should also be used be these people, think your package is fantastic. DiagnosticsFor the diagnostics, I think an option to show all warnings would be nice, even if the value of the warnings is limited. In infectious disease surveillance you want your datasets to be small, because if you have large datasets you failed in containing the disease. A small sample size warning is therefore often not helpful, since your data often still contains useful information. And this data has to be published to be used in a meta-analysis at a later time. If you consider adding an option, maybe a p_worseWhen thinking about this a little bit longer, I realized my worries with p_worse are maybe more about the interpretation of the probability. While I think My argument with the AUC was way simpler. When looking at the examples, I was unsure how to interpret
The main tradeoff is often elpd and parsimony. And since p_worse is often above 0.9, I would want p_worse to be close to 1 to consider a more complex model superior (everything else being equal). This is what is unclear to me. Did I get the right idea? Other thought in this regard: I hope you find my thoughts helpful, and sry for the rambling. Greetings, |
|
@ndevln , thanks for the additional comments! I've been busy with a grant proposal, but now the deadline has passed, I will soon modify one of the case studies to provide more information about the diagnostics and interpretation of the results. I'll comment here and tag you when it's ready, and we'll see if it will help to clarify some things and your feedback will be useful for us to see whether we are able to explain things better. |
|
@ndevln, and don't be sorry, it's great that you rambled and wrote what you think even when being uncertain (not all have the courage) as that also helps us to see where we need to improve documentation to reduce such uncertainty |
The corresponding issue #299
This PR is not ready for merge, but needs some thought