Skip to content

Conversation

@avehtari
Copy link
Member

The corresponding issue #299

This PR is not ready for merge, but needs some thought

@avehtari avehtari requested review from jgabry and n-kall August 31, 2025 17:00
@avehtari
Copy link
Member Author

@jgabry , we discussed this with @n-kall, and I added an option to the print function so that the new information is not shown by default. Would you have time to look at this?

Copy link
Member

@jgabry jgabry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think this is a good idea. It's possible that this doesn't actually break any packages that depend on loo. I can check by running reverse dependency checks once the errors in the code are fixed (see individual comments).

  • The tests need to be updated, but we can wait to do that until we check that it doesn't break any reverse dependencies

  • We should put some information about this in main loo_compare documentation, not only the print method

  • I actually think it would be good to print these probabilities by default. What was the reason you turned it off by default? (Actually right now the printing will error by default, but once fixed it will be off by default). I don't think the printing should affect backwards compatibility (or am I missing something?). It would be adding more columns could potentially affect something, right? Or was there a different reason you turned off printing the probabilities by default?

@avehtari
Copy link
Member Author

Ok, let's change them to be printed by default.

I had the column name as p_worse so that it will also tell direction. Then I considered to switching to pnorm as it tells that it's based on the normal approximation, and then did get betweeb

@jgabry
Copy link
Member

jgabry commented Oct 17, 2025

Setting simplify=FALSE when calling the print method also errors due to a weird interaction with the p_worse. But I think we can get rid of the simplify argument. I've never seen anyone use it and I doubt it will break any packages (we can find out by reverse dependency checks). I will push a commit fixing this.

@avehtari
Copy link
Member Author

I would love to get rid of simplify argument to make it difficult to print looic. I have seen people using it. A quick search shows e.g. Doing Bayesian Data Analysis in brms and the tidyverse, https://www.statistical-rethinking.robitalec.ca/week-8.html, https://www.andreashandel.com/posts/2022-02-24-longitudinal-multilevel-bayes-3/, https://dirmeier.github.io/rstansequential/index.html. But if it's not used by packages, we can just drop it (and always simplify)

R/loo_compare.R Outdated
diag_pnorm[khat_diff > 0.5] <- paste0("khat_diff > 0.5")
}
rownames(comp) <- rnms
comp <- cbind(data.frame(elpd_diff = elpd_diff, se_diff = se_diff,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also changes the returned object to a data frame when it used to be a matrix. That could potentially cause reverse dependency issues, but we'll see when I run the checks. This is a necessary change, though, since we're mixing numeric columns with text columns for the diagnostic.

@codecov-commenter
Copy link

codecov-commenter commented Oct 17, 2025

Codecov Report

❌ Patch coverage is 73.33333% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.52%. Comparing base (d014407) to head (3d008e9).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
R/loo_compare.R 72.88% 16 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #300      +/-   ##
==========================================
- Coverage   92.78%   92.52%   -0.26%     
==========================================
  Files          31       31              
  Lines        2992     3036      +44     
==========================================
+ Hits         2776     2809      +33     
- Misses        216      227      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jgabry
Copy link
Member

jgabry commented Oct 17, 2025

I fixed the tests so this is passing now. I will run reverse dependency checks over the weekend hopefully. I just put one question for you @avehtari in the review comments about how to label to diagnostic.

@jgabry
Copy link
Member

jgabry commented Oct 17, 2025

Starting a comment to keep track of reverse dependency issues. The reverse dependency checks are still running (they will take a while). I will update here as I discover new issues.

  • BAMBI
    • Error in main function due to new return type being a data frame instead of a matrix
    • For this particular error I was able to put a fix in loo_compare (it had to do with row/col names) so no PR is needed.
  • BayesERtools
    • Test error due to new return type being a data frame instead of a matrix
    • I forked the repo and fixed the test. Will wait to make a PR until we merge this PR into loo.
  • flocker
    • Test error due to new return type being a data frame instead of a matrix (and with additional columns)
    • I made a fork and fixed the test. Will wait to make a PR.
  • hbamr
    • uses simplify=FALSE in a vignette but not in any package functions
    • I made a fork and removed the use of simplify. Will wait to make a PR.
  • rstanarm
    • has some custom print stuff that assumed a matrix
    • I was actually able to get around this in loo_compare itself so no PR is necessary

@avehtari
Copy link
Member Author

  1. About the diagnostic messages
    Should we use
    N<100
    |elpd_diff|<4
    khat_diff > 0.5
    or
    N<100, small data
    |elpd_diff|<4, similar predictions
    khat_diff > 0.5, possible outliers
    ...
    Maybe the shorter ones are better?

  2. If one column name is p_worse, is diag_pnorm ok? Or should it be diag_p_worse, or just diag, diag_diff, or diagnostic?

  3. Since we are switching to using data.frame, can we have model names as column and not as row names? I have been using tinytable for pretty printing, but then I need the model name as a column.

  4. Illustration of how the output looks now (+ tinlytable) at https://users.aalto.fi/~ave/casestudies/LOO_uncertainty/loo_uncertainty_diag.html

@jgabry
Copy link
Member

jgabry commented Oct 21, 2025

@avehtari I just added the model column. Can you try it when you have time?

I will make a new list of the reverse dependencies that need to be fixed, since it's not identical to the list in my previous comment.

Also, @avehtari will work on improved documentation and I will work on updating loo_compare for subsampling to use a data frame and add the new model column.

@jgabry
Copy link
Member

jgabry commented Oct 21, 2025

Here's an example of how rstanarm output looks using this branch.

There are two methods. The first just uses regular loo_compare() and so it's no different than just using the loo package:

fit_wt <- stan_glm(mpg ~ wt, data = mtcars, refresh = 0)
fit_wt_cyl <- stan_glm(mpg ~ wt + cyl, data = mtcars, refresh = 0)
fit_wt_cyl_gear <- stan_glm(mpg ~ wt + cyl + gear, data = mtcars, refresh = 0)

loo_wt <- loo(fit_wt)
loo_wt_cyl <- loo(fit_wt_cyl)
loo_wt_cyl_gear <- loo(fit_wt_cyl_gear)
loo_compare(loo_wt, loo_wt_cyl, loo_wt_cyl_gear)
           model elpd_diff se_diff p_worse diag_diff
      fit_wt_cyl       0.0     0.0      NA          
 fit_wt_cyl_gear      -1.0     0.8    0.90   N < 100
          fit_wt      -5.1     2.8    0.97   N < 100

The second method adds loo objects to the fitted model objects and then puts them in a list:

fit_wt$loo <- loo_wt
fit_wt_cyl$loo <- loo_wt_cyl
fit_wt_cyl_gear$loo <- loo_wt_cyl_gear

loo_compare(stanreg_list(fit_wt, fit_wt_cyl, fit_wt_cyl_gear), detail = TRUE)
Model formulas: 
 fit_wt:  mpg ~ wt
 fit_wt_cyl:  mpg ~ wt + cyl
 fit_wt_cyl_gear:  mpg ~ wt + cyl + gear

Model comparison based on LOO-CV: 
           model elpd_diff se_diff p_worse diag_diff
      fit_wt_cyl       0.0     0.0      NA          
 fit_wt_cyl_gear      -1.0     0.8    0.90   N < 100
          fit_wt      -5.1     2.8    0.97   N < 100

@avehtari
Copy link
Member Author

I have tested model as column name, and it works well

I added yet another column diag_elpd to show if there were high Pareto-k's in PSIS-LOO computation. We had discussed this with @n-kall some time ago. Sometimes people look at the loo_compare() output without checking whether there were high khats. You can see how this looks like and how to interpret the results are illustrated in https://users.aalto.fi/~ave/modelselection/roaches_diag.html

The roaches example has an interesting case where in the end two models are compared, and the khat for pointwise elpds for both models is <0.5, but for the khat>0.5 for the pointwise differences. This is possible due to dependency from dependency of both models predicting the same target. This however goes slightly beyond what is said in the LOO uncertainty paper.

You can see how the comparison tables look like with tinytable at
https://users.aalto.fi/~ave/casestudies/LOO_uncertainty/loo_uncertainty_diag.html

Co-authored-by: Jonah Gabry <[email protected]>
@jgabry
Copy link
Member

jgabry commented Oct 22, 2025

You can see how the comparison tables look like with tinytable at https://users.aalto.fi/~ave/casestudies/LOO_uncertainty/loo_uncertainty_diag.html

It looks good, although none of the examples in this have anything in the diag_elpd column, so it doesn't show what that looks like.

@avehtari
Copy link
Member Author

You can see how the comparison tables look like with tinytable at https://users.aalto.fi/~ave/casestudies/LOO_uncertainty/loo_uncertainty_diag.html

It looks good, although none of the examples in this have anything in the diag_elpd column, so it doesn't show what that looks like.

That's why I linked also the roaches case study, which has stuff in diag_elpd

@jgabry
Copy link
Member

jgabry commented Oct 22, 2025

You can see how the comparison tables look like with tinytable at https://users.aalto.fi/~ave/casestudies/LOO_uncertainty/loo_uncertainty_diag.html
It looks good, although none of the examples in this have anything in the diag_elpd column, so it doesn't show what that looks like.

That's why I linked also the roaches case study, which has stuff in diag_elpd

Oops, sorry I forgot to look at that one!

@avehtari
Copy link
Member Author

Maybe change khat > 0.5 to k_diff > 0.5 and khat_psis > 0.7 to k_psis > 0.7 to make them more different and also shorter. Also the current print.loo says "Pareto k diagnostic values", that is, there is not hat

@jgabry
Copy link
Member

jgabry commented Oct 22, 2025

Maybe change khat > 0.5 to k_diff > 0.5 and khat_psis > 0.7 to k_psis > 0.7 to make them more different and also shorter. Also the current print.loo says "Pareto k diagnostic values", that is, there is not hat

Ok yeah that sounds good.

@jgabry
Copy link
Member

jgabry commented Oct 22, 2025

I made some small doc edits, fixed the tests, and moved the computation of diag_diff and diag_elpd to separate internal functions (defined at the end of the same R file). This way loo_compare() itself is a little less messy.

@ndevln
Copy link

ndevln commented Oct 29, 2025

Hi,

First of all, thank you for everything.

I really like printing the diagnostics in the output of loo_compare(). This is really helpful.

I think I would prefer the reporting of all diagnostic warnings, instead of only the most important one. A lot of analyzed datasets are smaller than they should be, and therefore the warning N<100 would always take precedence. I think the |elpd_diff|<4 warning is really helpful, since it took me quite a while to realize this and find the corresponding explanation in the FAQ the first time I used loo. For the khat diagnostics, I know there are other functions, but it would be convenient to show them in the model comparison too.

I also think p_worse is a helpful parameter. But I am unsure how to interpret it. Since these are probabilities, 0.5 is neutral (correct?). And all models in the output (row 2 and below) will have a p_worse > 0.5? In a lot of your examples p_worse is >0.9 even if model differences are still quite small, is this only due to the type of examples or is the interpretation of p_worse similar to the AUC and you expect values above 0.95.

Can I use it to compare model comparisons of the same models fitted with different datasets?

Would you consider renaming p_worse to prob_worse? I see that you prefer shorter names, but the p is a little bit worn out by frequentist approaches. Also, when googling (in German) p_worse gets autocorrected to p-Wert (p-value). I hope stan adoption increases and this would make it more distinct for people switching from frequentist methods.

Again, thank you for your work and all the best,
Alex

@avehtari
Copy link
Member Author

@ndevln, thanks for the feedback! And thanks for commenting here in github. As you can see my answer is quite long for bsky

I think I would prefer the reporting of all diagnostic warnings,

The challenge is to make them all fit. As default the output is printed data frame and if all the printed columns are too wide, the output is split and looks messy. We have also discussed an option of showing only a footnote numbers for diagnostics and then printing footnotes with longer text, but that is also a bit complicated. Another option would be to add an option to compute and show all the diagnostics.

The logic of showing only one of them is the following

  • If N<100, we can't reliably compute khat so there is no sense showing that diagnostic
  • If N<100, diff_se is likely to be underestimated and there is not much additional benefit to warn if |elpdf_diff|<4
  • If N>=100 and |elpdf_diff|<4, the normal approximation is not good and diff_se is not a useful summary, and thus reporting khat_diff is not adding much useful information

I also think p_worse is a helpful parameter. But I am unsure how to interpret it. Since these are probabilities, 0.5 is neutral (correct?). And all models in the output (row 2 and below) will have a p_worse > 0.5?

Yes, because the models have been ordered.

In a lot of your examples p_worse is >0.9 even if model differences are still quite small, is this only due to the type of examples

In all the examples with If N<100 diff_se is underestimated and the probabilities are estimated to be too close to 1. Maybe this is what you are seeing?

In addition, in the examples with N>100 and |elpd_diff|<4 diff_se is also likely to be too small and the estimated probabilities are estimated to be too close to 1.

If we pick examples which illustrate the different diagnostics, we are likely to have estimated probabilities that are often too close to 1.

or is the interpretation of p_worse similar to the AUC and you expect values above 0.95.

No. We expect values above 0.5.

Your question did make me think that we could add additional correction for using the ordered elpd_diffs, but that is not easy unless we have a large number of models to be compared.

Would you consider renaming p_worse to prob_worse? I see that you prefer shorter names, but the p is a little bit worn out by frequentist approaches. Also, when googling (in German) p_worse gets autocorrected to p-Wert (p-value). I hope stan adoption increases and this would make it more distinct for people switching from frequentist methods.

It's p, because the equations use p(). I've not seen before anyone proposing we should give up using p() in equations. Instead of prob, more logical would be Pr as that is sometimes used in equations that evaluate only to probability (while p() is used for both density and probability).

@jgabry
Copy link
Member

jgabry commented Oct 29, 2025

It's p, because the equations use p(). I've not seen before anyone proposing we should give up using p() in equations. Instead of prob, more logical would be Pr as that is sometimes used in equations that evaluate only to probability (while p() is used for both density and probability).

We could consider Pr_worse instead of p_worse. Although all of the other columns in the data frame have lowercase names.

@ndevln
Copy link

ndevln commented Nov 5, 2025

Hi,

thanks for entertaining my input. Just as context: I teach epidemiologists with differing degrees of statistical background, because of this I am maybe a little bit too cautious. And I think this package should also be used be these people, think your package is fantastic.

Diagnostics

For the diagnostics, I think an option to show all warnings would be nice, even if the value of the warnings is limited. In infectious disease surveillance you want your datasets to be small, because if you have large datasets you failed in containing the disease. A small sample size warning is therefore often not helpful, since your data often still contains useful information. And this data has to be published to be used in a meta-analysis at a later time.

If you consider adding an option, maybe a warning() could be given if some of the diagnostic warnings are hidden with an explanation how to show them. I know this complicates the function interface a little bit, but maybe it's ok.

p_worse

When thinking about this a little bit longer, I realized my worries with p_worse are maybe more about the interpretation of the probability. While I think Pr_worse is nice and maybe add clarity, I understand that p is the mathematically correct naming and the user should be able to differentiate this from a p-value.

My argument with the AUC was way simpler. When looking at the examples, I was unsure how to interpret p_worse since it often assumes values above 0.9 in the examples. For model selection, I have to consider multiple factors like:

  • elpd
  • known causal and biological relationships and the study design
  • parsimony
  • other model diagnostics...

The main tradeoff is often elpd and parsimony. And since p_worse is often above 0.9, I would want p_worse to be close to 1 to consider a more complex model superior (everything else being equal). This is what is unclear to me. Did I get the right idea?

Other thought in this regard:
There is no fixed decision threshold for p_worse, and it is more beneficial to interpret p_worse continuously. But since easy rules feel good, a p-value like decision threshold of 0.95 could be applied. But this would be too low and would benefit more complex models?

I hope you find my thoughts helpful, and sry for the rambling.

Greetings,
Alex

@avehtari
Copy link
Member Author

avehtari commented Nov 5, 2025

@ndevln , thanks for the additional comments! I've been busy with a grant proposal, but now the deadline has passed, I will soon modify one of the case studies to provide more information about the diagnostics and interpretation of the results. I'll comment here and tag you when it's ready, and we'll see if it will help to clarify some things and your feedback will be useful for us to see whether we are able to explain things better.

@avehtari
Copy link
Member Author

avehtari commented Nov 5, 2025

@ndevln, and don't be sorry, it's great that you rambled and wrote what you think even when being uncertain (not all have the courage) as that also helps us to see where we need to improve documentation to reduce such uncertainty

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants