add diff diagnostics #300

avehtari · 2025-08-31T17:00:07Z

The corresponding issue #299

This PR is not ready for merge, but needs some thought

avehtari · 2025-10-16T14:28:00Z

@jgabry , we discussed this with @n-kall, and I added an option to the print function so that the new information is not shown by default. Would you have time to look at this?

jgabry

I think this is a good idea. It's possible that this doesn't actually break any packages that depend on loo. I can check by running reverse dependency checks once the errors in the code are fixed (see individual comments).
The tests need to be updated, but we can wait to do that until we check that it doesn't break any reverse dependencies
We should put some information about this in main loo_compare documentation, not only the print method
I actually think it would be good to print these probabilities by default. What was the reason you turned it off by default? (Actually right now the printing will error by default, but once fixed it will be off by default). I don't think the printing should affect backwards compatibility (or am I missing something?). It would be adding more columns could potentially affect something, right? Or was there a different reason you turned off printing the probabilities by default?

R/loo_compare.R

Co-authored-by: Jonah Gabry <[email protected]>

avehtari · 2025-10-17T17:26:08Z

Ok, let's change them to be printed by default.

I had the column name as p_worse so that it will also tell direction. Then I considered to switching to pnorm as it tells that it's based on the normal approximation, and then did get betweeb

jgabry · 2025-10-17T17:37:24Z

Setting simplify=FALSE when calling the print method also errors due to a weird interaction with the p_worse. But I think we can get rid of the simplify argument. I've never seen anyone use it and I doubt it will break any packages (we can find out by reverse dependency checks). I will push a commit fixing this.

avehtari · 2025-10-17T18:04:50Z

I would love to get rid of simplify argument to make it difficult to print looic. I have seen people using it. A quick search shows e.g. Doing Bayesian Data Analysis in brms and the tidyverse, https://www.statistical-rethinking.robitalec.ca/week-8.html, https://www.andreashandel.com/posts/2022-02-24-longitudinal-multilevel-bayes-3/, https://dirmeier.github.io/rstansequential/index.html. But if it's not used by packages, we can just drop it (and always simplify)

R/loo_compare.R

jgabry · 2025-10-17T20:14:54Z

R/loo_compare.R

+    diag_pnorm[khat_diff > 0.5] <- paste0("khat_diff > 0.5")
+  }
+  rownames(comp) <- rnms
+  comp <- cbind(data.frame(elpd_diff = elpd_diff, se_diff = se_diff,


This also changes the returned object to a data frame when it used to be a matrix. That could potentially cause reverse dependency issues, but we'll see when I run the checks. This is a necessary change, though, since we're mixing numeric columns with text columns for the diagnostic.

codecov-commenter · 2025-10-17T20:28:20Z

Codecov Report

❌ Patch coverage is 73.33333% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.52%. Comparing base (d014407) to head (3d008e9).
⚠️ Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
R/loo_compare.R	72.88%	16 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #300      +/-   ##
==========================================
- Coverage   92.78%   92.52%   -0.26%     
==========================================
  Files          31       31              
  Lines        2992     3036      +44     
==========================================
+ Hits         2776     2809      +33     
- Misses        216      227      +11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jgabry · 2025-10-17T20:34:27Z

I fixed the tests so this is passing now. I will run reverse dependency checks over the weekend hopefully. I just put one question for you @avehtari in the review comments about how to label to diagnostic.

jgabry · 2025-10-17T22:16:06Z

Starting a comment to keep track of reverse dependency issues. The reverse dependency checks are still running (they will take a while). I will update here as I discover new issues.

BAMBI
- Error in main function due to new return type being a data frame instead of a matrix
- For this particular error I was able to put a fix in loo_compare (it had to do with row/col names) so no PR is needed.
BayesERtools
- Test error due to new return type being a data frame instead of a matrix
- I forked the repo and fixed the test. Will wait to make a PR until we merge this PR into loo.
flocker
- Test error due to new return type being a data frame instead of a matrix (and with additional columns)
- I made a fork and fixed the test. Will wait to make a PR.
hbamr
- uses simplify=FALSE in a vignette but not in any package functions
- I made a fork and removed the use of simplify. Will wait to make a PR.
rstanarm
- has some custom print stuff that assumed a matrix
- I was actually able to get around this in loo_compare itself so no PR is necessary

avehtari · 2025-10-18T09:34:01Z

About the diagnostic messages
Should we use
N<100
|elpd_diff|<4
khat_diff > 0.5
or
N<100, small data
|elpd_diff|<4, similar predictions
khat_diff > 0.5, possible outliers
...
Maybe the shorter ones are better?
If one column name is p_worse, is diag_pnorm ok? Or should it be diag_p_worse, or just diag, diag_diff, or diagnostic?
Since we are switching to using data.frame, can we have model names as column and not as row names? I have been using tinytable for pretty printing, but then I need the model name as a column.
Illustration of how the output looks now (+ tinlytable) at https://users.aalto.fi/~ave/casestudies/LOO_uncertainty/loo_uncertainty_diag.html

jgabry · 2025-10-21T16:16:47Z

@avehtari I just added the model column. Can you try it when you have time?

I will make a new list of the reverse dependencies that need to be fixed, since it's not identical to the list in my previous comment.

Also, @avehtari will work on improved documentation and I will work on updating loo_compare for subsampling to use a data frame and add the new model column.

jgabry · 2025-10-21T18:59:22Z

Here's an example of how rstanarm output looks using this branch.

There are two methods. The first just uses regular loo_compare() and so it's no different than just using the loo package:

fit_wt <- stan_glm(mpg ~ wt, data = mtcars, refresh = 0)
fit_wt_cyl <- stan_glm(mpg ~ wt + cyl, data = mtcars, refresh = 0)
fit_wt_cyl_gear <- stan_glm(mpg ~ wt + cyl + gear, data = mtcars, refresh = 0)

loo_wt <- loo(fit_wt)
loo_wt_cyl <- loo(fit_wt_cyl)
loo_wt_cyl_gear <- loo(fit_wt_cyl_gear)
loo_compare(loo_wt, loo_wt_cyl, loo_wt_cyl_gear)

           model elpd_diff se_diff p_worse diag_diff
      fit_wt_cyl       0.0     0.0      NA          
 fit_wt_cyl_gear      -1.0     0.8    0.90   N < 100
          fit_wt      -5.1     2.8    0.97   N < 100

The second method adds loo objects to the fitted model objects and then puts them in a list:

fit_wt$loo <- loo_wt
fit_wt_cyl$loo <- loo_wt_cyl
fit_wt_cyl_gear$loo <- loo_wt_cyl_gear

loo_compare(stanreg_list(fit_wt, fit_wt_cyl, fit_wt_cyl_gear), detail = TRUE)

Model formulas: 
 fit_wt:  mpg ~ wt
 fit_wt_cyl:  mpg ~ wt + cyl
 fit_wt_cyl_gear:  mpg ~ wt + cyl + gear

Model comparison based on LOO-CV: 
           model elpd_diff se_diff p_worse diag_diff
      fit_wt_cyl       0.0     0.0      NA          
 fit_wt_cyl_gear      -1.0     0.8    0.90   N < 100
          fit_wt      -5.1     2.8    0.97   N < 100

avehtari · 2025-10-22T13:16:38Z

I have tested model as column name, and it works well

I added yet another column diag_elpd to show if there were high Pareto-k's in PSIS-LOO computation. We had discussed this with @n-kall some time ago. Sometimes people look at the loo_compare() output without checking whether there were high khats. You can see how this looks like and how to interpret the results are illustrated in https://users.aalto.fi/~ave/modelselection/roaches_diag.html

The roaches example has an interesting case where in the end two models are compared, and the khat for pointwise elpds for both models is <0.5, but for the khat>0.5 for the pointwise differences. This is possible due to dependency from dependency of both models predicting the same target. This however goes slightly beyond what is said in the LOO uncertainty paper.

You can see how the comparison tables look like with tinytable at
https://users.aalto.fi/~ave/casestudies/LOO_uncertainty/loo_uncertainty_diag.html

R/loo_compare.R

Co-authored-by: Jonah Gabry <[email protected]>

jgabry · 2025-10-22T15:55:13Z

You can see how the comparison tables look like with tinytable at https://users.aalto.fi/~ave/casestudies/LOO_uncertainty/loo_uncertainty_diag.html

It looks good, although none of the examples in this have anything in the diag_elpd column, so it doesn't show what that looks like.

avehtari · 2025-10-22T15:56:33Z

You can see how the comparison tables look like with tinytable at https://users.aalto.fi/~ave/casestudies/LOO_uncertainty/loo_uncertainty_diag.html

It looks good, although none of the examples in this have anything in the diag_elpd column, so it doesn't show what that looks like.

That's why I linked also the roaches case study, which has stuff in diag_elpd

jgabry · 2025-10-22T16:03:29Z

You can see how the comparison tables look like with tinytable at https://users.aalto.fi/~ave/casestudies/LOO_uncertainty/loo_uncertainty_diag.html
It looks good, although none of the examples in this have anything in the diag_elpd column, so it doesn't show what that looks like.

That's why I linked also the roaches case study, which has stuff in diag_elpd

Oops, sorry I forgot to look at that one!

avehtari · 2025-10-22T16:22:23Z

Maybe change khat > 0.5 to k_diff > 0.5 and khat_psis > 0.7 to k_psis > 0.7 to make them more different and also shorter. Also the current print.loo says "Pareto k diagnostic values", that is, there is not hat

jgabry · 2025-10-22T16:33:35Z

Maybe change khat > 0.5 to k_diff > 0.5 and khat_psis > 0.7 to k_psis > 0.7 to make them more different and also shorter. Also the current print.loo says "Pareto k diagnostic values", that is, there is not hat

Ok yeah that sounds good.

jgabry · 2025-10-22T22:27:26Z

I made some small doc edits, fixed the tests, and moved the computation of diag_diff and diag_elpd to separate internal functions (defined at the end of the same R file). This way loo_compare() itself is a little less messy.

ndevln · 2025-10-29T16:54:21Z

Hi,

First of all, thank you for everything.

I really like printing the diagnostics in the output of loo_compare(). This is really helpful.

I think I would prefer the reporting of all diagnostic warnings, instead of only the most important one. A lot of analyzed datasets are smaller than they should be, and therefore the warning N<100 would always take precedence. I think the |elpd_diff|<4 warning is really helpful, since it took me quite a while to realize this and find the corresponding explanation in the FAQ the first time I used loo. For the khat diagnostics, I know there are other functions, but it would be convenient to show them in the model comparison too.

I also think p_worse is a helpful parameter. But I am unsure how to interpret it. Since these are probabilities, 0.5 is neutral (correct?). And all models in the output (row 2 and below) will have a p_worse > 0.5? In a lot of your examples p_worse is >0.9 even if model differences are still quite small, is this only due to the type of examples or is the interpretation of p_worse similar to the AUC and you expect values above 0.95.

Can I use it to compare model comparisons of the same models fitted with different datasets?

Would you consider renaming p_worse to prob_worse? I see that you prefer shorter names, but the p is a little bit worn out by frequentist approaches. Also, when googling (in German) p_worse gets autocorrected to p-Wert (p-value). I hope stan adoption increases and this would make it more distinct for people switching from frequentist methods.

Again, thank you for your work and all the best,
Alex

avehtari · 2025-10-29T19:14:57Z

@ndevln, thanks for the feedback! And thanks for commenting here in github. As you can see my answer is quite long for bsky

I think I would prefer the reporting of all diagnostic warnings,

The challenge is to make them all fit. As default the output is printed data frame and if all the printed columns are too wide, the output is split and looks messy. We have also discussed an option of showing only a footnote numbers for diagnostics and then printing footnotes with longer text, but that is also a bit complicated. Another option would be to add an option to compute and show all the diagnostics.

The logic of showing only one of them is the following

If N<100, we can't reliably compute khat so there is no sense showing that diagnostic
If N<100, diff_se is likely to be underestimated and there is not much additional benefit to warn if |elpdf_diff|<4
If N>=100 and |elpdf_diff|<4, the normal approximation is not good and diff_se is not a useful summary, and thus reporting khat_diff is not adding much useful information

I also think p_worse is a helpful parameter. But I am unsure how to interpret it. Since these are probabilities, 0.5 is neutral (correct?). And all models in the output (row 2 and below) will have a p_worse > 0.5?

Yes, because the models have been ordered.

In a lot of your examples p_worse is >0.9 even if model differences are still quite small, is this only due to the type of examples

In all the examples with If N<100 diff_se is underestimated and the probabilities are estimated to be too close to 1. Maybe this is what you are seeing?

In addition, in the examples with N>100 and |elpd_diff|<4 diff_se is also likely to be too small and the estimated probabilities are estimated to be too close to 1.

If we pick examples which illustrate the different diagnostics, we are likely to have estimated probabilities that are often too close to 1.

or is the interpretation of p_worse similar to the AUC and you expect values above 0.95.

No. We expect values above 0.5.

Your question did make me think that we could add additional correction for using the ordered elpd_diffs, but that is not easy unless we have a large number of models to be compared.

Would you consider renaming p_worse to prob_worse? I see that you prefer shorter names, but the p is a little bit worn out by frequentist approaches. Also, when googling (in German) p_worse gets autocorrected to p-Wert (p-value). I hope stan adoption increases and this would make it more distinct for people switching from frequentist methods.

It's p, because the equations use p(). I've not seen before anyone proposing we should give up using p() in equations. Instead of prob, more logical would be Pr as that is sometimes used in equations that evaluate only to probability (while p() is used for both density and probability).

jgabry · 2025-10-29T22:58:15Z

It's p, because the equations use p(). I've not seen before anyone proposing we should give up using p() in equations. Instead of prob, more logical would be Pr as that is sometimes used in equations that evaluate only to probability (while p() is used for both density and probability).

We could consider Pr_worse instead of p_worse. Although all of the other columns in the data frame have lowercase names.

ndevln · 2025-11-05T09:20:55Z

Hi,

thanks for entertaining my input. Just as context: I teach epidemiologists with differing degrees of statistical background, because of this I am maybe a little bit too cautious. And I think this package should also be used be these people, think your package is fantastic.

Diagnostics

For the diagnostics, I think an option to show all warnings would be nice, even if the value of the warnings is limited. In infectious disease surveillance you want your datasets to be small, because if you have large datasets you failed in containing the disease. A small sample size warning is therefore often not helpful, since your data often still contains useful information. And this data has to be published to be used in a meta-analysis at a later time.

If you consider adding an option, maybe a warning() could be given if some of the diagnostic warnings are hidden with an explanation how to show them. I know this complicates the function interface a little bit, but maybe it's ok.

p_worse

When thinking about this a little bit longer, I realized my worries with p_worse are maybe more about the interpretation of the probability. While I think Pr_worse is nice and maybe add clarity, I understand that p is the mathematically correct naming and the user should be able to differentiate this from a p-value.

My argument with the AUC was way simpler. When looking at the examples, I was unsure how to interpret p_worse since it often assumes values above 0.9 in the examples. For model selection, I have to consider multiple factors like:

elpd
known causal and biological relationships and the study design
parsimony
other model diagnostics...

The main tradeoff is often elpd and parsimony. And since p_worse is often above 0.9, I would want p_worse to be close to 1 to consider a more complex model superior (everything else being equal). This is what is unclear to me. Did I get the right idea?

Other thought in this regard:
There is no fixed decision threshold for p_worse, and it is more beneficial to interpret p_worse continuously. But since easy rules feel good, a p-value like decision threshold of 0.95 could be applied. But this would be too low and would benefit more complex models?

I hope you find my thoughts helpful, and sry for the rambling.

Greetings,
Alex

avehtari · 2025-11-05T19:28:53Z

@ndevln , thanks for the additional comments! I've been busy with a grant proposal, but now the deadline has passed, I will soon modify one of the case studies to provide more information about the diagnostics and interpretation of the results. I'll comment here and tag you when it's ready, and we'll see if it will help to clarify some things and your feedback will be useful for us to see whether we are able to explain things better.

avehtari · 2025-11-05T19:32:25Z

@ndevln, and don't be sorry, it's great that you rambled and wrote what you think even when being uncertain (not all have the courage) as that also helps us to see where we need to improve documentation to reduce such uncertainty

add diff diagnostics

a909cea

avehtari requested review from jgabry and n-kall August 31, 2025 17:00

add comments and option whether the probabilities are printed

b940192

jgabry reviewed Oct 17, 2025

View reviewed changes

R/loo_compare.R Outdated Show resolved Hide resolved

R/loo_compare.R Outdated Show resolved Hide resolved

R/loo_compare.R Outdated Show resolved Hide resolved

R/loo_compare.R Outdated Show resolved Hide resolved

avehtari and others added 2 commits October 17, 2025 20:22

Apply Jonah's suggestions from code review

6fe41bb

Co-authored-by: Jonah Gabry <[email protected]>

fix argument name

fa0d79d

add more documentation (+ khat threshold 0.5 as we don't yet smooth)

65ef286

avehtari added 2 commits October 17, 2025 20:38

oops, fix khat comparison

b0ebac7

skip pareto-k check for low number of unique values

7410b81

remove simplify argument (need to check reverse dependencies)

b993b44

jgabry reviewed Oct 17, 2025

View reviewed changes

R/loo_compare.R Outdated Show resolved Hide resolved

remove simplify argument from print.compare.loo_ss

b118da2

jgabry reviewed Oct 17, 2025

View reviewed changes

start fixing tests

25f93fd

jgabry added 2 commits October 17, 2025 15:34

fix issues in preliminary reverse dependency checks

62027d1

Update loo_compare.R

cc3edfe

jgabry and others added 5 commits October 17, 2025 17:02

make sure p_worse is available

22f442f

use x instead of xcopy

ca843f4

Update loo_compare.R

d579c06

Merge branch 'master' into diff-diagnostics

cf20250

unify diagnostic messages

846a891

jgabry added 2 commits October 21, 2025 10:00

update test snapshots

abcf209

add model column instead of row names

794ffb0

remove row numbers when printing

a375c93

avehtari added 3 commits October 22, 2025 11:53

add diag_elpd

f3fcb28

improve loo_compare doc

359853a

clarifiy loo_compare diag_diff khat

b7db7dd

yet another small doc improvement

77e753f

jgabry reviewed Oct 22, 2025

View reviewed changes

R/loo_compare.R Outdated Show resolved Hide resolved

jgabry reviewed Oct 22, 2025

View reviewed changes

R/loo_compare.R Outdated Show resolved Hide resolved

Use function()

c59414e

Co-authored-by: Jonah Gabry <[email protected]>

another loo_compare doc edit

93fdbff

avehtari and others added 2 commits October 22, 2025 20:02

adjust some diagnostic messages and documentation

ae48d69

edit doc, fix tests, move diagnostics to internal functions

4f5872b

jgabry added 2 commits October 22, 2025 16:27

Merge branch 'master' into diff-diagnostics

574af4f

Merge branch 'master' into diff-diagnostics

3d008e9

Uh oh!

add diff diagnostics #300

Are you sure you want to change the base?

add diff diagnostics #300

Conversation

avehtari commented Aug 31, 2025

Uh oh!

avehtari commented Oct 16, 2025

Uh oh!

jgabry left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

avehtari commented Oct 17, 2025

Uh oh!

jgabry commented Oct 17, 2025

Uh oh!

avehtari commented Oct 17, 2025

Uh oh!

Uh oh!

jgabry Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jgabry commented Oct 17, 2025

Uh oh!

jgabry commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avehtari commented Oct 18, 2025

Uh oh!

jgabry commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgabry commented Oct 21, 2025

Uh oh!

avehtari commented Oct 22, 2025

Uh oh!

Uh oh!

Uh oh!

jgabry commented Oct 22, 2025

Uh oh!

avehtari commented Oct 22, 2025

Uh oh!

jgabry commented Oct 22, 2025

Uh oh!

avehtari commented Oct 22, 2025

Uh oh!

jgabry commented Oct 22, 2025

Uh oh!

jgabry commented Oct 22, 2025

Uh oh!

ndevln commented Oct 29, 2025

Uh oh!

avehtari commented Oct 29, 2025

Uh oh!

jgabry commented Oct 29, 2025

Uh oh!

ndevln commented Nov 5, 2025

Diagnostics

p_worse

Uh oh!

avehtari commented Nov 5, 2025

Uh oh!

avehtari commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

jgabry left a comment •

edited

Loading

codecov-commenter commented Oct 17, 2025 •

edited

Loading

jgabry commented Oct 17, 2025 •

edited

Loading

jgabry commented Oct 21, 2025 •

edited

Loading