Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: make predictions possible after using pool() #82

Open
oren0e opened this issue Apr 8, 2018 · 28 comments
Open

Request: make predictions possible after using pool() #82

oren0e opened this issue Apr 8, 2018 · 28 comments

Comments

@oren0e
Copy link

oren0e commented Apr 8, 2018

When using mice multiple imputation to generate multiple imputed sets and then running a model on each one of them (like elastic-net from the glmnet package) and then pooling the estimates to a single set - its not possible to use predict.glmnet() (or predict()) on the object that comes out of mice::pool().
The prediction functions expect a fitted model object while mice::pool() returns

An object of class mipo, which stands for 'multiple imputation pooled outcome'.

Is it possible to add the ability to make predictions from the return object of mice::pool()?

@stefvanbuuren
Copy link
Member

There are actually multiple ways to create predictions from multiply imputed data. See Obtaining Predictions from Models Fit to Multiply Imputed Data for a comparison of two methods. I tend to deviate from Miles' recommendation, and would generally prefer Predict then combine because that corresponds to Rubin's rules and accounts for nonlinearities at the proper place.

Also see the discussion by Wood et al., who highlight the decisions to be made when accounting for model optimism.

It should be relatively straightforward to implement any of these approaches in mice.

@mnarayan
Copy link

Just want to add that this feature would be quite useful. I'm a bit surprised by the recommendations in the Miles paper, it doesn't really show the regimes where combine and predict would break down substantially.

@SolomonMg
Copy link

Any word on this feature? Would be very helpful!

@stefvanbuuren
Copy link
Member

I have done some thinking on mice prediction workflows. Here are two possibilities, one involving only a train datasets, another involving both a train and test dataset.

A

One data set: train, $n$ rows, $m$ imputations:

  • Impute the missing data;
  • Fit the complete-data model to each imputed data set;
  • Obtain predictions per imputed data set;
  • Pool predictions and prediction errors
  • Bind the result to train.
1. Standard mice: imp <- mice(train, …)
2. Standard complete-data analysis: fit <- with(imp, mymodel, …)
3. Calculate predictions: predlist <- lapply(fit$analysis, broom::augment.xxx, se.fit = TRUE, …) - We may need the data argument for some procedures. Perhaps use map().
4. Apply Rubin's rules to pool predictions: predmat <- pool(predlist, …), which results in a n * 3 matrix: .fitted, .residuals, .se.fit
5. Bind to data: ready <- bind_cols(train, predmat)
  • Supports all broom::augment.xxx() functions.
  • Steps 3 and 4 can jointly form an augment.mira() function, a kind of meta-augment.

B

Two data sets: train ($n_0$ rows) and test ($n_1$ rows), $m$ imputations.

  • Estimate the imputation model on train only;
  • Impute the missing data in both train and test;
  • Fit the complete-data model to each imputed data set using train only;
  • Obtain predictions per imputed data set in test only;
  • Pool predictions and prediction errors;
  • Bind the result to test.
1. Standard mice: imp.train <- mice(train, …)
2. Impute test data: imp.test <_ mice.mids(imp.train, newdata = test)
3. fit <- with(imp.test, mymodel, …)
4. Apply steps A3 and A4
5. Bind to data: ready <- bind_cols(test, predmat)

A and B are extreme scenarios, and obviously, we can mix aspects of both.

What I would like to know is whether A or B as described would be useful for applications. Suggestions for alternatives and tweaks are welcome.

@cryanking
Copy link

I came to request exactly option B. My application is to use MICE as part of a processing pipeline for predictive models, where the standard is to have a true hold-out set. In deployment I also don't want to update the estimation - just apply it (including the imputation model) to the example at hand. It's kind of complicated with predictive mean matching because (I think) the complete training dataset is needed at inference time to select the replacements at each step. Exporting the model to something light weight would be huge.

@JonNDCN
Copy link

JonNDCN commented May 28, 2021

I'd be very happy with A.

@angelrodriguez2020
Copy link

Dear all,
I'd like to know the current status of this issue.
In fact, I only need the pooled predicted values, not their standard errors, so I suppose averaging the predictions would be enough. If mice does not provide that yet, would this script be sensible (13 imputations)?:

fit <- with(impsppb, glm(compo ~ edad+sexo+fumarw2,family = quasibinomial))
for (i in 1:13) {
assign(paste0("fit",i), data.frame(fit[["analyses"]][[i]][["data"]][["hi1"]],fit[["analyses"]][[i]][["fitted.values"]]))
assign(paste0("fit",i), names<-(get(paste0("fit",i)), c("hi1", "pred")))
}
fittot<-Reduce(function(x,y) merge(x = x, y = y, by = "hi1"),
list(fit1,fit2,fit3,fit4,fit5,fit6,fit7,fit8,fit9,fit10,fit11,fit12,fit13))
fittot$predef <- rowMeans(fittot[,c(2:14)], na.rm=TRUE)
fittot <- fittot[c(1,15)]

Thank you.

@ASKurz
Copy link

ASKurz commented Oct 20, 2021

Option A would be spectacular. It's exactly what I'd want to teach to my students.

@stjeandenise
Copy link

Another vote for Option A!! I came looking for exactly that.

@ASKurz
Copy link

ASKurz commented Nov 12, 2021

For all other fans of Option A, here's a blog post showing how you can do it by hand: https://solomonkurz.netlify.app/post/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/

@stjeandenise
Copy link

For all other fans of Option A, here's a blog post showing how you can do it by hand: https://solomonkurz.netlify.app/post/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/

🤗 you are my favorite person @ASKurz

@ASKurz
Copy link

ASKurz commented Nov 12, 2021

Cheers! 🍻

@JonNDCN
Copy link

JonNDCN commented Nov 12, 2021

Thank you so much @ASKurz, just what I needed

@stefvanbuuren
Copy link
Member

stefvanbuuren commented Nov 14, 2021

The script below uses the @ASKurz example in a more micy way. It is essentially Rubin's rules applied to model predictions instead of model parameters.

library(mice)
library(dplyr)

# generate data
set.seed(201)
b_small <-
  brandsma %>%
  filter(!complete.cases(.)) %>%
  slice_sample(n = 50) %>%
  select(ses, iqv, iqp)

# impute & analyse as usual
imp <- mice(b_small, seed = 540, m = 10, method = "norm", print = FALSE)
fit <- with(imp, lm(iqv ~ 1 + ses))

# obtain predictions Q and prediction variance U
predm <- lapply(getfit(fit), predict, se.fit = TRUE)
Q <- sapply(predm, `[[`, "fit")
U <- sapply(predm, `[[`, "se.fit")^2
dfcom <- predm[[1]]$df

# pool predictions
pred <- matrix(NA, nrow = nrow(Q), ncol = 3,
                 dimnames = list(NULL, c("fit", "se.fit", "df")))
for(i in 1:nrow(Q)) {
  pi <- pool.scalar(Q[i, ], U[i, ], n = dfcom + 1)
  pred[i, 1] <- pi[["qbar"]]
  pred[i, 2] <- sqrt(pi[["t"]])
  pred[i, 3] <- pi[["df"]]
}

Perhaps we could evolve this into a predict.mira() function.

@markdanese
Copy link

FWIW, predm[[1]] doesn't have a df object for glm, so dfcom ends up being NULL. It probably makes sense to get it from getfit(fit)[[1]]$df.null assuming the null df is the correct one. Apologies if I am missing something obvious.

@stefvanbuuren
Copy link
Member

@markdanese Thanks for flagging this. The code is an example. It works for lm(), but needs tweaking for other models.

@RStellato
Copy link

Perhaps we could evolve this into a predict.mira() function.

Would really appreciate that implementation!

@reetakankaanpaa
Copy link

Hi, the option A above looks tempting, I'm trying to plot predicted values from lme model using multilevel multiple imputed longitudinal data.

However, when I run the code on the third step,
predlist <- lapply(fit$analysis, broom::augment.xxx, se.fit = TRUE)
I repeatedly get an error: 'augment.xxx' is not an exported object from 'namespace:broom', regardless of whether I use broom.mixed::augment.lme.

I tried it with nlme Orthodent data too, using
fit <- with(imp, lm(distance~age*Sex, data = Orthodont))
predlist <- lapply(fit$analysis, broom::augment, se.fit = TRUE, data = Orthodont)
predmat <- pool(predlist)
Here, I get an error: Error in object$analyses[[1]] : subscript out of bounds

Could someone please give advice, how to get fitted values for plotting?

I also found the instructions from Haile, Sarah, 2019, Multiple Imputation using R, and noticed that "However van Buuren appears to disagree with this approach, which is perhaps why it’s not part of mice yet."

@stefvanbuuren
Copy link
Member

Your code says fit$analysis. Does it help to say fit$analyses?

I don't know Haile.

@ne-jos
Copy link

ne-jos commented Apr 14, 2023

I am trying to calculate predicted probabilities from a multiply imputed mixed effect logistic regression model:

df_select <- df_collapsed %>% select("op.day","age.admission.nmds", "gender", "ethnicity2", "asa2", "C3.score.categories", "extent.nzcr", "site", "dep13.nmds.quint", "op.year", "mortality.90d", "dhb.nmds", "days.to.surg")
imp <- mice(df_select, m=10)
fit <- with(imp, glmer(mortality.90d ~ op.day + age.admission.nmds + gender + ethnicity2 + asa2 + C3.score.categories + extent.nzcr + site + dep13.nmds.quint + op.year + days.to.surg + (1 | dhb.nmds), family = binomial()))

I have tried option A but I am running into the same problem as mentioned above, swtiching 'analysis' for 'analyses' does not seem to help. Any suggestions would be greatly appreciated.

@thomvolker
Copy link
Member

Perhaps using broom.mixed:::augment.merMod will give you the correct predictions. However, pool() will not work, because pooling currently only works on the level of the analyses (i.e., pooling the estimated parameters), not on the level of the data (i.e., pooling the predictions). If you want to pool the predictions, you will have to average the predictions and calculate the variances manually.

@stefvanbuuren
Copy link
Member

stefvanbuuren commented Aug 22, 2023 via email

@fthielen
Copy link

@fthielen
Copy link

The script below uses the @ASKurz example in a more micy way. It is essentially Rubin's rules applied to model predictions instead of model parameters.

library(mice)
library(dplyr)

# generate data
set.seed(201)
b_small <-
  brandsma %>%
  filter(!complete.cases(.)) %>%
  slice_sample(n = 50) %>%
  select(ses, iqv, iqp)

# impute & analyse as usual
imp <- mice(b_small, seed = 540, m = 10, method = "norm", print = FALSE)
fit <- with(imp, lm(iqv ~ 1 + ses))

# obtain predictions Q and prediction variance U
predm <- lapply(getfit(fit), predict, se.fit = TRUE)
Q <- sapply(predm, `[[`, "fit")
U <- sapply(predm, `[[`, "se.fit")^2
dfcom <- predm[[1]]$df

# pool predictions
pred <- matrix(NA, nrow = nrow(Q), ncol = 3,
                 dimnames = list(NULL, c("fit", "se.fit", "df")))
for(i in 1:nrow(Q)) {
  pi <- pool.scalar(Q[i, ], U[i, ], n = dfcom + 1)
  pred[i, 1] <- pi[["qbar"]]
  pred[i, 2] <- sqrt(pi[["t"]])
  pred[i, 3] <- pi[["df"]]
}

Perhaps we could evolve this into a predict.mira() function.

Is this the most recent solution for lm() models? predict.mira() does not seem to be implemented (yet).

@gerkovink
Copy link
Member

Voilà. Will need to think about the default choices that are made in predict.lm() before we implement this further.

library(purrr)
library(magrittr)
library(dplyr)
library(mice)

# generate data
set.seed(201)
b_small <-
  brandsma %>%
  filter(!complete.cases(.)) %>%
  slice_sample(n = 50) %>%
  select(ses, iqv, iqp)

imp <- b_small %>% 
  mice(seed = 540, m = 10, method = "norm", print = FALSE)

# fit with map() workflow
fit1 <- imp %>% 
  complete("all") %>%
  map(~.x %$% lm(iqv ~ 1 + ses))

# fit with with.mids() workflow
fit2 <- with(imp, lm(iqv ~ 1 + ses))

# function to pool individual values cf. Rubin's rules
predict.mira <- function(fit, tiny = TRUE){
  # preprocess to allow for flexible workflows
  if("mira" %in% class(fit)){
    fit <- fit %>% 
      .$analyses # extract the fitted analyses
  } 
  # create predictions and intervals
  pred <- fit %>% 
    map(~.x %>%  
          predict(se.fit = TRUE) %>% # create predictions with corresponding SE
          as.data.frame() %>% 
          select(fit, se.fit, df)) # purge irrelevant
  
  # pool predictions
  # convert predictions to array and call pool.scalar over dimensions 1 and 2
  pooled <- array(unlist(pred), dim = c(dim(pred[[1]]), length(pred))) %>% 
    apply(., c(1), function(x) pool.scalar(Q = x[1,], 
                                           U = x[2,]^2, 
                                           n = x[3,1] + 1)) %>% 
    sapply(unlist) %>% 
    t() %>% 
    as_tibble() %>% 
    mutate(se.fit = sqrt(t))
  
  if(!tiny){
    return(pooled)
  } else 
    return(pooled %>% select(qbar, se.fit, df))
}

#tiny output
predict.mira(fit1)
#> # A tibble: 50 × 3
#>       qbar se.fit    df
#>      <dbl>  <dbl> <dbl>
#>  1 -1.57    0.489 27.2 
#>  2 -1.12    1.29   3.45
#>  3 -0.578   0.335 29.3 
#>  4  2.40    0.719 31.1 
#>  5  0.450   0.865  4.94
#>  6  0.239   1.10   3.36
#>  7  0.0420  0.315 32.0 
#>  8 -0.578   0.335 29.3 
#>  9  0.0420  0.315 32.0 
#> 10  1.03    0.430 32.7 
#> # ℹ 40 more rows
predict.mira(fit2)
#> # A tibble: 50 × 3
#>       qbar se.fit    df
#>      <dbl>  <dbl> <dbl>
#>  1 -1.57    0.489 27.2 
#>  2 -1.12    1.29   3.45
#>  3 -0.578   0.335 29.3 
#>  4  2.40    0.719 31.1 
#>  5  0.450   0.865  4.94
#>  6  0.239   1.10   3.36
#>  7  0.0420  0.315 32.0 
#>  8 -0.578   0.335 29.3 
#>  9  0.0420  0.315 32.0 
#> 10  1.03    0.430 32.7 
#> # ℹ 40 more rows
# full output
predict.mira(fit1, tiny = FALSE)
#> # A tibble: 50 × 29
#>        m    qhat1    qhat2  qhat3   qhat4   qhat5  qhat6  qhat7   qhat8   qhat9
#>    <dbl>    <dbl>    <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
#>  1    10 -1.34    -1.73    -1.90  -1.81   -1.58   -1.46  -1.18  -1.41   -1.47  
#>  2    10  0.135    0.0911  -0.663 -0.291  -2.49   -3.18  -2.03  -1.06   -0.122 
#>  3    10 -0.512   -0.665   -0.821 -0.661  -0.674  -0.439 -0.266 -0.520  -0.573 
#>  4    10  1.97     2.54     2.42   2.79    2.05    2.61   2.49   2.14    2.12  
#>  5    10 -0.794   -0.130    1.18   0.263   0.818  -0.448  0.973  1.57    0.275 
#>  6    10 -1.25     1.64     1.14   0.552   0.0164 -1.01   0.966 -0.796   0.149 
#>  7    10  0.00583  0.00304 -0.146  0.0569 -0.106   0.196  0.308  0.0340 -0.0114
#>  8    10 -0.512   -0.665   -0.821 -0.661  -0.674  -0.439 -0.266 -0.520  -0.573 
#>  9    10  0.00583  0.00304 -0.146  0.0569 -0.106   0.196  0.308  0.0340 -0.0114
#> 10    10  0.835    1.07     0.933  1.21    0.803   1.21   1.23   0.921   0.886 
#> # ℹ 40 more rows
#> # ℹ 19 more variables: qhat10 <dbl>, u1 <dbl>, u2 <dbl>, u3 <dbl>, u4 <dbl>,
#> #   u5 <dbl>, u6 <dbl>, u7 <dbl>, u8 <dbl>, u9 <dbl>, u10 <dbl>, qbar <dbl>,
#> #   ubar <dbl>, b <dbl>, t <dbl>, df <dbl>, r <dbl>, fmi <dbl>, se.fit <dbl>
predict.mira(fit2, tiny = FALSE)
#> # A tibble: 50 × 29
#>        m    qhat1    qhat2  qhat3   qhat4   qhat5  qhat6  qhat7   qhat8   qhat9
#>    <dbl>    <dbl>    <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
#>  1    10 -1.34    -1.73    -1.90  -1.81   -1.58   -1.46  -1.18  -1.41   -1.47  
#>  2    10  0.135    0.0911  -0.663 -0.291  -2.49   -3.18  -2.03  -1.06   -0.122 
#>  3    10 -0.512   -0.665   -0.821 -0.661  -0.674  -0.439 -0.266 -0.520  -0.573 
#>  4    10  1.97     2.54     2.42   2.79    2.05    2.61   2.49   2.14    2.12  
#>  5    10 -0.794   -0.130    1.18   0.263   0.818  -0.448  0.973  1.57    0.275 
#>  6    10 -1.25     1.64     1.14   0.552   0.0164 -1.01   0.966 -0.796   0.149 
#>  7    10  0.00583  0.00304 -0.146  0.0569 -0.106   0.196  0.308  0.0340 -0.0114
#>  8    10 -0.512   -0.665   -0.821 -0.661  -0.674  -0.439 -0.266 -0.520  -0.573 
#>  9    10  0.00583  0.00304 -0.146  0.0569 -0.106   0.196  0.308  0.0340 -0.0114
#> 10    10  0.835    1.07     0.933  1.21    0.803   1.21   1.23   0.921   0.886 
#> # ℹ 40 more rows
#> # ℹ 19 more variables: qhat10 <dbl>, u1 <dbl>, u2 <dbl>, u3 <dbl>, u4 <dbl>,
#> #   u5 <dbl>, u6 <dbl>, u7 <dbl>, u8 <dbl>, u9 <dbl>, u10 <dbl>, qbar <dbl>,
#> #   ubar <dbl>, b <dbl>, t <dbl>, df <dbl>, r <dbl>, fmi <dbl>, se.fit <dbl>

Created on 2024-06-10 with reprex v2.1.0

@stefvanbuuren
Copy link
Member

Since predict() is pretty well adopted and implemented, I assume this might work more generally, i.e., for any object for which there is a predict method. Right?

Would there be value in using broom::augment for this feature?

@gerkovink
Copy link
Member

Yes; should work universally. Don't know if broom::augment() would apply to all of those.

@darentsai
Copy link

darentsai commented Aug 30, 2024

Dear @stefvanbuuren

Thank you for the example code for model prediction using imputed data. In that example, what you actually calculated were the "fitted values", not the "predicted values", because you didn't input a independent test dataset into predict() function.

If there is a separate test dataset without any missing values, assumed to be collected in the future,

b_small_test <- brandsma %>%
  filter( complete.cases(.) ) %>%  # This line preserves all the complete records
  slice_sample(n = 20) %>%
  select(ses, iqv, iqp)

we can just add newdata = b_small_test into predict(), and the remaining part works perfectly.

# obtain predictions Q and prediction variance U
predm <- lapply(getfit(fit), predict, newdata = b_small_test, se.fit = TRUE)

However, more realistically, if the test dataset also includes missing values,

b_small_test_with_na <- mice::ampute(b_small_test)$amp

then newdata = b_small_test_with_na will lead to the predictions also being missing values. How can the example code be modified to handle this situation?


I think we can use the imputed dataset from the model training phase to fill in missing values in the test set.

imp_test <- mice.mids(imp, newdata = b_small_test_with_na, maxit = 5)

Unfortunately, imp_test is still a multiply imputed object with m=10 imputed results, and I cannot figure out how to accommodate it into lapply(getfit(fit), predict, se.fit = TRUE).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests