-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request: make predictions possible after using pool() #82
Comments
There are actually multiple ways to create predictions from multiply imputed data. See Obtaining Predictions from Models Fit to Multiply Imputed Data for a comparison of two methods. I tend to deviate from Miles' recommendation, and would generally prefer Also see the discussion by Wood et al., who highlight the decisions to be made when accounting for model optimism. It should be relatively straightforward to implement any of these approaches in |
Just want to add that this feature would be quite useful. I'm a bit surprised by the recommendations in the Miles paper, it doesn't really show the regimes where combine and predict would break down substantially. |
Any word on this feature? Would be very helpful! |
I have done some thinking on mice prediction workflows. Here are two possibilities, one involving only a train datasets, another involving both a train and test dataset. AOne data set: train,
1. Standard mice: imp <- mice(train, …)
2. Standard complete-data analysis: fit <- with(imp, mymodel, …)
3. Calculate predictions: predlist <- lapply(fit$analysis, broom::augment.xxx, se.fit = TRUE, …) - We may need the data argument for some procedures. Perhaps use map().
4. Apply Rubin's rules to pool predictions: predmat <- pool(predlist, …), which results in a n * 3 matrix: .fitted, .residuals, .se.fit
5. Bind to data: ready <- bind_cols(train, predmat)
— BTwo data sets: train (
1. Standard mice: imp.train <- mice(train, …)
2. Impute test data: imp.test <_ mice.mids(imp.train, newdata = test)
3. fit <- with(imp.test, mymodel, …)
4. Apply steps A3 and A4
5. Bind to data: ready <- bind_cols(test, predmat) — A and B are extreme scenarios, and obviously, we can mix aspects of both. What I would like to know is whether A or B as described would be useful for applications. Suggestions for alternatives and tweaks are welcome. |
I came to request exactly option B. My application is to use MICE as part of a processing pipeline for predictive models, where the standard is to have a true hold-out set. In deployment I also don't want to update the estimation - just apply it (including the imputation model) to the example at hand. It's kind of complicated with predictive mean matching because (I think) the complete training dataset is needed at inference time to select the replacements at each step. Exporting the model to something light weight would be huge. |
I'd be very happy with A. |
Dear all, fit <- with(impsppb, glm(compo ~ edad+sexo+fumarw2,family = quasibinomial)) Thank you. |
Option A would be spectacular. It's exactly what I'd want to teach to my students. |
Another vote for Option A!! I came looking for exactly that. |
For all other fans of Option A, here's a blog post showing how you can do it by hand: https://solomonkurz.netlify.app/post/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/ |
🤗 you are my favorite person @ASKurz |
Cheers! 🍻 |
Thank you so much @ASKurz, just what I needed |
The script below uses the @ASKurz example in a more micy way. It is essentially Rubin's rules applied to model predictions instead of model parameters. library(mice)
library(dplyr)
# generate data
set.seed(201)
b_small <-
brandsma %>%
filter(!complete.cases(.)) %>%
slice_sample(n = 50) %>%
select(ses, iqv, iqp)
# impute & analyse as usual
imp <- mice(b_small, seed = 540, m = 10, method = "norm", print = FALSE)
fit <- with(imp, lm(iqv ~ 1 + ses))
# obtain predictions Q and prediction variance U
predm <- lapply(getfit(fit), predict, se.fit = TRUE)
Q <- sapply(predm, `[[`, "fit")
U <- sapply(predm, `[[`, "se.fit")^2
dfcom <- predm[[1]]$df
# pool predictions
pred <- matrix(NA, nrow = nrow(Q), ncol = 3,
dimnames = list(NULL, c("fit", "se.fit", "df")))
for(i in 1:nrow(Q)) {
pi <- pool.scalar(Q[i, ], U[i, ], n = dfcom + 1)
pred[i, 1] <- pi[["qbar"]]
pred[i, 2] <- sqrt(pi[["t"]])
pred[i, 3] <- pi[["df"]]
} Perhaps we could evolve this into a |
FWIW, |
@markdanese Thanks for flagging this. The code is an example. It works for |
Would really appreciate that implementation! |
Hi, the option A above looks tempting, I'm trying to plot predicted values from lme model using multilevel multiple imputed longitudinal data. However, when I run the code on the third step, I tried it with nlme Orthodent data too, using Could someone please give advice, how to get fitted values for plotting? I also found the instructions from Haile, Sarah, 2019, Multiple Imputation using R, and noticed that "However van Buuren appears to disagree with this approach, which is perhaps why it’s not part of mice yet." |
Your code says I don't know Haile. |
I am trying to calculate predicted probabilities from a multiply imputed mixed effect logistic regression model: df_select <- df_collapsed %>% select("op.day","age.admission.nmds", "gender", "ethnicity2", "asa2", "C3.score.categories", "extent.nzcr", "site", "dep13.nmds.quint", "op.year", "mortality.90d", "dhb.nmds", "days.to.surg") I have tried option A but I am running into the same problem as mentioned above, swtiching 'analysis' for 'analyses' does not seem to help. Any suggestions would be greatly appreciated. |
Perhaps using |
Please look at #32, #299 and the `ignore` argument in `mice()`
Van: jeandigitale ***@***.***>
Datum: dinsdag, 22 augustus 2023 om 20:58
Aan: amices/mice ***@***.***>
CC: Buuren, S. van (Stef) ***@***.***>, Comment ***@***.***>
Onderwerp: Re: [amices/mice] Request: make predictions possible after using pool() (#82)
CAUTION: This email originated from outside of Utrecht University. Do not click links or open attachments unless you recognize the sender and know the content is safe.
* Estimate the imputation model on train only;
* Impute the missing data in both train and test;
Like @cryanking<https://github.com/cryanking>, I'm trying to use imputation in a real-time prediction pipeline and am looking for a way to estimate imputation models with training data and then apply those models to test data without having to re-estimate them. I have been searching for a package that has this feature with no luck yet!
—
Reply to this email directly, view it on GitHub<#82 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA55ALYXMTH6THPAAJMQ2ZDXWT6OLANCNFSM4EZN5F5A>.
You are receiving this because you commented.Message ID: ***@***.***>
|
This link has changed and is now: https://solomonkurz.netlify.app/blog/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/ |
Is this the most recent solution for lm() models? |
Voilà. Will need to think about the default choices that are made in library(purrr)
library(magrittr)
library(dplyr)
library(mice)
# generate data
set.seed(201)
b_small <-
brandsma %>%
filter(!complete.cases(.)) %>%
slice_sample(n = 50) %>%
select(ses, iqv, iqp)
imp <- b_small %>%
mice(seed = 540, m = 10, method = "norm", print = FALSE)
# fit with map() workflow
fit1 <- imp %>%
complete("all") %>%
map(~.x %$% lm(iqv ~ 1 + ses))
# fit with with.mids() workflow
fit2 <- with(imp, lm(iqv ~ 1 + ses))
# function to pool individual values cf. Rubin's rules
predict.mira <- function(fit, tiny = TRUE){
# preprocess to allow for flexible workflows
if("mira" %in% class(fit)){
fit <- fit %>%
.$analyses # extract the fitted analyses
}
# create predictions and intervals
pred <- fit %>%
map(~.x %>%
predict(se.fit = TRUE) %>% # create predictions with corresponding SE
as.data.frame() %>%
select(fit, se.fit, df)) # purge irrelevant
# pool predictions
# convert predictions to array and call pool.scalar over dimensions 1 and 2
pooled <- array(unlist(pred), dim = c(dim(pred[[1]]), length(pred))) %>%
apply(., c(1), function(x) pool.scalar(Q = x[1,],
U = x[2,]^2,
n = x[3,1] + 1)) %>%
sapply(unlist) %>%
t() %>%
as_tibble() %>%
mutate(se.fit = sqrt(t))
if(!tiny){
return(pooled)
} else
return(pooled %>% select(qbar, se.fit, df))
}
#tiny output
predict.mira(fit1)
#> # A tibble: 50 × 3
#> qbar se.fit df
#> <dbl> <dbl> <dbl>
#> 1 -1.57 0.489 27.2
#> 2 -1.12 1.29 3.45
#> 3 -0.578 0.335 29.3
#> 4 2.40 0.719 31.1
#> 5 0.450 0.865 4.94
#> 6 0.239 1.10 3.36
#> 7 0.0420 0.315 32.0
#> 8 -0.578 0.335 29.3
#> 9 0.0420 0.315 32.0
#> 10 1.03 0.430 32.7
#> # ℹ 40 more rows
predict.mira(fit2)
#> # A tibble: 50 × 3
#> qbar se.fit df
#> <dbl> <dbl> <dbl>
#> 1 -1.57 0.489 27.2
#> 2 -1.12 1.29 3.45
#> 3 -0.578 0.335 29.3
#> 4 2.40 0.719 31.1
#> 5 0.450 0.865 4.94
#> 6 0.239 1.10 3.36
#> 7 0.0420 0.315 32.0
#> 8 -0.578 0.335 29.3
#> 9 0.0420 0.315 32.0
#> 10 1.03 0.430 32.7
#> # ℹ 40 more rows
# full output
predict.mira(fit1, tiny = FALSE)
#> # A tibble: 50 × 29
#> m qhat1 qhat2 qhat3 qhat4 qhat5 qhat6 qhat7 qhat8 qhat9
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 10 -1.34 -1.73 -1.90 -1.81 -1.58 -1.46 -1.18 -1.41 -1.47
#> 2 10 0.135 0.0911 -0.663 -0.291 -2.49 -3.18 -2.03 -1.06 -0.122
#> 3 10 -0.512 -0.665 -0.821 -0.661 -0.674 -0.439 -0.266 -0.520 -0.573
#> 4 10 1.97 2.54 2.42 2.79 2.05 2.61 2.49 2.14 2.12
#> 5 10 -0.794 -0.130 1.18 0.263 0.818 -0.448 0.973 1.57 0.275
#> 6 10 -1.25 1.64 1.14 0.552 0.0164 -1.01 0.966 -0.796 0.149
#> 7 10 0.00583 0.00304 -0.146 0.0569 -0.106 0.196 0.308 0.0340 -0.0114
#> 8 10 -0.512 -0.665 -0.821 -0.661 -0.674 -0.439 -0.266 -0.520 -0.573
#> 9 10 0.00583 0.00304 -0.146 0.0569 -0.106 0.196 0.308 0.0340 -0.0114
#> 10 10 0.835 1.07 0.933 1.21 0.803 1.21 1.23 0.921 0.886
#> # ℹ 40 more rows
#> # ℹ 19 more variables: qhat10 <dbl>, u1 <dbl>, u2 <dbl>, u3 <dbl>, u4 <dbl>,
#> # u5 <dbl>, u6 <dbl>, u7 <dbl>, u8 <dbl>, u9 <dbl>, u10 <dbl>, qbar <dbl>,
#> # ubar <dbl>, b <dbl>, t <dbl>, df <dbl>, r <dbl>, fmi <dbl>, se.fit <dbl>
predict.mira(fit2, tiny = FALSE)
#> # A tibble: 50 × 29
#> m qhat1 qhat2 qhat3 qhat4 qhat5 qhat6 qhat7 qhat8 qhat9
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 10 -1.34 -1.73 -1.90 -1.81 -1.58 -1.46 -1.18 -1.41 -1.47
#> 2 10 0.135 0.0911 -0.663 -0.291 -2.49 -3.18 -2.03 -1.06 -0.122
#> 3 10 -0.512 -0.665 -0.821 -0.661 -0.674 -0.439 -0.266 -0.520 -0.573
#> 4 10 1.97 2.54 2.42 2.79 2.05 2.61 2.49 2.14 2.12
#> 5 10 -0.794 -0.130 1.18 0.263 0.818 -0.448 0.973 1.57 0.275
#> 6 10 -1.25 1.64 1.14 0.552 0.0164 -1.01 0.966 -0.796 0.149
#> 7 10 0.00583 0.00304 -0.146 0.0569 -0.106 0.196 0.308 0.0340 -0.0114
#> 8 10 -0.512 -0.665 -0.821 -0.661 -0.674 -0.439 -0.266 -0.520 -0.573
#> 9 10 0.00583 0.00304 -0.146 0.0569 -0.106 0.196 0.308 0.0340 -0.0114
#> 10 10 0.835 1.07 0.933 1.21 0.803 1.21 1.23 0.921 0.886
#> # ℹ 40 more rows
#> # ℹ 19 more variables: qhat10 <dbl>, u1 <dbl>, u2 <dbl>, u3 <dbl>, u4 <dbl>,
#> # u5 <dbl>, u6 <dbl>, u7 <dbl>, u8 <dbl>, u9 <dbl>, u10 <dbl>, qbar <dbl>,
#> # ubar <dbl>, b <dbl>, t <dbl>, df <dbl>, r <dbl>, fmi <dbl>, se.fit <dbl> Created on 2024-06-10 with reprex v2.1.0 |
Since Would there be value in using broom::augment for this feature? |
Yes; should work universally. Don't know if |
Dear @stefvanbuuren Thank you for the example code for model prediction using imputed data. In that example, what you actually calculated were the "fitted values", not the "predicted values", because you didn't input a independent test dataset into If there is a separate test dataset without any missing values, assumed to be collected in the future, b_small_test <- brandsma %>%
filter( complete.cases(.) ) %>% # This line preserves all the complete records
slice_sample(n = 20) %>%
select(ses, iqv, iqp) we can just add # obtain predictions Q and prediction variance U
predm <- lapply(getfit(fit), predict, newdata = b_small_test, se.fit = TRUE) However, more realistically, if the test dataset also includes missing values, b_small_test_with_na <- mice::ampute(b_small_test)$amp then I think we can use the imputed dataset from the model training phase to fill in missing values in the test set. imp_test <- mice.mids(imp, newdata = b_small_test_with_na, maxit = 5) Unfortunately, |
When using mice multiple imputation to generate multiple imputed sets and then running a model on each one of them (like elastic-net from the
glmnet
package) and then pooling the estimates to a single set - its not possible to usepredict.glmnet()
(orpredict()
) on the object that comes out ofmice::pool()
.The prediction functions expect a fitted model object while
mice::pool()
returnsIs it possible to add the ability to make predictions from the return object of
mice::pool()
?The text was updated successfully, but these errors were encountered: