-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reference for computing bootstrap estimates of confidence intervals? #67
Comments
Hi vadori, I'm glad the package was helpful. Regarding your questions:
|
Dear @Thie1e This is amazing, thank you so much for your reply! I am now experimenting. I will let you know how it goes. Best, EDIT: Everything is going great, I love your package (and your paper, which allowed be to understand lots of details). I have some questions:
Thank you!!! |
Hi, unfortunately, there is no built-in parameter for customizing the number of digits in the output of the bootstrap summary. I am going to try to add this to an upcoming version. There are two solutions: First, you could calculate the summary manually by extracting the bootstrap results from library(cutpointr)
library(tidyverse)
cp <- cutpointr(data = suicide,
x = dsi,
class = suicide,
method = maximize_metric,
metric = sum_sens_spec,
boot_runs = 300)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values
#> Running bootstrap...
# New parameter boot_digits
my_summary_func <- function (x, digits = 4, boot_digits = 3)
{
cat(paste("Method:", x$cutpointr[[1]]$method, "\n"))
cat(paste("Predictor:", x$cutpointr[[1]]$predictor, "\n"))
cat(paste("Outcome:", x$cutpointr[[1]]$outcome, "\n"))
cat(paste("Direction:", x$cutpointr[[1]]$direction, "\n"))
if (cutpointr:::has_column(x$cutpointr[[1]], "subgroup")) {
cat(c("Subgroups:", paste(purrr::map(x$cutpointr, ~.$subgroup),
collapse = ", "), "\n"))
}
if (cutpointr:::has_boot_results(x)) {
cat(paste("Nr. of bootstraps:", x$boot_runs[1], "\n"))
}
for (i in 1:nrow(x)) {
cat("\n")
if (cutpointr:::has_column(x$cutpointr[[i]], "subgroup")) {
cat(paste("Subgroup:", x$cutpointr[[i]]$subgroup,
"\n"))
cat(paste0(rep("-", getOption("width")), collapse = ""),
"\n")
}
tempdat <- x$cutpointr[[i]] %>%
dplyr::select(.data$AUC) %>%
round(digits = digits) %>%
dplyr::mutate(n = x$n_obs[i],
n_pos = x$n_pos[i], n_neg = x$n_neg[i]) %>%
as.data.frame()
cutpointr:::print_df_nodat(tempdat, row.names = FALSE)
cat("\n")
purrr::map_df(1:length(x$cutpointr[[i]]$optimal_cutpoint[[1]]),
function(j) {
x$cutpointr[[i]] %>%
dplyr::select(
!tidyselect::any_of(c("direction",
"subgroup",
"method",
"AUC",
"pos_class",
"neg_class",
"prevalence",
"outcome",
"predictor",
"grouping",
"data",
"roc_curve",
"boot"))) %>%
purrr::map_df(cutpointr:::get_fnth, n = j)}) %>%
as.data.frame %>%
dplyr::left_join(y = x$confusion_matrix[[i]],
by = c(optimal_cutpoint = "cutpoint")) %>%
round(digits = digits) %>%
cutpointr:::print_df_nodat(row.names = FALSE)
cat("\n")
cat(paste("Predictor summary:", "\n"))
cutpointr:::print_df_nodat(rbind(x$desc[[i]], x$desc_by_class[[i]]))
if (cutpointr:::has_boot_results(x[i, ])) {
cat("\n")
cat(paste("Bootstrap summary:", "\n"))
cutpointr:::print_df_nodat(x[["boot"]][[i]] %>%
dplyr::mutate_if(
is.numeric,
round,
digits = boot_digits),
row.names = rep("", nrow(x[["boot"]][[i]])))
}
}
return(invisible(x))
}
my_summary_func(summary(cp))
#> Method: maximize_metric
#> Predictor: dsi
#> Outcome: suicide
#> Direction: >=
#> Nr. of bootstraps: 300
#> Warning: Use of .data in tidyselect expressions was deprecated in tidyselect 1.2.0.
#> ℹ Please use `"AUC"` instead of `.data$AUC`
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> AUC n n_pos n_neg
#> 0.9238 532 36 496
#>
#> optimal_cutpoint sum_sens_spec acc sensitivity specificity tp fn fp tn
#> 2 1.7518 0.8647 0.8889 0.8629 32 4 68 428
#>
#> Predictor summary:
#> Data Min. 5% 1st Qu. Median Mean 3rd Qu. 95% Max. SD NAs
#> Overall 0 0.00 0 0 0.9210526 1 5.00 11 1.852714 0
#> no 0 0.00 0 0 0.6330645 0 4.00 10 1.412225 0
#> yes 0 0.75 4 5 4.8888889 6 9.25 11 2.549821 0
#>
#> Bootstrap summary:
#> Variable Min. 5% 1st Qu. Median Mean 3rd Qu. 95% Max. SD
#> optimal_cutpoint 1.000 1.000 2.000 2.000 2.077 2.000 4.000 4.000 0.673
#> AUC_b 0.825 0.875 0.908 0.928 0.924 0.942 0.961 0.977 0.025
#> AUC_oob 0.764 0.872 0.903 0.924 0.925 0.954 0.972 0.994 0.035
#> sum_sens_spec_b 1.589 1.672 1.723 1.759 1.758 1.796 1.836 1.881 0.051
#> sum_sens_spec_oob 1.393 1.564 1.672 1.724 1.719 1.778 1.852 1.891 0.087
#> acc_b 0.748 0.773 0.850 0.867 0.859 0.878 0.904 0.938 0.036
#> acc_oob 0.708 0.769 0.843 0.859 0.854 0.878 0.900 0.924 0.038
#> sensitivity_b 0.727 0.806 0.868 0.909 0.901 0.935 0.977 1.000 0.052
#> sensitivity_oob 0.500 0.667 0.818 0.875 0.867 0.929 1.000 1.000 0.096
#> specificity_b 0.731 0.761 0.848 0.866 0.856 0.877 0.906 0.945 0.039
#> specificity_oob 0.698 0.754 0.841 0.858 0.853 0.878 0.907 0.934 0.043
#> cohens_kappa_b 0.178 0.263 0.370 0.415 0.409 0.462 0.519 0.659 0.075
#> cohens_kappa_oob 0.139 0.248 0.328 0.393 0.387 0.443 0.526 0.609 0.084
#> NAs
#> 0
#> 0
#> 0
#> 0
#> 0
#> 0
#> 0
#> 0
#> 0
#> 0
#> 0
#> 0
#> 0 Created on 2024-12-08 with reprex v2.1.1 If you have further questions, just add more comments. I don't get notified if you edit a comment. |
Regarding your question about customizing plots, take a look at issue #68 . If that does not solve your problem, leave a comment here. |
Hi @Thie1e, Thank you again for your responses! Regarding the additional digits, thank you for taking the time to write the code. I’ll definitely try it next week. I’m not very proficient in R at the moment, so I really appreciate your detailed instructions. The additional digits are for publication purposes, so it would be great if they could be added as an enhancement! As for the graphs, I’d like to combine the first two plots into a single one, similar to the example below. Specifically, I’d like to have the two density curves displayed with different colors and transparency, along with the cutoff represented as a vertical bar. I saw the post you recommended, and based on my understanding, it seems I can use the code to customize the appearance of the plots. However, I’m unsure if the same approach can be used to create a combined plot like the one I’m aiming for. What do you think? Apologies for editing instead of commenting! 😊 |
If you need a lot of customizability, it is probably easier to plot the predictor data the usual way with ggplot and just add the optimal cutpoint value. For example: library(cutpointr)
library(tidyverse)
cp <- cutpointr(data = suicide,
x = dsi,
class = suicide,
method = maximize_metric,
metric = sum_sens_spec)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values
ggplot(suicide, aes(x = dsi, color = suicide, fill = suicide)) +
geom_density(alpha = 0.5) +
geom_vline(xintercept = cp$optimal_cutpoint) Created on 2024-12-08 with reprex v2.1.1 You probably need to customize that plot further, but this might get you started. By the way, the above summary function is simply a modification of the built-in print function for summary objects, which is |
Hi @Thie1e, Great, thank you! Will be trying this today. |
Hi @Thie1e , The code you provided for adding digits works perfectly, thank you! For the graph, I extended your code as: ggplot(dataall, aes(x = !!predictor_modified, fill = !!sym(outcome))) +
geom_histogram(
aes(y = ..density..), # Use density for histogram so both match the density scale
position = "identity", # Overlap the histograms
alpha = 0.6, # Transparency for histogram bars
bins = 50 # Number of bins for finer resolution
) +
geom_density(
aes(y = ..density..), # Density curve on the same scale as the histogram
alpha = 0.2, # Transparency for histogram bars
size = 0.2 # Line width for density line
) +
geom_vline(
xintercept = cp$optimal_cutpoint,
linetype = "dashed",
color = "darkred",
size = 0.8
) +
scale_fill_manual(values = c("gray", "steelblue2")) + # Custom colors for histogram fill
scale_color_manual(values = c("gray", "steelblue2")) + # Custom colors for density line
scale_x_continuous(
limits = c(NA, 400), # Limit x-axis to a maximum of 400
breaks = seq(0, 400, by = 50) # Set x-axis labels every 50 units
) +
labs(
x = "Predictor",
y = "Probability",
color = "Outcome",
fill = "Outcome"
) +
theme_minimal() which gives me this: The last thing (hopefully) I'd like to do based on the output is to plot the ROC curve with oob sensitivity and specificity - is this possible? |
You can plot the ROC curve using
I am not sure how you would like to plot the oob sensitivity and specificity. The oob sensitivities and specificities are recalculated in every bootstrap sample and of course the ROC curves and thus optimal cutpoints as well as these metrics differ between the bootstrap samples. You can plot a confidence interval around the ROC curve based on the bootstrap samples, see #43 . The oob sensitivities and specificities are of course distributions that you could plot like this. Their distributions depend on the chosen metric and optimization method.
|
Hi @Thie1e, I am calculating optimal cutoffs using bootstrapping and averaging over the bootstrap samples. Now, I’m wondering what the ROC curve generated by the current implementation represents. Isn't it the case that the specificity and sensitivity plotted in the ROC curve are averages across the bootstrap samples? |
No, the ROC curve is always the usual one calculated over the full data. The bootstrapping only adds the cross-validated estimates of the various metrics. |
You could generate some kind of bootstrapped ROC curve, if you need that, by summarizing the bootstrapped ROC curves manually. The in-bag and out-of-bag ROC curves for every bootstrap sample are being returned: |
Alright, I’d prefer not to overcomplicate things. I’ll consider the ROC curve computed on the original dataset. Just to clarify:
Thank you! |
library(cutpointr)
cp <- cutpointr(data = suicide,
x = dsi,
class = suicide,
method = maximize_metric,
metric = sum_sens_spec,
boot_runs = 400)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values
#> Running bootstrap...
summary(cp)
#> Method: maximize_metric
#> Predictor: dsi
#> Outcome: suicide
#> Direction: >=
#> Nr. of bootstraps: 400
#>
#> AUC n n_pos n_neg
#> 0.9238 532 36 496
#>
#> optimal_cutpoint sum_sens_spec acc sensitivity specificity tp fn fp tn
#> 2 1.7518 0.8647 0.8889 0.8629 32 4 68 428
#>
#> Predictor summary:
#> Data Min. 5% 1st Qu. Median Mean 3rd Qu. 95% Max. SD NAs
#> Overall 0 0.00 0 0 0.9210526 1 5.00 11 1.852714 0
#> no 0 0.00 0 0 0.6330645 0 4.00 10 1.412225 0
#> yes 0 0.75 4 5 4.8888889 6 9.25 11 2.549821 0
#>
#> Bootstrap summary:
#> Variable Min. 5% 1st Qu. Median Mean 3rd Qu. 95% Max. SD NAs
#> optimal_cutpoint 1.00 1.00 2.00 2.00 2.12 2.00 4.00 5.00 0.70 0
#> AUC_b 0.84 0.88 0.91 0.93 0.92 0.94 0.96 0.98 0.02 0
#> AUC_oob 0.82 0.87 0.90 0.93 0.93 0.95 0.97 0.99 0.03 0
#> sum_sens_spec_b 1.58 1.67 1.72 1.76 1.76 1.79 1.84 1.87 0.05 0
#> sum_sens_spec_oob 1.17 1.54 1.67 1.73 1.72 1.78 1.85 1.90 0.09 0
#> acc_b 0.74 0.78 0.85 0.87 0.86 0.88 0.90 0.96 0.03 0
#> acc_oob 0.72 0.77 0.85 0.86 0.86 0.88 0.91 0.93 0.04 0
#> sensitivity_b 0.68 0.80 0.87 0.90 0.90 0.94 0.98 1.00 0.05 0
#> sensitivity_oob 0.20 0.67 0.81 0.88 0.86 0.92 1.00 1.00 0.11 0
#> specificity_b 0.72 0.76 0.85 0.86 0.86 0.88 0.91 0.97 0.04 0
#> specificity_oob 0.70 0.76 0.85 0.86 0.86 0.88 0.92 0.97 0.04 0
#> cohens_kappa_b 0.21 0.27 0.37 0.42 0.41 0.46 0.53 0.70 0.07 0
#> cohens_kappa_oob 0.18 0.25 0.34 0.39 0.39 0.45 0.52 0.63 0.08 0
# This returns the 5th and 95th percentiles for specificity_b above
boot_ci(cp, variable = specificity, in_bag = T, alpha = 0.1)
#> # A tibble: 2 × 2
#> quantile values
#> <dbl> <dbl>
#> 1 0.05 0.764
#> 2 0.95 0.907 Created on 2024-12-10 with reprex v2.1.0 |
Hi @Thie1e , I meant the CIs on the ROC curve, like the ones in the post you shared above. Thank you! |
Hi!
Thank you so much for sharing this package. It’s so helpful. I have a couple of questions :)
Thanks!
The text was updated successfully, but these errors were encountered: