Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference for computing bootstrap estimates of confidence intervals? #67

Open
vadori opened this issue Nov 11, 2024 · 15 comments
Open

Reference for computing bootstrap estimates of confidence intervals? #67

vadori opened this issue Nov 11, 2024 · 15 comments

Comments

@vadori
Copy link

vadori commented Nov 11, 2024

Hi!

Thank you so much for sharing this package. It’s so helpful. I have a couple of questions :)

  1. Could you please confirm whether the bootstrap summary table's 5% and 95% columns display the 95% confidence intervals for accuracy, sensitivity, and specificity?
  2. Additionally, could you suggest any references for understanding how to calculate in-bag and out-of-bag confidence intervals using bootstrapping? My current understanding is that out-of-bag estimates involve resampling the dataset N times, recalculating the optimal cutoff each time, and evaluating performance on the samples excluded in each round. Would it be correct to say that this approach does not yield a confidence interval for the metric specifically at the optimal cutoff but rather for the metric when "applying a cutoff on the given biomarker," as the cutoff may vary across bootstrap samples?

Thanks!

@Thie1e
Copy link
Owner

Thie1e commented Nov 24, 2024

Hi vadori,

I'm glad the package was helpful. Regarding your questions:

  1. No, those are the 5th and 95th percentiles of the bootstrap results, so it's a 90% confidence interval from a nonparametric bootstrap. You can generate the 95% (or any other) confidence interval using the boot_ci function. For example, boot_ci(my_result, optimal_cutpoint, alpha = 0.05) returns the 2.5th and 97.5th percentiles.
  2. Yes, the out-of-bag metrics are a way to estimate the performance on unseen data, as the 'optimal' cutpoint is determined only using the in-bag data. So this is a type of cross-validation. Your interpretation is correct, I would phrase it as "an estimate of the expected performance when optimizing a cutoff on the given biomarker and given a specific optimization method". My suggestion for more details on validation using the bootstrap would be 'Regression Modeling Strategies' by Frank Harrell.

@vadori
Copy link
Author

vadori commented Nov 28, 2024

Dear @Thie1e

This is amazing, thank you so much for your reply! I am now experimenting. I will let you know how it goes.

Best,
VV

EDIT: Everything is going great, I love your package (and your paper, which allowed be to understand lots of details). I have some questions:

  1. If I wanted to customize the plots (for example, to display the distribution of the two classes on the same plot with different colors, along with a vertical bar indicating the cutoff value) what would be the best approach? Custom code right?

  2. VERY IMPORTANT FOR ME: is there a way to increase the number of decimals in the bootstrap summary that I obtained for example with

cp <- cutpointr(dataall, !!sym(predictor),
      direction = "<=", pos_class = 1, !!sym(outcome),
      method = maximize_metric, metric = youden, boot_runs = 100o
    )

Thank you!!!

@Thie1e
Copy link
Owner

Thie1e commented Dec 8, 2024

Hi,

unfortunately, there is no built-in parameter for customizing the number of digits in the output of the bootstrap summary. I am going to try to add this to an upcoming version. There are two solutions: First, you could calculate the summary manually by extracting the bootstrap results from cp$boot. Second, you could write your own function for printing the cutpointr summary like in the following example. It has an additional boot_digits parameter that you can set as desired. If you increase it, the output may be too wide to print nicely in the R console, but you get the additional digits:

library(cutpointr)
library(tidyverse)

cp <- cutpointr(data = suicide,
                x = dsi,
                class = suicide,
                method = maximize_metric,
                metric = sum_sens_spec,
                boot_runs = 300)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values
#> Running bootstrap...

# New parameter boot_digits
my_summary_func <- function (x, digits = 4, boot_digits = 3) 
{
     cat(paste("Method:", x$cutpointr[[1]]$method, "\n"))
     cat(paste("Predictor:", x$cutpointr[[1]]$predictor, "\n"))
     cat(paste("Outcome:", x$cutpointr[[1]]$outcome, "\n"))
     cat(paste("Direction:", x$cutpointr[[1]]$direction, "\n"))
     if (cutpointr:::has_column(x$cutpointr[[1]], "subgroup")) {
          cat(c("Subgroups:", paste(purrr::map(x$cutpointr, ~.$subgroup), 
                                    collapse = ", "), "\n"))
     }
     if (cutpointr:::has_boot_results(x)) {
          cat(paste("Nr. of bootstraps:", x$boot_runs[1], "\n"))
     }
     for (i in 1:nrow(x)) {
          cat("\n")
          if (cutpointr:::has_column(x$cutpointr[[i]], "subgroup")) {
               cat(paste("Subgroup:", x$cutpointr[[i]]$subgroup, 
                         "\n"))
               cat(paste0(rep("-", getOption("width")), collapse = ""), 
                   "\n")
          }
          tempdat <- x$cutpointr[[i]] %>% 
               dplyr::select(.data$AUC) %>% 
               round(digits = digits) %>% 
               dplyr::mutate(n = x$n_obs[i], 
                             n_pos = x$n_pos[i], n_neg = x$n_neg[i]) %>% 
               as.data.frame()
          cutpointr:::print_df_nodat(tempdat, row.names = FALSE)
          cat("\n")
          purrr::map_df(1:length(x$cutpointr[[i]]$optimal_cutpoint[[1]]), 
                        function(j) {
                             x$cutpointr[[i]] %>% 
                                  dplyr::select(
                                       !tidyselect::any_of(c("direction", 
                                                             "subgroup", 
                                                             "method", 
                                                             "AUC", 
                                                             "pos_class",
                                                             "neg_class", 
                                                             "prevalence", 
                                                             "outcome", 
                                                             "predictor",
                                                             "grouping",
                                                             "data", 
                                                             "roc_curve",
                                                             "boot"))) %>%
                                  purrr::map_df(cutpointr:::get_fnth, n = j)}) %>% 
               as.data.frame %>%
               dplyr::left_join(y = x$confusion_matrix[[i]], 
                                by = c(optimal_cutpoint = "cutpoint")) %>% 
               round(digits = digits) %>% 
               cutpointr:::print_df_nodat(row.names = FALSE)
          cat("\n")
          cat(paste("Predictor summary:", "\n"))
          cutpointr:::print_df_nodat(rbind(x$desc[[i]], x$desc_by_class[[i]]))
          if (cutpointr:::has_boot_results(x[i, ])) {
               cat("\n")
               cat(paste("Bootstrap summary:", "\n"))
               cutpointr:::print_df_nodat(x[["boot"]][[i]] %>%
                                               dplyr::mutate_if(
                                                    is.numeric, 
                                                    round, 
                                                    digits = boot_digits), 
                                          row.names = rep("", nrow(x[["boot"]][[i]])))
          }
     }
     return(invisible(x))
}

my_summary_func(summary(cp))
#> Method: maximize_metric 
#> Predictor: dsi 
#> Outcome: suicide 
#> Direction: >= 
#> Nr. of bootstraps: 300
#> Warning: Use of .data in tidyselect expressions was deprecated in tidyselect 1.2.0.
#> ℹ Please use `"AUC"` instead of `.data$AUC`
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#>     AUC   n n_pos n_neg
#>  0.9238 532    36   496
#> 
#>  optimal_cutpoint sum_sens_spec    acc sensitivity specificity tp fn fp  tn
#>                 2        1.7518 0.8647      0.8889      0.8629 32  4 68 428
#> 
#> Predictor summary: 
#>     Data Min.   5% 1st Qu. Median      Mean 3rd Qu.  95% Max.       SD NAs
#>  Overall    0 0.00       0      0 0.9210526       1 5.00   11 1.852714   0
#>       no    0 0.00       0      0 0.6330645       0 4.00   10 1.412225   0
#>      yes    0 0.75       4      5 4.8888889       6 9.25   11 2.549821   0
#> 
#> Bootstrap summary: 
#>           Variable  Min.    5% 1st Qu. Median  Mean 3rd Qu.   95%  Max.    SD
#>   optimal_cutpoint 1.000 1.000   2.000  2.000 2.077   2.000 4.000 4.000 0.673
#>              AUC_b 0.825 0.875   0.908  0.928 0.924   0.942 0.961 0.977 0.025
#>            AUC_oob 0.764 0.872   0.903  0.924 0.925   0.954 0.972 0.994 0.035
#>    sum_sens_spec_b 1.589 1.672   1.723  1.759 1.758   1.796 1.836 1.881 0.051
#>  sum_sens_spec_oob 1.393 1.564   1.672  1.724 1.719   1.778 1.852 1.891 0.087
#>              acc_b 0.748 0.773   0.850  0.867 0.859   0.878 0.904 0.938 0.036
#>            acc_oob 0.708 0.769   0.843  0.859 0.854   0.878 0.900 0.924 0.038
#>      sensitivity_b 0.727 0.806   0.868  0.909 0.901   0.935 0.977 1.000 0.052
#>    sensitivity_oob 0.500 0.667   0.818  0.875 0.867   0.929 1.000 1.000 0.096
#>      specificity_b 0.731 0.761   0.848  0.866 0.856   0.877 0.906 0.945 0.039
#>    specificity_oob 0.698 0.754   0.841  0.858 0.853   0.878 0.907 0.934 0.043
#>     cohens_kappa_b 0.178 0.263   0.370  0.415 0.409   0.462 0.519 0.659 0.075
#>   cohens_kappa_oob 0.139 0.248   0.328  0.393 0.387   0.443 0.526 0.609 0.084
#>  NAs
#>    0
#>    0
#>    0
#>    0
#>    0
#>    0
#>    0
#>    0
#>    0
#>    0
#>    0
#>    0
#>    0

Created on 2024-12-08 with reprex v2.1.1

If you have further questions, just add more comments. I don't get notified if you edit a comment.

@Thie1e
Copy link
Owner

Thie1e commented Dec 8, 2024

Regarding your question about customizing plots, take a look at issue #68 .

If that does not solve your problem, leave a comment here.

@vadori
Copy link
Author

vadori commented Dec 8, 2024

Hi @Thie1e,

Thank you again for your responses! Regarding the additional digits, thank you for taking the time to write the code. I’ll definitely try it next week. I’m not very proficient in R at the moment, so I really appreciate your detailed instructions. The additional digits are for publication purposes, so it would be great if they could be added as an enhancement!

As for the graphs, I’d like to combine the first two plots into a single one, similar to the example below. Specifically, I’d like to have the two density curves displayed with different colors and transparency, along with the cutoff represented as a vertical bar. I saw the post you recommended, and based on my understanding, it seems I can use the code to customize the appearance of the plots. However, I’m unsure if the same approach can be used to create a combined plot like the one I’m aiming for. What do you think?

image

Apologies for editing instead of commenting! 😊

@Thie1e
Copy link
Owner

Thie1e commented Dec 8, 2024

If you need a lot of customizability, it is probably easier to plot the predictor data the usual way with ggplot and just add the optimal cutpoint value. For example:

library(cutpointr)
library(tidyverse)
cp <- cutpointr(data = suicide,
                x = dsi,
                class = suicide,
                method = maximize_metric,
                metric = sum_sens_spec)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values

ggplot(suicide, aes(x = dsi, color = suicide, fill = suicide)) + 
     geom_density(alpha = 0.5) +
     geom_vline(xintercept = cp$optimal_cutpoint)

Created on 2024-12-08 with reprex v2.1.1

You probably need to customize that plot further, but this might get you started.

By the way, the above summary function is simply a modification of the built-in print function for summary objects, which is cutpointr:::print.summary_cutpointr.

@vadori
Copy link
Author

vadori commented Dec 9, 2024

Hi @Thie1e,

Great, thank you! Will be trying this today.

@vadori
Copy link
Author

vadori commented Dec 10, 2024

Hi @Thie1e ,

The code you provided for adding digits works perfectly, thank you! For the graph, I extended your code as:

ggplot(dataall, aes(x = !!predictor_modified, fill = !!sym(outcome))) +
      geom_histogram(
        aes(y = ..density..),  # Use density for histogram so both match the density scale
        position = "identity", # Overlap the histograms
        alpha = 0.6,           # Transparency for histogram bars
        bins = 50              # Number of bins for finer resolution
      ) +
      geom_density(
        aes(y = ..density..),  # Density curve on the same scale as the histogram
        alpha = 0.2,           # Transparency for histogram bars
        size = 0.2               # Line width for density line
      ) +
      geom_vline(
        xintercept = cp$optimal_cutpoint,
        linetype = "dashed",
        color = "darkred",
        size = 0.8
      ) +
      scale_fill_manual(values = c("gray", "steelblue2")) +   # Custom colors for histogram fill
      scale_color_manual(values = c("gray", "steelblue2")) +  # Custom colors for density line
      scale_x_continuous(
        limits = c(NA, 400),   # Limit x-axis to a maximum of 400
        breaks = seq(0, 400, by = 50)  # Set x-axis labels every 50 units
      ) +
      labs(
        x = "Predictor",  
        y = "Probability",
        color = "Outcome",
        fill = "Outcome"
      ) +
      theme_minimal()

which gives me this:

image

The last thing (hopefully) I'd like to do based on the output is to plot the ROC curve with oob sensitivity and specificity - is this possible?

@Thie1e
Copy link
Owner

Thie1e commented Dec 10, 2024

You can plot the ROC curve using plot_roc, which again returns a ggplot object that you can customize further.

cp <- cutpointr(data = suicide,
                x = dsi,
                class = suicide,
                method = maximize_metric,
                metric = sum_sens_spec)

plot_roc(cp, 
         display_cutpoint = T, 
         type = "step") + # step or line
   coord_equal() +
   theme_bw()

I am not sure how you would like to plot the oob sensitivity and specificity. The oob sensitivities and specificities are recalculated in every bootstrap sample and of course the ROC curves and thus optimal cutpoints as well as these metrics differ between the bootstrap samples.

You can plot a confidence interval around the ROC curve based on the bootstrap samples, see #43 .

The oob sensitivities and specificities are of course distributions that you could plot like this. Their distributions depend on the chosen metric and optimization method.

cp |> 
   select(boot) |> 
   unnest(boot) |> 
   ggplot(aes(x = sensitivity_oob)) + 
   geom_density()

cp |> 
   select(boot) |> 
   unnest(boot) |> 
   ggplot(aes(x = specificity_oob)) + 
   geom_density()

@vadori
Copy link
Author

vadori commented Dec 10, 2024

Hi @Thie1e,

I am calculating optimal cutoffs using bootstrapping and averaging over the bootstrap samples. Now, I’m wondering what the ROC curve generated by the current implementation represents. Isn't it the case that the specificity and sensitivity plotted in the ROC curve are averages across the bootstrap samples?

@Thie1e
Copy link
Owner

Thie1e commented Dec 10, 2024

No, the ROC curve is always the usual one calculated over the full data. The bootstrapping only adds the cross-validated estimates of the various metrics.

@Thie1e
Copy link
Owner

Thie1e commented Dec 10, 2024

You could generate some kind of bootstrapped ROC curve, if you need that, by summarizing the bootstrapped ROC curves manually. The in-bag and out-of-bag ROC curves for every bootstrap sample are being returned: cp$boot[[1]] |> glimpse()

@vadori
Copy link
Author

vadori commented Dec 10, 2024

Alright, I’d prefer not to overcomplicate things. I’ll consider the ROC curve computed on the original dataset. Just to clarify:

  1. The optimal cutoff displayed on the ROC curve is the one computed using the chosen method/metric, isn't it?
  2. The confidence intervals are based on bootstrap samples and are not related to those computed, for example, with boot_ci(cp, sensitivity, in_bag = FALSE, alpha = 0.05).

Thank you!

@Thie1e
Copy link
Owner

Thie1e commented Dec 10, 2024

  1. Yes
  2. You mean the confidence intervals returned by summary after running cutpointr with bootstrapping? They are related. boot_ci simply returns percentiles from the bootstrap distribution of a specified metric. Thus, the CIs from the summary can be reproduced using boot_ci if you set that function to the same percentiles. For example:
library(cutpointr)
cp <- cutpointr(data = suicide,
                x = dsi,
                class = suicide,
                method = maximize_metric,
                metric = sum_sens_spec,
                boot_runs = 400)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values
#> Running bootstrap...

summary(cp)
#> Method: maximize_metric 
#> Predictor: dsi 
#> Outcome: suicide 
#> Direction: >= 
#> Nr. of bootstraps: 400 
#> 
#>     AUC   n n_pos n_neg
#>  0.9238 532    36   496
#> 
#>  optimal_cutpoint sum_sens_spec    acc sensitivity specificity tp fn fp  tn
#>                 2        1.7518 0.8647      0.8889      0.8629 32  4 68 428
#> 
#> Predictor summary: 
#>     Data Min.   5% 1st Qu. Median      Mean 3rd Qu.  95% Max.       SD NAs
#>  Overall    0 0.00       0      0 0.9210526       1 5.00   11 1.852714   0
#>       no    0 0.00       0      0 0.6330645       0 4.00   10 1.412225   0
#>      yes    0 0.75       4      5 4.8888889       6 9.25   11 2.549821   0
#> 
#> Bootstrap summary: 
#>           Variable Min.   5% 1st Qu. Median Mean 3rd Qu.  95% Max.   SD NAs
#>   optimal_cutpoint 1.00 1.00    2.00   2.00 2.12    2.00 4.00 5.00 0.70   0
#>              AUC_b 0.84 0.88    0.91   0.93 0.92    0.94 0.96 0.98 0.02   0
#>            AUC_oob 0.82 0.87    0.90   0.93 0.93    0.95 0.97 0.99 0.03   0
#>    sum_sens_spec_b 1.58 1.67    1.72   1.76 1.76    1.79 1.84 1.87 0.05   0
#>  sum_sens_spec_oob 1.17 1.54    1.67   1.73 1.72    1.78 1.85 1.90 0.09   0
#>              acc_b 0.74 0.78    0.85   0.87 0.86    0.88 0.90 0.96 0.03   0
#>            acc_oob 0.72 0.77    0.85   0.86 0.86    0.88 0.91 0.93 0.04   0
#>      sensitivity_b 0.68 0.80    0.87   0.90 0.90    0.94 0.98 1.00 0.05   0
#>    sensitivity_oob 0.20 0.67    0.81   0.88 0.86    0.92 1.00 1.00 0.11   0
#>      specificity_b 0.72 0.76    0.85   0.86 0.86    0.88 0.91 0.97 0.04   0
#>    specificity_oob 0.70 0.76    0.85   0.86 0.86    0.88 0.92 0.97 0.04   0
#>     cohens_kappa_b 0.21 0.27    0.37   0.42 0.41    0.46 0.53 0.70 0.07   0
#>   cohens_kappa_oob 0.18 0.25    0.34   0.39 0.39    0.45 0.52 0.63 0.08   0

# This returns the 5th and 95th percentiles for specificity_b above
boot_ci(cp, variable = specificity, in_bag = T, alpha = 0.1)
#> # A tibble: 2 × 2
#>   quantile values
#>      <dbl>  <dbl>
#> 1     0.05  0.764
#> 2     0.95  0.907

Created on 2024-12-10 with reprex v2.1.0

@vadori
Copy link
Author

vadori commented Dec 11, 2024

Hi @Thie1e ,

I meant the CIs on the ROC curve, like the ones in the post you shared above.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants