-
Notifications
You must be signed in to change notification settings - Fork 13
/
README.Rmd
1269 lines (1076 loc) · 60.2 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
github_document:
toc: true
toc_depth: 2
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```
```{r, include = FALSE, echo = FALSE}
library(ggplot2)
library(dplyr)
library(tidyr)
library(purrr)
```
# cutpointr
[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/Thie1e/cutpointr?branch=master&svg=true)](https://ci.appveyor.com/project/Thie1e/cutpointr)
[![Project Status: Inactive – The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows.](https://www.repostatus.org/badges/latest/inactive.svg)](https://www.repostatus.org/#inactive)
[![codecov](https://codecov.io/github/thie1e/cutpointr/branch/master/graphs/badge.svg)](https://codecov.io/github/thie1e/cutpointr)
[![CRAN_Release_Badge](http://www.r-pkg.org/badges/version-ago/cutpointr)](https://CRAN.R-project.org/package=cutpointr)
**cutpointr** is an R package for tidy calculation of "optimal" cutpoints. It
supports several methods for calculating cutpoints and includes several
metrics that can be maximized or minimized by selecting a cutpoint. Some of these
methods are designed to be more robust than the simple empirical optimization
of a metric. Additionally,
**cutpointr** can automatically bootstrap the variability of the optimal
cutpoints and return out-of-bag estimates of various performance metrics.
## Installation
You can install **cutpointr** from CRAN using the menu in RStudio or simply:
```{r CRAN, eval = FALSE}
install.packages("cutpointr")
```
## Example
For example, the optimal cutpoint for the included data set is 2 when maximizing the sum of sensitivity and specificity.
```{r}
library(cutpointr)
data(suicide)
head(suicide)
cp <- cutpointr(suicide, dsi, suicide,
method = maximize_metric, metric = sum_sens_spec)
```
```{r}
summary(cp)
```
```{r}
plot(cp)
```
When considering the optimality of a cutpoint, we can only make a judgement based
on the sample at hand. Thus, the estimated cutpoint may not be optimal within
the population or on unseen data, which is why we sometimes put the "optimal" in
quotation marks.
`cutpointr` makes assumptions about the direction of the dependency between
`class` and `x`, if `direction` and / or `pos_class` or `neg_class` are not
specified. The same result as above can be achieved by manually defining `direction` and
the positive / negative classes which is slightly faster, since the classes and direction
don't have to be determined:
```{r}
opt_cut <- cutpointr(suicide, dsi, suicide, direction = ">=", pos_class = "yes",
neg_class = "no", method = maximize_metric, metric = youden)
```
`opt_cut` is a data frame that returns the input data and the ROC curve
(and optionally the bootstrap results) in a
nested tibble. Methods for summarizing and plotting the data and
results are included (e.g. `summary`, `plot`, `plot_roc`, `plot_metric`)
To inspect the optimization, the function of metric values per cutpoint can be
plotted using `plot_metric`, if an optimization function was used that returns
a metric column in the `roc_curve` column. For example, the `maximize_metric`
and `minimize_metric` functions do so:
```{r}
plot_metric(opt_cut)
```
Predictions for new data can be made using `predict`:
```{r}
predict(opt_cut, newdata = data.frame(dsi = 0:5))
```
## Features
- Calculation of optimal cutpoints in binary classification tasks
- Tidy output, integrates well with functions from the tidyverse
- Functions for plotting ROC curves, metric distributions and more
- Bootstrapping for simulating the cutpoint variability and for obtaining
out-of-bag estimates of various metrics (as a form of internal validation)
with optional parallelisation
- Multiple methods for calculating cutpoints
- Multiple metrics can be chosen for maximization / minimization
- Tidyeval
# Calculating cutpoints
## Method functions for cutpoint estimation
The included methods for calculating cutpoints are:
- `maximize_metric`: Maximize the metric function
- `minimize_metric`: Minimize the metric function
- `maximize_loess_metric`: Maximize the metric function after LOESS smoothing
- `minimize_loess_metric`: Minimize the metric function after LOESS smoothing
- `maximize_gam_metric`: Maximize the metric function after smoothing via Generalized Additive Models
- `minimize_gam_metric`: Minimize the metric function after smoothing via Generalized Additive Models
- `maximize_boot_metric`: Bootstrap the optimal cutpoint when maximizing a metric
- `minimize_boot_metric`: Bootstrap the optimal cutpoint when minimizing a metric
- `oc_manual`: Specify the cutoff value manually
- `oc_mean`: Use the sample mean as the "optimal" cutpoint
- `oc_median`: Use the sample median as the "optimal" cutpoint
- `oc_youden_kernel`: Maximize the Youden-Index after kernel smoothing
the distributions of the two classes
- `oc_youden_normal`: Maximize the Youden-Index parametrically
assuming normally distributed data in both classes
## Metric functions
The included metrics to be used with the minimization and maximization methods
are:
- `accuracy`: Fraction correctly classified
- `abs_d_sens_spec`: The absolute difference of sensitivity and specificity
- `abs_d_ppv_npv`: The absolute difference between positive predictive
value (PPV) and negative predictive value (NPV)
- `roc01`: Distance to the point (0,1) on ROC space
- `cohens_kappa`: Cohen's Kappa
- `sum_sens_spec`: sensitivity + specificity
- `sum_ppv_npv`: The sum of positive predictive value (PPV) and negative
predictive value (NPV)
- `prod_sens_spec`: sensitivity * specificity
- `prod_ppv_npv`: The product of positive predictive value (PPV) and
negative predictive value (NPV)
- `youden`: Youden- or J-Index = sensitivity + specificity - 1
- `odds_ratio`: (Diagnostic) odds ratio
- `risk_ratio`: risk ratio (relative risk)
- `p_chisquared`: The p-value of a chi-squared test on the confusion
matrix
- `misclassification_cost`: The sum of the misclassification cost of
false positives and false negatives. Additional arguments: cost_fp, cost_fn
- `total_utility`: The total utility of true / false positives / negatives.
Additional arguments: utility_tp, utility_tn, cost_fp, cost_fn
- `F1_score`: The F1-score (2 * TP) / (2 * TP + FP + FN)
- `metric_constrain`: Maximize a selected metric given a minimal value of
another selected metric
- `sens_constrain`: Maximize sensitivity given a minimal value of specificity
- `spec_constrain`: Maximize specificity given a minimal value of sensitivity
- `acc_constrain`: Maximize accuracy given a minimal value of sensitivity
Furthermore, the following functions are included which can be used as metric
functions but are more useful for plotting purposes, for example in
`plot_cutpointr`, or for defining new metric functions:
`tp`, `fp`, `tn`, `fn`, `tpr`, `fpr`, `tnr`, `fnr`, `false_omission_rate`,
`false_discovery_rate`, `ppv`, `npv`, `precision`, `recall`, `sensitivity`, and `specificity`.
The inputs to the arguments
`method` and `metric` are functions so that user-defined functions can easily
be supplied instead of the built-in ones.
## Separate subgroups and bootstrapping
Cutpoints can be separately estimated on subgroups that are defined by a third variable,
`gender` in this case. Additionally,
if `boot_runs` is larger zero, `cutpointr` will carry out the usual cutpoint
calculation on the full sample, just as before, and additionally on
`boot_runs` bootstrap samples. This offers a way of gauging the out-of-sample
performance of the cutpoint estimation method. If a subgroup is given,
the bootstrapping is carried out separately for every
subgroup which is also reflected in the plots and output.
```{r, cache=TRUE}
set.seed(12)
opt_cut <- cutpointr(suicide, dsi, suicide, boot_runs = 1000)
opt_cut
```
The returned object has the additional column `boot` which is a nested tibble that
includes the cutpoints per bootstrap sample along with the metric calculated using
the function in `metric` and
various default metrics. The
metrics are suffixed by `_b` to indicate in-bag results or `_oob` to indicate
out-of-bag results:
```{r}
opt_cut$boot
```
The summary and plots include additional elements that summarize or display the
bootstrap results:
```{r}
summary(opt_cut)
plot(opt_cut)
```
### Parallelized bootstrapping
Using `foreach` and `doRNG` the bootstrapping can be parallelized easily. The
**doRNG** package is being used to make the bootstrap sampling reproducible.
```{r, cache=TRUE}
if (suppressPackageStartupMessages(require(doParallel) & require(doRNG))) {
cl <- makeCluster(2) # 2 cores
registerDoParallel(cl)
registerDoRNG(12) # Reproducible parallel loops using doRNG
opt_cut <- cutpointr(suicide, dsi, suicide, gender, pos_class = "yes",
direction = ">=", boot_runs = 1000, allowParallel = TRUE)
stopCluster(cl)
opt_cut
}
```
# More robust cutpoint estimation methods
## Bootstrapped cutpoints
It has been shown that bagging can substantially improve performance of a wide range of types of models in regression as well as in classification tasks. This method is available for cutpoint estimation via the `maximize_boot_metric` and `minimize_boot_metric` functions. If one of these functions is used as `method`, `boot_cut` bootstrap samples are drawn, the cutpoint optimization is carried out in each one and a summary (e.g. the mean) of the resulting optimal cutpoints on the bootstrap samples is returned as the optimal cutpoint in `cutpointr`. Note that if bootstrap validation is run, i.e. if `boot_runs` is larger zero, an outer bootstrap will be executed. In the bootstrap validation routine `boot_runs` bootstrap samples are generated and each one is again bootstrapped `boot_cut` times. This may lead to long run times, so activating the built-in parallelization may be advisable.
The advantages of bootstrapping the optimal cutpoint are that the procedure doesn't possess parameters that have to be tuned, unlike the LOESS smoothing, that it doesn't rely on assumptions, unlike the Normal method, and that it is applicable to any metric that can be used with `minimize_metric` or `maximize_metric`, unlike the Kernel method. Furthermore, like Random Forests cannot be overfit by increasing the number of trees, the bootstrapped cutpoints cannot be overfit by running an excessive amount of `boot_cut` repetitions.
```{r, cache=TRUE}
set.seed(100)
cutpointr(suicide, dsi, suicide, gender,
method = maximize_boot_metric,
boot_cut = 200, summary_func = mean,
metric = accuracy, silent = TRUE)
```
## LOESS smoothing for selecting a cutpoint
When using `maximize_metric` and `minimize_metric` the optimal cutpoint is
selected by searching the maximum or minimum of the metric function. For
example, we may want to minimize the misclassification cost. Since false
negatives (a suicide attempt was not anticipated) can be regarded as much more
severe than false positives we can set the cost of a false negative `cost_fn`
for example to ten times the cost of a false positive.
```{r}
opt_cut <- cutpointr(suicide, dsi, suicide, gender, method = minimize_metric,
metric = misclassification_cost, cost_fp = 1, cost_fn = 10)
```
```{r}
plot_metric(opt_cut)
```
As this "optimal" cutpoint may depend on minor differences between the
possible cutoffs, smoothing of the function of metric values by
cutpoint value might be desirable, especially in small samples. The
`minimize_loess_metric` and `maximize_loess_metric` functions can be used
to smooth the function so that the optimal cutpoint is selected based on the
smoothed metric values. Options to modify the smoothing, which is implemented using
`loess.as` from the **fANCOVA** package, include:
- `criterion`: the criterion for automatic smoothing parameter selection: "aicc" denotes bias-corrected AIC criterion, "gcv" denotes generalized cross-validation.
- `degree`: the degree of the local polynomials to be used. It can be 0, 1 or 2.
- `family`: if "gaussian" fitting is by least-squares, and if "symmetric" a re-descending M estimator is used with Tukey's biweight function.
- `user.span`: the user-defined parameter which controls the degree of smoothing.
Using parameters for the LOESS smoothing of `criterion = "aicc"`, `degree = 2`,
`family = "symmetric"`, and `user.span = 0.7` we get the following smoothed
versions of the above metrics:
```{r}
opt_cut <- cutpointr(suicide, dsi, suicide, gender,
method = minimize_loess_metric,
criterion = "aicc", family = "symmetric",
degree = 2, user.span = 0.7,
metric = misclassification_cost, cost_fp = 1, cost_fn = 10)
```
```{r}
plot_metric(opt_cut)
```
The optimal cutpoint for the female subgroup changes to 3. Note, though, that there
are no reliable rules for selecting the "best" smoothing parameters. Notably,
the LOESS smoothing is sensitive to the number of unique cutpoints. A large
number of unique cutpoints generally leads to a more volatile curve of
metric values by cutpoint value, even after smoothing. Thus, the curve
tends to be undersmoothed in that scenario. The unsmoothed metric
values are returned in `opt_cut$roc_curve` in the column
`m_unsmoothed`.
## Smoothing via Generalized Additive Models for selecting a cutpoint
In a similar fashion, the function of metric values per cutpoint can be smoothed
using Generalized Additive Models with smooth terms. Internally, `mgcv::gam`
carries out the smoothing which can be customized via the arguments
`formula` and `optimizer`, see `help("gam", package = "mgcv")`. Most importantly,
the GAM can be specified by altering the default formula, for example the
smoothing function could be configured to apply cubic regression splines (`"cr"`)
as the smooth term. As the `suicide` data has only very few unique cutpoints,
it is not very suitable for showcasing the GAM smoothing, so we will use two
classes of the `iris` data here. In this case, the purely empirical method and
the GAM smoothing lead to identical cutpoints, but in practice the GAM smoothing
tends to be more robust, especially with larger data. An attractive feature of
the GAM smoothing is that the default values tend to work quite well and
usually require no tuning, eliminating researcher degrees of freedom.
```{r}
library(ggplot2)
exdat <- iris
exdat <- exdat[exdat$Species != "setosa", ]
opt_cut <- cutpointr(exdat, Petal.Length, Species,
method = minimize_gam_metric,
formula = m ~ s(x.sorted, bs = "cr"),
metric = abs_d_sens_spec)
plot_metric(opt_cut)
```
### Parametric method assuming normality
The Normal method in `oc_youden_normal` is a parametric method for maximizing the Youden-Index or equivalently the sum of $Se$ and $Sp$. It relies on the assumption that the predictor for both the negative and positive observations is normally distributed. In that case it can be shown that
$$c^* = \frac{(\mu_P \sigma_N^2 - \mu_N \sigma_P^2) - \sigma_N \sigma_P \sqrt{(\mu_N - \mu_P)^2 + (\sigma_N^2 - \sigma_P^2) log(\sigma_N^2 / \sigma_P^2)}}{\sigma_N^2 - \sigma_P^2}$$
where the negative class is normally distributed with $\sim N(\mu_N, \sigma_N^2)$ and the positive class independently normally distributed with $\sim N(\mu_P, \sigma_P^2)$ provides the optimal cutpoint $c^*$ that maximizes the Youden-Index. If $\sigma_N$ and $\sigma_P$ are equal, the expression can be simplified to $c^* = \frac{\mu_N + \mu_P}{2}$. However, the `oc_youden_normal` method in cutpointr always assumes unequal standard deviations. Since this method does not select a cutpoint from the observed predictor values, it is questionable which values for $Se$ and $Sp$ should be reported. Here, the Youden-Index can be calculated as
$$J = \Phi(\frac{c^* - \mu_N}{\sigma_N}) - \Phi(\frac{c^* - \mu_P}{\sigma_P})$$
if the assumption of normality holds. However, since there exist several methods that do not select cutpoints from the available observations and to unify the reporting of metrics for these methods, **cutpointr** reports all metrics, e.g. $Se$ and $Sp$, based on the empirical observations.
```{r}
cutpointr(suicide, dsi, suicide, gender, method = oc_youden_normal)
```
### Nonparametric kernel method
A nonparametric alternative is the Kernel method [@fluss_estimation_2005]. Here, the empirical distribution functions are smoothed using the Gaussian kernel functions $\hat{F}_N(t) = \frac{1}{n} \sum^n_{i=1} \Phi(\frac{t - y_i}{h_y})$ and $\hat{G}_P(t) = \frac{1}{m} \sum^m_{i=1} \Phi(\frac{t - x_i}{h_x})$ for the negative and positive classes respectively. Following Silverman's plug-in "rule of thumb" the bandwidths are selected as $h_y = 0.9 * min\{s_y, iqr_y/1.34\} * n^{-0.2}$ and $h_x = 0.9 * min\{s_x, iqr_x/1.34\} * m^{-0.2}$ where $s$ is the sample standard deviation and $iqr$ is the inter quartile range. It has been demonstrated that AUC estimation is rather insensitive to the choice of the bandwidth procedure [@faraggi_estimation_2002] and thus the plug-in bandwidth estimator has also been recommended for cutpoint estimation. The `oc_youden_kernel` function in **cutpointr** uses a Gaussian kernel and the direct plug-in method for selecting the bandwidths. The kernel smoothing is done via the `bkde` function from the **KernSmooth** package [@wand_kernsmooth:_2013].
Again, there is a way to calculate the Youden-Index from the results of this method [@fluss_estimation_2005] which is
$$\hat{J} = max_c \{\hat{F}_N(c) - \hat{G}_N(c) \}$$
but as before we prefer to report all metrics based on applying the cutpoint that was estimated using the Kernel method to the empirical observations.
```{r}
cutpointr(suicide, dsi, suicide, gender, method = oc_youden_kernel)
```
# Additional features
## Calculating only the ROC curve
When running `cutpointr`, a ROC curve is by default returned in the column `roc_curve`.
This ROC curve can be plotted using `plot_roc`. Alternatively, if only the
ROC curve is desired and no cutpoint needs to be calculated, the ROC curve
can be created using `roc()` and plotted using `plot_cutpointr`.
The `roc` function, unlike `cutpointr`, does not determine `direction`, `pos_class` or `neg_class`
automatically.
```{r, fig.width=4, fig.height=3}
roc_curve <- roc(data = suicide, x = dsi, class = suicide,
pos_class = "yes", neg_class = "no", direction = ">=")
auc(roc_curve)
head(roc_curve)
plot_roc(roc_curve)
```
## Midpoints
So far - which is the default in `cutpointr` - we have considered all unique values of the predictor as possible cutpoints. An alternative could be to use a sequence of equidistant values instead, for example in the case of the `suicide` data all integers in $[0, 10]$. However, with very sparse data and small intervals between the candidate cutpoints (i.e. a 'dense' sequence like `seq(0, 10, by = 0.01)`) this leads to the uninformative evaluation of large ranges of cutpoints that all result in the same metric value. A more elegant alternative, not only for the case of sparse data, that is supported by **cutpointr** is the use of a mean value of the optimal cutpoint and the next highest (if `direction = ">="`) or the next lowest (if `direction = "<="`) predictor value in the data. The result is an optimal cutpoint that is equal to the cutpoint that would be obtained using an infinitely dense sequence of candidate cutpoints and is thus usually more efficient computationally. This behavior can be activated by setting `use_midpoints = TRUE`, which is the default. If we use this setting, we obtain an optimal cutpoint of 1.5 for the complete sample on the `suicide` data instead of 2 when maximizing the sum of sensitivity and specificity.
Assume the following small data set:
```{r}
dat <- data.frame(outcome = c("neg", "neg", "neg", "pos", "pos", "pos", "pos"),
pred = c(1, 2, 3, 8, 11, 11, 12))
```
Since the distance of the optimal cutpoint (8) to the next lowest
observation (3) is rather large we arrive at a range of possible cutpoints that
all maximize the metric. In the case of this kind of sparseness it might for example be
desirable to classify a new observation with a predictor value of 4 as belonging
to the negative class. If `use_midpoints` is set to `TRUE`, the mean of the
optimal cutpoint and the next lowest observation is returned as the optimal
cutpoint, if direction is `>=`. The mean of the optimal cutpoint and the next
highest observation is returned as the optimal cutpoint, if `direction = "<="`.
```{r}
opt_cut <- cutpointr(dat, x = pred, class = outcome, use_midpoints = TRUE)
plot_x(opt_cut)
```
A simulation demonstrates more clearly that setting `use_midpoints = TRUE` avoids
biasing the cutpoints. To simulate the bias of the metric functions, the
predictor values of both classes were drawn from normal distributions with
constant standard deviations of 10, a constant mean of the negative class of 100
and higher mean values of the positive class that are selected in such a way
that optimal Youden-Index values of 0.2, 0.4, 0.6, and 0.8 result in the population.
Samples of 9 different sizes were drawn and the cutpoints that maximize the
Youden-Index were estimated. The simulation was repeated 10000 times. As can be
seen by the mean error, `use_midpoints = TRUE` eliminates the bias that is
introduced by otherwise selecting the value of an observation as the optimal
cutpoint. If `direction = ">="`, as in this case, the observation that
represents the optimal cutpoint is the highest possible cutpoint that leads to the
optimal metric value and thus the biases are positive. The methods `oc_youden_normal`
and `oc_youden_kernel` are always unbiased, as they don't select a cutpoint
based on the ROC-curve or the function of metric values per cutpoint.
```{r, echo = FALSE}
plotdat_nomidpoints <- structure(list(sim_nr = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 8L,
8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L, 11L, 11L,
11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L,
13L, 13L, 13L, 13L, 13L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L,
15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 16L,
16L, 16L, 16L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 18L, 18L,
18L, 18L, 18L, 18L, 18L, 18L, 19L, 19L, 19L, 19L, 19L, 19L, 19L,
19L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 21L, 21L, 21L, 21L,
21L, 21L, 21L, 21L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 23L,
23L, 23L, 23L, 23L, 23L, 23L, 23L, 24L, 24L, 24L, 24L, 24L, 24L,
24L, 24L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 26L, 26L, 26L,
26L, 26L, 26L, 26L, 26L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L,
28L, 28L, 28L, 28L, 28L, 28L, 28L, 28L, 29L, 29L, 29L, 29L, 29L,
29L, 29L, 29L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 31L, 31L,
31L, 31L, 31L, 31L, 31L, 31L, 32L, 32L, 32L, 32L, 32L, 32L, 32L,
32L, 33L, 33L, 33L, 33L, 33L, 33L, 33L, 33L, 34L, 34L, 34L, 34L,
34L, 34L, 34L, 34L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 36L,
36L, 36L, 36L, 36L, 36L, 36L, 36L), method = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L), .Label = c("emp",
"normal", "loess", "boot", "spline", "spline_20", "kernel", "gam"
), class = "factor"), n = c(30, 30, 30, 30, 30, 30, 30, 30, 50,
50, 50, 50, 50, 50, 50, 50, 75, 75, 75, 75, 75, 75, 75, 75, 100,
100, 100, 100, 100, 100, 100, 100, 150, 150, 150, 150, 150, 150,
150, 150, 250, 250, 250, 250, 250, 250, 250, 250, 500, 500, 500,
500, 500, 500, 500, 500, 750, 750, 750, 750, 750, 750, 750, 750,
1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 30, 30, 30, 30,
30, 30, 30, 30, 50, 50, 50, 50, 50, 50, 50, 50, 75, 75, 75, 75,
75, 75, 75, 75, 100, 100, 100, 100, 100, 100, 100, 100, 150,
150, 150, 150, 150, 150, 150, 150, 250, 250, 250, 250, 250, 250,
250, 250, 500, 500, 500, 500, 500, 500, 500, 500, 750, 750, 750,
750, 750, 750, 750, 750, 1000, 1000, 1000, 1000, 1000, 1000,
1000, 1000, 30, 30, 30, 30, 30, 30, 30, 30, 50, 50, 50, 50, 50,
50, 50, 50, 75, 75, 75, 75, 75, 75, 75, 75, 100, 100, 100, 100,
100, 100, 100, 100, 150, 150, 150, 150, 150, 150, 150, 150, 250,
250, 250, 250, 250, 250, 250, 250, 500, 500, 500, 500, 500, 500,
500, 500, 750, 750, 750, 750, 750, 750, 750, 750, 1000, 1000,
1000, 1000, 1000, 1000, 1000, 1000, 30, 30, 30, 30, 30, 30, 30,
30, 50, 50, 50, 50, 50, 50, 50, 50, 75, 75, 75, 75, 75, 75, 75,
75, 100, 100, 100, 100, 100, 100, 100, 100, 150, 150, 150, 150,
150, 150, 150, 150, 250, 250, 250, 250, 250, 250, 250, 250, 500,
500, 500, 500, 500, 500, 500, 500, 750, 750, 750, 750, 750, 750,
750, 750, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000), mean_err = c(0.532157164015659,
0.0344907054484091, 1.09430750651166, 0.847845162156675, 1.72337372126503,
0.893756658507988, 0.0430309247027736, 0.785821459035346, 0.368063404388512,
0.0256197760404459, 0.54480529648463, 0.54385597929651, 0.657325657699579,
0.578611116865437, 0.0400491342691897, 0.515688005217413, 0.256713912589642,
0.0444582875885996, 0.326975493112402, 0.371128780921122, 0.473515115741104,
0.389519558405289, 0.105044360789378, 0.301924717299333, 0.207750921776918,
-0.00318128936770314, 0.215170156089776, 0.27218780048926, 0.260519564021842,
0.236792923882582, 0.0209319074923902, 0.232957055204834, 0.0726605917614469,
-0.00282823355849125, 0.0753216783313991, 0.147121931849656,
0.0986417724955371, 0.10048009778446, -0.0117861260923649, 0.0650845904350442,
0.0985144485083747, 0.00601003227249322, 0.107439979908118, 0.120777421732797,
0.098470489820427, 0.0940946984227826, 0.0340166854141625, 0.107851118082414,
0.0249685210781582, -0.00275219600614378, 0.0258069390207201,
0.0303381972274654, -0.000994602151869198, 0.00196854833683764,
-0.0172319489562159, 0.0230957871932473, 0.00787486424680835,
-0.018438041997315, -0.000808033567394628, -0.00151904153864496,
-0.0258118523805697, -0.020984156892953, -0.0411584927473141,
-0.0462075435919094, 0.0481149217843661, -0.0115241085997692,
0.0278045708419731, 0.0358588316426625, 0.0424473909450939, 0.0379773233328197,
0.0298772985879321, 0.0494939492036379, 0.561286668337778, 0.0210874502384648,
0.607711822769155, 0.944733906256477, 1.32069801051061, 0.623604782428556,
0.0138075109769806, 0.640859854412358, 0.284873604303057, -0.0170357985365701,
0.273426417633118, 0.524432737895336, 0.355003110979807, 0.312837607951434,
-0.0316296929553873, 0.270109834098986, 0.174110335581819, 0.0253101719615279,
0.199956222742702, 0.375485416120771, 0.278956745944806, 0.244245525945888,
0.0325126314233263, 0.253352659868514, 0.154840760004461, 0.00231589709639472,
0.154412480165179, 0.264847742842386, 0.196572744185608, 0.182934783774992,
0.0207139021497755, 0.17351041376412, 0.14190910348156, -0.000766834484010096,
0.159975205214477, 0.191222128926019, 0.119768252112669, 0.12033372914036,
-0.00429047209392067, 0.120982527821078, 0.0756304869484406,
-0.00890884219048113, 0.0727782693168392, 0.118690444738942,
0.0814898789647033, 0.0799724348001957, 0.0182926240912726, 0.0887155007804252,
0.00799604720502299, 0.000599148388616836, -0.00567769035990384,
0.0358412514670032, 0.0308474979074875, 0.0341668723768997, -0.000180318451026095,
0.0180733341290925, 0.00456876236626807, 0.00150574966485876,
0.011152095953916, 0.0176039119729626, 0.00608274255434991, 0.0146257828313115,
-0.0108877417404102, 0.00341198000323035, -0.00198459880370283,
0.0026551895445694, 0.00199040664534129, 0.0150165794544221,
0.00646287144368147, 0.00999205240904708, -0.00850278571195971,
0.000833666619266177, 0.714730067273087, -0.00916546079360956,
0.662799490366986, 1.18552468844156, 1.25901933062308, 0.672701515532179,
0.0311066140197676, 0.699068058809396, 0.451908043813962, 0.0131716226592205,
0.429551369887738, 0.697928133235757, 0.445367768988423, 0.408463448982185,
0.0318707241721211, 0.406284953982951, 0.257736307754364, -0.00525924423719458,
0.236977000055322, 0.429144726596141, 0.291381752107184, 0.267557606428613,
0.0103657879176852, 0.254728590646094, 0.187783398487578, -0.00216064381479362,
0.209025103860707, 0.318293592390017, 0.216751610346408, 0.195630579126633,
-0.00355723971644246, 0.174111826408428, 0.151010324964235, 0.0152409223899092,
0.159002511320467, 0.214643583389694, 0.136211731513269, 0.138948149207635,
0.00736196817594524, 0.115637867729083, 0.0491348055596302, -0.00133957946235943,
0.0507437758212659, 0.103956325245849, 0.0641182216839426, 0.0721933081297794,
-0.0124376134651938, 0.0632317888879588, 0.0322195712438111,
0.00170122889182022, 0.0287526766624194, 0.0589662164030242,
0.0348535721709848, 0.039527944642463, -0.00617539706415593,
0.0274246010641889, 0.0325877909680824, 0.00530528253248245,
0.0221776555499961, 0.0389702052631117, 0.0221602091288215, 0.0254478639695596,
-0.0016189234058987, 0.0197144417326668, -0.00632485262604172,
-0.00364979854195596, -0.00276076468388984, 0.0126267527301874,
0.0123592498266038, 0.0154921632247644, -0.00591512196680815,
0.0098685016547149, 1.19276916750486, -0.0296831640401583, 0.99406393593888,
1.95758669445116, 1.45842790978446, 0.916899913902239, -0.0240222410217233,
1.00771193034927, 0.748151091428865, 0.001671855025917, 0.665180535306263,
1.1777049634557, 0.578603609273264, 0.546625362141714, 0.0292152981607387,
0.615230814912951, 0.417886753756131, 0.00324593885807739, 0.406076310942717,
0.732191741449251, 0.352684769616612, 0.326901376027897, -0.000759357576337989,
0.350075431324921, 0.310927617707656, -0.0107255472998434, 0.28102101085112,
0.514683023356017, 0.24913510139508, 0.235155452507568, -0.0220885572014814,
0.243370611433649, 0.209652330609093, -0.00502865663759991, 0.2172246261346,
0.356540958804122, 0.172121720418057, 0.17487914828986, 0.00365942442127361,
0.176594681455494, 0.126956927057327, -0.00270525933073803, 0.120116234221594,
0.210827536708082, 0.101520409193932, 0.101379097920023, 0.00238043252144371,
0.113027315928011, 0.0598624378953727, -0.00538838415690431,
0.0568400730102315, 0.0978115258288965, 0.0454207684906316, 0.0473140143579152,
-0.00165813015281622, 0.0521772135812508, 0.0530224090961669,
-0.000416993405198653, 0.0353236911458531, 0.0605493601241619,
0.0316204159297213, 0.0344789374555544, -0.00446984887315054,
0.0328807595695966, 0.0396438546423947, -0.00331466369719113,
0.0379029847219126, 0.0572435100638761, 0.0253269328104989, 0.0235663211070417,
0.00220241478536399, 0.0307132312422208), youden = c(0.2, 0.2,
0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2,
0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2,
0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2,
0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2,
0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2,
0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4,
0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4,
0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4,
0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4,
0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4,
0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.6,
0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6,
0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6,
0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6,
0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6,
0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6,
0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8,
0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8,
0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8,
0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8,
0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8,
0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8
)), row.names = c(NA, -288L), group_sizes = c(8L, 8L, 8L, 8L,
8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L,
8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L
), biggest_group_size = 8L, class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), groups = structure(list(sim_nr = 1:36,
.rows = list(1:8, 9:16, 17:24, 25:32, 33:40, 41:48, 49:56,
57:64, 65:72, 73:80, 81:88, 89:96, 97:104, 105:112, 113:120,
121:128, 129:136, 137:144, 145:152, 153:160, 161:168,
169:176, 177:184, 185:192, 193:200, 201:208, 209:216,
217:224, 225:232, 233:240, 241:248, 249:256, 257:264,
265:272, 273:280, 281:288)), row.names = c(NA, -36L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE))
```
```{r, echo = FALSE}
library(dplyr)
ggplot(plotdat_nomidpoints,
aes(x = n, y = mean_err, color = method, shape = method)) +
geom_line() + geom_point() +
facet_wrap(~ youden, scales = "fixed") +
scale_shape_manual(values = 1:nlevels(plotdat_nomidpoints$method)) +
scale_x_log10(breaks = c(30, 50, 75, 100, 150, 250, 500, 750, 1000)) +
ggtitle("Bias of all methods when use_midpoints = FALSE",
"normally distributed data, 10000 repetitions of simulation")
```
## Finding all cutpoints with acceptable performance
By default, most packages only return the "best" cutpoint and disregard other cutpoints with quite similar performance, even if the performance differences are minuscule. **cutpointr** makes this process more explicit via the `tol_metric` argument. For example, if all cutpoints are of interest that achieve at least an accuracy within `0.05` of the optimally achievable accuracy, `tol_metric` can be set to `0.05` and also those cutpoints will be returned.
In the case of the `suicide` data and when maximizing the sum of sensitivity and specificity, empirically the cutpoints 2 and 3 lead to quite similar performances. If `tol_metric` is set to `0.05`, both will be returned.
```{r}
opt_cut <- cutpointr(suicide, dsi, suicide, metric = sum_sens_spec,
tol_metric = 0.05, break_ties = c)
library(tidyr)
opt_cut %>%
select(optimal_cutpoint, sum_sens_spec) %>%
unnest(cols = c(optimal_cutpoint, sum_sens_spec))
```
## Manual and mean / median cutpoints
Using the `oc_manual` function the optimal cutpoint will not be determined
based on, for example, a metric but is instead set manually using the
`cutpoint` argument. This is useful for supplying and evaluating cutpoints that were found
in the literature or in other external sources.
The `oc_manual` function could also be used to set the cutpoint to the sample
mean using `cutpoint = mean(data$x)`. However, this may introduce a bias into the
bootstrap validation procedure, since the actual mean of the population is
not known and thus the mean to be used as the cutpoint should be automatically determined in every resample.
To do so, the `oc_mean` and `oc_median` functions can be used.
```{r}
set.seed(100)
opt_cut_manual <- cutpointr(suicide, dsi, suicide, method = oc_manual,
cutpoint = mean(suicide$dsi), boot_runs = 30)
set.seed(100)
opt_cut_mean <- cutpointr(suicide, dsi, suicide, method = oc_mean, boot_runs = 30)
```
## Nonstandard evaluation via tidyeval
The arguments to `cutpointr` do not need to be enclosed in quotes. This is
possible thanks to nonstandard evaluation of the arguments, which are
evaluated on `data`.
Functions that use nonstandard evaluation are often not suitable for
programming with. The use of nonstandard evaluation may lead to scoping
problems and subsequent obvious as well as possibly subtle errors.
**cutpointr** uses tidyeval internally and accordingly the same rules as
for programming with `dplyr` apply. Arguments can be unquoted with `!!`:
```{r, eval = FALSE}
myvar <- "dsi"
cutpointr(suicide, !!myvar, suicide)
```
## ROC curve and optimal cutpoint for multiple variables
Alternatively, we can map the standard evaluation version `cutpointr` to
the column names. If `direction` and / or `pos_class` and `neg_class` are unspecified, these parameters
will automatically be determined by **cutpointr** so that the AUC values for all
variables will be $> 0.5$.
We could do this manually, e.g. using `purrr::map`, but to make this task more convenient
`multi_cutpointr` can be used
to achieve the same result. It maps multiple predictor columns to
`cutpointr`, by default all numeric columns except for the class column.
```{r}
mcp <- multi_cutpointr(suicide, class = suicide, pos_class = "yes",
use_midpoints = TRUE, silent = TRUE)
summary(mcp)
```
## Accessing `data`, `roc_curve`, and `boot`
The object returned by `cutpointr` is of the classes `cutpointr`, `tbl_df`,
`tbl`, and `data.frame`. Thus, it can be handled like a usual data frame. The
columns `data`, `roc_curve`, and `boot` consist of nested data frames, which means that
these are list columns whose elements are data frames. They can either be accessed
using `[` or by using functions from the tidyverse. If subgroups were given,
the output contains one row per subgroup and the function
that accesses the data should be mapped to every row or the data should be
grouped by subgroup.
```{r, cache=TRUE}
# Extracting the bootstrap results
set.seed(123)
opt_cut <- cutpointr(suicide, dsi, suicide, gender, boot_runs = 1000)
# Using base R to summarise the result of the bootstrap
summary(opt_cut$boot[[1]]$optimal_cutpoint)
summary(opt_cut$boot[[2]]$optimal_cutpoint)
# Using dplyr and tidyr
library(tidyr)
opt_cut %>%
group_by(subgroup) %>%
select(boot) %>%
unnest(boot) %>%
summarise(sd_oc_boot = sd(optimal_cutpoint),
m_oc_boot = mean(optimal_cutpoint),
m_acc_oob = mean(acc_oob))
```
## Adding metrics to the result of cutpointr() or roc()
By default, the output of `cutpointr` includes the optimized metric and several
other metrics. The `add_metric` function adds further metrics.
Here, we're adding the negative predictive value (NPV) and
the positive predictive value (PPV) at the optimal cutpoint per subgroup:
```{r}
cutpointr(suicide, dsi, suicide, gender, metric = youden, silent = TRUE) %>%
add_metric(list(ppv, npv)) %>%
select(subgroup, optimal_cutpoint, youden, ppv, npv)
```
In the same fashion, additional metric columns can be added to a `roc_cutpointr`
object:
```{r}
roc(data = suicide, x = dsi, class = suicide, pos_class = "yes",
neg_class = "no", direction = ">=") %>%
add_metric(list(cohens_kappa, F1_score)) %>%
select(x.sorted, tp, fp, tn, fn, cohens_kappa, F1_score) %>%
head()
```
## User-defined functions
### method
User-defined functions can be supplied to `method`, which is the function that
is responsible for returning the optimal cutpoint.
To define a new method function, create a function that may take
as input(s):
- `data`: A `data.frame` or `tbl_df`
- `x`: (character) The name of the predictor variable
- `class`: (character) The name of the class variable
- `metric_func`: A function for calculating a metric, e.g. accuracy. Note
that the method function does not necessarily have to accept this argument
- `pos_class`: The positive class
- `neg_class`: The negative class
- `direction`: `">="` if the positive class has higher x values, `"<="` otherwise
- `tol_metric`: (numeric) In the built-in methods, all cutpoints will be returned that lead to a metric
value in the interval [m_max - tol_metric, m_max + tol_metric] where
m_max is the maximum achievable metric value. This can be used to return
multiple decent cutpoints and to avoid floating-point problems.
- `use_midpoints`: (logical) In the built-in methods, if TRUE (default FALSE) the returned optimal
cutpoint will be the mean of the optimal cutpoint and the next highest
observation (for direction = ">") or the next lowest observation
(for direction = "<") which avoids biasing the optimal cutpoint.
- `...`: Further arguments that are passed to `metric` or that can be captured
inside of `method`
The function should return a data frame or tibble with
one row, the column `optimal_cutpoint`, and an optional column with an arbitrary name
with the metric value at the optimal cutpoint.
For example, a function for choosing the cutpoint as the mean of the independent
variable could look like this:
```{r, eval = FALSE}
mean_cut <- function(data, x, ...) {
oc <- mean(data[[x]])
return(data.frame(optimal_cutpoint = oc))
}
```
If a `method` function does not return a metric column, the default `sum_sens_spec`, the sum of sensitivity and
specificity, is returned as the extra metric column in addition to accuracy,
sensitivity and specificity.
Some `method` functions that make use of the additional arguments (that are
captured by `...`) are already included in **cutpointr**, see
the list at the top. Since these functions are arguments to `cutpointr`
their code can be accessed by simply typing their name, see for example
`oc_youden_normal`.
### metric
User defined `metric` functions can be used as well. They are mainly useful in
conjunction with `method = maximize_metric`, `method = minimize_metric`, or one of
the other minimization and maximization functions.
In case of a different `method` function `metric` will only be used as the main
out-of-bag metric when plotting the result. The `metric` function should
accept the following inputs as vectors:
- `tp`: Vector of true positives
- `fp`: Vector of false positives
- `tn`: Vector of true negatives
- `fn`: Vector of false negatives
- `...`: Further arguments
The function should return a numeric vector, a matrix, or a `data.frame` with one column. If the column is named, the name will be included in the output and plots. Avoid using names that are identical to the column names that are by default returned by **cutpointr**, as such names will be prefixed by `metric_` in the output. The inputs (`tp`, `fp`, `tn`, and `fn`) are vectors.
The code of the included metric functions can be accessed by simply typing their name.
For example, this is the `misclassification_cost` metric function:
```{r}
misclassification_cost
```
# Plotting
**cutpointr** includes several convenience functions for plotting data from a
`cutpointr` object. These include:
- `plot_cutpointr`: General purpose plotting function for cutpointr or roc_cutpointr objects
- `plot_cut_boot`: Plot the bootstrapped distribution of optimal cutpoints
- `plot_metric`: If `maximize_metric` or `minimize_metric` was used this function
plots all possible cutoffs on the x-axis vs. the respective metric values on
the y-axis. If bootstrapping was run, a confidence interval based on the
bootstrapped distribution of metric values at each cutpoint can be displayed.
To display no confidence interval set `conf_lvl = 0`.
- `plot_metric_boot`: Plot the distribution of out-of-bag metric values
- `plot_precision_recall`: Plot the precision recall curve
- `plot_sensitivity_specificity`: Plot all cutpoints vs. sensitivity and specificity
- `plot_roc`: Plot the ROC curve
- `plot_x`: Plot the distribution of the predictor variable
```{r, fig.width=4, fig.height=3}
set.seed(102)
opt_cut <- cutpointr(suicide, dsi, suicide, gender, method = minimize_metric,
metric = abs_d_sens_spec, boot_runs = 200, silent = TRUE)
opt_cut
plot_cut_boot(opt_cut)
plot_metric(opt_cut, conf_lvl = 0.9)
plot_metric_boot(opt_cut)
plot_precision_recall(opt_cut)
plot_sensitivity_specificity(opt_cut)
plot_roc(opt_cut)
```
All plot functions, except for the standard plot method that returns
a composed plot, return `ggplot` objects
than can be further modified. For example, changing labels, title, and the theme
can be achieved this way:
```{r, fig.width=4, fig.height=3}
p <- plot_x(opt_cut)
p + ggtitle("Distribution of dsi") + theme_minimal() + xlab("Depression score")
```
## Flexible plotting function
Using `plot_cutpointr` any metric can be chosen to be plotted on the x- or
y-axis and results of `cutpointr()` as well as `roc()` can be plotted.
If a `cutpointr` object is to be plotted, it is thus irrelevant which `metric`
function was chosen for cutpoint estimation. Any metric that can be calculated
based on the ROC curve can be subsequently plotted as only the true / false
positives / negatives over all cutpoints are needed.
That way, not only the above plots can be produced, but also any
combination of two metrics (or metric functions) and / or cutpoints. The built-in
metric functions as well as user-defined functions or anonymous functions can
be supplied to `xvar` and `yvar`. If bootstrapping was run, confidence intervals
can be plotted around the y-variable. This is especially useful if the cutpoints,
available in the `cutpoints` function, are placed on the x-axis.
Note that confidence intervals can only be correctly plotted if the values of
`xvar` are constant across bootstrap samples. For example, confidence intervals
for TPR by FPR (a ROC curve) cannot be plotted easily, as the values of the false
positive rate vary per bootstrap sample.
```{r, fig.width=4, fig.height=3, cache=TRUE}
set.seed(500)
oc <- cutpointr(suicide, dsi, suicide, boot_runs = 1000,
metric = sum_ppv_npv) # metric irrelevant for plot_cutpointr
plot_cutpointr(oc, xvar = cutpoints, yvar = sum_sens_spec, conf_lvl = 0.9)
plot_cutpointr(oc, xvar = fpr, yvar = tpr, aspect_ratio = 1, conf_lvl = 0)
plot_cutpointr(oc, xvar = cutpoint, yvar = tp, conf_lvl = 0.9) + geom_point()
```
## Manual plotting
Since `cutpointr` returns a `data.frame` with the original data, bootstrap
results, and the ROC curve in nested tibbles, these data can be conveniently
extracted and plotted manually. The relevant
nested tibbles are in the columns `data`, `roc_curve` and `boot`. The following
is an example of accessing and plotting the grouped data.
```{r, fig.width=4, fig.height=3, cache=TRUE}
set.seed(123)
opt_cut <- cutpointr(suicide, dsi, suicide, gender, boot_runs = 1000)
opt_cut %>%
select(data, subgroup) %>%
unnest %>%
ggplot(aes(x = suicide, y = dsi)) +
geom_boxplot(alpha = 0.3) + facet_grid(~subgroup)
```
# Benchmarks
To offer a comparison to established solutions,
**cutpointr** 1.0.0 will be benchmarked against `optimal.cutpoints`
from **OptimalCutpoints** 1.1-4, **ThresholdROC** 2.7 and custom functions based on
**ROCR** 1.0-7 and **pROC** 1.15.0. By generating data of different sizes,
the benchmarks will offer a comparison of the scalability of the different
solutions.
Using `prediction` and `performance` from the **ROCR** package and `roc` from the
**pROC** package, we can write functions for computing the cutpoint that maximizes the sum of sensitivity and
specificity. **pROC** has a built-in function to optimize a few metrics:
```{r, eval = FALSE}
# Return cutpoint that maximizes the sum of sensitivity and specificiy
# ROCR package
rocr_sensspec <- function(x, class) {
pred <- ROCR::prediction(x, class)
perf <- ROCR::performance(pred, "sens", "spec")
sens <- slot(perf, "y.values")[[1]]
spec <- slot(perf, "x.values")[[1]]
cut <- slot(perf, "alpha.values")[[1]]
cut[which.max(sens + spec)]
}
# pROC package
proc_sensspec <- function(x, class) {
r <- pROC::roc(class, x, algorithm = 2, levels = c(0, 1), direction = "<")
pROC::coords(r, "best", ret="threshold", transpose = FALSE)[1]
}
```
The benchmarking will be carried out using the **microbenchmark** package and randomly
generated data. The values of the `x` predictor variable are drawn from a normal distribution
which leads to a lot more unique values than were encountered before in the
`suicide` data. Accordingly, the search for an optimal cutpoint is much more
demanding, if all possible cutpoints are evaluated.
Benchmarks are run for sample sizes of 100, 1000, 1e4, 1e5, 1e6, and 1e7.
For low sample sizes **cutpointr** is slower than the other
solutions. While this should be of low practical importance, **cutpointr** scales
more favorably with increasing sample size. The speed disadvantage in small
samples that leads to the lower limit of around 25ms is mainly due to the nesting
of the original data and the results that makes the compact output of `cutpointr`
possible. This observation is emphasized by the fact that `cutpointr::roc` is
quite fast also in small samples. For sample sizes > 1e5 **cutpointr**
is a little faster than the function based on **ROCR** and **pROC**. Both of these
solutions are generally faster than **OptimalCutpoints** and **ThresholdROC** with the exception of
small samples. **OptimalCutpoints** and **ThresholdROC** had to be excluded from benchmarks with
more than 1e4 observations due to high memory requirements and/or excessive run times, rendering
the use of these packages in larger samples impractical.
```{r, eval = FALSE, echo = FALSE}
library(OptimalCutpoints)
library(ThresholdROC)
n <- 100
set.seed(123)
dat <- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE))
x_pos <- dat$x[dat$y == 1]
x_neg <- dat$x[dat$y == 0]
bench_100 <- microbenchmark::microbenchmark(
cutpointr(dat, x, y, pos_class = 1, neg_class = 0,
direction = ">=", metric = youden, break_ties = mean),
rocr_sensspec(dat$x, dat$y),
proc_sensspec(dat$x, dat$y),
optimal.cutpoints(X = "x", status = "y", tag.healthy = 0, methods = "Youden",
data = dat),
thres2(k1 = x_neg, k2 = x_pos, rho = 0.5,
method = "empirical", ci = FALSE),
times = 100, unit = "ms"
)
n <- 1000