0.632 Bootstrap #828
Replies: 5 comments
-
I forgot to attach the code: |
Beta Was this translation helpful? Give feedback.
-
Thanks so much for this detailed note and experiments. Based on your experiments (and of course to adhere to the original definition), it makes absolute sense to update the current implementation. It's been a busy summer so far (which is also why my answer came with such a big delay), but I am planning to get to it in the upcoming weeks. |
Beta Was this translation helpful? Give feedback.
-
Dear Sebastian,
Thank you for your e-mail. I can imagine things are hectic, no worries. If I can be of assistance with the update, please let me know.
Also, I would like to ask you if you would like me to contribute a bolstered resubstitution estimator module to mlxtend. I could work with one of my students to make a pull request. We don't have a popular implementation of it yet (just some C code I distribute with the paper), and of course mlextend is very popular.
Best Regards
Ulisses.
…--
Ulisses M. Braga-Neto, Ph.D.
Professor of Electrical and Computer Engineering
Director of the TAMIDS Scientific Machine Learning Lab
Texas A&M University
https://sciml.tamids.tamu.edu/
https://braganeto.engr.tamu.edu/
"Only the educated are free." -- Epictetus
________________________________
From: Sebastian Raschka ***@***.***>
Sent: Sunday, June 13, 2021 1:00 PM
To: rasbt/mlxtend
Cc: Braga Neto, Ulisses; Author
Subject: Re: [rasbt/mlxtend] 0.632 Bootstrap (#828)
Thanks so much for this detailed note and experiments. Based on your experiments (and of course to adhere to the original definition), it makes absolute sense to update the current implementation. It's been a busy summer so far (which is also why my answer came with such a big delay), but I am planning to get to it in the upcoming weeks.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/rasbt/mlxtend/discussions/828*discussioncomment-865509__;Iw!!KwNVnqRv!VRdfXNK6LijGQ8-CaHu8Gy_IYCIQ6DwSPKGRulMM66j8DDM4M3ai02hCYra5Kig$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AUDAUHOPVXHPFPYO3D4S3CDTSTW2BANCNFSM447GZ2GQ__;!!KwNVnqRv!VRdfXNK6LijGQ8-CaHu8Gy_IYCIQ6DwSPKGRulMM66j8DDM4M3ai02hCpho4efA$>.
|
Beta Was this translation helpful? Give feedback.
-
Finally got some time to address the issue, and it is merged into the main branch now via #844 . I will also make a new release later today. Thanks again for pointing out the issue with internal training set use. Regarding the bolstered resubstitution estimator, this sounds like a useful method, and I would welcome a PR if it is feasible. I realize though that converting the C code into a form that follows the mlxtend/scikit-learn API might involve some extra work. But in case you decide to make a PR, I am happy to give feedback and help with the process. |
Beta Was this translation helpful? Give feedback.
-
Thanks Sebastian! I will ask my student to look into submitting a PR.
Best Regards
Ulisses.
…--
Ulisses M. Braga-Neto, Ph.D.
Professor of Electrical and Computer Engineering
Director of the TAMIDS Scientific Machine Learning Lab
Texas A&M University
https://sciml.tamids.tamu.edu/
https://braganeto.engr.tamu.edu/
"Only the educated are free." -- Epictetus
________________________________
From: Sebastian Raschka ***@***.***>
Sent: Thursday, September 2, 2021 5:38 PM
To: rasbt/mlxtend ***@***.***>
Cc: Braga Neto, Ulisses ***@***.***>; Author ***@***.***>
Subject: Re: [rasbt/mlxtend] 0.632 Bootstrap (#828)
Finally got some time to address the issue, and it is merged into the main branch now via #844<https://urldefense.com/v3/__https://github.com/rasbt/mlxtend/pull/844__;!!KwNVnqRv!Th3bkX2M21-4LQRUV6T8ODNbWJdlDWRZi2na6XCiV9MB9ijJ8mFf7n6tDDjm9QA$> . I will also make a new release later today. Thanks again for pointing out the issue with internal training set use.
Regarding the bolstered resubstitution estimator, this sounds like a useful method, and I would welcome a PR if it is feasible. I realize though that converting the C code into a form that follows the mlxtend/scikit-learn API might involve some extra work. But in case you decide to make a PR, I am happy to give feedback and help with the process.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/rasbt/mlxtend/discussions/828*discussioncomment-1274438__;Iw!!KwNVnqRv!Th3bkX2M21-4LQRUV6T8ODNbWJdlDWRZi2na6XCiV9MB9ijJ8mFf7n6to78Oo6c$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AUDAUHO3YYNCIOZNEZKAA6TT774F3ANCNFSM447GZ2GQ__;!!KwNVnqRv!Th3bkX2M21-4LQRUV6T8ODNbWJdlDWRZi2na6XCiV9MB9ijJ8mFf7n6t5XCnYK8$>.
Triage notifications on the go with GitHub Mobile for iOS<https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!KwNVnqRv!Th3bkX2M21-4LQRUV6T8ODNbWJdlDWRZi2na6XCiV9MB9ijJ8mFf7n6tzO1G-lI$> or Android<https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!KwNVnqRv!Th3bkX2M21-4LQRUV6T8ODNbWJdlDWRZi2na6XCiV9MB9ijJ8mFf7n6tGoXTNZo$>.
|
Beta Was this translation helpful? Give feedback.
-
First of all, congratulations to Sebastian on leading this important initiative. I came to learn about it when I was searching for a python implementation of the 0.632 bootstrap classification error estimator, which is provided by mlextend.
The purpose of this post is to discuss the proper definition of the 0.632 bootstrap. In Efron's original 1983 paper, the "zero bootstrap" error estimator is defined as
where Q(x_i,x*b) is 1 if the class prediction for x_i made by the classifier trained on the bootstrap sample x*b disagrees with the label of x_i, and P_i*b=0 indicates that x_i does not appear in the bootstrap sample x*b. Then the 0.632 bootstrap estimator is defined by
where err is the resubstitution error (i.e., the training error) made by the classifier designed on the entire original training data.
In mlextend,
bootstrap_point632_score()
computes equation (6.12) internally to each boostrap sample, with the training error being with respect to the bootstrap sample, and not the entire training data. I ran a comparison experiment with Gaussian synthetic data, where I computed the RMS (which combines bias and variance, see definitions below) for different sample sizes of the "external" (Efron's definition) and the "internal" (mlextend's implementation) 0.632 bootstrap error estimator, along with the plain zero bootstrap, the plain resubstitution error on the original training data, the 2-fold cross-validation (which Efron claims in the 1983 paper would be essentially the same as the zero bootstrap; here denoted "hcv"), and bolstered resubstitution (more on this one later). I used a linear SVM classification rule and 5-nearest neighbor, to see the behavior with linear and nonlinear decision boundaries. If e and e^ denote the true error (this one was estimated with a large independent test sample of size M=400) and the estimated error, respectively, the definitions are:The expectations were approximated by the usual sample estimator using a large number N of independent training data sets at each sample size (here, N=500). A small RMS indicates a superior compromise between bias and variance. I also display an estimate of the deviation distribution, which is the probability density of eˆ-e, by fitting a beta density to the vector of N differences e^-e. The deviation distribution should be centered around zero, for low bias, and tall and thin, for low variance.
These results show that the internal and external bootstrap error estimators are comparable in performance for the 5NN classification rule, but the external performs better for the linear SVM; it is less biased and less variable. Also, the internal bootstrap seems to be basically equal (in the average sense) to the plain zero bootstrap, in both cases, being positively biased, as pointed out in the 1983 Efron paper. I have not tried to understand why they agree in detail, but it doesn't seem to be a coincidence, since I have seen the same behavior in other experiments I ran with other classification rules. Contrary to what is claimed by Efron in his paper, the 2-fold CV is not similar to the zero bootstrap. It has low bias but a very large variance, as is typical of all cross-validation estimators.
The bolstered resubstitution estimator does not strictly belong to the current discussion, but I have added for more context. This is a modified form of resubstitution that has smaller bias and variance. Like resubstitution, it does not require training of any additional classifiers, and is thus much faster to compute than all cross-validation and bootstrap error estimators. We can see it did not work well in the 5NN case, basically because resubstitution is too biased here (even the external bootstrap suffers from this), but it has the best performance in the linear SVM case, despite being much simpler to compute. Empirically, we have observed consistently that bolstered resubistution tends to perform well with linear or piecewise linear decision boundaries (the latter include CART decision trees and RELU-neural networks). The original bolstered resubstitution paper is: https://www.sciencedirect.com/science/article/abs/pii/S0031320303003327
Beta Was this translation helpful? Give feedback.
All reactions