-
-
Notifications
You must be signed in to change notification settings - Fork 33
SLEP025: Losing Accuracy in Scikit-Learn Score #96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
d5e5765
30e5d52
63f4f2f
58a254f
a5a93ed
12a1c90
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,6 +23,7 @@ | |
| :caption: Under review | ||
|
|
||
| slep017/proposal | ||
| slep025/proposal | ||
|
|
||
| .. toctree:: | ||
| :maxdepth: 1 | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,147 @@ | ||
| .. _slep_025: | ||
|
|
||
| ============================================== | ||
| SLEP025: Losing Accuracy in Scikit-Learn Score | ||
| ============================================== | ||
|
|
||
| :Author: Christian Lorentzen | ||
| :Status: Draft | ||
| :Type: Standards Track | ||
| :Created: 2025-12-07 | ||
| :Resolution: TODO <url> (required for Accepted | Rejected | Withdrawn) | ||
|
|
||
| Abstract | ||
| -------- | ||
|
|
||
| This SLEP proposes to rectify the default ``score`` method for scikit-learn | ||
| classifiers. Currently, the ease of ``classifier.score(X, y)`` favors the use of | ||
| *accuracy*, which has many well known deficiencies. This SLEP changes the default | ||
| scoring method. | ||
|
|
||
| Motivation | ||
| ---------- | ||
|
|
||
| As it stands, *accuracy* is the most used metric for classifiers in scikit-learn. This | ||
| is manifest in ``classifier.score(..)`` which applies accuracy. While the original goal | ||
| might have been to provide a score method that works for all classifiers, the actual | ||
| implication has been the blind usage, without critical thinking, of the accuracy score. | ||
| This has mislead many researchers and users because accuracy is well known for its | ||
| severe deficiencies: To the point, it is not a *strictly proper scoring rule* and | ||
| scikit-learn's implementation hard-coded a probability threshold of 50% into it by | ||
| relying on ``predict``. | ||
|
|
||
| This situation calls for a correction. Ideally, scikit-learn provides good defaults | ||
| or fosters a conscious decision by users, e.g. by forcing engagement with the subject, | ||
| see [2]_ subsection "Which scoring function should I use?". | ||
|
|
||
| Solution | ||
| -------- | ||
|
|
||
| The solution is a multi-step approach: | ||
|
|
||
| 1. Introduce the new keyword ``scoring`` to the ``score`` method. The default for | ||
| classifiers is ``scoring="accuracy"``, for regressors ``scoring="r2"``. | ||
| 2. Deprecate the default ``"accuracy"`` for classifiers. | ||
| 3. After the release cycle, set a new default for classifiers: ``"d2_brier_score"``. | ||
|
|
||
| There are two main questions with this approach: | ||
|
|
||
| a. The time frame of the deprecation period. Should it be longer than the usual 2 minor | ||
| releases? Should step 1 and 2 happen in the same minor release? | ||
|
Comment on lines
+49
to
+50
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think if we can't agree a new default we should not start the process. I think this because agreeing the new default is the hard part of this task and if we start it we are on the hook for finishing the transition, which will be impossible without agreement. Otherwise I agree with Thomas that we can do 1 and 2 at the same time.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I included it as proposal. |
||
| b. What is the new default scoring parameter in step 3? | ||
| The fact that different scoring metrics focus on different things, i.e. ``predict`` | ||
| vs. ``predict_proba``, and not all classifiers provide ``predict_proba`` complicates | ||
| a unified choice. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need to choose the same metric for all classifiers? I think the answer is yes because people will use the results of (This is almost a rhetorical question, but I wanted to double check my thinking)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given your assumption that users will continue to compare My currently best choice for a general classifier metric is the skill score (R2) variant of the Brier score. Classifiers and regressors would then have the same metric, which is nice.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I am ready to remove
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is listed as alternative, not as proposal! |
||
| Possibilities are | ||
| - D2 Brier score, ``"d2_brier_score"``, which is basically the same as R2 for | ||
| regressors, | ||
| - the objective function of the estimator, i.e. the penalized log loss for | ||
| ``LogisticRegression``. | ||
|
|
||
| Proposals: | ||
|
|
||
| a. Use a deprecation period of 4 instead of 2 minor releases which amounts to 2 years | ||
| and do step 1 and 2 at the same time (in the same release). | ||
| Reasoning: It is a deprecation that is doable within the current deprecation | ||
| habit of minor releases. It should be longer than the usual 2 minor releases because | ||
| of it's big impact. | ||
| A major release just because of such a deprecation is not very attractive (or | ||
| marketable). | ||
| b. Use D2 Brier score. | ||
| Reasoning: Scores will be compared among different models. Therefore, the model | ||
| specific loss is not suitable. | ||
| Note that the Brier score and hence also the D2 Brier score are strictly proper | ||
| scoring rules (or strictly consistent scoring functions) for the probability | ||
| predictions with ``predict_proba``. At the same time, Brier score returns a valid | ||
| score even for ``predict`` (in case a classifier has no ``predict_proba``), in | ||
| constrast to log loss (which returns infinity for false certainty). On top, this | ||
| would result in classifiers and regressors having the same | ||
| score (it's just a different name), returning values in the range [0, 1]. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the multi-class case, will the Specifically, do we set
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I would do that.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On a second thought: it does not matter. The R^2 version is invariant to scaling: Brier(model) / Brier(mean of data) = MSE(model) / MSE(mean of data).
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you think it'll be useful to include this information in the proposal? (Less for voters to think about)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea. |
||
| Note that the D2 Brier score as a skill score (a relavitve score to a baseline) is | ||
| invariant under a multiplicative factor, e.g. specified by ``scale_by_half``. It is | ||
| given by ``MSE(model predictions) / MSE(mean of data)``. | ||
|
|
||
| Backward compatibility | ||
| ---------------------- | ||
|
|
||
| The outlined solution would be feasible within the usual deprecation strategy of | ||
| scikit-learn releases. | ||
|
|
||
| Alternatives | ||
| ------------ | ||
|
|
||
| Removing | ||
| ^^^^^^^^ | ||
| An alternative is to remove the ``score`` method altogether. Scoring metrics are well | ||
| available in scikit-learn, see ``sklearn.metric`` module and [2]_. The advantages of | ||
| removing ``score`` are: | ||
|
|
||
| - An active choice by the user is triggered as there is no more default. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My assumption is that most people who blindly use It doesn't stop us from extending our documentation and educational material to increase the chances of people reading it (eg we could have a blog post about this topic when the deprecation starts to explain why this is a much bigger deal than it might seem) and hopefully making a better decision than the default argument to
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Let's go through the options:
Qunitessence: Situation can only improve, but never worsens. |
||
| - Defaults for ``score`` are tricky anyway. Different estimators estimate different | ||
| things and the output of their ``score`` method most likely is not comparable, e.g. | ||
| consider a hinge loss based SVM vs. log loss based logistic regression. | ||
|
|
||
| Disadvantages: | ||
|
|
||
| - Disruption of the API. | ||
| - Very likely a major release for something not very marketable. | ||
| - More imports required and a bit longer code as compared to just | ||
| ``my_estimator.score(X, y)``. | ||
|
|
||
| Keep status quo | ||
| ^^^^^^^^^^^^^^^ | ||
|
|
||
| Advantages: | ||
|
|
||
| - No change or breaking things for users | ||
| - No ressources bound | ||
|
|
||
| Disadvantages: | ||
| - No change for users | ||
| - Bad practice is continued | ||
| - Bad signal: scikit-learn community is unable to rectify serious grievance | ||
|
|
||
| Discussion | ||
| ---------- | ||
|
|
||
| The following issues contain discussions on this subject: | ||
|
|
||
| - https://github.com/scikit-learn/scikit-learn/issues/28995 | ||
|
|
||
|
|
||
| References and Footnotes | ||
| ------------------------ | ||
|
|
||
| .. [1] Each SLEP must either be explicitly labeled as placed in the public | ||
| domain (see this SLEP as an example) or licensed under the `Open | ||
| Publication License`_. | ||
|
|
||
| .. _Open Publication License: https://www.opencontent.org/openpub/ | ||
|
|
||
| .. [2] Scikit-Learn User Guide on "Metrics and Scoring" | ||
| https://scikit-learn.org/stable/modules/model_evaluation.html | ||
|
|
||
| Copyright | ||
| --------- | ||
|
|
||
| This document has been placed in the public domain. [1]_ | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this needs a longer than the usual 2 minor releases. Personally, I'll be okay with 3 minor releases or 1 major release.