Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
:caption: Under review

slep017/proposal
slep025/proposal

.. toctree::
:maxdepth: 1
Expand Down
147 changes: 147 additions & 0 deletions slep025/proposal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
.. _slep_025:

==============================================
SLEP025: Losing Accuracy in Scikit-Learn Score
==============================================

:Author: Christian Lorentzen
:Status: Draft
:Type: Standards Track
:Created: 2025-12-07
:Resolution: TODO <url> (required for Accepted | Rejected | Withdrawn)

Abstract
--------

This SLEP proposes to rectify the default ``score`` method for scikit-learn
classifiers. Currently, the ease of ``classifier.score(X, y)`` favors the use of
*accuracy*, which has many well known deficiencies. This SLEP changes the default
scoring method.

Motivation
----------

As it stands, *accuracy* is the most used metric for classifiers in scikit-learn. This
is manifest in ``classifier.score(..)`` which applies accuracy. While the original goal
might have been to provide a score method that works for all classifiers, the actual
implication has been the blind usage, without critical thinking, of the accuracy score.
This has mislead many researchers and users because accuracy is well known for its
severe deficiencies: To the point, it is not a *strictly proper scoring rule* and
scikit-learn's implementation hard-coded a probability threshold of 50% into it by
relying on ``predict``.

This situation calls for a correction. Ideally, scikit-learn provides good defaults
or fosters a conscious decision by users, e.g. by forcing engagement with the subject,
see [2]_ subsection "Which scoring function should I use?".

Solution
--------

The solution is a multi-step approach:

1. Introduce the new keyword ``scoring`` to the ``score`` method. The default for
classifiers is ``scoring="accuracy"``, for regressors ``scoring="r2"``.
2. Deprecate the default ``"accuracy"`` for classifiers.
3. After the release cycle, set a new default for classifiers: ``"d2_brier_score"``.

There are two main questions with this approach:

a. The time frame of the deprecation period. Should it be longer than the usual 2 minor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this needs a longer than the usual 2 minor releases. Personally, I'll be okay with 3 minor releases or 1 major release.

releases? Should step 1 and 2 happen in the same minor release?
Comment on lines +49 to +50
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think we need to decide on a default by step 2, so we can tell users what to set to emulate the new default.
  • I am okay with 1 and 2 happening at the same time as long as we choose a default. If we have not chosen a new default, then only 1 can happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we can't agree a new default we should not start the process. I think this because agreeing the new default is the hard part of this task and if we start it we are on the hook for finishing the transition, which will be impossible without agreement.

Otherwise I agree with Thomas that we can do 1 and 2 at the same time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included it as proposal.

b. What is the new default scoring parameter in step 3?
The fact that different scoring metrics focus on different things, i.e. ``predict``
vs. ``predict_proba``, and not all classifiers provide ``predict_proba`` complicates
a unified choice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to choose the same metric for all classifiers?

I think the answer is yes because people will use the results of est1.score(X, y) and est2.score(X, y) to evaluate which one is the better estimator. It seems very hard to educate people that they can't compare scores from different estimators

(This is almost a rhetorical question, but I wanted to double check my thinking)

Copy link
Member Author

@lorentzenchr lorentzenchr Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given your assumption that users will continue to compare score results of different estimators, and given that a generally satisfying metric does not exist, the conclusion is to remove the score method.

My currently best choice for a general classifier metric is the skill score (R2) variant of the Brier score. Classifiers and regressors would then have the same metric, which is nice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I am ready to remove score().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is listed as alternative, not as proposal!

Possibilities are
- D2 Brier score, ``"d2_brier_score"``, which is basically the same as R2 for
regressors,
- the objective function of the estimator, i.e. the penalized log loss for
``LogisticRegression``.

Proposals:

a. Use a deprecation period of 4 instead of 2 minor releases which amounts to 2 years
and do step 1 and 2 at the same time (in the same release).
Reasoning: It is a deprecation that is doable within the current deprecation
habit of minor releases. It should be longer than the usual 2 minor releases because
of it's big impact.
A major release just because of such a deprecation is not very attractive (or
marketable).
b. Use D2 Brier score.
Reasoning: Scores will be compared among different models. Therefore, the model
specific loss is not suitable.
Note that the Brier score and hence also the D2 Brier score are strictly proper
scoring rules (or strictly consistent scoring functions) for the probability
predictions with ``predict_proba``. At the same time, Brier score returns a valid
score even for ``predict`` (in case a classifier has no ``predict_proba``), in
constrast to log loss (which returns infinity for false certainty). On top, this
would result in classifiers and regressors having the same
score (it's just a different name), returning values in the range [0, 1].
Copy link
Member

@thomasjpfan thomasjpfan Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the multi-class case, will the brier_score_loss be normalized to be in the [0, 1] range?

Specifically, do we set scale_by_half=True in brier_score_loss?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought: it does not matter. The R^2 version is invariant to scaling: Brier(model) / Brier(mean of data) = MSE(model) / MSE(mean of data).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it'll be useful to include this information in the proposal? (Less for voters to think about)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

Note that the D2 Brier score as a skill score (a relavitve score to a baseline) is
invariant under a multiplicative factor, e.g. specified by ``scale_by_half``. It is
given by ``MSE(model predictions) / MSE(mean of data)``.

Backward compatibility
----------------------

The outlined solution would be feasible within the usual deprecation strategy of
scikit-learn releases.

Alternatives
------------

Removing
^^^^^^^^
An alternative is to remove the ``score`` method altogether. Scoring metrics are well
available in scikit-learn, see ``sklearn.metric`` module and [2]_. The advantages of
removing ``score`` are:

- An active choice by the user is triggered as there is no more default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption is that most people who blindly use score() do not know better. It is unclear to me if forcing them to make a decision is going to improve the quality of the decision they make. scikit-learn is about "machine learning without the learning curve", so we are on the hook for making a "not unreasonable" decision for the beginner user.

It doesn't stop us from extending our documentation and educational material to increase the chances of people reading it (eg we could have a blog post about this topic when the deprecation starts to explain why this is a much bigger deal than it might seem) and hopefully making a better decision than the default argument to score().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if forcing them to make a decision is going to improve the quality of the decision they make

Let's go through the options:

  • a lucky user chooses a strictly consistent scoring function like log-loss or brier score: situation improved
  • an informed user chooses a metric close to a business/application metric he has: situation improved
  • a stubborn (or unlucky) user chooses accuracy or something similar (balanced accuracy, F2, you name it): situation is not worse.

Qunitessence: Situation can only improve, but never worsens.

- Defaults for ``score`` are tricky anyway. Different estimators estimate different
things and the output of their ``score`` method most likely is not comparable, e.g.
consider a hinge loss based SVM vs. log loss based logistic regression.

Disadvantages:

- Disruption of the API.
- Very likely a major release for something not very marketable.
- More imports required and a bit longer code as compared to just
``my_estimator.score(X, y)``.

Keep status quo
^^^^^^^^^^^^^^^

Advantages:

- No change or breaking things for users
- No ressources bound

Disadvantages:
- No change for users
- Bad practice is continued
- Bad signal: scikit-learn community is unable to rectify serious grievance

Discussion
----------

The following issues contain discussions on this subject:

- https://github.com/scikit-learn/scikit-learn/issues/28995


References and Footnotes
------------------------

.. [1] Each SLEP must either be explicitly labeled as placed in the public
domain (see this SLEP as an example) or licensed under the `Open
Publication License`_.

.. _Open Publication License: https://www.opencontent.org/openpub/

.. [2] Scikit-Learn User Guide on "Metrics and Scoring"
https://scikit-learn.org/stable/modules/model_evaluation.html

Copyright
---------

This document has been placed in the public domain. [1]_