[WIP] FEA: Add categorical support for DecisionTree* #2

cakedev0 · 2025-09-24T18:01:49Z

WIP: see TODO at the end

What does this implement? Explain your changes.

Categorical support for compatible critertia (gini, log_loss, squared_error, friedman_mse?, poisson?) using the breiman shortcut (and only it).

My implementation choices are aimed at delivering a first support with the least possible amount of changes (esp. changes in the deep logic of criterion/splitter/partitioner). So:

no support with sparse X
no support for splitter="random"
support for a max 64 categories: allows using a simple uint64 as a bitset, rather than a more complex structure.

The central idea is to replace the call to sort in DensePartitioner.sort_samples_and_feature_values by a call to _breiman_sort_categories:

        if self.n_categories <= 0:
            # not a categorical feature
            sort(&feature_values[self.start], &samples[self.start], self.end - self.start - n_missing)
        else:
            self._breiman_sort_categories(self.n_categories)

_breiman_sort_categories as complexity O(n + n_cat log n_cat), so it's really good (usually much better than the O(n log n) of the sort). But if n_cat >> n, it can be a performance issue. For now this isn't adressed. See benchmarks bellow for more details.

Follow-up work to achieve a real widely-usable support of categorical features in trees/forests:

optimize (edit: probably not needed actually): relabel categories while traversing down the trees to ensure n_categories <= n_samples (and hence, ensures that categorical features are always processed as fast - or almost - than continous features)
support for ExtraTree* (random split in _splitter.pyx)
support for than just contiguous integers in X, get inspiration from _preprocess_X in gradient_boosting.py for instance
support for forests (and iforest?)
- add some notes in the user-guide (like for GB)
Auxiliary: add support in plot_tree for categorical splits (and missing values)

Note: support for criterion="absolute_error" is computationally intractable I think. I have counter-examples for the brieman shortcut, and brute force is impossible because median can't be computed from medians of each category (contrary to the mean), hence the complexity of one split would be in O(n_cat^2 n_samples log n_samples).

Reference Issues/PRs

Highly inspired by scikit-learn#29437

But differs slightly in the crux part, allowing for simpler code (and probably more efficient too).

Benchmark

Looking good.

TODO: publish detailed benchmarks.

Still TODO for this PR:

…andle bitsets

… trigger the categorical split path and test it

…l tests pass; TODO: test with categories

…es work in notebook: logic seems to be correct

github-actions · 2025-09-24T18:02:55Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 4a59a7f. Link to the linter CI: here}

…ealed by the newly added tests

Copilot

Pull Request Overview

This PR implements categorical feature support for decision tree classifiers and regressors. The main goal is to add the ability to handle categorical features using the Breiman shortcut algorithm, which is computationally efficient for splitting on categorical features by ordering categories based on their target values.

Key changes:

Added categorical_features parameter to tree classes to specify which features should be treated as categorical
Implemented Breiman shortcut algorithm for efficient categorical splits
Modified tree building and prediction logic to handle both numerical and categorical splits
Added comprehensive tests for categorical functionality

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
sklearn/tree/_classes.py	Added categorical_features parameter and validation logic
sklearn/tree/_tree.pyx	Modified Tree class to support categorical splits with bitsets
sklearn/tree/_splitter.pyx	Updated splitters to handle categorical features
sklearn/tree/_partitioner.pyx	Implemented Breiman shortcut for categorical sorting
sklearn/tree/_utils.pxd	Added SplitValue union and updated Node struct
sklearn/tree/tests/test_tree.py	Added tests for categorical features and endianness
sklearn/tree/tests/test_split.py	Added comprehensive split optimality tests

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

sklearn/tree/_classes.py

sklearn/tree/_partitioner.pyx

Copilot · 2025-09-25T22:23:46Z

sklearn/tree/_tree.pyx

                    node.right_child = _TREE_LEAF
                    node.feature = _TREE_UNDEFINED
                    node.threshold = _TREE_UNDEFINED
+                    # node.categorical_bitset = _TREE_UNDEFINED


These commented lines should either be removed or uncommented with proper implementation. Leaving commented code creates confusion about intended behavior.

Suggested change

# node.categorical_bitset = _TREE_UNDEFINED

Copilot · 2025-09-25T22:23:46Z

sklearn/tree/_tree.pyx

            node.right_child = _TREE_LEAF
            node.feature = _TREE_UNDEFINED
            node.threshold = _TREE_UNDEFINED
+            # node.categorical_bitset = _TREE_UNDEFINED


These commented lines should either be removed or uncommented with proper implementation. Leaving commented code creates confusion about intended behavior.

Suggested change

# node.categorical_bitset = _TREE_UNDEFINED

node.categorical_bitset = _TREE_UNDEFINED

sklearn/tree/_tree.pyx

sklearn/tree/_partitioner.pyx

Copilot · 2025-09-25T22:23:47Z

sklearn/tree/_tree.pyx


                    if is_target_feature:
                        # In this case, we push left or right child on stack
+                        # TODO: handle categorical (and missing?)


This TODO comment indicates incomplete implementation for categorical features in this code path. Either implement the handling or create a proper issue to track this work.

cakedev0 · 2025-09-28T09:44:26Z

Small note about how trees can overfit with categorical features when you don't have many samples per category:

from sklearn.tree import DecisionTreeRegressor
import numpy as np

importances = []
for _ in range(1000):
    n = 1000
    X = np.random.rand(n, 2)
    X[:, 1] = np.random.randint(64, size=n)
    y = X[:, 0] * 0.2 + np.random.rand(n)

    reg = DecisionTreeRegressor(
        max_depth=1,
        categorical_features=[1]
    ).fit(X, y)
    importances.append(reg.feature_importances_)
np.sum(importances, axis=0)
# array([432., 568.]) 
# while it's array([1000.,    0.]) if you remove the categorical feature

EDIT: While reading the code from HistGradientBoosting, I saw this:

            # Reduces the effect of noises in categorical features,
            # especially for categories with few data. Called cat_smooth in
            # LightGBM. TODO: Make this user adjustable?
            Y_DTYPE_C MIN_CAT_SUPPORT = 10.

And another comment: "we exclude categories that don't respect MIN_CAT_SUPPORT from this sorted array".

adam2392 · 2025-10-08T23:01:33Z

Apologies on the delay. Getting around to cleaning out my notifications.

How far along are you in this PR? I had intended to finish my original PR. Sklearn has a (perhaps unspoken) policy to ping the existing PR to see if the original author intends on finishing it before overwriting it.

know I was a bit lagging, and the PR is somewhat buried so understand the lapse :p.

With that said, I don't want to hinder work. If the status is similar to what I had, then I'm down to finish it and can probably do so in the next few months. Otw also open to collaborate on finishing it together.

cakedev0 · 2025-10-09T16:05:19Z

Don't worry, I'm probably the cause of many notifications ^^

Sorry I didn't ping you on your PR. I though it was more or less abandoned, plus I was not sure what I would be able to produce. I had some ideas that seemed promising, but I wasn't sure they work as expected, so I had to give it a shot first.

In the end, I'm fairly happy of how it went together, and I think this PR has a few benefits compared to yours:

the scope/ambition is a bit smaller, but still functional, so it's simpler to review
my implementation of the Breiman shortcut avoid sorting the features values, so it's likely faster (though I don't)
I almost don't change the flow of the current implementation. Juts calling _breiman_sort_categories instead of sort in one place => also simpler to review.

That being said, all the things I did different than you I thought about them while reading your PR. And there are plenty of things I more or less copy-pasted first, then modified a bit. So it would be more than fair to consider you as a co-author of this PR.

How far along are you in this PR?

Everything hard is done, except if the review reveals some big gaps (which can happen, and in which case, we might want to abandon this PR and continue yours, if the gap is only in this one). Testing is mostly done too. What remains is in the TODO in the PR description. The hardest one is probably the first one, about over-fitting.

I can finish everything in a few days, if we consider moving forward with this PR. But until then, I pause here. And there are also a few PRs I'd like to see merged before re-focusing here.

Otw also open to collaborate on finishing it together.

I would love that! See the section "Follow-up work to achieve a real widely-usable support of categorical features in trees/forests" in the description above: after this PR there is still plenty of work to do!

cakedev0 added 9 commits September 15, 2025 21:30

test for optimal splits

91348de

added friedman mse

7d38553

Compiles. Big chunk of the low-level machinery done; still needs to h…

f32d6c0

…andle bitsets

Added bitset creation logic; TODO: using it.

35c10ac

added goes_left function

61cdba7

Merge branch 'tree-simpler-missing' into categorical

eaa6769

Most changes down; all tests (non categorical) pass; still remains to…

e8cbf55

… trigger the categorical split path and test it

added categorical_features kwarg to top-level classes; non-categorica…

4c571af

…l tests pass; TODO: test with categories

use categorical split in apply/predict; basic test test with categori…

a27b16c

…es work in notebook: logic seems to be correct

cakedev0 added 9 commits September 24, 2025 22:22

refacto check_cat; better error messages

288c1ea

fix segfault

f512e03

added basic but strong tests; fixed some minor but impactful bugs rev…

eb7f94c

…ealed by the newly added tests

Minor tests changes

9afc1d8

test sparse; clean-up

9d8b93b

Merge branch 'testing-split' into categorical

5c12733

test categorical; found a new bug in missing

1d796cf

docstring for brieman sort

5c3de79

minor comments fixes

f47ca8e

cakedev0 requested a review from Copilot September 25, 2025 22:22

Copilot AI reviewed Sep 25, 2025

View reviewed changes

adressed some copilot comments

96998ee

cakedev0 mentioned this pull request Oct 6, 2025

RFC: is optimizing DecisionTree*/ExtraTree* worth the effort/extra complexity? scikit-learn/scikit-learn#32384

Open

cakedev0 added 2 commits October 7, 2025 14:31

Merge branch 'tree-simpler-missing' into categorical

f49a185

enable more tests

09b03a0

Merge branch 'tree-simpler-missing' into categorical

4a59a7f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] FEA: Add categorical support for DecisionTree* #2

[WIP] FEA: Add categorical support for DecisionTree* #2

Uh oh!

cakedev0 commented Sep 24, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 24, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Sep 25, 2025

Uh oh!

Copilot AI Sep 25, 2025

Uh oh!

Uh oh!

Uh oh!

Copilot AI Sep 25, 2025

Uh oh!

cakedev0 commented Sep 28, 2025 •

edited

Loading

Uh oh!

adam2392 commented Oct 8, 2025

Uh oh!

cakedev0 commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	# node.categorical_bitset = _TREE_UNDEFINED
	node.categorical_bitset = _TREE_UNDEFINED

[WIP] FEA: Add categorical support for DecisionTree* #2

Are you sure you want to change the base?

[WIP] FEA: Add categorical support for DecisionTree* #2

Uh oh!

Conversation

cakedev0 commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

WIP: see TODO at the end

What does this implement? Explain your changes.

Follow-up work to achieve a real widely-usable support of categorical features in trees/forests:

Reference Issues/PRs

Benchmark

Still TODO for this PR:

Uh oh!

github-actions bot commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

cakedev0 commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adam2392 commented Oct 8, 2025

Uh oh!

cakedev0 commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cakedev0 commented Sep 24, 2025 •

edited

Loading

github-actions bot commented Sep 24, 2025 •

edited

Loading

cakedev0 commented Sep 28, 2025 •

edited

Loading