Skip to content

Conversation

@cakedev0
Copy link
Owner

@cakedev0 cakedev0 commented Sep 24, 2025

WIP: see TODO at the end

What does this implement? Explain your changes.

Categorical support for compatible critertia (gini, log_loss, squared_error, friedman_mse?, poisson?) using the breiman shortcut (and only it).

My implementation choices are aimed at delivering a first support with the least possible amount of changes (esp. changes in the deep logic of criterion/splitter/partitioner). So:

  • no support with sparse X
  • no support for splitter="random"
  • support for a max 64 categories: allows using a simple uint64 as a bitset, rather than a more complex structure.

The central idea is to replace the call to sort in DensePartitioner.sort_samples_and_feature_values by a call to _breiman_sort_categories:

        if self.n_categories <= 0:
            # not a categorical feature
            sort(&feature_values[self.start], &samples[self.start], self.end - self.start - n_missing)
        else:
            self._breiman_sort_categories(self.n_categories)

_breiman_sort_categories as complexity O(n + n_cat log n_cat), so it's really good (usually much better than the O(n log n) of the sort). But if n_cat >> n, it can be a performance issue. For now this isn't adressed. See benchmarks bellow for more details.

Follow-up work to achieve a real widely-usable support of categorical features in trees/forests:

  • optimize (edit: probably not needed actually): relabel categories while traversing down the trees to ensure n_categories <= n_samples (and hence, ensures that categorical features are always processed as fast - or almost - than continous features)
  • support for ExtraTree* (random split in _splitter.pyx)
  • support for than just contiguous integers in X, get inspiration from _preprocess_X in gradient_boosting.py for instance
  • support for forests (and iforest?)
    • add some notes in the user-guide (like for GB)
  • Auxiliary: add support in plot_tree for categorical splits (and missing values)

Note: support for criterion="absolute_error" is computationally intractable I think. I have counter-examples for the brieman shortcut, and brute force is impossible because median can't be computed from medians of each category (contrary to the mean), hence the complexity of one split would be in O(n_cat^2 n_samples log n_samples).

Reference Issues/PRs

Highly inspired by scikit-learn#29437

But differs slightly in the crux part, allowing for simpler code (and probably more efficient too).

Benchmark

Looking good.

TODO: publish detailed benchmarks.

Still TODO for this PR:

  • Take some counter-measure against over-fitting? See comment [WIP] FEA: Add categorical support for DecisionTree* #2 (comment)
  • tests
  • docstring for new inner functions: WIP
  • docstring for public API
  • test different endianness (probably broken for now. Edit: acutally wasn't broken)
  • clean-up intp_t * for is_categorical? / Investigate it to maybe create an issue on the Cython repo
  • clean up asserts on X (in _fit)
  • Add a warning (or an error?) in plot_tree for categorical splits (and maybe missings too)
  • Assert for compatible criterion & n_classes=2
  • Support for missing? Added complexity might be very small, so it's probably worth to do it in this MR too. Edit: no, left for a follow-up PR.
  • Do some benchmarks with different depths: WIP => looking very good/fast
  • Think how to make current code a bit more adapted to future work (while keeping it as simple as possible)
    • support for missing values: should be ok
    • relabelling optimization if needed: should be ok
    • support in random split: should be ok
    • support in sparse: might requires of moving around some code, but shouldn't be too hard
    • support for more than 64 categories: unsure (but unsure if this is really needed)
  • Update BasePartitioner commented class
  • User guide about best split finding: add something about all (2 nc) partitions being candidate splits (but in practice only nc "relevant" categories partitions are evaluated making it usually faster than for numerical features).
  • Publish some experiments about over-fitting risk when many categories.

@github-actions
Copy link

github-actions bot commented Sep 24, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 4a59a7f. Link to the linter CI: here

@cakedev0 cakedev0 requested a review from Copilot September 25, 2025 22:22
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements categorical feature support for decision tree classifiers and regressors. The main goal is to add the ability to handle categorical features using the Breiman shortcut algorithm, which is computationally efficient for splitting on categorical features by ordering categories based on their target values.

Key changes:

  • Added categorical_features parameter to tree classes to specify which features should be treated as categorical
  • Implemented Breiman shortcut algorithm for efficient categorical splits
  • Modified tree building and prediction logic to handle both numerical and categorical splits
  • Added comprehensive tests for categorical functionality

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
sklearn/tree/_classes.py Added categorical_features parameter and validation logic
sklearn/tree/_tree.pyx Modified Tree class to support categorical splits with bitsets
sklearn/tree/_splitter.pyx Updated splitters to handle categorical features
sklearn/tree/_partitioner.pyx Implemented Breiman shortcut for categorical sorting
sklearn/tree/_utils.pxd Added SplitValue union and updated Node struct
sklearn/tree/tests/test_tree.py Added tests for categorical features and endianness
sklearn/tree/tests/test_split.py Added comprehensive split optimality tests

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

node.right_child = _TREE_LEAF
node.feature = _TREE_UNDEFINED
node.threshold = _TREE_UNDEFINED
# node.categorical_bitset = _TREE_UNDEFINED
Copy link

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These commented lines should either be removed or uncommented with proper implementation. Leaving commented code creates confusion about intended behavior.

Suggested change
# node.categorical_bitset = _TREE_UNDEFINED

Copilot uses AI. Check for mistakes.
node.right_child = _TREE_LEAF
node.feature = _TREE_UNDEFINED
node.threshold = _TREE_UNDEFINED
# node.categorical_bitset = _TREE_UNDEFINED
Copy link

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These commented lines should either be removed or uncommented with proper implementation. Leaving commented code creates confusion about intended behavior.

Suggested change
# node.categorical_bitset = _TREE_UNDEFINED
node.categorical_bitset = _TREE_UNDEFINED

Copilot uses AI. Check for mistakes.

if is_target_feature:
# In this case, we push left or right child on stack
# TODO: handle categorical (and missing?)
Copy link

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO comment indicates incomplete implementation for categorical features in this code path. Either implement the handling or create a proper issue to track this work.

Copilot uses AI. Check for mistakes.
@cakedev0
Copy link
Owner Author

cakedev0 commented Sep 28, 2025

Small note about how trees can overfit with categorical features when you don't have many samples per category:

from sklearn.tree import DecisionTreeRegressor
import numpy as np

importances = []
for _ in range(1000):
    n = 1000
    X = np.random.rand(n, 2)
    X[:, 1] = np.random.randint(64, size=n)
    y = X[:, 0] * 0.2 + np.random.rand(n)

    reg = DecisionTreeRegressor(
        max_depth=1,
        categorical_features=[1]
    ).fit(X, y)
    importances.append(reg.feature_importances_)
np.sum(importances, axis=0)
# array([432., 568.]) 
# while it's array([1000.,    0.]) if you remove the categorical feature

EDIT: While reading the code from HistGradientBoosting, I saw this:

            # Reduces the effect of noises in categorical features,
            # especially for categories with few data. Called cat_smooth in
            # LightGBM. TODO: Make this user adjustable?
            Y_DTYPE_C MIN_CAT_SUPPORT = 10.

And another comment: "we exclude categories that don't respect MIN_CAT_SUPPORT from this sorted array".

@adam2392
Copy link

adam2392 commented Oct 8, 2025

Apologies on the delay. Getting around to cleaning out my notifications.

How far along are you in this PR? I had intended to finish my original PR. Sklearn has a (perhaps unspoken) policy to ping the existing PR to see if the original author intends on finishing it before overwriting it.

know I was a bit lagging, and the PR is somewhat buried so understand the lapse :p.

With that said, I don't want to hinder work. If the status is similar to what I had, then I'm down to finish it and can probably do so in the next few months. Otw also open to collaborate on finishing it together.

@cakedev0
Copy link
Owner Author

cakedev0 commented Oct 9, 2025

Don't worry, I'm probably the cause of many notifications ^^

Sorry I didn't ping you on your PR. I though it was more or less abandoned, plus I was not sure what I would be able to produce. I had some ideas that seemed promising, but I wasn't sure they work as expected, so I had to give it a shot first.

In the end, I'm fairly happy of how it went together, and I think this PR has a few benefits compared to yours:

  • the scope/ambition is a bit smaller, but still functional, so it's simpler to review
  • my implementation of the Breiman shortcut avoid sorting the features values, so it's likely faster (though I don't)
  • I almost don't change the flow of the current implementation. Juts calling _breiman_sort_categories instead of sort in one place => also simpler to review.

That being said, all the things I did different than you I thought about them while reading your PR. And there are plenty of things I more or less copy-pasted first, then modified a bit. So it would be more than fair to consider you as a co-author of this PR.

How far along are you in this PR?

Everything hard is done, except if the review reveals some big gaps (which can happen, and in which case, we might want to abandon this PR and continue yours, if the gap is only in this one). Testing is mostly done too. What remains is in the TODO in the PR description. The hardest one is probably the first one, about over-fitting.

I can finish everything in a few days, if we consider moving forward with this PR. But until then, I pause here. And there are also a few PRs I'd like to see merged before re-focusing here.

Otw also open to collaborate on finishing it together.

I would love that! See the section "Follow-up work to achieve a real widely-usable support of categorical features in trees/forests" in the description above: after this PR there is still plenty of work to do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants