-
Notifications
You must be signed in to change notification settings - Fork 0
[WIP] FEA: Add categorical support for DecisionTree* #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: tree-simpler-missing
Are you sure you want to change the base?
Conversation
… trigger the categorical split path and test it
…l tests pass; TODO: test with categories
…es work in notebook: logic seems to be correct
…ealed by the newly added tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements categorical feature support for decision tree classifiers and regressors. The main goal is to add the ability to handle categorical features using the Breiman shortcut algorithm, which is computationally efficient for splitting on categorical features by ordering categories based on their target values.
Key changes:
- Added
categorical_featuresparameter to tree classes to specify which features should be treated as categorical - Implemented Breiman shortcut algorithm for efficient categorical splits
- Modified tree building and prediction logic to handle both numerical and categorical splits
- Added comprehensive tests for categorical functionality
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| sklearn/tree/_classes.py | Added categorical_features parameter and validation logic |
| sklearn/tree/_tree.pyx | Modified Tree class to support categorical splits with bitsets |
| sklearn/tree/_splitter.pyx | Updated splitters to handle categorical features |
| sklearn/tree/_partitioner.pyx | Implemented Breiman shortcut for categorical sorting |
| sklearn/tree/_utils.pxd | Added SplitValue union and updated Node struct |
| sklearn/tree/tests/test_tree.py | Added tests for categorical features and endianness |
| sklearn/tree/tests/test_split.py | Added comprehensive split optimality tests |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| node.right_child = _TREE_LEAF | ||
| node.feature = _TREE_UNDEFINED | ||
| node.threshold = _TREE_UNDEFINED | ||
| # node.categorical_bitset = _TREE_UNDEFINED |
Copilot
AI
Sep 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These commented lines should either be removed or uncommented with proper implementation. Leaving commented code creates confusion about intended behavior.
| # node.categorical_bitset = _TREE_UNDEFINED |
| node.right_child = _TREE_LEAF | ||
| node.feature = _TREE_UNDEFINED | ||
| node.threshold = _TREE_UNDEFINED | ||
| # node.categorical_bitset = _TREE_UNDEFINED |
Copilot
AI
Sep 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These commented lines should either be removed or uncommented with proper implementation. Leaving commented code creates confusion about intended behavior.
| # node.categorical_bitset = _TREE_UNDEFINED | |
| node.categorical_bitset = _TREE_UNDEFINED |
|
|
||
| if is_target_feature: | ||
| # In this case, we push left or right child on stack | ||
| # TODO: handle categorical (and missing?) |
Copilot
AI
Sep 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO comment indicates incomplete implementation for categorical features in this code path. Either implement the handling or create a proper issue to track this work.
|
Small note about how trees can overfit with categorical features when you don't have many samples per category: from sklearn.tree import DecisionTreeRegressor
import numpy as np
importances = []
for _ in range(1000):
n = 1000
X = np.random.rand(n, 2)
X[:, 1] = np.random.randint(64, size=n)
y = X[:, 0] * 0.2 + np.random.rand(n)
reg = DecisionTreeRegressor(
max_depth=1,
categorical_features=[1]
).fit(X, y)
importances.append(reg.feature_importances_)
np.sum(importances, axis=0)
# array([432., 568.])
# while it's array([1000., 0.]) if you remove the categorical featureEDIT: While reading the code from HistGradientBoosting, I saw this: # Reduces the effect of noises in categorical features,
# especially for categories with few data. Called cat_smooth in
# LightGBM. TODO: Make this user adjustable?
Y_DTYPE_C MIN_CAT_SUPPORT = 10.And another comment: "we exclude categories that don't respect MIN_CAT_SUPPORT from this sorted array". |
|
Apologies on the delay. Getting around to cleaning out my notifications. How far along are you in this PR? I had intended to finish my original PR. Sklearn has a (perhaps unspoken) policy to ping the existing PR to see if the original author intends on finishing it before overwriting it. know I was a bit lagging, and the PR is somewhat buried so understand the lapse :p. With that said, I don't want to hinder work. If the status is similar to what I had, then I'm down to finish it and can probably do so in the next few months. Otw also open to collaborate on finishing it together. |
|
Don't worry, I'm probably the cause of many notifications ^^ Sorry I didn't ping you on your PR. I though it was more or less abandoned, plus I was not sure what I would be able to produce. I had some ideas that seemed promising, but I wasn't sure they work as expected, so I had to give it a shot first. In the end, I'm fairly happy of how it went together, and I think this PR has a few benefits compared to yours:
That being said, all the things I did different than you I thought about them while reading your PR. And there are plenty of things I more or less copy-pasted first, then modified a bit. So it would be more than fair to consider you as a co-author of this PR.
Everything hard is done, except if the review reveals some big gaps (which can happen, and in which case, we might want to abandon this PR and continue yours, if the gap is only in this one). Testing is mostly done too. What remains is in the TODO in the PR description. The hardest one is probably the first one, about over-fitting. I can finish everything in a few days, if we consider moving forward with this PR. But until then, I pause here. And there are also a few PRs I'd like to see merged before re-focusing here.
I would love that! See the section "Follow-up work to achieve a real widely-usable support of categorical features in trees/forests" in the description above: after this PR there is still plenty of work to do! |
WIP: see TODO at the end
What does this implement? Explain your changes.
Categorical support for compatible critertia (gini, log_loss, squared_error, friedman_mse?, poisson?) using the breiman shortcut (and only it).
My implementation choices are aimed at delivering a first support with the least possible amount of changes (esp. changes in the deep logic of criterion/splitter/partitioner). So:
The central idea is to replace the call to
sortinDensePartitioner.sort_samples_and_feature_valuesby a call to_breiman_sort_categories:_breiman_sort_categoriesas complexity O(n + n_cat log n_cat), so it's really good (usually much better than the O(n log n) of the sort). But if n_cat >> n, it can be a performance issue. For now this isn't adressed. See benchmarks bellow for more details.Follow-up work to achieve a real widely-usable support of categorical features in trees/forests:
ExtraTree*(random split in_splitter.pyx)_preprocess_Xingradient_boosting.pyfor instanceplot_treefor categorical splits (and missing values)Note: support for
criterion="absolute_error"is computationally intractable I think. I have counter-examples for the brieman shortcut, and brute force is impossible because median can't be computed from medians of each category (contrary to the mean), hence the complexity of one split would be in O(n_cat^2 n_samples log n_samples).Reference Issues/PRs
Highly inspired by scikit-learn#29437
But differs slightly in the crux part, allowing for simpler code (and probably more efficient too).
Benchmark
Looking good.
TODO: publish detailed benchmarks.
Still TODO for this PR:
intp_t *foris_categorical? / Investigate it to maybe create an issue on the Cython repo_fit)plot_treefor categorical splits (and maybe missings too)