Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaves with no samples from original distribution #39

Open
arogozhnikov opened this issue Nov 11, 2016 · 0 comments
Open

Leaves with no samples from original distribution #39

arogozhnikov opened this issue Nov 11, 2016 · 0 comments
Labels

Comments

@arogozhnikov
Copy link
Owner

arogozhnikov commented Nov 11, 2016

This issue was observed and reported by Jack Wimberley.

If there is a region with very few original samples, decision tree can build a leaf with samples only from target distribution (> min_samples_leaf) and 0 (exactly zero) from original.

As a result, 'corrections' made by a tree do not affect train weights, but this results in blowing up weights on the test.

Workarounds

Basically, almost anything from

  • increase min_samples_leaf
  • subsample=0.5
  • increase regularization (available in develop version)

(and any combination of the above) works well and resolves the problem in practice.

Proper solution (not available now)

Good, correct solution would be to introduce parameter 'minimal number of samples from original distribution in a leaf', but this isn't supported by decision trees of scikit-learn (or any other library).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant