Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative sWeights #51

Open
marthaisabelhilton opened this issue May 29, 2018 · 8 comments
Open

Negative sWeights #51

marthaisabelhilton opened this issue May 29, 2018 · 8 comments
Labels

Comments

@marthaisabelhilton
Copy link

Hi,

I am trying to use the BoostingToUniformity notebook, in particular the uBoost classifier. I am getting the error message 'the weights should be non-negative'. I have tried removing this from the source code and tried to run uBoost without this line. When I use the 'predict' function I get an array of all zeros and when I try to plot the ROC curve I get nans as the output. I am wondering if there is a way of dealing with negative weights?

Many thanks,

Martha

@arogozhnikov
Copy link
Owner

arogozhnikov commented May 29, 2018

Hi Martha,
negative weights aren't friendly towards ML because of driving to non-convex unbounded optimization, so you should not expect those to work right for ML models (sometimes they do, however).

@tlikhomanenko sometime ago prepared an overview of strategies for dealing with negative weights, but the first thing you'd better try is to simply remove samples with negative weights from training (but not from testing, that's important)

@tlikhomanenko
Copy link
Collaborator

Hi Martha,

Please have a look at this notebook https://github.com/yandexdataschool/mlhep2015/blob/master/day2/advanced_seminars/sPlot.ipynb prepared for a summer school. There is a part called "Training on sPlot data" where you could find several approaches how to train your classifier on data with negative and positive weights. Hope, you'll find them useful.

@alexpearce
Copy link
Contributor

For classifiers that only compute statistics on ensembles of events whilst fitting, like decision trees, I would hope that an implementation would accept negative weights, rather than doing assert (weights < 0).sum() == 0.

When it should fail is if the sum of weights in an ensemble currently under study is negative.

@marthaisabelhilton
Copy link
Author

Thanks for your responses. I have tried removing the negative weights from my training sample and classifier.predict(X_train) is giving me an array of all 1's. Do you know why this is happening?

I am using a similar method to the 'Add events two times in training' section in the notes above.

@arogozhnikov
Copy link
Owner

@alexpearce

Hey Alex,
I don't think it is so different for trees. Things may go arbitrarily bad in very simple situations:

reg = GradientBoostingRegressor(n_estimators=100, max_depth=1).fit(numpy.arange(2)[:, None], numpy.arange(2), sample_weight=[-0.9999999999, 1])
reg.predict(numpy.arange(2)[:, None])
# outputs: array([9.99999917e+09, 9.99999917e+09])

@marthaisabelhilton

No idea, but try to use clf.predict_proba to see if those provide meaningful separation.

@alexpearce
Copy link
Contributor

Yes, negative weights certainly can make things go bad, but in the case of very low sample sizes sWeights also don't make much sense, they only give 'reasonable' results with 'larges ensemble (all poorly defined terms of course). That's what I was suggest algorithms don't check immediately for negative weights, but only when actually computing quantities used in the fitting.

@arogozhnikov
Copy link
Owner

@alexpearce
Well, in such case you should check for sum in each particular leaf of the tree (since we aggregating over samples in a leaf).

I see potential complains like "it just worked with two trees, what's the problem with the third one?" (in a huge ensemble like uboost almost surely this check will be triggered), but I don't mind if anyone decides to PR such checks.

@alexpearce
Copy link
Contributor

Well, in such case you should check for sum in each particular leaf of the tree (since we aggregating over samples in a leaf).

Yes, exactly. The check should be made at that point, rather than when the training data is first fed into the tree.

And you're right, I should just open a PR if I think this is useful behaviour. I'll look into it.

(You're also right, for the third time, that I might be underestimating how often an ensemble of negative weights will have a negative sum, but I would leave that problem up to the users, to tune the hyper parameters.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants