Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporting/saving/reusing the reweighting formula #33

Open
bifani opened this issue May 11, 2016 · 4 comments
Open

Exporting/saving/reusing the reweighting formula #33

bifani opened this issue May 11, 2016 · 4 comments
Labels

Comments

@bifani
Copy link

bifani commented May 11, 2016

Sometimes one would like to use a control sample, e.g. because more abundant, to determine MC weights to be then applied to other, e.g. more rare, samples

For this reason it would be very useful if hep_ml.reweight could export the "reweighting formula" in some format, e.g. ROOT, so that it can be reused also from different programming languages

Thanks

@arogozhnikov
Copy link
Owner

arogozhnikov commented May 12, 2016

This is a frequent question (or family of questions) from physicists, who are interested in applying reweighting to one more data sample. Below I give solutions for different situations.

Working from the same script

Frequently applicable, but for some reason ignored by physicists (ROOT influence?) solution is read this file inside the same script/notebook and apply reweigher.

You can store the weights column using recipe from this issue.

When you need to store formula

Possible reasons:

  • data is not available
  • need to transfer formula to different machine
  • keep formula for future comparison / reproducing results.

You can use cPickle. Works as following:

import cPickle as pickle
# saving formula
with open('reweighter.pkl', 'w') as f:
    pickle.dump(reweighter, f)

#loading formula
with open('reweighter.pkl') as f:
    reweighter = pickle.load(f)

Exporting to TMVA

(needed when you need to build it inside some production script / experiment)

When applying formula, reweighter is not much different from simple gradient boosting / random forest (see how predict_weights works).

hep_ml uses own BDT, but it is easily converted from/to sklearn.

There are solutions, which convert sklearn's trees to TMVA format: koza4ok and sklearn-pmml.

Warning: I haven't tried any of those, since I am not using TMVA, so I expect many caveats on that way. If someone tried and succeeded with exporting to TMVA, let me know.

@arogozhnikov arogozhnikov changed the title Exporting the reweighting formula Exporting/saving/reusing the reweighting formula May 12, 2016
@bifani
Copy link
Author

bifani commented May 12, 2016

Hi Alex,

thanks a lot for the quick feedback!
cPickle looks like what I need, I'll give this a go

Regards,
s.

@kpedro88
Copy link

kpedro88 commented Oct 3, 2018

I have a question about converting from hep_ml BDTs to sklearn BDTs. I am trying to use the "exporting to TMVA" method via koza4ok, and it works with a few tweaks:

classifiers['uGBFL'].loss_ = classifiers['uGBFL'].loss
classifiers['uGBFL'].loss_.K = 1
classifiers['uGBFL'].estimators_ = np.empty((classifiers['uGBFL'].n_estimators, classifiers['uGBFL'].loss_.K), dtype=np.object)
for i,est in enumerate(classifiers['uGBFL'].estimators): classifiers['uGBFL'].estimators_[i] = est[0]

However, I am not sure the last line gives the correct output. In UGradientBoostingClassifier, the estimators_ member is a list of [tree, leaf_values]. The leaf_values first come from the tree, but then get updated:

# update tree leaves
leaf_values = tree.get_leaf_values()
if self.update_tree:
terminal_regions = tree.transform(X)
leaf_values = self.loss.prepare_new_leaves_values(terminal_regions, leaf_values=leaf_values,
y_pred=y_pred)
y_pred += self.learning_rate * self._estimate_tree(tree, leaf_values=leaf_values, X=X)
self.estimators.append([tree, leaf_values])

At the end, get_leaf_values() returns a different array than the leaf_values stored in the estimators_ list:

>>> print classifiers['uGBFL'].estimators[0][0].get_leaf_values()
[ 0.01252273 -1.72148748 -2.77744433 -1.07583091  0.29113487  0.16071584
  0.05392691  1.75249969  2.29887652]
>>> print classifiers['uGBFL'].estimators[0][1]                  
[ 0.          0.         -2.6523975  -1.15883605  0.          0.
  0.08844491  1.44762732  2.12097526]

Should I export the array from get_leaf_values(), or use the leaf_values from the list?

@arogozhnikov
Copy link
Owner

Hi @kpedro88
Your analysis is correct - only leaf id predicted by the tree is important, not leaf values; leaf values that are stored separately then used, (tree, leaf_values). So, leaf values stored inside the tree are ignored completely.

For conversion, almost surely you'll need to do the following (not tested, maybe needs corrections):

for tree, leaf_values in estimators:
    new_tree = copy.deepcopy(tree)
    assert new_tree.tree_.value.shape == (len(leaf_values), 1, 1)
    new_tree.tree_.value[:, 0, 0] = leaf_values
    <save new tree to the ensemble>

Don't forget to verify you get the same predictions before / after conversion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants