Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert parameters of logistic regression model from glment to sklearn #6

Open
lingling93 opened this issue Apr 28, 2019 · 5 comments

Comments

@lingling93
Copy link

lingling93 commented Apr 28, 2019

code in predictr.ipynb:

fit = hetior::glmnet_train(X = X_train, y = y_train, alpha = 0.2, s = lambda, cores = 10, seed = 0, penalty.factor=penalty, lambda.min.ratio=1e-8, nlambda=150, standardize=TRUE)

How to match these parameters with sklearn logistic regression parameters?

@dhimmel dhimmel changed the title Reverse parameters of logistic regression model Convert parameters of logistic regression model from glment to sklearn Apr 30, 2019
@dhimmel
Copy link
Owner

dhimmel commented Apr 30, 2019

Thanks @lingling93 for the interest.

Looking at the source code for the hetior::glmnet_train function, it fits the logistic regression model with the following:

  fit$cv_model <- glmnet::cv.glmnet(x = X, y = y, weights = w, family='binomial',
    alpha=alpha, parallel=TRUE, ...)
  fit$lambda <- fit$cv_model[[s]]

I believe the closest thing in sklearn is sklearn.linear_model.SGDClassifier, which can fit an elastic net logistic regression model. Note that the alpha parameter in glmnet is equivalent to the l1_ratio parameter for sklearn (elastic-net mixing parameter). The lambda parameter in glment corresponds to the alpha parameter in sklearn (regularization parameter).

If I remember correctly, glmnet::cv.glmnet is more efficient at evaluating many regularization parameters than implementations in that use sklearn.model_selection.GridSearchCV with SGDClassifier. I believe that is why I did the model fitting in R rather than Python. It may be worth checking out python-glmnet, which claims to follow the sklearn API while providing access to the underlying glmnet fortran code.

Additional References

@lingling93
Copy link
Author

lingling93 commented May 9, 2019

@dhimmel Thank you for your quick and informative answer.
I think I have a clue now, still another question, code in predictr.ipynb:
I didn't find the 's' parameter in glmnet. What is 's' ?

@dhimmel
Copy link
Owner

dhimmel commented May 9, 2019

The s parameter specifies which λ (lambda regularization parameter) value to use based on the cross validation results. From the glmnet vignette:

lambda.min is the value of λ that gives minimum mean cross-validated error. The other λ saved is lambda.1se, which gives the most regularized model such that error is within one standard error of the minimum. To use that, we only need to replace lambda.min with lambda.1se above.

For Project Rephetio, we use s="lambda.1se". We also used lambda.1se in our previous work, where we wrote the following:

Regularized logistic regression requires a parameter, λ, setting the strength of regularization. We optimized λ separately for each model fit. Using 10-fold cross-validation and the “one-standard-error” rule to choose the optimal λ from deviance, we adopted a conservative approach designed to prevent overfitting [80].

The “one-standard-error” rule is further described in Regularization Paths for Generalized Linear Models via Coordinate Descent:

We often use the “one-standard-error” rule when selecting the best model; this acknowledges the fact that the risk curves are estimated with error, so errs on the side of parsimony (Hastie et al. 2009). Cross-validation can be used to select α as well, although it is often viewed as a higher-level parameter and chosen on more subjective grounds.

In Project Rephetio discussion, I made the following comment related to our use of lambda.1se:

Such a small number of positive coefficients is a bit disappointing. Our feature assessment ... shows that a broad range of metapaths are informative. The origin of our model's selectivity appears to lie with the “one-standard-error” rule [2] we use to identify the optimal regularization strength (λ). Our model had high cross-validated standard error leading to substantial regularization on top of the deviance minimizing model. While it's tempting to relax our λ selection, I'd rather be more confident in a minimalist model than risk a less coherent but more complex model.

@lingling93
Copy link
Author

@dhimmel Hi Daniel, problem solved. I tried python glmnet to reproduce your work, training the logistic regression model and matching every parameter. Then I checked the lambda_best, with different seeds, it fluctuate a little bit, with a certain seed, it gives a close result to yours. The coefficient of prior_prob is more steady around 0.7. So I think this is enough to prove that I can use python glmnet.
Thank you for all your help.q

@dhimmel
Copy link
Owner

dhimmel commented May 12, 2019

Cool! I'm looking forward to trying out the python glmnet myself. Yeah there is a random seed and I'm guessing it won't be possible to achieve exactly the same results in python versus R because the randomness will be different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants