-
-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
importance = "permutation", what is this doing? #237
Comments
From our paper Do little interactions get lost in dark random forests?
|
I'm getting a little hung up on the nomenclature used in this CV post. Is the permutation importance method different from a "conditional Random Forest"? |
Yes, in a conditional inference forest, a permutation test is used for splitting, while the permutation importance uses permutations to calculate variable importance. |
Thank you for the clarification. Is it possible to incorporate this splitting rule within |
If you have a regression, survival or binary classification problem, you could use |
Thanks for the info! I'm able to access your paper as well. |
Strobl et al. [68] show that the default option (in ranger) of sampling with replacement leads to a Variable Importance Measure bias in favor of predictors with many categories even if the trees are built using an unbiased criterion. Two questions:
Citation: C. Strobl, A. L. Boulesteix, A. Zeileis, and T. Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a |
The choice of 0.632 * n (aka 1-exp(-1)) is quite common because this is the number of unique observations if sampled with replacement. It is a reasonable default but might be improved by tuning. See also Section 2.1.2 of our recent paper on hyperparameters** and Philipp et al.'s paper on tunability. *Free preprint: https://arxiv.org/pdf/1605.03391 |
In ranger 0.11.2, I'm getting "Error: maxstat splitrule applicable to regression or survival data only" when I try to use |
You'll have to grow a regression forest on 0/1 outcomes. For two classes, this is equivalent to classification with |
A follow-up question: |
It's the mean squared error. |
I found my way to this issue while trying to build a fast, unbiased random forest on a binary outcome hoping, eventually, to do some model inference using a variable importance metric that is robust to correlations among predictors. (For posterity, the Sobol-MDA approach to deal with correlated predictors led me to this other {ranger} issue, which led me to "conditional predictive impact" and the {cpi} package). I learned a lot! Thanks for all the links and suggestions. In trying to implement this, I am hoping to account for class imbalance too. When I switch from a classification problem of the binary outcome to a regression problem with a binary outcome to effectively grow a probability-like random forest, I lose the ability to include Is there a way that you would recommend accounting for class imbalance in a 0/1 outcome regression problem? Particularly a way that can be implemented in {mlr3} so that I can use {cpi}? |
I think you are looking for probability trees/forests. In ranger, define it as a classification outcome (i.e. factor) and set |
Thanks so much for your reply! You are right that I am essentially looking for probability trees, but I learned from this issue that I can only use the
The ultimate goal is to try to get unbiased variable selection, since I'm interested in the inference about important features, which is what led me to conditional inference trees, and then to your Wright et al., 2017 paper. So is there an appropriate way to mimic For posterity, I did learn that I can implement the |
No, ranger/src/TreeProbability.cpp Line 290 in 5f71872
However, you can use And again, be careful with |
This is so helpful; thank you very much for your time! And thanks also for your caveat about You've likely seen that work, but for posterity: they performed simulations to identify bias in variable importance metrics that arise from bias in the splitting rule. They use 0/1 outcomes to investigate this in the classification scenario, and use From my read, the results for the 0/1 outcome {ranger} regression using Is there different testing that would make us feel more confident of this? |
Yes, that looks good! I would also check the prediction performance, but I'm sure you are already doing that. |
thanks @mnwright for looping me in we tried maxstat with binary outcomes only for completeness but we never used it for prediction for instance Now, I would like to clarify and demystify a couple of things
|
This is great; thanks so much for providing some of this intuition here. I'm convinced about using the computational power of {ranger} with a maxstat split rule. I think the only outstanding question in my mind is about dealing with correlated predictors. That seems to be a factor that also plays into which variable importance measure you might pick. In a different issue, it was you that pointed me to CPI for exactly this use case. Understanding that there is no such thing as totally "unbiased" variable selection, it sounds like more research is needed before using the |
Unbiasedness is unachievable for recursive partitioning when you have predictors of different types Correlation is problematic in general within any statistical framework: CPI would give a higher ranking to uncorrelated predictors, which I think makes sense for most problems, but variabile importance is a vague concept and it really depends on what you want to get from the data. Correlated variables compete and share some of the same information so for prediction I'd rather pick predictors that are not correlated but that's a personal question The total reduction in heterogeneity is fine because it measures exactly that, the problem is when you use it on a forest that splits to purity. Splitting criteria and variable importance and prediction are 3 different beasts that have different goals. "Unbiased" in this case is a misnomer used in machine learning because there is no parameter. The variable importance we presented in 2018 has the purpose of mimicking the permutation importance, which can be extremely slow for some datasets (even days), and can be computes at basically no cost when the forest is generated Hope this helps |
Hi,
I am quite new running RF and using ranger. I have ~3000 SNPs, 265 individuals, and a quantitative phenotype. My aim is to use RF (ranger) to rank the SNPs by the ones that explain more the phenotype variation in the population (inter-individual variation).
Anyway, I have read that "The most reliable measurement is the (unscaled) permutation importance" (Szymczak et al., 2016). It refers to the most reliable measurement for assess the predictive power of each variable (or SNP), which is what I am after. Therefore, I have to use importance = "permutation" which is the only way to compute permutation importance (right?). But, I don't understand well what is being permuted? I suppose the samples (??) Could someone explain a bit more in detain how permutation in Random Forest work? I really would like to understand!
Thanks in advance for the help,
Regards,
'Angela Parody-Merino
The text was updated successfully, but these errors were encountered: