Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use of case.weights versus class.weights in the case of a binary response? #12

Open
mikoontz opened this issue Aug 1, 2022 · 0 comments

Comments

@mikoontz
Copy link

mikoontz commented Aug 1, 2022

tl;dr

I noticed some unexpected behavior of the permutation importance in the case of a binary response variable when using a regression approach for the random forest model. Variables that were highly important based on other "importance" metrics (e.g., mean minimum tree depth, observing large differences in predicted value across a gradient of that metric, number of times a root, the cross-validated importance value I get by using spatialRF::rf_importance()) were showing up as strongly negative in the standard $variable.importance score.

Some details

I built some {ranger} models directly to try to suss this out and think I've identified that this arises when treating a binary response as a regression problem.

My (naive) understanding is that the class.weights argument of ranger() is the best way to account for class imbalance given a binary (or other categorical) response. I believe that the {spatialRF} machinery (e.g., using spatialRF::case_weights()) passes that information along to case.weights instead of class.weights.

I am having a hard time understanding how case.weights and class.weights are being used in ranger() but the permutation importance when building a {ranger} model directly, having a binary response, and treating it as a classification problem (rather than regression) seems to track much better with the other measures of variable importance I listed above, which makes me suspect this is a fundamental issue that comes up when (inappropriately??) treating a binary response as a regression problem and using case.weights to try to account for class imbalance.

Anyway, I'm still trying to read more to better understand the implications for building the model but I thought I'd flag it for now!

[edit: I'm pasting in some of my investigation code in case that's useful...]

library(spatialRF)
library(ranger)

plant_richness_df$response_binomial <- ifelse(
  plant_richness_df$richness_species_vascular > 5000,
  1,
  0
)

case.wgts <- spatialRF::case_weights(data = plant_richness_df, 
                                    dependent.variable.name = "response_binomial")

predictor.variable.names <- colnames(plant_richness_df)[5:21]

# Regression problem with binary response and using case.weights
fm1 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
                      y = plant_richness_df[["response_binomial"]], 
                      data = plant_richness_df,
                      classification = FALSE,
                      probability = FALSE,
                      case.weights = case.wgts,
                      importance = "permutation",
                      seed = 1)

as.data.frame(sort(fm1$variable.importance))

# Classification problem with a factor as response variable, and using case.weights
fm2 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
                      y = as.factor(plant_richness_df[["response_binomial"]]), 
                      data = plant_richness_df,
                      classification = TRUE,
                      probability = FALSE,
                      case.weights = case.wgts,
                      importance = "permutation",
                      seed = 1)

as.data.frame(sort(fm2$variable.importance))

# Probability estimation problem with a factor as response variable, and using case.weights
fm3 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
                      y = as.factor(plant_richness_df[["response_binomial"]]), 
                      data = plant_richness_df,
                      classification = FALSE,
                      probability = TRUE,
                      case.weights = case.wgts,
                      importance = "permutation",
                      seed = 1)

as.data.frame(sort(fm3$variable.importance))

# Probability estimation with a factor as response variable, and using class.weights
fm4 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
                      y = as.factor(plant_richness_df[["response_binomial"]]), 
                      data = plant_richness_df,
                      classification = FALSE,
                      probability = TRUE,
                      class.weights = unique(case.wgts),
                      importance = "permutation",
                      seed = 1)

as.data.frame(sort(fm4$variable.importance))

# Probability estimation with a factor as response variable, and using both class.weights and case.weights
fm5 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
                      y = as.factor(plant_richness_df[["response_binomial"]]), 
                      data = plant_richness_df,
                      classification = FALSE,
                      probability = TRUE,
                      case.weights = case.wgts,
                      class.weights = unique(case.wgts),
                      importance = "permutation",
                      seed = 1)

as.data.frame(sort(fm5$variable.importance))

# spatialRF
fm6 <- spatialRF::rf(data = plant_richness_df, 
                     dependent.variable.name = "response_binomial", 
                     predictor.variable.names = predictor.variable.names, 
                     seed = 1)

as.data.frame(sort(fm6$variable.importance))
as.data.frame(sort(fm1$variable.importance))

# spatialRF
fm7 <- spatialRF::rf(data = plant_richness_df, 
                     dependent.variable.name = "response_binomial", 
                     predictor.variable.names = predictor.variable.names, 
                     seed = 1)

as.data.frame(sort(fm7$variable.importance)) # the {spatialRF} version creates the same model as fm1
as.data.frame(sort(fm1$variable.importance))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant