Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not specify the classes of a prediction outcome #654

Open
HaloCollider opened this issue Feb 18, 2023 · 4 comments
Open

Can not specify the classes of a prediction outcome #654

HaloCollider opened this issue Feb 18, 2023 · 4 comments

Comments

@HaloCollider
Copy link

I'm tackling with a binomial classification task, where the dependent variable y is a numeric type instead of a factor type (namely 0 and 1), in the convenience of the following numeric calculation. My problem is that:

The prediction returned by the model is a n by 2 dataframe (or some datatype alike), with each column representing the probability of a class but has no column names. What's important is that the order of the columns does not necessarily match the "0 and 1" order, so I cannot simply use the second column's value as the probability of y = 1 in this binomial classification case. I haven't figure out the logic behind this, so it seems that the order is kind of randomly produced.

Therefore, I want to ask whether we have a way to specify the different classes (0 or 1) of a prediction outcome in a classification scenario. It would be greater if we don't have to convert y into a factor type because we will do lots of numeric calculations after predicting. Thanks.

@HaloCollider
Copy link
Author

I think this can be a serious problem for classification. Luckily, we have a very unbalanced sample so we can easily see that the order changed for different models, because some of them produced the exactly opposite predictions if the order remained the same. Still took a long time for me to find out though......

@mnwright
Copy link
Member

mnwright commented Mar 3, 2023

Could you please give a reproducible example of the problem?

@stephematician
Copy link
Contributor

If the data are not a factor (assuming using R interface), then columns are ordered in the same order that the values appear in the data (by row).

Using the R interface, the columns should have the correct names, however this won't be obvious if using the C++ interface. I also don't believe this is documented.

@krzyzinskim
Copy link

I encountered the same problem. @HaloCollider, it's probably out of date by now but the order of the classes in the matrix of predicted probabilities can be found in your.model$forest$class.values (I think it's always in the right order).

And @mnwright, here a small reproducible example:

library(ranger)

## 0 is first 
set.seed(123)
p <- 4
n <- 1000
X <- data.frame(matrix(rnorm(n*p), nrow = n))
y <- as.numeric(rowSums(X) > 0)

y[1:5] # [1] 0 0 0 1 1

model <- ranger(x=X,
               y=y, 
               probability=TRUE)

prediction_probs <- predict(model, X)$predictions
prediction_probs[1:5, ]
#           [,1]        [,2]
# [1,] 0.9956444 0.004355556
# [2,] 0.9906111 0.009388889
# [3,] 0.8179349 0.182065079
# [4,] 0.0780381 0.921961905
# [5,] 0.3289381 0.671061905

model$forest$class.values # [1] 0 1

#### 

## 1 is first 
set.seed(42)
X <- data.frame(matrix(rnorm(n*p), nrow = n))
y <- as.numeric(rowSums(X) > 0)

y[1:5] # [1] 1 0 0 0 0

model <- ranger(x=X,
                y=y, 
                probability=TRUE)

prediction_probs <- predict(model, X)$predictions
prediction_probs[1:5, ]
#            [,1]       [,2]
# [1,] 0.96184603 0.03815397
# [2,] 0.04116032 0.95883968
# [3,] 0.12405714 0.87594286
# [4,] 0.03781984 0.96218016
# [5,] 0.18086905 0.81913095

model$forest$class.values # [1] 1 0

I've found here that the matrix is only given column names when forest$levels is not NULL (and it is for non-factor response, related resolved issue). Perhaps it's worth naming the columns based on forest$class.values, which is always non-empty?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants