Skip to content

imbs-hl/pranger

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Travis Build Status AppVeyor Build Status Coverage Status

Proximities with ranger

Cesaire J. K. Fouodo

Introduction

Pranger bases on unsupervised random forests (URF) generated with ranger to compute proximities between individuals.Two approaches are supported. The first approach of Shi and Horvath (2006) that increases the dissimilarity between two individuals to one if they don't belong to the same terminal node. The second approach of Fouodo et al. (2021) takes the tree depth into account and estimates the dissimilarity between two individuals basing on the length of the minimal path between the terminal nodes they belong to. We also propose a new approach to compute URF variable importance, suitable for the high-dimensional testing procedure of Janitza et al. (2018).

Installation

Installation from Github:

devtools::install_github("imbs-hl/pranger")

CRAN release coming soon.

Usage

For usage in R, see ?pranger in R. Most importantly, see the Examples section. As a first example you could try

An illustrative example with the iris dataset

library(pranger)
library(ggplot2)

Using the Shi and Hovarth (2006) URF based method two compute dissimilarities. Please, use the function shi_ranger_one_tree for parallel computing over the trees. That is, you have to synthesize the two-classes classification problem using resampling, grow the generate the ranger model and predict the terminal nodes for the original observations. See ?shi_ranger_one_tree for more details.

## URF dissimilarities with the Shi distance
iris_diss_shi <- pranger(data = iris[ , -5],
                    strategy = "boostaggr", num.trees = 100,
                    min.node.size = 15, approach = "shi", seed = 123)
## MDS on the Shi and Hovarth (2006) dissimilarities
mds_iris_shi <- data.frame(cmdscale(iris_diss_shi, k = 2))
mds_iris_shi$class <- iris[ , 5]
names(mds_iris_shi) <- c("PC1", "PC2", "Species")
mds_iris_shi$Method <- "URF Shi"

Using the Fouodo (2021) URF based method two compute dissimilarities. Please use the function predicted_tree_distance for parallel computing for each tree.

## URF dissimilarities with the deep distance
iris_diss_deep <- pranger(data = iris[ , -5],
                    strategy = "boostaggr",
                    num.trees = 100,
                    min.node.size = 15, approach = "deep", seed = 123)
## MDS on the deep dissimilarities
mds_iris_deep <- data.frame(cmdscale(iris_diss_deep, k = 2))
mds_iris_deep$class <- iris[ , 5]
names(mds_iris_deep) <- c("PC1", "PC2", "Species")
mds_iris_deep$Method <- "URF Deep"

Using PCA to capture latent effect.

## PCA
pca_iris <- princomp(iris[ , -5])$scores[ , 1:2]
pca_iris <- data.frame(pca_iris)
pca_iris$class <- iris[ , 5]
names(pca_iris) <- c("PC1", "PC2", "Species")
pca_iris$Method <- "PCA"

Visualization the results from the three methods.

## Plot
mds_iris_dist <- data.frame(rbind(pca_iris, mds_iris_shi, mds_iris_deep))
iris_plot <- ggplot(data = mds_iris_dist,
                            aes(x = PC1, y = PC2,
                            colour = Species)) +
              geom_point() +
              facet_wrap( ~ Method, scales = "free", nrow = 2)
print(iris_plot)

Variable importance and testing procedure with UNAIR

library(MASS)
set.seed(321)
## Simulate data with 3 clusters and 10 relevant variables
test_data <- mvrnorm(n = 300, mu = rep(0, 250), Sigma = diag(250))
test_data[1:100, 1:10] <- test_data[1:100, 1:10] - 5
test_data[201:300, 1:10] <- test_data[201:300, 1:10] + 5
## Variable importance and testing procedure
testing_res <- urf_test(data = data.frame(test_data),
                        target = "target",
                        resampling_seed = 123,
                        num.trees = 10e3)
print(testing_res)

References

  • Shi, T., Hovarth and S. (2006). Unsupervised Learning with Random Forest Predictors. Journal of Computational and Graphical Statistics 15 (1): 118–38.
  • Janitza, S, Celik, E, Boulesteix, AL. (2018). A computationally fast variable importance test for random forests for high-dimensional data. Adv Data Anal Classif.; doi.org: 10.1007/s11634-016-0276-4
  • Cesaire J. K. Fouodo, Inke R. König Silke Szymczak (2022) Computing variable importance with unsupervised random forests. In review process.
  • Fouodo, K.C.J, Szymczak, S., Wright, N.M. and König R.I. (2021). Improving unsupervised random forests with new proximity measure and a modified resampling strategy. xxxx xx(xx): xx-xx.

Releases

No releases published

Packages

No packages published

Languages