R package vignette (tutorial) (#461)

h2oai · Mar 23, 2018 · b5b5c1b · b5b5c1b
1 parent 6310658
commit b5b5c1b
Show file tree

Hide file tree

Showing 2 changed files with 262 additions and 1 deletion.
diff --git a/src/interface_r/DESCRIPTION b/src/interface_r/DESCRIPTION
@@ -1,7 +1,7 @@
 Package: h2o4gpu
 Type: Package
 Title: R Interface to 'H2O4GPU'
-Version: 0.0.0.9000
+Version: 0.2.0
 Authors@R: c(
   person("Yuan", "Tang", role = c("aut", "cre"), email = "[email protected]"),
   person("Navdeep", "Gill", role = c("aut"), email = "[email protected]"),

diff --git a/src/interface_r/vignettes/getting_started.Rmd b/src/interface_r/vignettes/getting_started.Rmd
@@ -0,0 +1,261 @@
+---
+title: "H2O4GPU: Machine Learning with GPUs in R"
+author: "Navdeep Gill, Erin LeDell, Yuan Tang"
+date: "`r Sys.Date()`"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Vignette Title}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+
+
+**H2O4GPU** is a collection of GPU solvers by [H2O.ai](https://www.h2o.ai/) with APIs in Python and R.  The Python API builds upon the easy-to-use [scikit-learn](http://scikit-learn.org) API and its well-tested CPU-based algorithms.  The **h2o4gpu** R package is a wrapper around the **h2o4gpu** Python package, and the interface follows standard R conventions for modeling.  
+
+The R package makes use of RStudio's [reticulate](https://rstudio.github.io/reticulate/) package for facilitating access to Python libraries through R.  Reticulate embeds a Python session within your R session, enabling seamless, high-performance interoperability.
+
+**H2O4GPU** is a new project under active development and we are looking for contributors!  If you find a bug, please check that we have not already fixed the issue in the bleeding edge version and then check that we do not already have an issue opened for this topic.  If not, then please file a new issue with a reproducible example. 
+
+- Here is the main [GitHub repo](https://github.com/h2oai/h2o4gpu).  If you like the package, please 🌟 the repo on GitHub! 
+- If you're looking to contribute, check out the [CONTRIBUTING.md](https://github.com/h2oai/h2o4gpu/blob/master/CONTRIBUTING.md) file.
+- All open issues that are specific to the R package are [here](https://github.com/h2oai/h2o4gpu/labels/R).
+- All open issues are [here](https://github.com/h2oai/h2o4gpu/issues?utf8=%E2%9C%93&q=is%3Aopen).
+
+## Installation
+
+The Python package is a prerequisite for the R package. So first, follow the instructions [here](https://github.com/h2oai/h2o4gpu#installation) to install the **h2o4gpu** Python package (either at the system level or in a Python virtual envivonment). The easiest thing to do is to `pip install` either the stable or bleeding edge `whl` file. To ensure compatibility, the Python package version number should match the R package version number.
+
+The R package can be installed from CRAN using `install.packages("h2o4gpu")`.  To install the development version of the **h2o4gpu** R package, you can install directly from GitHub as follows:
+
+```{r, eval = FALSE}
+library(devtools)
+devtools::install_github("h2oai/h2o4gpu", subdir = "src/interface_r")
+```
+
+If the Python package was installed into a virtual environment, you may have to add thesse two lines of code to the top of your script.  The path you will use will be the path of your virtual environment:
+
+```{r, eval = FALSE}
+library(reticulate)
+use_virtualenv("/home/ledell/venv/h2o4gpu")  # set this to the path of your venv
+```
+
+However, if you installed the **h2o4gpu** Python package into the main Python installation on your machine, then these two lines of code will not be neccessary.
+
+
+## Quickstart
+
+Here's a quick demo of how to train and evaluate a GPU-based Random Forest classifier model.  We will use the classic Iris dataset, which is a three-class classification problem and evaluate the performance of the model using classification error.
+
+```{r, eval = FALSE}
+library(h2o4gpu)
+library(reticulate)  # only needed if using a virtual Python environment
+use_virtualenv("/home/ledell/venv/h2o4gpu")  # set this to the path of your venv
+
+# Prepare data
+x <- iris[1:4]
+y <- as.integer(iris$Species) # all columns, including the response, must be numeric
+
+# Initialize and train the classifier
+model <- h2o4gpu.random_forest_classifier() %>% fit(x, y)
+
+# Make predictions
+pred <- model %>% predict(x)
+
+# Compute classification error using the Metrics package (note this is training error)
+library(Metrics)
+ce(actual = y, predicted = pred)
+```
+
+
+## Supervised Learning
+
+**H2O4GPU** contains a collection of popular algorithms for supervised learning: Random Forest, Gradient Boosting Machine (GBM) and Generalized Linear Models (GLMs) with Elastic Net regularization.  There are methods for regression and classification for each of these algorithms.  Both Random Forest and GBM support multiclass clasification, however the GLM currently only supports binomial classification (a ticket for multinomial support is open [here](https://github.com/h2oai/h2o4gpu/issues/505)).
+
+The tree based models (Random Forest and GBM) are built on top of the very powerful [XGBoost](https://xgboost.readthedocs.io/en/latest/) library, and the Elastic Net GLM has been built upon the POGS solver.  [Proximal Graph Solver (POGS)](http://stanford.edu/%7Eboyd/papers/pogs.html) is a solver for convex optimization problems in graph form using Alternating Direction Method of Multipliers (ADMM).  We have found that this method is not as fast as we'd like it to be, so we are working on implementing an entirely new GLM from scratch (follow progress [here](https://github.com/h2oai/h2o4gpu/issues/356)).
+
+The **h2o4gpu** R package does not include a suite of internal model metrics functions, therefore we encourage users to use a third-party model metrics package of their choice.  For all the examples below, we will use the [Metrics](https://cran.r-project.org/web/packages/Metrics/index.html) R package.  This package has a large number of model metrics functions, all with a very simple, unified API.
+
+### Binary Classification
+
+In this example, we will train and test three different models on a subset of the [HIGGS](https://archive.ics.uci.edu/ml/datasets/HIGGS) dataset.  The goal in this dataset is to distinguish between signal "1" and background "0", so this is a binary classification problem.  The features are all numeric.
+
+**H2O4GPU** requires all feature and response columns to be numeric, so in this case, we don't have to do any pre-processing of the data.  If your response column is a factor, then you can simply convert the levels to integer values using `as.integer()`.  If you have categorical/factor columns among your features, you must apply an encoding method to convert the columns into numeric data.  Some options are label encoding (simply convert the levels to integers) or one hot encoding (binary indicator columns, one for each categorical level).  For simplicity, in this tutorial, we will always use label encoding, however you can read more about different types of encodings [here](https://dzone.com/articles/handling-character-data-for-machine-learning).
+
+```{r, eval = FALSE}
+# Load a sample dataset for binary classification
+# Source: https://archive.ics.uci.edu/ml/datasets/HIGGS
+train <- read.csv("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
+test <- read.csv("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
+
+# Create train & test sets (column 1 is the response)
+x_train <- train[, -1]
+y_train <- train[, 1]
+x_test <- test[, -1]
+y_test <- test[, 1]
+```
+
+Below we see that the **h2o4gpu** modeling functions follow a two-phased functional apporach.  The two phased approach to modeling (first initialize model, then train) is more common in Python, and we borrow that paradigm here.  We blend this with the the functional pipe syntax in R.  
+
+First you define the model with it's hyperparameters, for example, `h2o4gpu.gradient_boosting_classifier(n_estimators = 500, subsample = 0.8)`.  Then we pipe the initialized model object to the `fit(x, y)` function to train the model, and save the resulting object.
+
+```{r, eval = FALSE}
+# Train three different binary classification models
+model_gbc <- h2o4gpu.gradient_boosting_classifier() %>% fit(x_train, y_train)
+model_rfc <- h2o4gpu.random_forest_classifier() %>% fit(x_train, y_train)
+model_enc <- h2o4gpu.elastic_net_classifier() %>% fit(x_train, y_train) 
+```
+
+
+We pipe our trained models to the familiar `predict()` method.  In binary classification, we are often more interested in the numeric predicted values, rather than the predicted class labels.  We follow the same design as the `predict()` function in the popular **caret** package, which allows the user to specify which type of predictions they want to return using the `type` argument.  This defaults to `"raw"` which in classification, yields predicted class labels.  When we set it to `"prob"`, it returns the (uncalibrated) class probabilities.  This is not mentioned often in modeling software documentation, but you should note that despite using the term "probabilities", these predicted values do not represent actual probabilities unless some method like [Platt scaling](https://en.wikipedia.org/wiki/Platt_scaling) is used for calibration.  This is true for all machine learning packages, including **caret**, **h2o**, and **h2o4gpu** (though we do offer the option to perform Platt scaling inside the **h2o** R package).  
+
+```{r, eval = FALSE}
+# Generate predictions (type "prob" gives predicted values instead of predicted label)
+pred_gbc <- model_gbc %>% predict(x_test, type = "prob")
+pred_rfc <- model_rfc %>% predict(x_test, type = "prob")
+pred_enc <- model_enc %>% predict(x_test, type = "prob")
+```
+
+Let's take a look at what the output of the `predict()` function looks like in binary classification. It will be a two-column matrix with the column names set to the names of the classes.
+
+```{r, eval = FALSE}
+head(pred_rfc)
+```
+
+To compute AUC of a binary classification model, we use the predicted values of the second column (the "positive" class) and pass that to the `Metrics::auc()` function.
+
+
+```{r, eval = FALSE}
+# Compare test set performance using AUC
+auc(actual = y_test, predicted = pred_gbc[, 2])
+auc(actual = y_test, predicted = pred_rfc[, 2])
+auc(actual = y_test, predicted = pred_enc[, 2])
+```
+
+
+### Multiclass Classification
+
+Now that we are familiar with binary classification, there is not much more to say about multiclass classification.  The predict output will have the same format as binary classification, except that if you use `type = "prob"` number of columns will match the number of classes.  Often in multiclass classification, you may be interested in the predicted class label and misclassification error, which we've demonstrated already in the Quickstart section.
+
+
+### Regression
+
+In this next exercise, we will compare a GBM and GLM regression model.  Until [this issue](https://github.com/h2oai/h2o4gpu/issues/493) is respolved, we don't recommend that you use the Random Forest regressor, as there are some bugs that are severely affecting model performance.
+
+We will predicting the age of abalone from physical measurements, using the [Abalone](https://archive.ics.uci.edu/ml/datasets/Abalone) dataset.  
+
+```{r, eval = FALSE}
+# Load a sample dataset for regression
+# Source: https://archive.ics.uci.edu/ml/datasets/Abalone
+df <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", header = FALSE)
+str(df)
+```
+
+There is one categorical/factor column in this dataset, so we will first convert those values to integers (label encoding).  Recall that label encoding is just one way of encoding the categorical column and that there may be other ways that produce better results in terms of model performance.
+
+```{r, eval = FALSE}
+df[, 1] <- as.integer(df[, 1])  #label encode the one factor column
+```
+
+
+In this case, we started with a single data frame, so we should break the data into train and test splits at random.  We can do that easily in R by sampling 80% of the row indices and subsetting the data frame by row.
+
+```{r, eval = FALSE}
+# Randomly sample 80% of the rows for the training set
+set.seed(1)
+train_idx <- sample(1:nrow(df), 0.8*nrow(df))
+
+# Create train & test sets (column 9 is the response)
+x_train <- df[train_idx, -9]
+y_train <- df[train_idx, 9]
+x_test <- df[-train_idx, -9]
+y_test <- df[-train_idx, 9]
+```
+
+
+```{r, eval = FALSE}
+# Train two different regression models
+model_gbr <- h2o4gpu.gradient_boosting_regressor() %>% fit(x_train, y_train)
+model_enr <- h2o4gpu.elastic_net_regressor() %>% fit(x_train, y_train)
+
+# Generate predictions 
+pred_gbr <- model_gbr %>% predict(x_test)
+pred_enr <- model_enr %>% predict(x_test)
+```
+
+In regression, the `predict()` function always returns a vector of predictions (not a data frame).
+
+```{r, eval = FALSE}
+head(pred_gbr)
+```
+
+In regression problems, Mean Squared Error (MSE), is a common metric for model evaluation.  We will use test set MSE to evaluate and compare our two models.
+```{r, eval = FALSE}
+# Compare test set performance using MSE
+mse(actual = y_test, predicted = pred_gbr)
+mse(actual = y_test, predicted = pred_enr)
+```
+
+In this case, which is not usual, the GBM drastically outperforms the GLM.
+
+
+## Unsupervised Learning
+
+The unsupervised learning algorithms in **h2o4gpu** include K-Means, Principal Component Analysis (PCA), and Truncated Singular Value Decompostion (SVD).  
+
+
+### K-Means Clustering
+
+First we will train a K-Means model.  Let's create a train and test set from the iris dataset.
+
+```{r, eval = FALSE}
+# Prepare data
+iris$Species <- as.integer(iris$Species) # convert to numeric data
+
+# Randomly sample 80% of the rows for the training set
+set.seed(1)
+train_idx <- sample(1:nrow(iris), 0.8*nrow(iris)) 
+train <- iris[train_idx, ]
+test <- iris[-train_idx, ]
+```
+
+Train a K-Means model with three clusters.
+```{r, eval = FALSE}
+model_km <- h2o4gpu.kmeans(n_clusters = 3L) %>% fit(train)
+```
+
+Once you have trained a K-Means model, applying the `transform()` function to a dataset transforms your points into distances from each centroid.  So your `n`x`p` matrix becomes `n`x`k` (`n` is the number of observations,`p` the number of features and `k` the number of clusters).
+
+```{r, eval = FALSE}
+test_dist <- model_km %>% transform(test)
+head(test_dist)
+```
+
+
+### Principal Compoment Analysis (PCA)
+
+Let's use the HIGGS train and test datasets again for demonstration.
+
+```{r, eval = FALSE}
+# Load a sample dataset for binary classification
+# Source: https://archive.ics.uci.edu/ml/datasets/HIGGS
+train <- read.csv("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
+test <- read.csv("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
+```
+
+Train a PCA model with 4 components and apply the transformation onto a dataset.  Once you have created a projection model from a dataset, you can apply that transformation to a new dataset (such as a test set) using the `transform()` function.
+
+```{r, eval = FALSE}
+model_pca <- h2o4gpu.pca(n_components = 4) %>% fit(train)
+test_transformed <- model_pca %>% transform(test)
+```
+
+
+### Truncated Singular Value Decomposition (SVD)
+
+Train a truncated SVD model with 4 components and apply the transformation on a test set.
+
+```{r, eval = FALSE}
+model_tsvd <- h2o4gpu.truncated_svd(n_components = 4) %>% fit(train)
+test_transformed <- model_tsvd %>% transform(test)
+```