Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks. It is a Supervised Learning algorithm which uses ensemble learning method for classification and regression. Random forest is a bagging technique and not a boosting technique.
- It runs efficiently on large databases.
- It gives estimates of what variables that are important in the classification.
- It is one of the most accurate learning algorithms available. For many data sets, it produces a highly accurate classifier.
- It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
- Random forests have been observed to overfit for some datasets with noisy classification/regression tasks.
- For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data.
The given DataSet has 3 columns:
- Position
- Level
- Salary
The Salary is described in accordance to the Position and the Level. For example if an employee is a Manager — he/she falls in Level 4 and should get around $80,000.
Libraries consists of pre-built functions that take user input and give desired output.
Three essential libraries:
- numpy: For performing mathematical functions.
- matplotlib: For creating charts for visualization.
- pandas: To import, manage and manipulate data and files.
- sklearn: To split the dataset and later perform scaling of test and training data.
Can be imported via GUI package explorer in RStudio. Generally all basic libraries we need are automatically imported.
Three essential libraries:
- caTools: To split the dataset into Training set and Test set.
- randomForest: Fitting Random Forest Regression to the dataset.
- ggplot2: To visualize the Random Forest Regression result.
-
Import required libraries.
-
Import and print the dataset.
-
Split the data into training set and test set.
-
Scale the data using StandardScalar.
-
Select all rows and column 1 from dataset to x and all rows and column 2 as y. x = data.iloc[:, 1:2].values
print(x) y = data.iloc[:, 2].values -
Fit Random forest regressor to the dataset. regressor = RandomForestRegressor(n_estimators = 100, random_state = 0) regressor.fit(x, y)
-
Predicting a new result y_pred = regressor.predict(6.5)
-
Visualising the result using pyplot.
-
Load the Dataset. If we look at the dataset, we need to predict the salary for an employee who falls under Level 6.5 — So we really do not need the first column “Position”. dataset = dataset[2:3]
-
Split the data into training set and test set. Install caTools library to split the data.
-
Scale the training_setand test_set.
-
Fitting Random Forest Regression to the dataset. Install randomForest package.
-
Prediction of new result.
-
Visualization of RFR result in high quality.