Monday, May 2, 2016

Introduction to Machine Learning with Random Forest

The purpose of this tutorial is to serve as a introduction to the randomForest package in R and some common analysis in machine learning.

Part 1. Getting Started

First step, we will load the package and iris data set. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

Part 2. Fit Model

Now that we know what our data set contains, let fit our first model. We will be fitting 500 trees in our forest and trying to classify the Species of each iris in the data set. For the randomForest() function, "~." means use all the variables in the data frame.

Note: a common mistake, made by beginners, is trying to classify a categorical variable that R sees as a character. To fix this, convert the variable to a factor like this randomForest(as.factor(Species) ~ ., iris, ntree=500)

fit <- randomForest(Species ~ ., iris, ntree=500)

The next step is to use the newly create model in the fit variable and predict the label.

results <- predict(fit, iris)

After you have the predicted labels in a vector (results), the predict and actual labels must be compared. This can be done with a confusion matrix. A confusion matrix is a table of the actual vs the predicted with the diagonal numbers being correctly classified elements while all others are incorrect.

# Confusion Matrix
 table(results, iris$Species) 

Now we can take the diagonal points in the table and sum them, this will give us the total correctly classified instances. Then dividing this number by the total number of instances will calculate the percentage of prediction correctly classified. <- -="" 1="" accuracy="" correctly_classified="" div="" error="" iris="" length="" pecies="" results="" style="overflow: auto;" table="" total_classified="">

# Calculate the accuracy  
correctly_classified <- table(results, iris$Species)[1,1] + table(results, iris$Species)[2,2] + table(results, iris$Species)[3,3] 
total_classified <- length(results)  
# Accuracy   
correctly_classified / total_classified   
# Error  
1 - (correctly_classified / total_classified)

Part 3. Validate Model 

The next step is to validate the prediction model. Validation requires splitting your data into two sections. First, the training set, which will be used to create the model. The second will be the test set and will test the accuracy of the prediction model. The reasoning for splitting the data is to allow a model to be created using one data set and then reserving some data, where the output is already known, to "test" the model accuracy. This more effectively estimates the accuracy of the model by not using the same data used to create the model and predict the accuracy.

# How to split into a training set   
rows <- nrow(iris) col_count <- c(1:rows)  
Row_ID <- sample(col_count, rows, replace = FALSE)   
iris$Row_ID <- Row_ID   
# Choose the percent of the data to be used in the training 
data training_set_size = .80   
#Now to split the data into training and test 
 index_percentile <- rows*training_set_size   
# If the Row ID is smaller then the index percentile, it will be assigned into the training set   
train <- iris[iris$Row_ID <= index_percentile,]   
# If the Row ID is larger then the index percentile, it will be assigned into the training set   
test <- iris[iris$Row_ID > index_percentile,]   
train_data_rows <- nrow(train)   
test_data_rows <- nrow(test)   
total_data_rows <- (nrow(train)+nrow(test)) train_data_rows / total_data_rows     
# Now we have 80% of the data in the training set  test_data_rows / total_data_rows     
# Now we have 20% of the data in the training set   
# Now lets build the randomforest using the train data set  
fit <- randomForest(Species ~ ., train, ntree=500) 

After the test set is predicted, a confusion matrix and accuracy must be calculated.

# Use the new model to predict the test set   
results <- predict(fit, test, type="response")   
# Confusion Matrix  
table(results, test$Species)  
# Calculate the accuracy   
correctly_classified <- table(results, test$Species)[1,1] + table(results, test$Species)[2,2] + table(results, test$Species)[3,3] total_classified <- length(results)   
# Accuracy   
correctly_classified / total_classified  
# Error  
1 - (correctly_classified / total_classified)

Part 4. Model Analysis 

After the model is created, understanding the relationship between variables and number of trees is important. R makes it easy to plot the errors of the model as the number of trees increase. This allows users to trade off between more trees and accuracy or fewer trees and lower computational time.

fit <- randomForest(Species ~ ., train, ntree=500)   
results <- predict(fit, test, type="response")  
# Rank the input variables based on their effectiveness as predictors   
# To understand the error rate lets plot the model's error as the number of trees increases  

Part 5. Handling Missing Values 

The last section of this tutorial involves one of the most time consuming and important parts of the data analysis process, missing variables. Very few machine learning algorithms can handle missing data in the data. However the randomForest package contains one of the most useful functions of all time, na.roughfix(). Na.roughfix() takes the most common factor in that column and replaces all the NAs with it. For this section we will first create some NAs in this data set and then replace them and run the prediction algorithm.

# Create some NA in the data. <- iris for (i in 1:4)[sample(150, sample(20)), i] <- NA   
# Now we have a dataframe with NAs   
#Adding na.action=na.roughfix   
#For numeric variables, NAs are replaced with column medians.   
#For factor variables, NAs are replaced with the most frequent levels (breaking ties at random) 
iris.narf <- randomForest(Species ~ .,, na.action=na.roughfix) 
results <- predict(iris.narf, train, type="response")

Congratulations! You now know how to create machine learning models, fit data using those models, test the model’s accuracy and display it in a confusion matrix, how to validate the model, and quickly replace missing variables. All of these are the basic fundamental skills in machine learning!
Hallway Mathlete Data Scientist

I am a PhD student in Industrial Engineering at Penn State University. I did my undergrad at Iowa State in Industrial Engineering and Economics. My academic website can be found here.