Part 1. Getting Started
First step, we will load the package and iris data set. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.install.packages('randomForest') library(randomForest) data(iris) head(iris)
Part 2. Fit Model
Now that we know what our data set contains, let fit our first model. We will be fitting 500 trees in our forest and trying to classify the Species of each iris in the data set. For the randomForest() function, "~." means use all the variables in the data frame.Note: a common mistake, made by beginners, is trying to classify a categorical variable that R sees as a character. To fix this, convert the variable to a factor like this randomForest(as.factor(Species) ~ ., iris, ntree=500)
fit <- randomForest(Species ~ ., iris, ntree=500)
The next step is to use the newly create model in the fit variable and predict the label.
results <- predict(fit, iris) summary(results)
After you have the predicted labels in a vector (results), the predict and actual labels must be compared. This can be done with a confusion matrix. A confusion matrix is a table of the actual vs the predicted with the diagonal numbers being correctly classified elements while all others are incorrect.
Now we can take the diagonal points in the table and sum them, this will give us the total correctly classified instances. Then dividing this number by the total number of instances will calculate the percentage of prediction correctly classified. <- -="" 1="" accuracy="" correctly_classified="" div="" error="" iris="" length="" pecies="" results="" style="overflow: auto;" table="" total_classified="">->
# Calculate the accuracy correctly_classified <- table(results, iris$Species)[1,1] + table(results, iris$Species)[2,2] + table(results, iris$Species)[3,3] total_classified <- length(results) # Accuracy correctly_classified / total_classified # Error 1 - (correctly_classified / total_classified)
Part 3. Validate Model
The next step is to validate the prediction model. Validation requires splitting your data into two sections. First, the training set, which will be used to create the model. The second will be the test set and will test the accuracy of the prediction model. The reasoning for splitting the data is to allow a model to be created using one data set and then reserving some data, where the output is already known, to "test" the model accuracy. This more effectively estimates the accuracy of the model by not using the same data used to create the model and predict the accuracy.
# How to split into a training set rows <- nrow(iris) col_count <- c(1:rows) Row_ID <- sample(col_count, rows, replace = FALSE) iris$Row_ID <- Row_ID # Choose the percent of the data to be used in the training data training_set_size = .80 #Now to split the data into training and test index_percentile <- rows*training_set_size # If the Row ID is smaller then the index percentile, it will be assigned into the training set train <- iris[iris$Row_ID <= index_percentile,] # If the Row ID is larger then the index percentile, it will be assigned into the training set test <- iris[iris$Row_ID > index_percentile,] train_data_rows <- nrow(train) test_data_rows <- nrow(test) total_data_rows <- (nrow(train)+nrow(test)) train_data_rows / total_data_rows # Now we have 80% of the data in the training set test_data_rows / total_data_rows # Now we have 20% of the data in the training set # Now lets build the randomforest using the train data set fit <- randomForest(Species ~ ., train, ntree=500)
# Use the new model to predict the test set results <- predict(fit, test, type="response") # Confusion Matrix table(results, test$Species) # Calculate the accuracy correctly_classified <- table(results, test$Species)[1,1] + table(results, test$Species)[2,2] + table(results, test$Species)[3,3] total_classified <- length(results) # Accuracy correctly_classified / total_classified # Error 1 - (correctly_classified / total_classified)
Part 4. Model Analysis
After the model is created, understanding the relationship between variables and number of trees is important. R makes it easy to plot the errors of the model as the number of trees increase. This allows users to trade off between more trees and accuracy or fewer trees and lower computational time.fit <- randomForest(Species ~ ., train, ntree=500) results <- predict(fit, test, type="response") # Rank the input variables based on their effectiveness as predictors varImpPlot(fit) # To understand the error rate lets plot the model's error as the number of trees increases plot(fit)
Part 5. Handling Missing Values
The last section of this tutorial involves one of the most time consuming and important parts of the data analysis process, missing variables. Very few machine learning algorithms can handle missing data in the data. However the randomForest package contains one of the most useful functions of all time, na.roughfix(). Na.roughfix() takes the most common factor in that column and replaces all the NAs with it. For this section we will first create some NAs in this data set and then replace them and run the prediction algorithm.# Create some NA in the data. iris.na <- iris for (i in 1:4) iris.na[sample(150, sample(20)), i] <- NA # Now we have a dataframe with NAs View(iris.na) #Adding na.action=na.roughfix #For numeric variables, NAs are replaced with column medians. #For factor variables, NAs are replaced with the most frequent levels (breaking ties at random)
iris.narf <- randomForest(Species ~ ., iris.na, na.action=na.roughfix)
results <- predict(iris.narf, train, type="response")
Congratulations! You now know how to create machine learning models, fit data using those models, test the model’s accuracy and display it in a confusion matrix, how to validate the model, and quickly replace missing variables. All of these are the basic fundamental skills in machine learning!