Sunday, February 28, 2016

Introduction to R: For Beginners Who Want to be Intermediate



This semester, I am managing an undergraduate on our research team. She has a basic engineering statistics background with a little experience doing analysis in Excel and understands some basic programming concepts. For her project with our group, she will have to learn R. 

I tried to compile a list of high quality videos and tutorials. While searching, I was surprised by two things: the increase in quantity of resources since when I started to learn R and the giant gap between beginner tutorials and the skills needed to be proficient in R. Many beginner tutorials explained how to assign or print out a variable, but no introduction to real analysis or real-world application of these skills. Also many of the tutorials only included basic R functions, for example plot() instead of ggplot2(), which any good Hadley Wickham worshipper knows is akin to sin (For those of you are just starting out, Hadley Wickham created ggplot2 and some of the most used R packages of all time). So I decided to post my compilation of tutorials and videos that will take you from a person that knows nothing about R to an intermediate R programmer in a smooth transition.


Step 1: Download and Setup
Download R  -   LINK
This is the basic program to run R on your computer.

Download RStudio  -   LINK
This is an IDE or Integrated development environment for R. Always open your R files into RStudio and not basic R because RStudio helps track variables, stores past graphs and has many other features that you will use heavily in analyzing data.

Step 2: Understand the Basic
Follow this tutorial for R  -  LINK
This is all online but it gets the basics of the language down.

Do the Introduction to R tutorial and Data Manipulation in R with dplyr. If you have trouble finding them in the site, the links to each class are here.

Step 2.5: Really a tip more than a step
The rest of the steps will be inside RStudio. To run a line of code, click in the line of code you wish to run and press Crt +Enter for Windows or Command + Enter for Mac. The line should drop down to the console window and run. This will make more sense once you watch the later videos.

Step 3: Introduction to R and RStudio

Step 4: Data Visualization and Generating Graphs
R is made of groups of functions called “packages”. There are packages for everything in R from accessing databases to creating websites to creating predictive models to generating plots.

The most common package for data analysis is ggplot2. GGplot2 allows R users to quickly plot data stored in data frames. The following video steps you through using qplot (which stands for quick plot and is ggplot2 most popular function).

Introduction to R Programming - Module 8 (qplot)  -  LINK
The data for this video can be downloaded here and is called “bank.zip”: http://mlr.cs.umass.edu/ml/machine-learning-databases/00222/
The file you want for this video is bank-full.csv.

In the video, replace the line:
attach(bank.marketing)

with the following code:
install.packages("ggplot2")
library(ggplot2)
file_path <- file.choose()
bank.marketing <- read.csv2(file_path)

This is how you read a csv data file separated by “;” into R, if it was separated by commas then the command would be read.csv(). Now the data in the csv selected is in the data frame “bank.marketing”.

The command file.choose() is used when the path to the file is unknown and will pop up a file search for you to select the desired file.

Step 5: Intro to Data Science
This is a great video for learning data analysis in R. There are three parts, I would recommend looking at all three parts.


Step 6: Intro to Web scraping
R Web Page Scraping Example Video

Web scraping is super easy in R and can allow you to grab data others can not. It is often missed in intro R tutorials which is a shame. 


Step 7: Introduction to Machine Learning with Random Forest
A quick introduction tutorial to machine learning algorithms and best practices.
http://www.hallwaymathlete.com/2016/05/introduction-to-machine-learning-with.html

Step 8: Machine Learning Course
Hands down the best machine learning course on the planet. 

At this point, you are a full data scientist and you can gather and analyze any data on the web!
Hallway Mathlete Data Scientist

I am a PhD student in Industrial Engineering at Penn State University. I did my undergrad at Iowa State in Industrial Engineering and Economics. My academic website can be found here.