Part 1 Before diving in
Let’s make sure we load up our packages for data exploration and reading an Excel file.
# Loading necessities
library(tidyverse)
library(readxl)
#library(dplyr) # for coding with data
#library(ggplot2) # for graphing
And import the data…
#rm(list=ls())
nshap <- read_excel("P:/5707/5707B/Administrative/Research Assistant Essentials/NSHAP Stata Training Materials/R tutorial winter 2019/nshap_w3_core.xlsx") # make sure to use quotes!
#head(nshap)
You can uncomment #head(nshap)
to take a preview of the dataframe object to make sure it loaded correctly.
1.1 General syntax for R
Inputting commands in R is similar to any other language with its own perks just like any other language. Below will demonstrate how R code generally runs and their different versions of formatting that R understands:
function(argument1 = value1, argument2 = value2) # original command
function(value1, value2) # correct
function(argument2 = value2, argument1 = value1) # correct, safer option
function(value2, value1) # wrong
# Think of the above this way:
regress(data = nshap, y = income, x = age)
regress(x = age, data = nshap, y = income)
# Note: function "regress" does not exist in R -- look into "lm" which stands for linear model
This may make things easier than Stata, because Stata needs some arguments in order (for example, y variable must come first) and all the options need to come at the end (for example, robust), but R can take it in any order as long as you specify the argument. This makes the code more transferable since other people can tell what values you used specifically for each of the arguments. But for readability sake and as a standard in R, we format the code as the following instead:
regress(data = nshap,
y = income,
x = age)
regress(x = age,
data = nshap,
y = income)
Though some people do it the following way and line up the equal signs:
regress(data = nshap,
y = income,
x = age)
regress(x = age,
data = nshap,
y = income)
The second way might be more important if you are sharing your code with someone else, but lining up the equal signs is generally not practiced. Eitherway, R Studio makes it easy to format it this way through code linting.
1.2 The dplyr
way (or magrittr
way)
This is where things get a bit tricky, using “pipes” or %>%
. For pipe, you can press ctrl + shift + M
. Below two commands are equivalent:
regress(data = nshap,
y = income,
x = age)
nshap %>%
regress(y = income, x = age)
# or the altnernative format for the above code:
nshap %>%
regress(y = income,
x = age)
As long as nshap
is a declared object as a dataframe, this gets rid of the necessity to declare and type data = nshap
. Basically, what pipe does is take the left of the pipe (in the above case, nshap
) into the right of the pipe (regress()
), and right of the pipe already understands how to use the left side as long as the function is from the tidyverse
library. Think of the left side as being imported or loaded. There will be other truncation tricks from dplyr
we will see later on.
1.3 The functions from dplyr
You can take a look at the documentation directly at dplyr.tidyverse.org.
Four key functions for manipulating the data are mutate()
, select()
, filter()
, and arrange()
.
mutate()
: make new variable —gen
in Stataselect()
:drop
in Stata, as in get rid variables (or drop others using-
)filter()
: pick a data point if it matches a certain value from given variable —if variable == 123
in Stataarrange()
:sort
in Stata
And there is one key data grouping function group_by()
(same as by
in Stata), and data reporting function summarise()
. Let’s see what we can do with these six functions! Keep following through the tutorial.