Part 1 Before diving in

Let’s make sure we load up our packages for data exploration and reading an Excel file.

# Loading necessities
library(tidyverse)
library(readxl)
#library(dplyr) # for coding with data
#library(ggplot2) # for graphing

And import the data…

#rm(list=ls())
nshap <- read_excel("P:/5707/5707B/Administrative/Research Assistant Essentials/NSHAP Stata Training Materials/R tutorial winter 2019/nshap_w3_core.xlsx") # make sure to use quotes!
#head(nshap)

You can uncomment #head(nshap) to take a preview of the dataframe object to make sure it loaded correctly.


1.1 General syntax for R

Inputting commands in R is similar to any other language with its own perks just like any other language. Below will demonstrate how R code generally runs and their different versions of formatting that R understands:

function(argument1 = value1, argument2 = value2)  # original command
function(value1, value2)  # correct
function(argument2 = value2, argument1 = value1)  # correct, safer option
function(value2, value1)  # wrong

  # Think of the above this way:
regress(data = nshap, y = income, x = age)
regress(x = age, data = nshap, y = income)
  # Note: function "regress" does not exist in R -- look into "lm" which stands for linear model

This may make things easier than Stata, because Stata needs some arguments in order (for example, y variable must come first) and all the options need to come at the end (for example, robust), but R can take it in any order as long as you specify the argument. This makes the code more transferable since other people can tell what values you used specifically for each of the arguments. But for readability sake and as a standard in R, we format the code as the following instead:

regress(data = nshap,
        y = income,
        x = age)
regress(x = age,
        data = nshap,
        y = income)

Though some people do it the following way and line up the equal signs:

regress(data = nshap,
        y    = income,
        x    = age)
regress(x    = age,
        data = nshap,
        y    = income)

The second way might be more important if you are sharing your code with someone else, but lining up the equal signs is generally not practiced. Eitherway, R Studio makes it easy to format it this way through code linting.


1.2 The dplyr way (or magrittr way)

This is where things get a bit tricky, using “pipes” or %>%. For pipe, you can press ctrl + shift + M. Below two commands are equivalent:

regress(data = nshap,
        y    = income,
        x    = age)

nshap %>%
    regress(y = income, x = age)

  # or the altnernative format for the above code:

nshap %>%
    regress(y = income,
            x = age)

As long as nshap is a declared object as a dataframe, this gets rid of the necessity to declare and type data = nshap. Basically, what pipe does is take the left of the pipe (in the above case, nshap) into the right of the pipe (regress()), and right of the pipe already understands how to use the left side as long as the function is from the tidyverse library. Think of the left side as being imported or loaded. There will be other truncation tricks from dplyr we will see later on.


1.3 The functions from dplyr

You can take a look at the documentation directly at dplyr.tidyverse.org.

Four key functions for manipulating the data are mutate(), select(), filter(), and arrange().

  • mutate(): make new variable — gen in Stata
  • select(): drop in Stata, as in get rid variables (or drop others using -)
  • filter(): pick a data point if it matches a certain value from given variable — if variable == 123 in Stata
  • arrange(): sort in Stata

And there is one key data grouping function group_by() (same as by in Stata), and data reporting function summarise(). Let’s see what we can do with these six functions! Keep following through the tutorial.