Starting with R

0.1 Downloading and installing R

R by itself (AKA base R) is a language, so the usual case is to use an integrated development environment (IDE) that wraps around base R. Basically, the IDE understands the language, can execute the code, help debug code, have an intuitive interface for graphics or visuals, among other functions. The most popular IDE is called R Studio. Here are the download links:

  1. Base R: https://cran.rstudio.com or https://cran.r-project.org
  2. R Studio: https://www.rstudio.com/products/rstudio/download/#download

After you installed both, you’ll most likely never need to open R, though you might need to update R from the time to time. Run R Studio and you should be able to see four panes — source (code), console, environment, and files. Go to File → New File → R Markdown (or R script – the most basic IDE). It should take you through some installation (of packages).

Markdown is another language though, but it is a standard for couple reasons:

  1. Easy publishing option (like this documentation you’re reading now is made out of Markdown variation called Bookdown)
  2. In-window results presented within the code, so you can see which line of your code produced what

After giving your Rmd (R markdown) a title, go ahead and delete all the text/code (or you can choose to read it if you’d like). Now to start coding, press ctrl + alt + I or alt + cmd + I to create an R code block.


0.2 Note for this tutorial

Within the code blocks of the tutorial, note that the number sign (“#” or the hash) denotes code comment, where double number sign (“##”) denotes output from R.


0.3 Loading some libraries and code

In order to load data into R, we need to install a library for different data types. Also, unlike how you can just run a regression on Stata by running the command reg, you have to install a package (think of it as downloading the regression formula). R does come with some basic functions, but it is a barebone when first installed. Think of it this way: we know English grammar, but do we need to know every single word in the dictionary for one specialized field of work or everyday life? Not really, so what we are doing here is adding a package of vocabulary that are commonly used, known as tidyverse.

Here is a command to run (remove # to un-comment the code):

#install.packages("tidyverse")

Pressing ctrl + enter will run the line of code or clicking on the green arrow button on the top right of the code block will run the code block. If the code is unsuccessful, it will tell you something. If successful it might not, depending on how the author of the libraries built them. Let’s see if we can run the code below. Does it run?

library(tidyverse)
library(readxl) # for reading Excel files

To explain readxl, it enables R Studio to import Excel files. The important packages we’ll use within tidyverse package are dplyr and ggplot2. Now that they are installed, let’s load the libraries so we can use the functions that come with the libraries.

Note that when using install.packages() function, you have to use quotes, while for library() function, there are no quotes (think of it as learning how to pronounce the word in quotes when installing). After installing a package, R now recognizes those packages like vocabulary. Also, installing a package may automatically install other libraries that’s not named. For example, tidyvers comes with several libraries. However, you need to call each library from the packages separately. For example, library(lubridate) needs to be called even after calling library(tidyverse), even though lubridate library is inside tidyverse package.

To start fresh, similar to clear all in Stata, we use rm() to remove objects, and list=ls() means all objects that are listed (or loaded).

rm(list=ls()) # for resetting

0.4 Help docs

In Stata, help X brings up the documentation for the function X. In R, you can type in ?X in the console. Note that if a particular function is from a library, you have to load that library first. For example, arrange() is from dplyr library, so you need to first run the command library(dplyr) then run ?arrange.


0.5 Importing data

Now, let’s import some data. While there are libraries that enables R to load Stata data, there are some minor problems when handling null data. Stata can handle 27 different null data, but R can only handle one. NSHAP uses different null codes, so exporting as Excel file from Stata and importing that into R makes it easier to work with, because Excel converts the different null data into strings (and these strings can be removed manually in R if needed). After importing, let’s preview the data with head() function to see if it loaded correctly.

df <- read_excel("P:/5707/5707B/Administrative/Research Assistant Essentials/NSHAP Stata Training Materials/R tutorial winter 2019/nshap_w3_core.xlsx")
head(head(df))
## # A tibble: 6 x 496
##   su_id hh_id fi_id surveytype3 partnerconfirm partner_id version3
##   <chr> <chr> <chr> <chr>       <chr>          <chr>      <chr>   
## 1 1000~ 1000~ 2132~ returning ~ not applicable 10000011   returni~
## 2 1000~ 1000~ 2132~ returning ~ not applicable 10000010   returni~
## 3 1000~ 1000~ 0760~ returning ~ not applicable 10000051   returni~
## 4 1000~ 1000~ 0760~ returning ~ not applicable 10000050   returni~
## 5 1000~ 1000~ 8128~ returning ~ not applicable <NA>       returni~
## 6 1000~ 1000~ 5341~ returning ~ not applicable <NA>       not ret~
## # ... with 489 more variables: int_start <dttm>, weight_sel <dbl>,
## #   weight_adj <dbl>, stratum <dbl>, cluster <dbl>, military <chr>,
## #   gender <chr>, age <chr>, hschl <chr>, hschlgr <chr>, college <chr>,
## #   collegey <chr>, degree <chr>, race <chr>, hispanic <chr>,
## #   ethgrp <chr>, degree_coded <chr>, educ <chr>, rosterintro <chr>,
## #   maritlst <chr>, spartner <chr>, spnamed <chr>, splineno <chr>,
## #   other_hh <chr>, born_us <chr>, born_stat <chr>, famhappy <chr>,
## #   dadeduc <chr>, momeduc <chr>, famfin <chr>, liveparent <chr>,
## #   chldhlth <chr>, expviol <chr>, witnviol <chr>, volunteer <chr>,
## #   attend <chr>, social <chr>, rlthappy <chr>, sptime <chr>,
## #   spopen2 <chr>, sprely2 <chr>, spdemand2 <chr>, spcritze2 <chr>,
## #   spunderstand <chr>, sptalk <chr>, sprelyhelp <chr>, spletdown <chr>,
## #   par_nerve <chr>, famopen2 <chr>, famrely2 <chr>, famdeman2 <chr>,
## #   famcritz2 <chr>, famfeel <chr>, famworries <chr>, famhelp <chr>,
## #   famdown <chr>, fam_nerve <chr>, clsrel <chr>, fropen2 <chr>,
## #   frrely2 <chr>, frdemn2 <chr>, frcritz2 <chr>, frfeel <chr>,
## #   frworries <chr>, frhelp <chr>, frdown <chr>, fr_nerve <chr>,
## #   framt <chr>, physhlth <chr>, mntlhlth3 <chr>, coldtoday <chr>,
## #   mammogram3 <chr>, regularmammogram <chr>, futuremammogram <chr>,
## #   psa3 <chr>, regularpsa <chr>, futurepsa <chr>, colonos3 <chr>,
## #   generaldoc <chr>, placeappt3 <chr>, conditns_6 <chr>, arthritis <chr>,
## #   osteo_rheu <chr>, hrtprob3 <chr>, hrtattack <chr>, hrtcard <chr>,
## #   hrtchf <chr>, hrtatrfib <chr>, othcan <chr>, howmanyc2 <chr>,
## #   cdiag_1 <chr>, cdiagm_1 <chr>, cdiagy_1 <chr>, cdiagage_1 <chr>,
## #   cbegin_1 <chr>, spread_1 <chr>, cdiag_2 <chr>, cdiagm_2 <chr>,
## #   cdiagy_2 <chr>, cdiagage_2 <chr>, ...

Besides the massive output, here’s an ugly part of R — the back slash or \ symbol is an escape character, which means we need to convert all the back slashes into forward slashes. Make sure to note this for your next R script!

What we did with the code above just now is assign (generate or replace in Stata) an object called df, which stands for “dataframe” in the usual R lingo, and take a look at the head of the data, which shows first 6 rows of the object. In this case, it is the first 6 rows of the object df. It shows other information, such as how many variables there are (496 in this case), what kind of variables are imported. The seven types of variables are:

  1. int: integers
  2. dbl: double (real numbers)
  3. chr: characters (strings)
  4. date: date
  5. dttm: date + times
  6. lgl: logical (binary)
  7. fctr: factor (discrete)

There is a slight upside to using R, which is to bring in multiple data files into the IDE. To clarify, Stata can only handle one file at a time, while R can handle multiple data files (and data types — there’s a library for pretty much everything!) as multiple objects within the IDE. Here is an example:

df1 <- read_excel("excel_data.xlsx")
df2 <- read_dta("stata_data.dta")
df3 <- read_csv("csv_data.csv")

To convey a simple analogy, you can tell R to “regress variable1 from df1, variable5 from df2, variable9 from df3”. With a bit of basic understanding, we are almost prepared to do some real coding. Head to the next part for functions.