Part 4 Date data and regex parsing

The library lubridate in the tidyverse package is the go-to library for time-series data. It can recognize all kinds of time formats, keep them in a uniformed format, calculate time, and more. However, not all datasets are created equally; some date entries may not be easily recognized, so this section will go over how to parse w2 data using stringr library from tidyverse package and convert the parsed data into recognizable date format.

4.1 W3 – already in date format

Let’s take a look at the variable int_start, which is interview start. Let’s convert wave 3 data first since it is simpler to work with.

library(lubridate)

df %>% 
    select(int_start) %>%
    head()
## # A tibble: 6 x 1
##   int_start          
##   <dttm>             
## 1 2015-09-01 00:00:00
## 2 2015-09-01 00:00:00
## 3 2015-11-01 00:00:00
## 4 2015-10-01 00:00:00
## 5 2016-01-01 00:00:00
## 6 2016-02-01 00:00:00
df <- df %>% 
    mutate(int_year = year(ymd(int_start)),
           int_month = month(ymd(int_start)),
           int_day = day(ymd(int_start)))

df %>% 
    select(int_start, int_year, int_month, int_day) %>% 
    head()
## # A tibble: 6 x 4
##   int_start           int_year int_month int_day
##   <dttm>                 <dbl>     <dbl>   <int>
## 1 2015-09-01 00:00:00     2015         9       1
## 2 2015-09-01 00:00:00     2015         9       1
## 3 2015-11-01 00:00:00     2015        11       1
## 4 2015-10-01 00:00:00     2015        10       1
## 5 2016-01-01 00:00:00     2016         1       1
## 6 2016-02-01 00:00:00     2016         2       1

4.2 W2 – regex to convert into date format

Alright, let’s take a look at what int_start variable looks like for w2 after importing w2 data first.

w2 <- read_excel("P:/5707/5707B/Administrative/Research Assistant Essentials/NSHAP Stata Training Materials/R tutorial winter 2019/nshap_w2_core.xlsx")

head(w2$int_start)
## [1] "2010m11" "2011m2"  "2010m10" "2010m11" "2010m10" "2010m9"

Note: alternative head() code for the w3 code above without using pipes – see the difference in presented format? This code might be shorter, but as codes become more and more complex, readability becomes more and more important. Using pipes help with readability a lot, where it follows similar pattern to English grammar, from subject to verb/object. The code below starts with a verb, with subject and object included within the verb. However, for short and simple codes like this, it should be ok, as long as the output is understandable.

Oh no, what is this format for w2!? The value “2010m11” looks like the year 2010 and the month of November. In order to convert this, we have to use “regular expressions” or regex (the stringr library from tidyverse) to parse it, because lubridate won’t recognize this. For all things regex, take a look at the official documentation: https://stringr.tidyverse.org/articles/regular-expressions.html

YYYYmMM <- regex('\\d{4}    # Four numbers
                  m         # The "m"
                  \\d{1,2}  # One or two number matches',
                  comments = T)

# Let's see how the matches look
head(str_match(w2$int_start, YYYYmMM))
##      [,1]     
## [1,] "2010m11"
## [2,] "2011m2" 
## [3,] "2010m10"
## [4,] "2010m11"
## [5,] "2010m10"
## [6,] "2010m9"
# Alternative code:
#head(str_match(w2$int_start, regex('\\d{4}m\\d{1,2}')))

Alright, so it seems that the regex is working well from what we can see; it listed all the matches starting from “2010m11” in the head() output. Here is what is happening within the regex code:

  • Essentially, \ is an escape character. We need a double escape here because we don’t want the alphabet “d” but we want the regex function d after escaping the escape character, which stands for digit. {4} means four repetitions, which makes this four digits.

  • Similar to \\d{4} in the first line, {1,2} makes 1 as minimum number of matches, while 2 is the maximum number of matches.

  • comments = TRUE allows comments after “#” symbol in the multiline code within the quoted argument.

  • We save all of this in an object called YYYYmMM so that we can easily compare the data with the regex object. Alternatively, in the str_match code, you can replace “YYYYmMM” argument with the regex code, as demonstrated.

Now, we can just take it apart and separate them into variables year and month.

regex_w2_years <- regex('^\\d{4}')
regex_w2_month <- regex('\\d{1,2}$')

w2 <- w2 %>% 
    mutate(int_year = as.numeric(str_match(int_start, regex_w2_years)),
           int_month = as.numeric(str_match(int_start, regex_w2_month)))

w2 %>% 
    select(int_start, int_year, int_month) %>% 
    head()
## # A tibble: 6 x 3
##   int_start int_year int_month
##   <chr>        <dbl>     <dbl>
## 1 2010m11       2010        11
## 2 2011m2        2011         2
## 3 2010m10       2010        10
## 4 2010m11       2010        11
## 5 2010m10       2010        10
## 6 2010m9        2010         9

Now that we have the year and month separate for both waves, you can choose to combine them. Combining them into a single time variable can be convenient too! In order to combine them, we can use mutate() to create a variable, and use date() to make sure the variable created is a date variable. Since there is no day data in w2, R will default to the first day of the month.

# w2
w2 <- w2 %>% 
    mutate(yearmonth = make_date(int_year, int_month))

# w3
df <- df %>% 
    mutate(yearmonthday = make_date(int_year, int_month, int_day))

w2 %>% 
    select(yearmonth) %>% 
    head()
## # A tibble: 6 x 1
##   yearmonth 
##   <date>    
## 1 2010-11-01
## 2 2011-02-01
## 3 2010-10-01
## 4 2010-11-01
## 5 2010-10-01
## 6 2010-09-01
df %>% 
    select(yearmonthday) %>% 
    head()
## # A tibble: 6 x 1
##   yearmonthday
##   <date>      
## 1 2015-09-01  
## 2 2015-09-01  
## 3 2015-11-01  
## 4 2015-10-01  
## 5 2016-01-01  
## 6 2016-02-01

The key function here is makedate(). Notice that the variable type is <date> instead of double or integer. Now, time series data analysis is possible!