Part 4 Date data and regex parsing
The library lubridate
in the tidyverse
package is the go-to library for time-series data. It can recognize all kinds of time formats, keep them in a uniformed format, calculate time, and more. However, not all datasets are created equally; some date entries may not be easily recognized, so this section will go over how to parse w2 data using stringr
library from tidyverse
package and convert the parsed data into recognizable date format.
4.1 W3 – already in date format
Let’s take a look at the variable int_start
, which is interview start. Let’s convert wave 3 data first since it is simpler to work with.
library(lubridate)
df %>%
select(int_start) %>%
head()
## # A tibble: 6 x 1
## int_start
## <dttm>
## 1 2015-09-01 00:00:00
## 2 2015-09-01 00:00:00
## 3 2015-11-01 00:00:00
## 4 2015-10-01 00:00:00
## 5 2016-01-01 00:00:00
## 6 2016-02-01 00:00:00
df <- df %>%
mutate(int_year = year(ymd(int_start)),
int_month = month(ymd(int_start)),
int_day = day(ymd(int_start)))
df %>%
select(int_start, int_year, int_month, int_day) %>%
head()
## # A tibble: 6 x 4
## int_start int_year int_month int_day
## <dttm> <dbl> <dbl> <int>
## 1 2015-09-01 00:00:00 2015 9 1
## 2 2015-09-01 00:00:00 2015 9 1
## 3 2015-11-01 00:00:00 2015 11 1
## 4 2015-10-01 00:00:00 2015 10 1
## 5 2016-01-01 00:00:00 2016 1 1
## 6 2016-02-01 00:00:00 2016 2 1
4.2 W2 – regex to convert into date format
Alright, let’s take a look at what int_start
variable looks like for w2 after importing w2 data first.
w2 <- read_excel("P:/5707/5707B/Administrative/Research Assistant Essentials/NSHAP Stata Training Materials/R tutorial winter 2019/nshap_w2_core.xlsx")
head(w2$int_start)
## [1] "2010m11" "2011m2" "2010m10" "2010m11" "2010m10" "2010m9"
Note: alternative head()
code for the w3 code above without using pipes – see the difference in presented format? This code might be shorter, but as codes become more and more complex, readability becomes more and more important. Using pipes help with readability a lot, where it follows similar pattern to English grammar, from subject to verb/object. The code below starts with a verb, with subject and object included within the verb. However, for short and simple codes like this, it should be ok, as long as the output is understandable.
Oh no, what is this format for w2!? The value “2010m11” looks like the year 2010 and the month of November. In order to convert this, we have to use “regular expressions” or regex (the stringr
library from tidyverse) to parse it, because lubridate
won’t recognize this. For all things regex, take a look at the official documentation: https://stringr.tidyverse.org/articles/regular-expressions.html
YYYYmMM <- regex('\\d{4} # Four numbers
m # The "m"
\\d{1,2} # One or two number matches',
comments = T)
# Let's see how the matches look
head(str_match(w2$int_start, YYYYmMM))
## [,1]
## [1,] "2010m11"
## [2,] "2011m2"
## [3,] "2010m10"
## [4,] "2010m11"
## [5,] "2010m10"
## [6,] "2010m9"
# Alternative code:
#head(str_match(w2$int_start, regex('\\d{4}m\\d{1,2}')))
Alright, so it seems that the regex is working well from what we can see; it listed all the matches starting from “2010m11” in the head()
output. Here is what is happening within the regex code:
Essentially,
\
is an escape character. We need a double escape here because we don’t want the alphabet “d” but we want the regex functiond
after escaping the escape character, which stands for digit.{4}
means four repetitions, which makes this four digits.Similar to
\\d{4}
in the first line,{1,2}
makes 1 as minimum number of matches, while 2 is the maximum number of matches.comments = TRUE
allows comments after “#” symbol in the multiline code within the quoted argument.We save all of this in an object called
YYYYmMM
so that we can easily compare the data with the regex object. Alternatively, in thestr_match
code, you can replace “YYYYmMM” argument with the regex code, as demonstrated.
Now, we can just take it apart and separate them into variables year
and month
.
regex_w2_years <- regex('^\\d{4}')
regex_w2_month <- regex('\\d{1,2}$')
w2 <- w2 %>%
mutate(int_year = as.numeric(str_match(int_start, regex_w2_years)),
int_month = as.numeric(str_match(int_start, regex_w2_month)))
w2 %>%
select(int_start, int_year, int_month) %>%
head()
## # A tibble: 6 x 3
## int_start int_year int_month
## <chr> <dbl> <dbl>
## 1 2010m11 2010 11
## 2 2011m2 2011 2
## 3 2010m10 2010 10
## 4 2010m11 2010 11
## 5 2010m10 2010 10
## 6 2010m9 2010 9
Now that we have the year and month separate for both waves, you can choose to combine them. Combining them into a single time variable can be convenient too! In order to combine them, we can use mutate()
to create a variable, and use date()
to make sure the variable created is a date variable. Since there is no day data in w2, R will default to the first day of the month.
# w2
w2 <- w2 %>%
mutate(yearmonth = make_date(int_year, int_month))
# w3
df <- df %>%
mutate(yearmonthday = make_date(int_year, int_month, int_day))
w2 %>%
select(yearmonth) %>%
head()
## # A tibble: 6 x 1
## yearmonth
## <date>
## 1 2010-11-01
## 2 2011-02-01
## 3 2010-10-01
## 4 2010-11-01
## 5 2010-10-01
## 6 2010-09-01
df %>%
select(yearmonthday) %>%
head()
## # A tibble: 6 x 1
## yearmonthday
## <date>
## 1 2015-09-01
## 2 2015-09-01
## 3 2015-11-01
## 4 2015-10-01
## 5 2016-01-01
## 6 2016-02-01
The key function here is makedate()
. Notice that the variable type is <date> instead of double or integer. Now, time series data analysis is possible!