Part 2 Data exploration

The package dplyr probably makes R the best for data exploration. More efficient data exploration may help the investigator on what steps to take next in preliminary stages of their research. So, let’s start with what we previewed in part 2.

nshap %>% 
    group_by(gender) %>%
    summarise(n = n())
## # A tibble: 2 x 2
##   gender     n
##   <chr>  <int>
## 1 female  2648
## 2 male    2129
# Seem like there are about 500 more female respondents

To explain the code above, this code takes nshap data, reducing the data by grouping into two discrete gender values (male and female), and creating a variable called n which counts how many there are using n(). As a side note, group_by() is usually used on discrete variables. Its equivalent function in Stata is by or over.


2.1 Age and gender

Let’s look at age and gender.

age_gender <- nshap %>%
    select(age, gender) %>%
    mutate(age = as.numeric(age))

head(age_gender)
## # A tibble: 6 x 2
##     age gender
##   <dbl> <chr> 
## 1    79 female
## 2    84 male  
## 3    83 male  
## 4    83 female
## 5    62 female
## 6    76 male

Here, the variables age and gender are the only variables selected, then converted the variable age into a numeric variable. This reduced data of two variables is now stored into an object called age_income. Then head() shows the first 6 rows of age_gender dataframe. Essentially, R dropped all variables except for the two and saved the dataframe as age_gender.


2.2 arrange() (Stata: sort)

From this data, we can see how many respondents there are for each age and gender. Let’s count them up and sort them in descending order.

age_gender %>%
    group_by(age, gender) %>% 
    summarise(n = n()) %>% 
    arrange(desc(n))
## # A tibble: 125 x 3
## # Groups:   age [68]
##      age gender     n
##    <dbl> <chr>  <int>
##  1    68 female   104
##  2    54 female    98
##  3    68 male      96
##  4    67 female    94
##  5    69 female    89
##  6    73 female    88
##  7    53 female    77
##  8    55 female    76
##  9    59 female    76
## 10    69 male      76
## # ... with 115 more rows

Interestingly enough, there seems to be quite a bit of respondents in their late 60s. We’ll investigate this a bit more later in part 4, where we visualize our data.


2.3 head() (Stata: list in 1/6)

The variable par_nerve seems interesting, where the respondents find their “partner to get on their nerves” as discrete values. Let’s see how the data looks:

nerve <- nshap %>%
    group_by(par_nerve, gender) %>% 
    summarise(n = n()) %>% 
    arrange(desc(n))

head(nerve)
## # A tibble: 6 x 3
## # Groups:   par_nerve [3]
##   par_nerve             gender     n
##   <chr>                 <chr>  <int>
## 1 not applicable        female   921
## 2 some of the time      female   842
## 3 some of the time      male     715
## 4 hardly ever or rarely male     639
## 5 hardly ever or rarely female   477
## 6 not applicable        male     359

It seems like there is quite a bit of missing data — 921 “not applicable” from females, 359 from males. Not counting for non-respondents, it seems that they generally find their partners to get on their nerves for the discrete value of “some of the time” answer.


2.4 Ascending order

Let’s sort them in ascending order, as well as ordering discrete values as we see fit; head(., 10) shows 10 results instead of the default 6. But instead of descending order in n or count, let’s see if we order the par_nerve variable by assigning order (“levels”) to the factor values.

nerve$par_nerve <- ordered(nerve$par_nerve, levels=c("often",
                                                     "some of the time",
                                                     "hardly ever or rarely",
                                                     "never",
                                                     "refused",
                                                     "don't know"))

nerve %>% 
    arrange(-desc(par_nerve))
## # A tibble: 14 x 3
## # Groups:   par_nerve [7]
##    par_nerve             gender     n
##    <ord>                 <chr>  <int>
##  1 often                 female   247
##  2 often                 male     139
##  3 some of the time      female   842
##  4 some of the time      male     715
##  5 hardly ever or rarely male     639
##  6 hardly ever or rarely female   477
##  7 never                 male     271
##  8 never                 female   157
##  9 refused               male       3
## 10 refused               female     2
## 11 don't know            male       3
## 12 don't know            female     2
## 13 <NA>                  female   921
## 14 <NA>                  male     359

To explain the code, this code replaces the par_nerve variable in nerve dataframe. The function ordered assigns an order to the discrete variables, where the option levels specifies the order. The c() function denotes a vector, where in this case, it’s a vector of six discrete strings. A little trick — -desc orders in ascending order (there is no asce).


2.5 More head(), variable type, and mean

Let’s take a look at household earnings, age, gender, and education.

hage <- nshap %>% # hearn, age, gender, education
    select(hearn, age, gender, educ) %>%
    mutate(hearn = as.numeric(hearn),
           age = as.numeric(age))

head(hage, 10)
## # A tibble: 10 x 4
##    hearn   age gender educ                       
##    <dbl> <dbl> <chr>  <chr>                      
##  1 16000    79 female voc cert/some college/assoc
##  2 34000    84 male   voc cert/some college/assoc
##  3 50000    83 male   bachelors or more          
##  4 25000    83 female voc cert/some college/assoc
##  5 30000    62 female < hs                       
##  6    NA    76 male   bachelors or more          
##  7 95000    79 male   hs/equiv                   
##  8    NA    64 female hs/equiv                   
##  9    NA    78 male   < hs                       
## 10    NA    80 female hs/equiv
mean(hage$hearn)
## [1] NA

First, the code converts hearn and age into numeric variables, similar to destring in Stata. Also, since there is missing data (represented as “NA” in R), it refuses to calculate the mean and results in NA. There are couple options in this situation.

  1. Don’t count the missing values when calculating the mean
mean(hage$hearn, na.rm = T)
## [1] 73746.95
  1. Remove the rows with missing values
hage <- na.omit(hage)
mean(hage$hearn)
## [1] 73746.95

Notice that there are na.rm and na.omit options. Essentially, na.rm removes NA values, while na.omit removes the rows with NA values. The first method is more surgical while second method is more invasive, but the second method will be used in part 4 for graphing.