Part 2 Data exploration
The package dplyr
probably makes R the best for data exploration. More efficient data exploration may help the investigator on what steps to take next in preliminary stages of their research. So, let’s start with what we previewed in part 2.
nshap %>%
group_by(gender) %>%
summarise(n = n())
## # A tibble: 2 x 2
## gender n
## <chr> <int>
## 1 female 2648
## 2 male 2129
# Seem like there are about 500 more female respondents
To explain the code above, this code takes nshap
data, reducing the data by grouping into two discrete gender
values (male and female), and creating a variable called n
which counts how many there are using n()
. As a side note, group_by()
is usually used on discrete variables. Its equivalent function in Stata is by
or over
.
2.1 Age and gender
Let’s look at age and gender.
age_gender <- nshap %>%
select(age, gender) %>%
mutate(age = as.numeric(age))
head(age_gender)
## # A tibble: 6 x 2
## age gender
## <dbl> <chr>
## 1 79 female
## 2 84 male
## 3 83 male
## 4 83 female
## 5 62 female
## 6 76 male
Here, the variables age
and gender
are the only variables selected, then converted the variable age
into a numeric variable. This reduced data of two variables is now stored into an object called age_income
. Then head()
shows the first 6 rows of age_gender
dataframe. Essentially, R dropped all variables except for the two and saved the dataframe as age_gender
.
2.2 arrange()
(Stata: sort
)
From this data, we can see how many respondents there are for each age
and gender
. Let’s count them up and sort them in descending order.
age_gender %>%
group_by(age, gender) %>%
summarise(n = n()) %>%
arrange(desc(n))
## # A tibble: 125 x 3
## # Groups: age [68]
## age gender n
## <dbl> <chr> <int>
## 1 68 female 104
## 2 54 female 98
## 3 68 male 96
## 4 67 female 94
## 5 69 female 89
## 6 73 female 88
## 7 53 female 77
## 8 55 female 76
## 9 59 female 76
## 10 69 male 76
## # ... with 115 more rows
Interestingly enough, there seems to be quite a bit of respondents in their late 60s. We’ll investigate this a bit more later in part 4, where we visualize our data.
2.3 head()
(Stata: list in 1/6
)
The variable par_nerve
seems interesting, where the respondents find their “partner to get on their nerves” as discrete values. Let’s see how the data looks:
nerve <- nshap %>%
group_by(par_nerve, gender) %>%
summarise(n = n()) %>%
arrange(desc(n))
head(nerve)
## # A tibble: 6 x 3
## # Groups: par_nerve [3]
## par_nerve gender n
## <chr> <chr> <int>
## 1 not applicable female 921
## 2 some of the time female 842
## 3 some of the time male 715
## 4 hardly ever or rarely male 639
## 5 hardly ever or rarely female 477
## 6 not applicable male 359
It seems like there is quite a bit of missing data — 921 “not applicable” from females, 359 from males. Not counting for non-respondents, it seems that they generally find their partners to get on their nerves for the discrete value of “some of the time” answer.
2.4 Ascending order
Let’s sort them in ascending order, as well as ordering discrete values as we see fit; head(., 10)
shows 10 results instead of the default 6. But instead of descending order in n
or count, let’s see if we order the par_nerve
variable by assigning order (“levels”) to the factor values.
nerve$par_nerve <- ordered(nerve$par_nerve, levels=c("often",
"some of the time",
"hardly ever or rarely",
"never",
"refused",
"don't know"))
nerve %>%
arrange(-desc(par_nerve))
## # A tibble: 14 x 3
## # Groups: par_nerve [7]
## par_nerve gender n
## <ord> <chr> <int>
## 1 often female 247
## 2 often male 139
## 3 some of the time female 842
## 4 some of the time male 715
## 5 hardly ever or rarely male 639
## 6 hardly ever or rarely female 477
## 7 never male 271
## 8 never female 157
## 9 refused male 3
## 10 refused female 2
## 11 don't know male 3
## 12 don't know female 2
## 13 <NA> female 921
## 14 <NA> male 359
To explain the code, this code replaces the par_nerve
variable in nerve
dataframe. The function ordered
assigns an order to the discrete variables, where the option levels
specifies the order. The c()
function denotes a vector, where in this case, it’s a vector of six discrete strings. A little trick — -desc
orders in ascending order (there is no asce
).
2.5 More head()
, variable type, and mean
Let’s take a look at household earnings, age, gender, and education.
hage <- nshap %>% # hearn, age, gender, education
select(hearn, age, gender, educ) %>%
mutate(hearn = as.numeric(hearn),
age = as.numeric(age))
head(hage, 10)
## # A tibble: 10 x 4
## hearn age gender educ
## <dbl> <dbl> <chr> <chr>
## 1 16000 79 female voc cert/some college/assoc
## 2 34000 84 male voc cert/some college/assoc
## 3 50000 83 male bachelors or more
## 4 25000 83 female voc cert/some college/assoc
## 5 30000 62 female < hs
## 6 NA 76 male bachelors or more
## 7 95000 79 male hs/equiv
## 8 NA 64 female hs/equiv
## 9 NA 78 male < hs
## 10 NA 80 female hs/equiv
mean(hage$hearn)
## [1] NA
First, the code converts hearn
and age
into numeric variables, similar to destring
in Stata. Also, since there is missing data (represented as “NA” in R), it refuses to calculate the mean and results in NA
. There are couple options in this situation.
- Don’t count the missing values when calculating the mean
mean(hage$hearn, na.rm = T)
## [1] 73746.95
- Remove the rows with missing values
hage <- na.omit(hage)
mean(hage$hearn)
## [1] 73746.95
Notice that there are na.rm
and na.omit
options. Essentially, na.rm
removes NA values, while na.omit
removes the rows with NA values. The first method is more surgical while second method is more invasive, but the second method will be used in part 4 for graphing.