A Response rate prediction

A.1 The goal

As of writing this tutorial, we have up to NSHAP wave 3. Using these W2 and W3 databases, we will:

  1. Find what proportion/probability of respondents from wave 2 completes the NSHAP questionnaire or not and

  2. See how well this probability carries over to wave 3 to determine if it is usable for wave 4

The main strength of this prediction technique is that the conditional probabilities can be easily transferred over to different databases (such as the NIH’s) because of its simple approach. Simply put, the question is “given certain characteristics, what is the probability that this person will complete the NSHAP survey?”

A.1.1 Notes

This statistical technique was used by Jin-Kyu Choi/Jack Coghlan for Colm’s research in SAS. The rest of the section will show a reduced version of their work. R script that combines all the code chunks is provided at the end.


A.2 The set up with W2

First, we have to find the conditional probabilities for W2. The wave 2 data is P:\5707\5707B\Administrative\Research Assistant Essentials\NSHAP Stata Training Materials\R tutorial winter 2019\nshap_w3_core.xlsx (W3 data is the same as before, but this will be covered later). Let’s import that first.

#w2 <- read_excel("P:/5707/5707B/Administrative/Research Assistant Essentials/NSHAP Stata Training Materials/R tutorial winter 2019/nshap_w2_core.xlsx")

w2 %>% 
    select(su_id, hh_id, fi_id, path, surveytype, category2, ageelig, partner, version2) %>% 
    head()
## # A tibble: 6 x 9
##   su_id  hh_id  fi_id path  surveytype category2   ageelig partner version2
##   <chr>  <chr>  <chr> <chr> <chr>      <chr>       <chr>   <chr>   <chr>   
## 1 10000~ 10000~ 1805~ 6     W1 NIR     partner in~ 1       0       partner 
## 2 10000~ 10000~ 4359~ 6     partner    partner in~ 1       1       partner 
## 3 10000~ 10000~ 6605~ 4     W1 R       partner in~ 1       0       partner 
## 4 10000~ 10000~ 6605~ 4     partner    partner in~ 1       1       partner 
## 5 10000~ 10000~ 3958~ 4     W1 R       not applic~ 1       0       no part~
## 6 10000~ 10000~ 4153~ 3     W1 R       partner in~ 1       0       partner

There are whopping 889 variables! For demonstration purposes, let’s reduce it down to gender and race for personal characteristics. As for defining what constitutes as “completing the questionnaire” in this case, let’s simply define that if we have invalid data for 150 questions or more, it will count as an incomplete; this means that since there are 889 variables, they should have 150 or fewer “not applicable” data points per row. Let’s create a variable to count the number of “not applicable” (not “NA” in R, but as string “not applicable” from the Excel file) for each row first. After that, we can create a binary variable that either meets or does not meet threshold.

w2 <- w2 %>% 
    mutate_all(as.character)

w2_completes <- w2 %>%
    mutate(count_na = rowSums(. == "not applicable")) %>%
# . or the dot works with %>% or the "pipe" to call w2 into the function;
# essentially, it is equivalent to rowSums(w2 == "not applicable").
    mutate(completed = (count_na < 150))

# Let's see how many completes and incompletes there are
w2_completes %>% 
    group_by(completed) %>% 
    summarise(n = n())
## # A tibble: 3 x 2
##   completed     n
##   <lgl>     <int>
## 1 FALSE      1706
## 2 TRUE       1659
## 3 NA           12

Note: converting the data as string or “character” is not best practice, but easier to work with for demonstration purposes. This solves couple of things. 1) it keeps leading zeroes in ID variables (though that should already be the case when importing the data), and 2) it makes upcoming string comparison code easier to work.

Now, let’s reduce the number of variables and remove w2 and w2_completes dataframes for efficiency.

w2_comp_selected <- w2_completes %>%
    select(gender, race, completed)

#rm(w2, w2_completes)

head(w2_comp_selected)
## # A tibble: 6 x 3
##   gender race            completed
##   <chr>  <chr>           <lgl>    
## 1 female white/caucasian TRUE     
## 2 male   white/caucasian TRUE     
## 3 male   white/caucasian TRUE     
## 4 female white/caucasian TRUE     
## 5 female white/caucasian FALSE    
## 6 male   white/caucasian FALSE

We are all set to go for wave 2! Using this dataframe, you can now generate conditional probabilities of completing the survey, where the conditions are given gender and race variables.


A.3 Generating conditional probabilities for W2

So, which cohorts are likely to complete our questionnaire? Let’s first divide the number of people that completed the survey by the total number of people to get the probability per gender per race cohorts.

w2_probs <- w2_comp_selected %>%
    group_by(gender, race) %>%
    summarise(n = n(), completes = sum(completed, na.rm = T)) %>%
    mutate(p2 = completes / n)

w2_probs %>% 
    arrange(desc(p2)) %>%
    head()
## # A tibble: 6 x 5
## # Groups:   gender [2]
##   gender race                                  n completes    p2
##   <chr>  <chr>                             <int>     <int> <dbl>
## 1 female american indian or alaskan native    10         6 0.6  
## 2 female don't know                            5         3 0.6  
## 3 female asian or pacific islander            21        12 0.571
## 4 male   black/african american              217       113 0.521
## 5 female refused                               2         1 0.5  
## 6 female white/caucasian                    1402       696 0.496

Just to see what the data looks like, it seems that highest rates of response rates are from females in general, though there are concerns such as missing data.


A.4 W3

Now, all we have to do is replicate the previous code for setting up the data and generating probabilities. We can combine all the W2 code into one through continuous piping. Although it becomes unreadable since it is a giant block of text, since it is nearly identical code to the W2 code, there is no need to worry about readability.

w3 <- read_excel("P:/5707/5707B/Administrative/Research Assistant Essentials/NSHAP Stata Training Materials/R tutorial winter 2019/nshap_w3_core.xlsx")

w3 <- w3 %>% 
    mutate_all(as.character)

w3_probs <- w3 %>%
    mutate(count_na = rowSums(. == "not applicable")) %>%
    mutate(completed = (count_na < 150)) %>% 
    group_by(completed) %>%
    select(gender, race, completed) %>%
    group_by(gender, race) %>%
    summarise(n = n(), completes = sum(completed, na.rm = T)) %>%
    mutate(p3 = completes / n)

w3_probs %>% 
    arrange(desc(p3)) %>%
    head()
## # A tibble: 6 x 5
## # Groups:   gender [2]
##   gender race                                  n completes    p3
##   <chr>  <chr>                             <int>     <int> <dbl>
## 1 female don't know                            4         4 1    
## 2 male   american indian or alaskan native    19        11 0.579
## 3 male   other                               175        89 0.509
## 4 male   white/caucasian                    1566       753 0.481
## 5 male   black/african american              321       137 0.427
## 6 female white/caucasian                    1890       798 0.422

Interestingly enough, there seems to be higher rates of responses from males in this wave. Note that there are concerns such as missing data as mentioned before. Either way, now that the W2 and W3 datasets have very similar variables, we can now merge them!


A.5 Merging W2 and W3

If you are familiar with SQL or its variants, then this should look familiar. The tidyverse library comes with several different methods for dataframe joining, such as left_join, inner_join, etc.

df_w2_w3 <- w2_probs %>% 
    full_join(w3_probs, by = c("gender", "race"))

df_w2_w3 %>% 
    arrange(desc(p2 + p3)) %>%
    head()
## # A tibble: 6 x 8
## # Groups:   gender [2]
##   gender race                 n.x completes.x    p2   n.y completes.y    p3
##   <chr>  <chr>              <int>       <int> <dbl> <int>       <int> <dbl>
## 1 female don't know             5           3 0.6       4           4 1    
## 2 male   american indian o~    15           6 0.4      19          11 0.579
## 3 male   white/caucasian     1198         589 0.492  1566         753 0.481
## 4 female asian or pacific ~    21          12 0.571    61          24 0.393
## 5 female american indian o~    10           6 0.6      17           6 0.353
## 6 male   other                 84          37 0.440   175          89 0.509

In order to compare the probabilities between the waves, we can subtract their differences to see how they compare:

df_w2_w3 <- df_w2_w3 %>%
    mutate(diff = p2 - p3)

df_w2_w3 %>%
    select(-completes.x, -completes.y, -p2, -p3) %>% 
    arrange(desc(abs(diff))) %>% # absolute value
    head()
## # A tibble: 6 x 5
## # Groups:   gender [2]
##   gender race                                n.x   n.y    diff
##   <chr>  <chr>                             <int> <int>   <dbl>
## 1 female don't know                            5     4 -0.4   
## 2 female american indian or alaskan native    10    17  0.247 
## 3 female black/african american              300   470  0.185 
## 4 male   american indian or alaskan native    15    19 -0.179 
## 5 female asian or pacific islander            21    61  0.178 
## 6 male   black/african american              217   321  0.0939

Looking at the bigger sample size cohorts, it seems that the difference between the two waves are from 7 to 19 percentage points. Given that the sample sizes are 100 or more in each waves, median difference seems to be hovering somewhere around 8 percentage points.

A.6 Graphing

Now, in order to visualize the differences between the two waves, we can use the graphing techniques we learned from the previous section. Let’s visualize how different they are across race and gender.

race_diff <- df_w2_w3 %>%
    filter(race != "don't know" &
           race != "refused" &
           race != "not applicable") %>% 
    ggplot() +
    geom_bar(aes(y = abs(diff),
                   x = race),
             stat = "identity") +
    facet_wrap(~gender, ncol = 1) +
    dark_theme_bw() +
    scale_y_continuous(limits = c(0, 0.5)) +
    coord_flip() +
    ggtitle("Differences in response rates between wave 2 and 3") +
    ylab("Absolute value of the differences")

ggsave(filename = "race_vs_diff.png",
       plot = race_diff,
       width = 12.8,
       height = 7.2,
       dpi = 150)

knitr::include_graphics("race_vs_diff.png")

It seems that white/caucasian and male alongside asian or pacfic islander and male values are the most consistent between the two waves just by looking at this demonstration. This does not mean that they have the highest response rates, but rather it will (probably) be easier to predict whether they will respond or not, though this also depends on sample size. However, this simple conditional probability model can be expanded further by predicting weekly (and moving averages or smoothed means), adding more characteristics, diving the data by sample types, etc.



A.7 R script

rm(list=ls())

library(tidyverse)
library(readxl)

w2 <- read_excel("P:/5707/5707B/Administrative/Research Assistant Essentials/NSHAP Stata Training Materials/R tutorial winter 2019/nshap_w2_core.xlsx")

# wait a few seconds for it to load the data...

w3 <- read_excel("P:/5707/5707B/Administrative/Research Assistant Essentials/NSHAP Stata Training Materials/R tutorial winter 2019/nshap_w3_core.xlsx")


# W2

w2 <- w2 %>% 
    mutate_all(as.character)

w2_completes <- w2 %>%
    mutate(count_na = rowSums(. == "not applicable")) %>%
    mutate(completed = (count_na < 150))

w2_completes %>% 
    group_by(completed) %>% 
    summarise(n = n())

w2_comp_selected <- w2_completes %>%
    select(gender, age, race, completed)
    
w2_probs <- w2_comp_selected %>%
    group_by(gender, race) %>%
    summarise(n = n(), completes = sum(completed, na.rm = T)) %>%
    mutate(p2 = completes / n)

#rm(w2, w2_completes)

w2_comp_selected %>%
    group_by(gender, age, race, completed) %>%
    summarise(n = n(), p = sum(completed))


# W3

w3 <- w3 %>% 
    mutate_all(as.character)

w3_probs <- w3 %>%
    mutate(count_na = rowSums(. == "not applicable")) %>%
    mutate(completed = (count_na < 150)) %>% 
    group_by(completed) %>%
    select(gender, race, completed) %>%
    group_by(gender, race) %>%
    summarise(n = n(), completes = sum(completed, na.rm = T)) %>%
    mutate(p3 = completes / n)

rm(w3)


# Merge W2 and W3

df_w2_w3 <- w2_probs %>% 
    full_join(w3_probs, by = c("gender", "race"))

df_w2_w3 <- df_w2_w3 %>%
    mutate(diff = p2 - p3)

race_diff <- df_w2_w3 %>%
    filter(race != "don't know" & race != "refused" & race != "not applicable") %>% 
    ggplot() +
    geom_bar(aes(y = abs(diff),
                   x = race),
             stat = "identity") +
    facet_wrap(~gender, ncol = 1) +
    dark_theme_bw() +
    scale_y_continuous(limits = c(0, 0.5)) +
    coord_flip() +
    ggtitle("Differences in response rates between wave 2 and 3") +
    ylab("Absolute value of the differences")

ggsave(filename = "race_vs_diff.png",
       plot = race_diff,
       width = 12.8,
       height = 7.2,
       dpi = 150)

knitr::include_graphics("race_vs_diff.png")