Part 3 T-test

3.1 What is the Student’s t-test?

The t-test usually tests whether the mean of a variable are different between two groups within a certain statistical significance (usually 5%). For example, is the mean height between males and females significant? However, t-test in general has huge number of uses, also useful in post-estimations for validating or supporting results. For this exercise, the focus will be on a common, preliminary test of means between two groups.

Why/when would I (want to) use the t-test?

To use the previously mentioned example, a researcher could be interested on whether there is a height difference between males and females in our NSHAP data. For sake of example, let’s pretend that we don’t know the answer and start by checking some basic stats. Try the command svy: mean height, over(gender) which shows the mean height by gender accounting for survey clusters, strata, and weights. Here is the output:

. svy: mean height, over(gender)
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      50        Number of obs   =      3,255
Number of PSUs   =     100        Population size = 3,252.5777
                                  Design df       =         50

         male: gender = male
       female: gender = female

--------------------------------------------------------------
             |             Linearized
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
height       |
        male |   68.89757   .1366571      68.62308    69.17205
      female |   62.87602   .1123597      62.65034     63.1017
--------------------------------------------------------------

For now, let’s pretend that we only calculated the means and not the standard errors or the confidence intervals. It seems that the there is a difference of about 6 inches in the mean height. This is accounting for the clusters, strata, and weights. Concluding that “because there are is a 6-inch difference in the mean height, we conclude that males tend to be taller” might not be very convincing with just the means. Do we have enough sample size? Is the difference significant? To draw a crude analogy, is it significant to give $1,000 to a rich household or a poor household?

However, before we carry out our t-test, we have to make sure that we meet some conditions and Houston, we have a problem.

First condition is no outliers. We can see that the minimum height is 5.5 inches for one observation with the command sum height or hist height. 1.5xIQR rule can also be used for detecting outliers. Either way, this is a problem when performing a t-test. A researcher would typically get rid of this outlier or correct it.
The data needs to be a normal distribution. You can test this by using the command swilk height or sfrancia height (it is normal if Prob>z is close to 0 or whatever significance level a researcher may prefer) Alternative is to eyeball it through a histogram. It seems like we are ok with this condition.

Researchers can check the above through Stata. For the conditions below, the researcher has to pay due diligence in the research design.

The variable of interest should be a continuous variable (height in our case), and our group variable must be binary, therefore only allowing two groups. However, researchers use a common trick to force a group variable to be binary, converting a continuous variable into a binary variable. For example, if we were to convert age into a binary variable, we can use the command gen age = 1 if age > 20 and replace age = 0 if age <= 20.

3.2 The unfortunate side of t-test for survey data

The problem with t-test with this data is that t-test is not compatible with survey data. This means that we cannot use the command svy: ttest.... The command ttest height, by(gender) is available, but this would not account for clusters, strata, and weights. However, we can use the regress or reg command to test the differences in height! This works because gender variable is binary. This way, OLS only the two different values of gender variable. So let’s try running the command svy: reg height gender.

. svy: reg height gender
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =        50                  Number of obs     =      3,255
Number of PSUs     =       100                  Population size   = 3,252.5777
                                                Design df         =         50
                                                F(   1,     50)   =    1221.10
                                                Prob > F          =     0.0000
                                                R-squared         =     0.4118

------------------------------------------------------------------------------
             |             Linearized
      height |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      gender |  -6.021551   .1723185   -34.94   0.000    -6.367663   -5.675439
       _cons |   74.91912   .2900223   258.32   0.000     74.33659    75.50165
------------------------------------------------------------------------------

The coefficient is the height difference between male and female values in the gender variable, which is indeed seems to be about 6 inches of height difference. The values of interest here are t and P>|t|, which shows that the height difference is indeed statistically significant at the 5% significance level. To interpret rest of the results, look through the OLS tutorial.