Part 3 T-test
3.1 What is the Student’s t-test?
The t-test usually tests whether the mean of a variable are different between two groups within a certain statistical significance (usually 5%). For example, is the mean height between males and females significant? However, t-test in general has huge number of uses, also useful in post-estimations for validating or supporting results. For this exercise, the focus will be on a common, preliminary test of means between two groups.
Why/when would I (want to) use the t-test?
To use the previously mentioned example, a researcher could be interested on whether there is a height difference between males and females in our NSHAP data. For sake of example, let’s pretend that we don’t know the answer and start by checking some basic stats. Try the command svy: mean height, over(gender)
which shows the mean height by gender accounting for survey clusters, strata, and weights. Here is the output:
. svy: mean height, over(gender)
(running mean on estimation sample)
Survey: Mean estimation
Number of strata = 50 Number of obs = 3,255
Number of PSUs = 100 Population size = 3,252.5777
Design df = 50
male: gender = male
female: gender = female
--------------------------------------------------------------
| Linearized
Over | Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
height |
male | 68.89757 .1366571 68.62308 69.17205
female | 62.87602 .1123597 62.65034 63.1017
--------------------------------------------------------------
For now, let’s pretend that we only calculated the means and not the standard errors or the confidence intervals. It seems that the there is a difference of about 6 inches in the mean height. This is accounting for the clusters, strata, and weights. Concluding that “because there are is a 6-inch difference in the mean height, we conclude that males tend to be taller” might not be very convincing with just the means. Do we have enough sample size? Is the difference significant? To draw a crude analogy, is it significant to give $1,000 to a rich household or a poor household?
However, before we carry out our t-test, we have to make sure that we meet some conditions and Houston, we have a problem.
First condition is no outliers. We can see that the minimum height is 5.5 inches for one observation with the command
sum height
orhist height
. 1.5xIQR rule can also be used for detecting outliers. Either way, this is a problem when performing a t-test. A researcher would typically get rid of this outlier or correct it.The data needs to be a normal distribution. You can test this by using the command
swilk height
orsfrancia height
(it is normal ifProb>z
is close to 0 or whatever significance level a researcher may prefer) Alternative is to eyeball it through a histogram. It seems like we are ok with this condition.
Researchers can check the above through Stata. For the conditions below, the researcher has to pay due diligence in the research design.
- The variable of interest should be a continuous variable (
height
in our case), and our group variable must be binary, therefore only allowing two groups. However, researchers use a common trick to force a group variable to be binary, converting a continuous variable into a binary variable. For example, if we were to convert age into a binary variable, we can use the commandgen age = 1 if age > 20
andreplace age = 0 if age <= 20
.
3.2 The unfortunate side of t-test for survey data
The problem with t-test with this data is that t-test is not compatible with survey data. This means that we cannot use the command svy: ttest...
. The command ttest height, by(gender)
is available, but this would not account for clusters, strata, and weights. However, we can use the regress
or reg
command to test the differences in height! This works because gender
variable is binary. This way, OLS only the two different values of gender
variable. So let’s try running the command svy: reg height gender
.
. svy: reg height gender
(running regress on estimation sample)
Survey: Linear regression
Number of strata = 50 Number of obs = 3,255
Number of PSUs = 100 Population size = 3,252.5777
Design df = 50
F( 1, 50) = 1221.10
Prob > F = 0.0000
R-squared = 0.4118
------------------------------------------------------------------------------
| Linearized
height | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | -6.021551 .1723185 -34.94 0.000 -6.367663 -5.675439
_cons | 74.91912 .2900223 258.32 0.000 74.33659 75.50165
------------------------------------------------------------------------------
The coefficient is the height difference between male
and female
values in the gender
variable, which is indeed seems to be about 6 inches of height difference. The values of interest here are t
and P>|t|
, which shows that the height difference is indeed statistically significant at the 5% significance level. To interpret rest of the results, look through the OLS tutorial.