The framework — introduction
In statistics, econometrics, psychometrics, etc., running a regression means inferring a variable of interest (say earnings) using variables of characteristics (age, race, etc.). In short, we can infer and estimate how much you earn due to your age, height, race, sex, marital status, etc. with the assumption that the model (AKA “specification”) along with its assumptions are correctly set up. As an example, this is how it looks in math form at a quick glance:
\[ household\ earnings = \beta_0 + \beta_1 age + \beta_2 height + \cdots + \varepsilon \] Our left-hand side variable, earnings, is called the dependent variable, because its inference depends on the right-hand side variables — the independent variables. A \(\beta\) are called “beta” in this equation, also called coefficients (to the independent variables). The last term, \(\varepsilon\), or “epsilon” in this equation, is called the error term. Let’s quickly go through some alternative names.
The dependent variable is also called the \(y\) variable, regressand, or left-hand variable/side (sometimes short as LHS).
The independent variables are also called the \(x\) variables, regressors, or right-hand variables/side (RHS).
The coefficients are the slopes (assuming linearity) for each of the \(x\) variables, starting from \(\beta_0\) or “beta naught” to \(\beta_1\) or “beta sub one” (sub meaning subscript) and so forth. You may see other forms of this equation. Here is an example:
\[ y_i = \beta_0 + \beta_{1,i} x_{1,i} + \beta_{2,i} x_{2,i} + \cdots + \beta_{p,n} x_{p,n} + \varepsilon_i \]
where \(i\) means index, which is basically the row number in Stata, subscript \(n\) denoting what the sample size is (or how many rows of data there are). The subscript numbers are essentially another form of indexing, basically saying “the first \(x\) variable, the second \(x\) variable” etc., usually expressed in subscript \(p\) for number of \(x\) variables used. Basically, the data used is \(n\) rows by \(p\) columns. In order to indicate a data point on the 5th row and 2nd \(x\) variable (height from previous equation), this is indicated by \(i = 5\) and \(j = 2\). Note that there are other notations (commonly \(k\) instead of \(p\)).
0.1 Types of data (picking the next section to read)
Generally, the \(y\) variable dictates what kind of model one should use. For the \(x\) variables, they do not dictate anything, but need to tell Stata what kind of variables they are; it’s simpler, so let’s go over how to use them in Stata first.
If the \(x\) variable is ordinal, categorical, or binary, use i.
prefix. So if we were to keep the specification above, the Stata command would be reg hearn age height i.race i.gender i.maritlst
. Here, Stata treats age
and height
as continuous variables, while race
, gender
, and marital status
are treated as discrete variables.
The following part can help pick the next tutorial, though I would recommend reading the rest before jumping on to the next tutorial.
If the \(y\) variable is…
continuous → check out the
OLS
tutorial. Some example variables arehearn
orage
in the NSHAP data. The command for OLS in Stata isreg
orregress
.binary/dummy → check out the
logit
tutorial. In NSHAP data, binary variables are typically yes or no questions, such as the variablecollege
.ordinal → check out the
ologit
tutorial. The variableattend
, for example, is an ordinal variable.categorical → check out
mlogit
tutorial. In NSHAP data,race
is a categorical variable. The difference between ordinal and categorical is that unlike the ordinal variables , categorical variables have no minimum or maximum value, which makes numerical assignments arbitrary.
There is no count data in NSHAP, but in case there is some in the future, it may help to search online on how to use Poisson or negative binomial methods. Count data is basically “how many heads am I going to get out of 5 coin flips?” with different caveats for each statistical methods.
0.2 Declaring survey data in Stata
Because NSHAP is a survey, it comes with many considerations on design and delivery. The survey design team has already set up clusters and strata before sending out the interviewers. There are two clusters assigned to each stratum; think of this as having two comparison groups, similar to having treatment and control groups. The variable for the strata is stratum
and the variable for the clusters is cluster
.
The data also has weights, more specifically sampling weights. Use the variable weight_adj
, which the Stata should recognize under pweight
option.
In order for Stata to recognize that these three variables need to be considered, use the command svyset cluster [pweight=weight_adj], strata(stratum)
. To check if they are (or were already) set, you can simply use the command svyset
. The output should look as follows:
. svyset
pweight: weight_adj
VCE: linearized
Single unit: missing
Strata 1: stratum
SU 1: cluster
FPC 1: <zero>
For more information on clusters, strata, and weights, check out the the weights
tutorial and the 1. SurveyDesign.pdf
slides.