The framework — introduction

In statistics, econometrics, psychometrics, etc., running a regression means inferring a variable of interest (say earnings) using variables of characteristics (age, race, etc.). In short, we can infer and estimate how much you earn due to your age, height, race, sex, marital status, etc. with the assumption that the model (AKA “specification”) along with its assumptions are correctly set up. As an example, this is how it looks in math form at a quick glance:

\[ household\ earnings = \beta_0 + \beta_1 age + \beta_2 height + \cdots + \varepsilon \] Our left-hand side variable, earnings, is called the dependent variable, because its inference depends on the right-hand side variables — the independent variables. A \(\beta\) are called “beta” in this equation, also called coefficients (to the independent variables). The last term, \(\varepsilon\), or “epsilon” in this equation, is called the error term. Let’s quickly go through some alternative names.

The dependent variable is also called the \(y\) variable, regressand, or left-hand variable/side (sometimes short as LHS).

The independent variables are also called the \(x\) variables, regressors, or right-hand variables/side (RHS).

The coefficients are the slopes (assuming linearity) for each of the \(x\) variables, starting from \(\beta_0\) or “beta naught” to \(\beta_1\) or “beta sub one” (sub meaning subscript) and so forth. You may see other forms of this equation. Here is an example:

\[ y_i = \beta_0 + \beta_{1,i} x_{1,i} + \beta_{2,i} x_{2,i} + \cdots + \beta_{p,n} x_{p,n} + \varepsilon_i \]

where \(i\) means index, which is basically the row number in Stata, subscript \(n\) denoting what the sample size is (or how many rows of data there are). The subscript numbers are essentially another form of indexing, basically saying “the first \(x\) variable, the second \(x\) variable” etc., usually expressed in subscript \(p\) for number of \(x\) variables used. Basically, the data used is \(n\) rows by \(p\) columns. In order to indicate a data point on the 5th row and 2nd \(x\) variable (height from previous equation), this is indicated by \(i = 5\) and \(j = 2\). Note that there are other notations (commonly \(k\) instead of \(p\)).


0.1 Types of data (picking the next section to read)

Generally, the \(y\) variable dictates what kind of model one should use. For the \(x\) variables, they do not dictate anything, but need to tell Stata what kind of variables they are; it’s simpler, so let’s go over how to use them in Stata first.

If the \(x\) variable is ordinal, categorical, or binary, use i. prefix. So if we were to keep the specification above, the Stata command would be reg hearn age height i.race i.gender i.maritlst. Here, Stata treats age and height as continuous variables, while race, gender, and marital status are treated as discrete variables.


The following part can help pick the next tutorial, though I would recommend reading the rest before jumping on to the next tutorial.

If the \(y\) variable is…

  • continuous → check out the OLS tutorial. Some example variables are hearn or age in the NSHAP data. The command for OLS in Stata is reg or regress.

  • binary/dummy → check out the logit tutorial. In NSHAP data, binary variables are typically yes or no questions, such as the variable college.

  • ordinal → check out the ologit tutorial. The variable attend, for example, is an ordinal variable.

  • categorical → check out mlogit tutorial. In NSHAP data, race is a categorical variable. The difference between ordinal and categorical is that unlike the ordinal variables , categorical variables have no minimum or maximum value, which makes numerical assignments arbitrary.

There is no count data in NSHAP, but in case there is some in the future, it may help to search online on how to use Poisson or negative binomial methods. Count data is basically “how many heads am I going to get out of 5 coin flips?” with different caveats for each statistical methods.


0.2 Declaring survey data in Stata

Because NSHAP is a survey, it comes with many considerations on design and delivery. The survey design team has already set up clusters and strata before sending out the interviewers. There are two clusters assigned to each stratum; think of this as having two comparison groups, similar to having treatment and control groups. The variable for the strata is stratum and the variable for the clusters is cluster.

The data also has weights, more specifically sampling weights. Use the variable weight_adj, which the Stata should recognize under pweight option.

In order for Stata to recognize that these three variables need to be considered, use the command svyset cluster [pweight=weight_adj], strata(stratum). To check if they are (or were already) set, you can simply use the command svyset. The output should look as follows:

. svyset

      pweight: weight_adj
          VCE: linearized
  Single unit: missing
     Strata 1: stratum
         SU 1: cluster
        FPC 1: <zero>

For more information on clusters, strata, and weights, check out the the weights tutorial and the 1. SurveyDesign.pdf slides.