Part 4 OLS

4.1 What is OLS?

OLS stands for ordinary least squares; it fits a line through data points like the picture above. A basic OLS is a good preliminary step for statistical analysis. It quickly draws an approximate picture how much impact a variable has on another variable (magnitude) and whether they have a positive or negative relationship. However, it should be noted that plain OLS estimations should not be used as a final step of a research.

4.2 Simple example — the math step by step

As an example, let’s just look at 2 variables for now — \(x\) and \(y\). The formula for a regression is \(y_i = \beta _0 + \beta _1 x _i + \varepsilon _i\). In order to find out the impact of \(x\) on \(y\), we have to find the value of \(\beta _1\) using OLS. The formula for that is \(\beta _1 = \frac{var(x)}{cov(x,y)}\). Another way of writing this is \(\beta _1 = \frac{(x - \bar{x})^2}{(x - \bar{x})(y - \bar{y})}\). Watch this super simple, 8-minute video on how to get the \(\hat\beta _1\) by hand: https://www.youtube.com/watch?v=JvS2triCgOY

4.3 In Stata

Since this is a survey with clusters, strata, and weights, let’s make sure we tell Stata that this is a survey data. Check if it is already set up using the command svyset.

. svyset

      pweight: weight_adj
          VCE: linearized
  Single unit: missing
     Strata 1: stratum
         SU 1: cluster
        FPC 1: <zero>

If the output does not look like this, then run the command svyset cluster [pw = weight_adj], str(stratum). This makes sure that our regression accounts for clusters, strata, and weights.

Now, on to OLS: when Stata performs an OLS, it also carries out many other functions simultaneously. For sake of example, let the \(y\) variable be hearn (household earning) and \(x\) be age. Let’s take a look at the output of the survey regression and talk about few commonly discussed values. Note that svy: is a prefix command that tells Stata account for clusters, strata, and weights.

. svy: reg hearn age
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =        50                  Number of obs     =      2,423
Number of PSUs     =       100                  Population size   = 2,467.7556
                                                Design df         =         50
                                                F(   1,     50)   =      29.78
                                                Prob > F          =     0.0000
                                                R-squared         =     0.0226

------------------------------------------------------------------------------
             |             Linearized
       hearn |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |  -1955.745   358.3929    -5.46   0.000    -2675.598   -1235.891
       _cons |   202385.9   27774.05     7.29   0.000     146600.1    258171.7
------------------------------------------------------------------------------

Starting with the equation, our \(y\) variable is hearn, while our \(x\) variable is age. So our OLS formula would be \(hearn = \beta _0 + \beta _1 age + \varepsilon\).

From the OLS formula, the \(\beta _1\) is -1955.745. This shows that for every 1 unit (year) increase in age, household earning decreases by an average 1955.745 units (dollars). (Remember the ages of the survey sample!)

Though not commonly discussed, for sake of interpretation, _cons is our \(\beta _0\). This means that if age was 0, our mean hearn would be 202385.9.

Since the 95% confidence interval for age does not include 0 for age, it is statistically significant within 95% confidence interval. Similarly, P>|t| also tells us that since it is smaller than 0.05, it is statistically significant at the 5% significance level.

R-squared looks extremely low, but that is expected since we only used age variable as our regressor. As you add more \(x\) variables, the specification will explain more of the variation in the \(y\) variable, meaning R-squared will increase. The value 0.0226 tells us that age explains about 2.26% of the variance in hearn.

4.4 Handling discrete variables

If the \(y\) variable is a discrete variable, then Stata cannot perform an OLS estimation. However, it can perform OLS if \(x\) variables are discrete. In order to estimate with discrete \(x\) variables, the prefix i. has to be attached to the \(x\) variable. An example Stata command would be reg y i.x1 i.x2 x3 x4 i.x5.