A Structural equation modeling

A.1 What does SEM accomplish?

You can think of sem as reg (or OLS) with much more flexilibity (but also comes with depth and complexities), where gsem is the generalized version. It is good for handling unobservables, correlations, and coefficient constraints. As an example for unobservables, with given certain assumptions, an investigator might use multiple academic performance metrics to infer intelligence — the latent variable. Vanilla bivariate SEM will result in exactly the same results as bivariate OLS, but beyond bivariate specifications, SEM’s flexibility on how it treats covariances and autocorrelations for multiple variables makes it a useful modeling technique. On top of this, we can give the model constraints if we (happen to) know the (true) value of a coefficient or just to give it a quick test. Without a doubt, it requires strong assumptions especially because of its flexibilities, but gives a bit of leeway for OLS’ limitations.

In the above diagram, intelligence and academic performance are latent variables, while the boxes are observed variables.

Notes

This statistical technique was used by Boyan Zheng for Louise’s research. The rest of the section will show a reduced version of their work for demonstration purposes. Also, this method should not be confused with structural models in econometrics and economics.

Also note that in Stata’s documentation for SEM, the lingo for “path coefficient” means beta coefficient.

.do file that combines all the chunks of Stata commands is provided at the end.


A.2 What was their goal?

Their goal was to see what kind of relationship different variables had on husbands and wives. The following are the reduced version of their research questions.

  1. What kind of effects does wife + husband depression have on husband + wife social life?
  2. What kind of intra and inter relations can we draw from them given that they live in the same household?

In order to set this up, we have to set up a CESD scoring system, reshape the NSHAP data so the CESD/social activity are into separate variables for husbands and wives. As a side note, there are few non-heterosexual observations in the data.

If you would like to learn more about the scoring system, check out NSHAP’s own publication “Measuring Social Isolation among Older Adults Using Multiple Indicators from the NSHAP Study” by Cornwell and Waite in the journals of gerontonology series b of psychological sciences, volume 64B, supplment 1 (November 2009). It is freely available online and hardcopies are available freely in the NORC office! You can see that this tutorial’s method is somewhat similar if you take a quick look at the table in page i41.


A.3 The set up

A.3.1 CESD scoring

First, let’s create a score variable for all of the depression variables. Essentially, the scoring is set up as an ordinal variable for each of the depression variables, then all of the depression ordinal variables combined. There are definitely concerns with this, but will be (partially) addressed later. Let’s take a look at one of the CESD variables first, namely noteat.

. tab noteat

   cesd: did not feel like |
                    eating |      Freq.     Percent        Cum.
---------------------------+-----------------------------------
rarely or none of the time |      3,467       72.68       72.68
          some of the time |        741       15.53       88.22
              occasionally |        392        8.22       96.44
          most of the time |        170        3.56      100.00
---------------------------+-----------------------------------
                     Total |      4,770      100.00

Looking at the above output, this is what the scoring would translate to:

1: rarely or none of the time
2: some of the time
3: occasionally
4: most of the time

And similar story applies to rest of the CESD variables, fltdep, ftleff, nosleep, etc. Note that washapy and enjlife are in reverse order, so when we come to the part where we combine the CESD variables to create a score, make sure to change the two variables’ order to match the ordinality with rest of the CESD variables with the commands gen r_enjlife = 5 - enjlife and gen r_washapy = 5 - washapy. Now, let’s combine them and see what the result looks like.

. gen cesd_score = r_washapy + r_enjlife + noteat + fltdep + flteff + nosleep + waslonly + unfriend + fltsad + dislikd + notgetgo

. tab cesd_score

 cesd_score |      Freq.     Percent        Cum.
------------+-----------------------------------
         11 |        599       12.71       12.71
         12 |        518       11.00       23.71
         13 |        495       10.51       34.22
         14 |        453        9.62       43.83
         15 |        370        7.85       51.69
         16 |        334        7.09       58.78
         17 |        270        5.73       64.51
         18 |        261        5.54       70.05
         19 |        212        4.50       74.55
...

Since there were 11 CESD variables, 11 being the minimum makes sense.


A.3.2 Concern with CESD scoring system

The unavoidable concern for validity is combining the CESD variables into one simple measure like above. Not only that, it is a combination of ordinal variables, which became a continuous variable. Furthermore, it does not help that each of the CESD variables’ correlation values do not seem very promising. Here is a truncated output:

. corr noteat fltdep flteff nosleep washapy waslonly unfriend enjlife fltsad dislikd notgetgo

             |   noteat   fltdep   flteff  nosleep  washapy waslonly unfriend  enjlife   fltsad  dislikd notgetgo
-------------+---------------------------------------------------------------------------------------------------
      noteat |   1.0000
      fltdep |   0.3005   1.0000
      flteff |   0.3391   0.4391   1.0000
     nosleep |   0.2321   0.2973   0.2894   1.0000
...

However, correlation is only a measurement of standard deviations and covariances. Correlation’s weakness lies in the fact that it does not account for shared covariances between the numerous variables. This is where Cronbach’s alpha comes in.

A.3.3 Cronbach’s alpha (\(\tau\)-equivalent reliability)

Cronbach’s alpha is a measurement of reliability over multiple variables — though the word “consistency” might be better to understand it. Basically, alpha shows that answers were consistently picked for the questions. Although SEM is better compatible with Raykov’s composite reliability (AKA congeneric reliability), it is not (yet) available in Stata, even though it is a more modern replacement for alpha. Further, alpha does not perform very well with missing data and checking \(\tau\)-equivalency adds a bit of stringency (which tends to be overlooked in the literature).

A researcher should go through the due diligence for the mentioned concerns, but for demonstration purposes, we will briefly go over what the process looks like. Let’s see how Cronbach’s alpha looks for the CESD variables.

. drop if partner_id == ""
. alpha noteat fltdep flteff nosleep washapy waslonly unfriend enjlife fltsad dislikd notgetgo

Test scale = mean(unstandardized items)
Reversed items: washapy enjlife

Average interitem covariance:     .1741151
Number of items in the scale:           11
Scale reliability coefficient:      0.8041

0.8041 is a solid number to work with. Although there is no general rule of thumb for alpha, rule of thumb for congeneric reliability is that we want 0.7 or higher.


A.3.4 Reshaping data by husband/wife partners

In order to continue with the SEM modeling, we have to make sure that our data is compatible to our research question for Stata to understand in variables. This requires the command reshape. Here is a visual explanation, given in Stata’s documentation.

   long
+------------+                  wide
| i  j  stub |                 +----------------+
|------------|                 | i  stub1 stub2 |
| 1  1   4.1 |     reshape     |----------------|
| 1  2   4.5 |   <--------->   | 1    4.1   4.5 |
| 2  1   3.3 |                 | 2    3.3   3.0 |
| 2  2   3.0 |                 +----------------+
+------------+

In our case, i is hh_id or household ID, and j is gender. We are trying to convert our data into the wide format, so we will have new variables for husbands and wives. However, Stata requires that j has unique values within each i. This means that in the data, gender variable contains only two values, 1 and 2. Since there are only two values, this means that there should be only two observations per household in order for them to be unique observations per household. This unique-value-per-household rule also goes same for other variables; this means that we cannot use the given weight_adj variable anymore (though we can still keep cluster and stratum). Let’s only keep the variables of interest for now. You can also choose to keep cluster and stratum.

keep hh_id gender cesd_score waslonly

For demonstration purposes, we will use partner_id variable, which is the su_id of their partner respondent. The only catch of this demonstration is that even though some NSHAP respondents are partnered, they are not necessarily in the NSHAP data. Let’s start with dropping observations with missing partner_id with the command keep if partner_id != "" or drop if partner_id == "".

To fit with the said research questions and for demonstration purposes, let’s focus on hetero partners. Summing up the gender with the two partners in the same household, and checking whether that sum is 3 makes sure that one partner is female and other is male (2 and 1).

. sort hh_id
. by hh_id: egen sum_gender = sum(gender)
. keep if sum_gender == 3

Now, we are ready run the reshape command. Let’s see how it looks and rename the variables for readability.

. reshape wide cesd_score social, i(hh_id) j(gender)
. list in 1/5

     +-----------------------------------------------------------------------------+
     |   hh_id                social1   cesd_s~1                social2   cesd_s~2 |
     |-----------------------------------------------------------------------------|
  1. | 1000001             every week         12   several times a year         16 |
  2. | 1000005             every week         15             every week         14 |
  3. | 1000009   several times a week         17             every week         25 |
  4. | 1000013             every week         11             every week         14 |
  5. | 1000021                  never         19   several times a year         18 |
     +-----------------------------------------------------------------------------+

. ren social1 social_husband
. ren social2 social_wife
. ren cesd_score1 cesd_husband
. ren cesd_score2 cesd_wife

During the reshape process, Stata automatically assigns the two gender values to the variable names. So in the above output, cesd_s~1 means CESD score for husband, while cesd_s~2 means CESD score for wife. Now, we are ready to execute some SEM commands!


A.4 Use SEM in Stata

Going back to the research question mentioned previously, we are trying to predict how CESD explains social life. In other words, our \(y\) variable is social, while our \(x\) variables is the CESD score. Since we are investigating a discrete variable, we use gsem instead of sem (which become essentially the same if working with continuous variable). Here is how to set up the structural equation with a truncated output. Note that the command works fine without the covariance option.

. gsem (i.social_husband -> cesd_wife) (i.social_husband -> cesd_husband) (i.social_wife -> cesd_wife) (i.social_wife -> cesd_husband), cov(e.cesd_husband*e.cesd_wife)

Generalized structural equation model           Number of obs     =      1,075
...
Log likelihood = -6319.4902

-------------------------------------------------------------------------------------------------
                                |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------------------------+----------------------------------------------------------------
cesd_wife                       |
                 social_husband |
         less than once a year  |  -.1796659   2.029517    -0.09   0.929    -4.157445    3.798114
    about once or twice a year  |  -1.521126   1.724117    -0.88   0.378    -4.900333    1.858081
          several times a year  |  -3.255614    1.60672    -2.03   0.043    -6.404728   -.1065003
            about once a month  |  -3.838579   1.599506    -2.40   0.016    -6.973553   -.7036041
                    every week  |  -3.396186   1.593948    -2.13   0.033    -6.520267   -.2721047
          several times a week  |   -3.47962   1.617088    -2.15   0.031    -6.649054   -.3101854
                                |
                    social_wife |
         less than once a year  |   3.006314    2.25652     1.33   0.183    -1.416384    7.429011
    about once or twice a year  |   .3544712   1.687279     0.21   0.834    -2.952534    3.661476
          several times a year  |  -.1305966   1.588978    -0.08   0.934    -3.244936    2.983742
            about once a month  |  -.6342675   1.574404    -0.40   0.687    -3.720043    2.451508
                    every week  |  -1.401933   1.566622    -0.89   0.371    -4.472455    1.668589
          several times a week  |  -1.928945   1.592559    -1.21   0.226    -5.050303    1.192413
                                |
                          _cons |   20.85038   1.987248    10.49   0.000     16.95545    24.74532
--------------------------------+----------------------------------------------------------------
cesd_husband                    |
                 social_husband |
         less than once a year  |    2.61445     1.8641     1.40   0.161    -1.039119    6.268019
...
--------------------------------+----------------------------------------------------------------
                var(e.cesd_wife)|   25.02529   1.081345                      22.99316    27.23702
             var(e.cesd_husband)|   19.91174   .8648453                      18.28682    21.68105
--------------------------------+----------------------------------------------------------------
 cov(e.cesd_wife,e.cesd_husband)|   3.951548   .7003826     5.64   0.000     2.578823    5.324272
-------------------------------------------------------------------------------------------------

Interpreting the coefficients is similar to mlogit; we can interpret the signs, but not the magnitude.

As a side note, when Stata is calculating the maximum likelihood, it should neither be backed up nor its functional form be convex. If Stata notifies (backed up) or (not concave) for the maximum likelihood iterations, then restrategizing the model may be necessary, or the direction of the prediction from the independent variables to the dependent variables may need restructuring.

A.4.1 Simple SEM comparison to OLS

Try comparing the outputs from reg cesd_wife social_husband and sem (social_husband -> cesd_wife) commands.


A.5 Estimating the first two moments

In Stata, there are few neat features that comes with SEM that lets you estimate the first and second moments (though conditions are slightly different for GSEM). The first moment means the mean, and the second moment means the variance. More specifically for SEM however, you can

  1. Estimate the moments
  2. Not estimate the moments
  3. Constrain a coefficient to a fixed value
  4. Constrain two or more moment coefficients to be equal

From the previous gsem output, the results for the covariance are reporting that there is slight covariance on CESD between the partners and it is statistically significant. Notice that rest of the results are the same as before, except for the maximum likelihood. For sem, similar story applies to noxconditional option, which is used by default when constraints are applied to the mean, variance, and/or covariance.

For the other three options, see help sem and gsem path notation in the Stata documentation.


A.6 Comparing coefficients

After running a model, changing the magnitudes between different coefficients may help understand the model’s specification a bit better, relating one coefficient to another. In other words, you might want to know the difference between x2 and x3 on their effects on the \(y\) variable. You can use lincom or nlcom for (non-)linear comparisons, where non-linear comparison allows for non-linear expressions such as division.

To give an example scenario from the above gsem specification, an investigator might be curious on the effects of husband’s social activity compared to wife’s social activity, maybe to see which has a bigger effect on wife’s CESD score.

. lincom [cesd_wife]2.social_wife - [cesd_wife]2.social_husband
...
------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |  -1.875598   2.596621    -0.72   0.470    -6.964882    3.213687
------------------------------------------------------------------------------

In the command, 2. means the second ordinal value, in which case would be “some of the time” value. And because this is a comparison, the formulation is subtraction, whereas if we were to see the effects of both of their social activity to wife’s CESD score, then we would use addition. If need be, the relative-risk ratio option can be used with , rr. Looking at the results above, it seems that wife’s social activity would contribute more towards wife’s CESD scoring compared to husband’s social activity; this sounds reasonable, but the statistical significance shows that the model performs poorly on accuracy, most likely due to the sample size. This can also be alleviated if there were fewer answer choices, so there are more samples per answer.

For other functional forms and methods such as logit, see help lincom in the Stata documentation.


A.7 Marginal effects and graphing

GSEM with discrete \(y\) variable specification cannot directly interpret the magnitudes of coefficients, which is similar to logit and its derivatives. This means that an extra step is necessary for investigating marginal effects. What we need to do is select a point in the domain and see what the marginal effect of that point is. Stata automatically chooses the middle point for you. Then, let’s make a graph of the margins.

. margins social_wife
. marginsplot, xdimension(social_wife) horizontal ytitle("Wife socializing")

Looking at the above graph, it is apparent that there may be sample size issues for certain answers, especially for “never” answer. Beyond that, this shows that wife’s different levels of social activity does not have much effect on husband’s CESD score. In the x-axis, “Mu” stands for “marginal units” for CESD score.



A.8 .do file

use "P:\5707\ARC\W3\Release as of 12-06-18\nshap-3.0b3\w3\nshap_w3_core.dta"

gen r_enjlife = 5 - enjlife
gen r_washapy = 5 - washapy
gen cesd_score = r_washapy + r_enjlife + noteat + fltdep + flteff + nosleep + waslonly + unfriend + fltsad + dislikd + notgetgo

keep if partner_id != ""

sort hh_id
by hh_id: egen sum_gender = sum(gender)
keep if sum_gender == 3

keep hh_id gender cesd_score social

reshape wide cesd_score social, i(hh_id) j(gender)
list in 1/5

ren social1 social_husband
ren social2 social_wife
ren cesd_score1 cesd_husband
ren cesd_score2 cesd_wife

gsem (i.social_husband -> cesd_wife) (i.social_husband -> cesd_husband) (i.social_wife -> cesd_wife) (i.social_wife -> cesd_husband), cov(e.cesd_husband*e.cesd_wife)

lincom [cesd_wife]2.social_husband - [cesd_wife]2.social_wife

margins social_wife
marginsplot, xdimension(social_wife) horizontal ytitle("Wife socializing")