Part 1 Weights

1.1 The goal

The goal of sampling weights is to make sure a smaller group (of people) has equal amount of influence on the results as the bigger groups and vice versa, achieving a balanced dataset. To draw a crude example, there are fewer rich people, but they still need to be represented as much as the middle class in the results to reduce bias. Another example would be male/female ratio in surveys: we know that male/female ratio is very close to 1:1 in the world, but females tend to answer surveys more. Based on survey data, some statistical techniques may predict more females in the world – a bias. However, weights are made up, down to the researcher’s discretion (e.g. “I’m going to up/down-weight these outliers”) so an observation can count twice compared to a down-weighted observation for unnecessary reasons, therefore need careful considerations before approaching.

This happens because no data are equal, non-responses are unavoidable, some regions might have a bigger population than others, and so forth. Due to these countless research limitations, many sources of bias can show themselves in every step of the surveying process when gathering data. However, using a sampling weight can amend few of these statistical problems. Here is a story to explain how the weights work.

1.2 Sampling frame: who we want to study

This is simple. The sample frame is targeting a specified part of a population for data gathering. More concretely, NSHAP gathers data on older adults in the US in the years 2005, 2010, etc. More specifically for the 2005 wave, the data is from the adults who were born between 1920 to 1947. Essentially, the sampling frame sets the target goal for the research project – “what do the elderly think about X, or how are they doing on Y?”" – the NSHAP target.

1.3 Clusters: groups within the data (usually grouped by geography)

Now that there is a sampling frame, the interviewers go out to survey whomever meets the criteria or are eligible. After the surveys are in and the data is stored, the problem is whether our data from the surveys from our sampled frame is representative of the US population, as good as random, and/or meets another criterion depending on the goal of the research. Although our target (mentioned above) is the US, we cannot sample every single person in the US that meets the criteria due to many research limitations. So what we do is set geographical areas of data (AKA primary sample units – PSU) called clusters by assigning a name/number to each of the clusters. Some examples of clusters can be hospitals, registers, organizations, zip codes, etc., and divide up the data into different groups (clusters). The problem is that different clusters have different population sizes, as well as different rates of non-response; these will be address later.

1.4 Strata: groups within the data, grouped by mutually exclusive characteristics (stratum for singular)

The necessary condition to meet for all strata is to make sure that one stratum is mutually exclusive to another stratum. In NSHAP’s W2 case, there are 56 strata, all with a pair of clusters within a stratum. The extent of mutual exclusivity also depends on the research goals. Some examples can be sizable municipalities or a minority group (e.g. the very rich). To visually see what the strata do, below shows data that is sorted alphabetically (where grouping does not matter) and sorted by groups (where names do not matter), highlighted randomly:

Name(sorted)	Stratum		Name	Stratum (sorted)
A	3	\|	A	1
B	1	\|	D	1
C	2	\|	C	2
D	1	\|	F	2
E	3	\|	A	3
F	2	\|	E	3

The left side is simple random sampling, whereas the right side is randomly picked within stratum. Although both has same number of samples, picking randomly within strata ensured that all groups are represented in the data to reduce bias in our results – for example, the very rich. Hooray! But…

1.5 Weights: solution for including data that’s highlighted and not highlighted in the tables above

The problem from the above strata example is that we are ignoring the data that are not highlighted and we definitely do not want to lose any information. Another problem we have is that clusters come in different sizes (try the command bysort cluster: count). In order to solve both problems at once, we assign weights to the observations – undersample if there are many (e.g. bigger populations), over sample if there are too few (e.g. the very rich). There are different formulas for the weights depending on research objectives (for example, “proportionate allocation” and “Neyman allocation” for strata).

NSHAP uses weights for different strata (try the commands bysort stratum: sum weight_adj weight_sel and you can see the mean weight assigned to that particular stratum; and similarly, you can bysort cluster...). For both weight variables, whichever command it may be, use the option pweight in Stata. So an example command could be

. regress hearn age [pweight = weight_adj], vce(cluster cluster)
*** or in the short form
. reg hearn age [pw = weight_adj], vce(cl cluster)

where vce option readjusts standard errors for correlation within the cluster. This cluster option should not be used if you have too few (say 20 or fewer) clusters; it pretty much becomes robust standard errors (which is not inherently a bad thing, but cluster standard errors should be different than robust standard errors).

In the simplest form of an example to explain a type of a (probability) weight (smotes variant in Python alone has 85 oversampling techniques!), a simple oversampling formula is:

\[W_J = \frac{number\ of\ samples\ in\ stratum\ j\ given\ certain\ characteristics}{population\ size\ in\ stratum\ j\ given\ certain\ characteristics}\]

where number of samples is how many interviews are completed in that stratum, and population size is given by the census for that stratum. The \(W\) (similar to weight_sel or weight_adj variable in Stata) determines how light (below 1) or heavy (above 1 – oversampling a rare observation) the observation should count.