The Degrees of Freedom of a Variance Estimator in a Probability Sample

ii


Introduction
Many surveys are based on stratified multi-stage probability samples. When analyzing probabilitysample data, one usually needs to take the complex nature of the sample design into account because sampled elements from the same strata or cluster can be correlated. Moreover, the item values of a sampled element may be related to its probability of selection. Probability-sample theory provides a convenient way of analyzing probability-sample data in a rigorous manner. (The more commonly used but misleading term "design-based theory" is not employed here because models are often used, at least implicitly, in motivating the probability-sampling design.) One common failing of probability-sampling theory is its reliance on an ad hoc method for determining the degrees of freedom (dof) of its variance estimators. Ingram et al. (2018, p. 5) describe this method in their recommendation for analyses of National Center of Health Statistics record-level survey data: "the recommended number of degrees of freedom . . . is the number of PSUs minus the number of sampling strata. " This is the default dof measure employed in the probability-sampling software language SUDAAN (RTI International, 2012) and in the probability-sampling procedures in SAS (SAS Institute Inc., 2015).
Here we investigate using an alternative method of dof calculation derived by assuming a contrafactual model for the sample data. We focus on estimating means, although generalizations to other statistics will be briefly discussed in the final section. We will look at full-population and domain-mean estimates computed from a stratified cluster samples drawn from a particular population and evaluate how well the conventional and the model-based alternative dof measures fare at constructing two-sided confidence intervals (CIs) for the estimands.
We first provide some technical background. We then describe our population and conduct the evaluation. Finally, we offer some caveats, conclusions, and speculations.

Background
Suppose we are estimating the mean of a finite population of item values using data from a complex probability sample. For simplicity, we will assume that there is no nonresponse and that the sample has been drawn from a complete population using a stratified multistage sample. In particular, the population U consisting of M elements has been divided in mutually exclusive and exhaustive primary sampling units (PSUs), and the PSUs themselves have been grouped into H mutually exclusive and exhaustive strata. A probability sample of n h ≥ 2 PSUs is selected in each stratum h (h = 1, …, H), and a probability sample of elements is selected within each sampled PSU. By "probability sample, " we mean every PSU in stratum h has a known positive probability of selection (or expected number of selections) into the sample of PSUs, and every element in each sampled PSU i has a known positive probability of selection into the sample of elements from that PSU. The selection of elements from PSUs can have any number of stages.
Let S 1h denote the sample of the n h PSUs selected from stratum h, S 1 the union of n = ∑ H n h sampled PSUs, and S the sample of elements drawn from within those n PSUs. The subscript 1 in S 1h and S 1 refers to the first stage of sampling in what could be many stages of sample selection.
Let k∈U denote an element in the population and π k its probability of selection into S: the product of the probability of selection of the PSU (call it i) containing k into S 1 and the probability of k's selection into the element sample drawn from within PSU i. That element sample is denoted by S i .
The sampling (or design) weight of element k is denoted d k = 1/π k . A nearly unbiased estimator for a population mean T y = ∑ k∈U y k /M under probabilitysampling theory is An estimator like ty is nearly unbiased when its relative bias tends to zero as the number of sampled PSUs in S 1 grows arbitrarily large under mild conditions we assume to hold. See, for example, Isaki and Fuller (1982) for a discussion of those conditions.
We will also assume the mild conditions under which the contribution of the bias of t y to its mean squared error tends to zero as the number of sampled PSUs in S 1 grows arbitrarily large. The reader should understand that, in what follows, the mild conditions we assume to hold remain in effect.
One conventionally estimates the variance/mean squared error of t y under probability-sampling theory by treating the random selection of PSUs within each stratum as if it were done with replacement, even though that is rarely the case in practice. We follow that conventional treatment here because it greatly simplifies variance estimation.
The conventional probability-sampling variance estimator for t y is Observe that e i+ is nearly equal to u i+ = ∑ k∊S i d k (y k -t y )/M , and the u i+ are independent random variables under the stratified withreplacement selection of PSUs. "Nearly equal" means that the ratio of the two, e i+ /u i+ , converges to 1 as the number of PSUs in S 1 grows arbitrary large.
If the u i+ were normally distributed and each had the same variance, then v would nearly have a (central) chi-squared distribution with n -H degrees of freedom. (It would be exactly such a distribution if the e i+ were replaced by the u i+ .) As a result, the statistic z = (t y − T y )/v 1/2 , which is used to make inferences about of t y as an estimate for of T y , would nearly have a Student t distribution with n -H degrees of freedom. This is where the conventional practice of determining the degrees of freedom for a statistic derived from a stratified multistage probability sample comes from. Many have noted that even under the ideal conditions assumed here-a stratified multistage probability sample drawn from a complete population, where there is no nonresponse, at least two PSUs chosen with replacement from each stratum, and a probability sample of elements selected within each sampled PSUs-the conventional practice is rarely justified. See, for example, Valliant and Rust (2010). That is because the PSU-level "residuals" captured in the u i+ seldom have the same variance and may not be normally distributed.
The probability-sampling variance estimation in Equation (2) captures any loss in efficiency of the estimator t y due to the individual y k within each PSU being correlated (whether there are additional stages of sampling) as well as any gains in efficiency due to stratification. The variance estimate also captures any loss efficiency due to the d k being unequal. Were all the y k in U normal, independent, and identically distributed, T y could be efficiently estimated by a simple mean of the m y k -values in S, the variance of which would be efficiently estimated by which has a chi-squared distribution with m − 1 degrees of freedom. Heuristically, in addition to the v usually being larger than v I , the smaller n -H nominal degrees of freedom in v (compared with m−1) is the price one pays for using t y and v for making inferences about T y . That price may not be steep enough, as we shall see.
The degrees of freedom of a chi-squared variance estimator is 2 divided by its relative variance. Accordingly, it seems reasonable to employ a Satterthwaite approximation (1946) to measure the effective degrees of freedom (edof) for a nonchi-squared variance estimator, like v; namely, edof(v) = 2/relvar(v). Kott (1994) proposed computing such edof for v by treating each y k as it is were normal, independent, and identically distributed. Assuming this contrafactual model makes computing the relative variance of v straightforward, and truly provides a measure of the price one pays for using v in place of v I when drawing inferences about a population mean.
Here we will look at estimates of domain means as well as population means. A domain mean has the form T y,g = ∑ k∈U g k y k /∑ k∈U g k , where g k = 1 when element k is in the domain and 0 otherwise. Its nearly-unbiased estimator under mild conditions is t y,g = ∑ k∊S d k g k y k /∑ k∊S d k g k , and its conventional probability-sampling variance estimator has the form of Equation (2) with the ei+ in Equation (3) replaced by A population mean and its estimator can be expressed as a domain mean and its estimator by letting g k = 1 for every k∈U.
Setting w k = d k g k , the edof measure for the probability-sampling variance estimator of a domain mean derived under the contrafactual assumption that the y k in the domain are normal, independent, and identically distributed, is (Kott, 1994).
A little algebra reveals that if ∑k∊S i w k 2 for each PSU were the same, the right-hand side of Equation (4) would collapse to across PSUs, whether due to the variability of the w k 2 or the sizes of the PSUs, increases.

The Population and Estimands of Interest
We now will compare how well the contrafactual edof in Equation (4) and the nominal degrees do at measuring "the true degrees of freedom" of a probability sampling variance estimator given a stratified cluster sample drawn for the MU281 population (Ohlsson, n.d., based on data from Appendix B of Särndal et al., 1992). Determining the true degrees of freedom is elusive. Instead, we define the empirical degrees of freedom for a two-sided q percent CI as the degrees of freedom of a chi-squared distribution with the same q percent CI as that produced by the sample data.
The MU281 population is a collection of Swedish municipalities with three outliers removed. The population is divided into 50 clusters from 8 regional strata. After combing strata 7 and 8, the clusters and strata are further divided (based on 1975 population counts) into 108 PSUs within 20 design strata. A stratified cluster sample of 4 PSUs is sampled with replacement from each design stratum via simple random sampling. All the municipalities in each sampled PSU are the elements of the sample. By sampling with replacement, the variance estimator in Equation (1) should produce nearly unbiased estimates of the true mean squared error.
This sampling process is repeated 10,000 times twice. • RMT/P85, a binary variable equal to 100 when the ratio RTM85/P85 is greater than 8, 0 otherwise (18.5053, which is the estimate the percent of municipalities with RMT/P85 ≥ 8).
All exhibit some degree of intra-stratum and intracluster correlation. These can be measured indirectly by treating the population as an equally weighted stratified with-replacement cluster sample and comparing alternative standard-error estimates for the means assuming: no strata or clustering, clustering but no strata, strata but no clustering, and both strata and clustering. This was done, but the results are not shown.

Results
For each of our simulated samples, the sampling weights in Equation (1)  The set S is now the set of sampled element selections, and S 1h and S 1 are defined analogously. Under withreplacement sampling, a PSU can be selected more than once; each selection is a separate member of S 1 , leading to potentially multiple memberships for its component elements in S.
We denote the results computed from each r = 1, . . . , 10,000 simulated sample with the subscript (r), and define the following empirical summary statistics (recall a full-population mean can be expressed as a domain mean): In the previous expressions, v is the probabilitysampling variance of t y,g.
The nominal dof of v is 60 (80 PSUs minus 20 strata), whereas the predicted edof are computed using Equation (4), which varies from sample to sample, but it is same for all population variables depending on whether a full-population of domain mean is being estimated. Table 1 displays the summary statistics for the edof(v). All summaries in this paper were computed using the PROC MEANS procedure within the SAS software (SAS Institute Inc., 2015). Estimated probability- The empirical relative bias of t y,g is ∑ r = 1 10,000 (t y,g(r) -T y,g ) T y,g ERB(t y,g ) = × 100%. The empirical variance of t y,g is The empirical relative bias of v is The empirical variance of v is The empirical relative variance of v is The empirical Satterthwaite DOF of v is sampling means and variances were computed using the DESCRIPT procedure in the SUDAAN language (RTI International, 2012) before being summarized using SAS.
Tables 2 and 3 display many of the empirical results of the simulation including those of the twosided 95th and 90th percentiles of the test statistic z = | t y(r) − T y |/v 1/2 , denoted by z95 and z90. Each of these percentiles has its own empirical dof (i.e., the degrees of freedom that would produce that value of z were the u i+ normal, independent, and identically distributed). These are denoted ED95(v) and ED90(v) respectively (they were computed using the TINV function in SAS).
For comparison purposes, the z95 and z90 values for the nominal 60 degrees of freedom (rounding to two decimal places) are 2.00 and 1.67. For 31.2 degrees of freedom (the mean of the full-population predicted edof in Table 1), they are 2.04 and 1.70. For 20.6 the mean of the domain prediction edof), they are 2.08 and 1.72. Under normality (infinite degrees of freedom), z95 is 1.96, and z90 is 1.645.
Tables 2 and 3 also show that, unlike their predictions, the empirical degrees of freedom, whether measured by ESD, ED95, or ED90 depend on the variable. That is likely because each variable has a different amount of intra-stratum and intra-PSU correlation. The predictions in Table 1, while stable, are not that good. Still, they are clearly better than the nominal measure (60) or infinity. Surprisingly, although developed to predict the Satterthwaite degrees of freedom, which are only supposed to be approximations, the edof appear to be closer to the empirical confidence-interval degrees of freedom (ED95 and ED90). Also, surprising, is that although the edof from the domain means is smaller than that of the population mean, empirically the dof measures for domain means are usually smaller-but are sometimes higher-than their full-population analogs, depending on the variable and measure. Notes: EDX = the empirical degrees of freedom of the value zX at the Xth percentile; ERB = the empirical relative bias; ESD = the empirical Satterthwaite (approximate) degrees of freedom; zX = the X percentile of the absolute difference between the estimator and estimand divided by the estimated standard error of the estimator. Notes: EDX = the empirical degrees of freedom of the value zX at the X th percentile; ERB = the empirical relative bias; ESD = the empirical Satterthwaite (approximate) degrees of freedom; zX = the X percentile of the absolute difference between the estimator and estimand divided by the estimated standard error of the estimator.
There is no obvious explanation why the empirical measures of the degrees of freedom for RMT/P85 were so much lower than for the other four variables investigated. If anything, the estimated population and domain means for this variable were less influenced by stratification and clustering than the other four (not shown). How stratification, clustering, and weight variability affect empirical dof is an area for future study.

Discussion
That the predicted edof measure produced by Equation (4) is not a complete success should not surprise anyone. It is based on the contrafactual assumption that v can be treated as a chi-squared distribution and that the variable whose mean is being estimated has no intra-stratum nor intracluster correction. None of those assumptions are correct for the variables in the MU281 data set being analyzed. In addition, the Satterthwaite formula, edof(v) = 2/relvar(v), is only an approximation when v is not chi-squared. Finally, an estimated mean can itself have a skewness and kurtosis that affect building appropriate CIs around them. These characteristics are extremely difficult to measure with a probability sample. Valliant and Rust (2010) attempt incorporating stratum-level kurtosis terms into a degrees of freedom measure with little success. Nearly unbiased kurtosis estimation requires four sampled PSUs in every stratum. Kott (2020) discusses the potential impact of skewness on CIs, an impact greater on an interval's symmetry than its size, which is why only two-sided intervals were evaluated here. A nearly unbiased skewness measure requires at least three sampled PSUs in every stratum.
Despite Equation (4) failing to predict the empirical edof, however they are measured, employing the equation when making statistical inferences is a realistic alternative to subtracting the number of sampled PSUs from the number of strata, which is the conventional practice. The equation provides a useful measure of the potential loss in efficiency from using a probability-sampling variance estimator. Although it is based on simple model assumptions, those contrafactual assumptions are more reasonable that then implicit assumption justifying the conventional measure of the edof.
Some speculations about extensions beyond those situations covered in the body of the text are in order. It may often be reasonable to use the contrafactual edof measure in Equation (4) when the weights in a weighted estimator like t y,g contain adjustments for nonresponse and coverage and when the design strata or design PSUs in Equation (2) are replaced by variance strata or PSUs (i.e., simplifications used in practice when, for example, there are less than two design PSUs in a design stratum). This assertion assumes that the difference t y,g − T y,g can be put in the form ∑k∊S w k e k . For example, with linear calibration weighting (Kott, 2015), e k = y k − x k (∑ j ∊S d j x j x j T ) -1 ∑ j ∊S d j y j , where x k is a vector of calibration variables, and w k = d k [1 + (∑ j ∊ U x j −∑ j ∊S d j x j ) (∑ j ∊S d j x j x j T ) −1 x k ]. Using Equation (4) is most reasonable under the simple linear model for y k , y k = x k T β + ε k , where the ε k (which are nearly equal to the e k ) are independent and identically distributed normal random variables.
The error in a single regression coefficient can also be put in the form ∑ k ∊S w k e k , where the w k are more complicated expressions than the sampling weights. The w k for the j th coefficient of a probability-sampling linear regression estimator b = (∑ j ∊S d j x j x j T ) −1 ∑ j ∊S d j x j y j is the j th row of (∑ j ∊S d j x j x j T ) −1 d k x k , and e k = y k − x k b, which is nearly equal to ε k in the linear model, y k = x k T β + ε k .
Similarly, the w k for the j th coefficient of a of a probability-sampling logistic-regression estimator or to any b that solves an estimating equation like ∑ k ∊S d k x k [y k − f(x k T b)] = 0, where f(.) is scalar and monotonic, is the j th row of (∑ j ∊S d j f '(x j T b) x j x j T ) −1 d k x k , and e k = y k − f(x k T b). This suggests that using Equation (4) as a reasonable, if imperfect, measure of the edof for the variance estimator of a regression coefficient, linear or otherwise. Because this measure will usually vary across the coefficients of a regression estimator, using a Bonferroni-adjusted t test (1974), rather than a probability-sampling F test (for examples, see RTI International, 2012, p. 217) may be advisable for (imperfectly) testing a joint hypothesis involving multiple coefficients.
Some readers may be wondering about the edof measure for the variance estimator of the estimated difference between two domain means. Expressing the estimated difference between two domain means as a linear regression coefficient is simple and has been discussed previously in this report.