SUDAAN Results

1: My SUDAAN estimates, standard errors, and/or tests of hypothesis are not the same as the ones I get in other packages. Why is this?

2: Why am I getting ******** in the output instead of results?

3: I am getting non-zero parameter estimates in LOGISTIC for the reference group. What is wrong?

4: I am trying to estimate quantiles for a variable with a large percentage of 0 values. I am getting missing values for the quantile and for the SEs and upper and lower confidence limits. Is there anything I can do?

5: Does it matter whether I use SUBPOPN or subset my data outside of SUDAAN before analyzing?

6:Can I use "-2 log likelihood" to evaluate the relative fit of two models?

7: I ran the same procedure using both WR and Delete-1 Jackknife designs. Results were very similar, but the Jackknife design takes much longer to execute. Which method do you recommend?

8: Which SEMETHOD should I use with R=EXCHANGEABLE?

9: My regression model contains independent variables that are coded as 0-1 indicator variables, and I listed these variables on the SUBGROUP statement. SUDAAN seems to be deleting a lot of observations from my analysis, and the regression coefficients don't look correct to me. What could be the problem?

10: I have used LSMEANS in PROC REGRESS to estimate means by race and income level controlling for body weight, sex, race, and income level. I have a set of LSMEANS for race and income. How can I test for differences between those LSMEANS?

11: Is there something about the way percentiles are calculated that would make the percentile estimates appear incompatible with proportions from a dichotomous variable?

<Return to TOP>

1: My SUDAAN estimates, standard errors, and/or tests of hypothesis are not the same as the ones I get in other packages. Why is this?

  • If you are analyzing data from a complex sample survey, you will likely get different results in SUDAAN vs. other packages. First, if you cannot use the survey sampling weights in other packages, the point estimates will be different. Some packages allow a WEIGHT statement, and that will ensure that the point estimates are the same between SUDAAN and most other packages. Point estimates are generally biased if the survey sampling weights cannot be utilized. In addition, the variances, standard errors, tests of hypotheses, and p-values will still be different, even when weights are utilized. This is because SUDAAN allows the user to specify the sampling design and thereby compute a robust variance estimate, yielding valid inferences. If another package does not allow for specification of the complex sampling design (stratification, clustering, etc.), then variance estimation, and hence test statistics and p-values, will be wrong. Usually, this results in variances that are too small and false-positive tests of hypothesis.

  • In some procedures different estimates as well as different standard errors may be due to different tolerances for matrix inversion. Try changing the value of the TOL parameter on the PROC statement.

  • In the iterative regression procedures, different estimates may be due to a different number of iterations. Try changing the values of MAXITER, EPSILON and / or P_EPSILON on the PROC statement.

<Return to TOP>

2: Why am I getting ******** in the output instead of results?

The ******* indicates that the default field width is not large enough for the result. Suppose, for instance, you find **** in the output from one of the descriptive procedures where you requested WSUM . You can add something similar to the following to your PRINT statement after the slash:

PRINT / WSUMFMT=Fw.d;

where w is the overall field width you desire and d is the number of decimal places. You should choose w large enough to accommodate the number of decimal places d, the decimal point, and enough digits to the left of the decimal to contain the largest value.

<Return to TOP>

3: I am getting non-zero parameter estimates in LOGISTIC for the reference group. What is wrong?

The large number of records on your data set may be the cause of the problem. The large size reduces the precision of sums of squares and cross products, which are accumulated in order to estimate the parameters. In this case, the round-off errors may be larger than the default tolerance for matrix inversion (TOL=1e-6). We suggest that you supply a larger tolerance on the PROC statement (TOL=1e-5 for example) and rerun the job.

<Return to TOP>

4: I am trying to estimate quantiles for a variable with a large percentage of 0 values. I am getting missing values for the quantile and for the SEs and upper and lower confidence limits. Is there anything I can do?

SUDAAN is unable to estimate any quantile that is less than or equal to the percentage of data accounted for by the 0 values. This will happen for any variable where the smallest value of that variable has ties.

<Return to TOP>

5 Does it matter whether I use SUBPOPN or subset my data outside of SUDAAN before analyzing?

It makes a difference any time parts of the sampling design (e.g., an entire PSU) are lost after subsetting the data. SUDAAN needs the entire design present in order to estimate variances correctly. In most cases, it will make a difference. The difference shows up in the variance estimation and hypothesis testing.

Here is how the SUBPOPN statement works. Imagine a new variable named ELIGIBLE which is equal to 1. If the observation is to be included in the analysis through SUBPOPN, and ELIGIBLE is equal to 2, it is not included in the analysis. If this variable is used on the SUBGROUP statement with the corresponding LEVELS equal to 2, and also crossed with every term on a TABLES statement, then it will produce results for both levels ELIGIBLE=1 and ELIGIBLE=2. The use of SUBPOPN ELIGIBLE=1 will produce results that are identical to the results for the cell for "ELIGIBLE=1" when both levels are analyzed.

If you instead subset the population outside of SUDAAN and then analyze the data using SUDAAN, the results may be different in the two analyses. One case for which the results will be the same is when "DESIGN=WR" and the subset contains al least one observation (with positive weight) in each of the original PSUs.

In conclusion, the safe (therefore preferred) approach is to use SUBPOPN, and not subset the data prior to using SUDAAN.

<Return to TOP>

6:Can I use "-2 log likelihood" to evaluate the relative fit of two models?

You can use "-2 log likelihood" to evaluate the relative fit of two models, but not the absolute p-value to test a hypothesis, since we do not know the distribution of the likelihood for complex samples.

<Return to TOP>

7: I ran the same procedure using both WR and Delete-1 Jackknife designs. Results were very similar, but the Jackknife design takes much longer to execute. Which method do you recommend?

Both methods are good large sample approximations. Here "large" refers to the number of PSUs (Primary Sampling Units), not the number of observations. Which to use is a matter of preference. There is no evidence that one method is superior to the other in general.

<Return to TOP>

8: Which SEMETHOD should I use with R=EXCHANGEABLE?

You can use either SEMETHOD=ZEGER or SEMETHOD=BINDER to obtain the robust variance. BINDER is most often used in complex sample surveys. ZEGER is most often used in randomized experiments and non-survey applications. In many cases, ZEGER and BINDER are identical.

Use SEMETHOD=MODEL to obtain the model-based or "naive" variance estimate. This estimate assumes that exchangeable intracluster correlations (R=EXCHANGEABLE) are correct. This is the most efficient variance estimate when the "working" correlation assumption (R=EXCHANGEABLE or R=INDEPENDENT) is correct. SEMETHOD=MODEL is most often used with randomized experiments and other non-survey applications.

<Return to TOP>

9: My regression model contains independent variables that are coded as 0-1 indicator variables, and I listed these variables on the SUBGROUP statement. SUDAAN seems to be deleting a lot of observations from my analysis, and the regression coefficients don't look correct to me. What could be the problem?

Do not list independent variables that are coded as 0-1 on the SUBGROUP statement. Values of 0 are treated as missing for variables that are listed on the SUBGROUP statement and will be excluded from your analysis. Independent variables coded 0-1 may be placed on the CLASS statement if you wish to treat them as categorical, or you can enter them into the model as is.

<Return to TOP>

10: I have used LSMEANS in PROC REGRESS to estimate means by race and income level controlling for body weight, sex, race, and income level. I have a set of LSMEANS for race and income. How can I test for differences between those LSMEANS?

First, you can use the t-tests that are printed by SUDAAN to test H0: Beta=0. Tests of the betas=0 are equivalent to testing for differences in LSMEANS. These t-tests automatically compare each level of the categorical covariates to the reference cells. You can also use the EFFECTS statement to compare other specific levels of the categorical covariates, and that is also equivalent to comparing LSMEANS.

<Return to TOP>

11: Is there something about the way percentiles are calculated that would make the percentile estimates appear incompatible with proportions from a dichotomous variable?

I calculated percentiles for a duration variable (number of minutes walked) using:

proc descript ...;
var walkdur;
tables _one_
percentile 10 25 50 75 90;

And then using a cut-off of 30 minutes, I created a dichotomous variable and calculated using proc crosstab the proportion who walked for 30 or more minutes

Percentile results were:

18.4 min. - 10th percentile
29.1 min. - 25th
34.5 min. - 50th
59.3 min. - 75th
77.2 min. - 90th

I calculated an estimate of 79% walking for 30 minutes or more using a dichotomous variable. However, from the percentiles, one would expect that less than 75% walked for 30 min. or more.

If you have a large percentage of ties, then when SUDAAN interpolates between successive values, you may see the type of behavior that you have indicated.

SUDAAN is interpolating between 30 and the value just below it in order to estimate the 25th percentile. The value closest to but less than 30 that occurs for the walkdur variable is 29. Values at or below 29 account for, roughly, 22% of the data. Since the number 30 is , roughly, 28% of the data, SUDAAN assumes this percentage is distributed between 29 and 30. So, 29 is the 22nd percentile and 30 is the 22+28=50th percentile. In order to find where between 29 and 30 the 25th percentile occurs, we need to find how far away from 29 that the 25th percentile occurs. In order to find the amount to add to 29, solve this equation for x:

x(28)=(25-22), which yields, approximately, .11. So, the 25th percentile is at 29+.11=29.11.

<Return to TOP>