Insights

Imputing Race/Ethnicity: Part 2

Lisa M. Lines Senior Health Services Researcher

Jamie Humphrey Health Geographer

September 08, 2021

This blog originally appeared on The Medical Care Blog and is republished here with permission.

Part 1 of this two-part series laid out arguments for and shortcomings of imputing race/ethnicity from the perspective of health equity. In this post, we’ll talk about gaps in the evidence and a few alternatives to imputation, including approaches involving population-level and neighborhood-level data.

Imputation is a common solution to deal with “the missing-data problem.” But much is still unknown about imputation’s implications. We need better data to begin with – how should we collect these data? What additional research could inform the next evolution in dealing with this perennial problem?

We Need More and Better Data

Discrimination in healthcare occurs on the basis of skin color and other physical characteristics, less than perfect English or stigmatized accents, sexual orientation, gender identity and its presentation, poverty, lack of education or literacy, history of substance use, disability, overweight, various other health conditions, and more. Truly, if the goal is health equity for all, we need to be taking all kinds of potential gaps into consideration.

One person of color interviewed on the topic of racial identification questions on a health-related survey said:

What is the point of asking me [my race on this survey]? If it is [about] experiencing discrimination, why aren’t you asking if I’ve experienced discrimination because of my race? Why does it even matter what race I am -- if the point is to uncover quality [issues], then that should matter regardless of race.

Following this argument, collecting more specific data on inequities, mistreatment, and gaps in high-quality care should be a priority. Similarly, every survey instrument that asks about race/ethnicity ought to allow respondents to select “Prefer not to answer” as a valid response option.

Also, researchers and policymakers alike need to examine their own internal assumptions. Are you using race/ethnicity as a proxy for something else? Are there data available, or could data be collected, to measure that thing? Either way, be explicit about why race/ethnicity are being used in your models.

Ask the Experts

What if we asked a racially diverse panel of individuals to weigh in on what should happen if they leave the race/ethnicity question blank on a survey? Shockingly, it does not appear that anyone has published research on this to date.

We should ask those who are most likely to be affected to weigh in on current practices and approaches. How would they feel about: being dropped from an analysis altogether? Being lumped into a missing/unknown or “Other” category? Having probabilities of being of various races/ethnicities assigned based on their last name and where they live?

We should also ask about the race/ethnicity questions themselves. How would they feel about open-ended race and ethnicity questions? What about asking people about their ancestry or family origins instead of their race/ethnicity? We need qualitative data to get more perspectives on this issue.

The Census has conducted extensive focus groups on race/ethnicity survey collection in recent years, but it's not clear from the report [PDF] whether the participants shed light on these specific issues. Per a comment on Part 1, the Census used imputation to address missing race/ethnicity data in the most recent decennial count. According to their explainer: "...if race is reported for a parent, we could use that information to fill in their child’s missing race. If no information is available within the household, we would impute the information using data from similar nearby households."

Data Linkages and Improved Enrollment

In addition to more qualitative data on this issue, we also need more quantitative data. Self-reported racial/ethnic identification collected via surveys, clinical assessments (such as in nursing homes), registries, and electronic health records (EHRs) could be linked with administrative data. Ideally, we would have a national healthcare blockchain system (or at least interoperability) to protect privacy and confidentiality.

Failing that, the Federal government has some power to facilitate better data linkages and collection. For example, the Medicare enrollment form could be changed to collect more information besides the basics. The Centers for Medicare & Medicaid Services (CMS) could link survey, assessment, registry, and EHR data with enrollment data to improve the accuracy of older, Social-Security-supplied race data. CMS could work with other Federal and state agencies to leverage both qualitative and quantitative approaches to improve race/ethnicity measurement.

Even so, some states limit when and how race and ethnicity can be collected. Others may push back against collecting such data for ideological reasons concerning the scope and limits of government’s role in citizens’ lives. These and other obstacles to collecting race/ethnicity will continue to stymie efforts to promote health equity.

Alternatives to Imputation

Admittedly, we currently lack good alternatives to imputation in the analysis phase. This is one reason why the best alternative is to collect better data to begin with.

When running regression models in many statistical software packages, individuals with missing data on predictors are, by default, simply removed from the model. This is known as “complete-case analysis” or “listwise deletion.” This approach decreases information and sample size and can introduce bias if data are not missing completely at random. Still, many studies in the literature have used this approach.

Another approach is to narrow the analysis to only people with known white or Black race and drop all others or put them in an “Other” category. While this sidesteps the missing data issue, it leads to a new problem: lack of generalizability. Similarly, creating a separate category of people with missing data and analyzing them separately leads to the inability to draw meaningful conclusions about those in that category.

More Alternatives

For research in which proportional representation is a primary concern, we could consider designs that sample portions of more-represented groups, rather than add imputed data to less-represented groups. This approach would be based on an understanding that statistical results based on groups with greater representation – such as white males – would be robust even with fewer observations.

In a design based on relative representation, researchers could – based on reliable sources of population characteristics – under-sample over-represented groups while including all (or nearly all) of the least-represented group, most of the next-least represented group, and so on. This approach still assumes that people from underrepresented groups who have missing racial/ethnic information are roughly the same with regard to outcomes as people from the same group whose information is reported do, which may not be the case.

This design would not avoid the need for more, and more reliable, data on the health and health outcomes of underserved populations. However, it could allow for fairer proportional representation without requiring imputation of identification data. This would also be a decision made at the design stage of the study, rather than the analysis phase.

Another is to use geographic, population-level data instead of, or in addition to, individual-level data. Measuring racial/ethnic-related characteristics can involve measures of residential segregation and isolation, dissimilarity indices, and historical redlining, reflecting the place-based idea of health. Area-based measures are often independently associated with health outcomes, and including them improves health equity by better accounting for social factors.

Conclusions

Take-away messages: We need more and better data, to prevent the need to impute in the first place. Also, race/ethnicity should not be used as a proxy for whatever the “real” exposure or confounder might be in your conceptual model.

Have you seen a better approach to handling missing data on race/ethnicity? Do you have input to share? The APHA Medical Care Section’s Health Equity Research Collective is interested in your perspective. You can leave a comment below or send an email to hltheq@outlook.com.

Opinions in this series are those of the authors. The authors gratefully acknowledge the following RTI staff for their helpful feedback on this series: Jane Allen, Dan Barch, Anupa Bir, Darryl Creel, Susan Haber, and Pam Spain.

Disclaimer: This piece was written by Lisa M. Lines (Senior Health Services Researcher) and Jamie Humphrey to share perspectives on a topic of interest. Expression of opinions within are those of the author or authors.