RTI uses cookies to offer you the best experience online. By clicking “accept” on this website, you opt in and you agree to the use of cookies. If you would like to know more about how RTI uses cookies and how to manage them please view our Privacy Policy here. You can “opt out” or change your mind by visiting: http://optout.aboutads.info/. Click “accept” to agree.


Adding COVID-19 Awareness to a Major Survey of U.S. Doctoral Graduates Using Machine Learning

Survey of Earned Doctorates expands to include open-ended questions on the wide-ranging effects of the pandemic


To track the effects on early careers and lives of doctoral students affected by university shutdowns from the COVID-19 pandemic.


We partnered with the National Science Foundation to develop COVID-19-related questions for their annual Survey of Earned Doctorates (SED). We used advanced data analytic methods to report and organize the results.


Our coding created a database with machine learned techniques to identify common themes within open-ended responses. The results will contribute to the understanding of COVID-19's societal effects. The survey's results and RTI's implementation of machine learned techniques will influence future cycles of the SED. 

Each year, some 55,000 people earn research doctorates from universities in the United States. It’s a milestone achievement for both these individuals and the institutions that educated them. And it’s the starting point for careers that can take any number of paths.

To keep track of trends in doctoral education, the National Science Foundation (NSF) funds the Survey of Earned Doctorates (SED), a census of all research doctorate graduates each academic year. The survey includes data on fields of study, student demographics, and post graduation employment plans. Decades of results have helped the NSF, other federal agencies, and other stakeholders understand changes in this critical sector of education and the economy.

When the COVID-19 pandemic hit in 2020 and universities began shutting down, it was clear that the pandemic would have an unprecedented effect on the lives and early careers of doctoral students. To capture these effects, RTI worked with the NSF project officer to quickly develop and add questions to the SED and then used advanced data analytic methods to help tell this unfolding story.

Using the power of machine learning to enable collection and analysis of open-ended questions

Because the pandemic had the potential to impact the doctoral experience in many ways, we added a series of broad, open-ended questions that asked about changes to graduation timelines, doctorate research impacts, funding changes, career plans, and other effects.

Adding open-ended questions isn’t easy in a large survey, mostly because coding thousands of responses would take too much time. Machine learning makes this possible as it gives researchers the ability to understand a diverse landscape of responses and group them together in a data-driven, systematic way.

RTI had been adding innovative approaches to the SED since we began collecting the data for the 2017 survey cycle. To leverage the advantages of machine learning for the SED, RTI’s experts in postsecondary education surveys teamed up with our data scientists. Our survey methodologists provided an initial coding of a subset of the open-ended results, creating a database that we used to start the machine learning process. The promise of machine learning paid off, as the machine learning analysis confirmed the experts’ analysis and supported revisions to the COVID items for the 2022 SED.

Combining natural language processing with machine learning techniques provided a broader understanding of the open-ended responses, including the relative magnitude of the common themes and the identification of new themes. One of these new themes also emerged during independent qualitative cognitive interviews but was excluded from the major response options within the revised COVID questions as the quantitative data showed that it was only mentioned by a small minority of respondents.

We still have more to learn. Data collection for the SED is ongoing and the results of this initial analysis informed the revision of the COVID-19 questions for the 2022 cycle of the SED. But the convergence of expertise that this project represents remains a competitive strength of RTI. We have built on our decades of leadership in survey research by integrating our growing data science capabilities.

A baseline for future understanding of pandemic-related trends in doctoral-level education

We believe the changes we made to the SED during the pandemic year will have an influence on other surveys for many years to come. Machine learning has great potential for survey research – it helps us develop better questions, analyze more results than before, and complete large projects in a cost-effective way. Adding open-ended questions enables surveys to adapt to changes and expand into new areas.

Meanwhile, the data we collected on how COVID-19 affected doctoral students at the onset of the pandemic will contribute to our understanding of the pandemic on society. SED is a baseline study, meaning it forms the basis of a longitudinal study. Every two years until they reach the age of 75, a sample of science and engineering doctoral graduates receives a follow-up survey, allowing the NSF to keep up with their varied career paths. Over time, a picture will emerge of the pandemic’s long-term effects. What was the impact of travel restrictions on doctoral students studying in the U.S. on temporary visas? Did male and female doctoral students experience similar obstacles to graduation? How did delays in graduation impact the career opportunities of doctoral recipients? We will be able to answer these and many other questions based on future SED data.