Using Data Science to Solve Social Science Dilemmas
Exploring the promises and problems of big data at the Summer Institute In Computational Social Science
It’s become a bit of a cliché, however true it is, to talk about how the abundance of big data can be used to solve many of the world’s most pressing problems. But many of those who understand data analytics, writing on topics like corporate decision-making to the world of baseball, have pointed to the insufficiency of data alone. These data need to be leveraged alongside the rapid advances in data science and exponential growth of computing power as part of the iterative process of scientific inquiry. Where, for example, someone struggling to make meaning out of huge swaths of text data might employ machine learning to identify patterns that would otherwise have gone unnoticed. The application of rigorous computational methods alongside ready-made and custom-made data to solve problems is often referred to as computational science, and like many other social scientists who may not have been exposed to computer science and programming during their graduate education, this world of data analytics and AI has been largely out of reach for me. The Summer Institute in Computational Social Science (SICSS) has sought to bridge that gap for researchers like me, bringing content expertise and computational science together in order to give analysts and academics the tools to employ powerful data analytics to their research.
SICSS, hosted by Princeton University and sponsored by the Russel Sage Foundation and the Alfred P. Sloan Foundation, brings together scholars, researchers, data scientists, and context experts from a variety of fields to promote computational social science (CSS), which merges computational science and social science research, and provides both instruction and practice with cutting-edge methodologies in computational science. This year, RTI served as a partner location for SICSS, providing a venue space, unique data sources, and expert speakers for our group of 31 participants to collaborate and engage with these methods. In just two short weeks, we ran a gauntlet of CSS programming: we created web scrapers, conducted sentiment analysis on tweets pulled from Twitter’s API, trained machine learning algorithms, and developed crowdsourced surveys through Amazon MTurk. We engaged in crucial discussions about the ethics of computational science, acknowledging the dangers of careless deployment of computational methods and how our research can be used and manipulated in unintended ways. We were encouraged to think more deeply about the ethical use of CSS and to go further than the typical IRB guidelines to ensure we are properly respecting our study participants’ privacy and livelihoods.
Beyond the lectures, we also had the opportunity to hear from CSS experts such as Beth Noveck, Alondra Nelson, and Jennifer Pan. The guest speakers taught us more about the applications of CSS in current research, how to publish CSS papers in academic journals, and how these methods are being used to address large scale societal problems. Matthew Salganik, a Professor of Sociology at Princeton and one of the faculty for SICSS, shared an example of the Fragile Families Challenge, a mass collaboration effort using the Fragile Families and Child Wellbeing Study to train machine learning algorithms to predict social outcomes like academic achievement household eviction, and economic hardship. The RTI site had lectures from our own experts as well, including Sam Adams, Rob Chew, Kasey Jones, and Georgiy Bobashev, who discussed work that RTI is doing in CSS, including neural networks, deep learning, knowledge graphs, and our cutting-edge data science work on agricultural drone imagery in Rwanda and RTI’s U.S. Synthetic Household Population dataset. Despite the short length, we got a glimpse at how these innovative methods and strategies can be applied to social science research and gained access to the tools we needed to apply them to our own work.
The second half of the institute allowed us to apply what we learned for group projects. In three and a half days, we developed research questions, designed a study involving the methods we learned, and executed the research project with enough time to present our findings. It is a testament to the talent we have both here at RTI and across the state that our groups were able to design compelling research and provide high-quality insights in such a short amount of time. One group, which included RTI employees Ben Allaire and Lawrence Whitley, conducted text and sentiment analyses on Yelp reviews to predict the number of stars reviewers would give restaurants. RTI employees Chris Inkpen, Marwa Salem, and Siri Warkentien worked in another group that used machine learning to model and estimate the undocumented immigrant population of North Carolina in conjunction with a text analysis of local newspaper articles about immigration. I worked alongside my RTI colleagues Derek Ramirez and Tara Weatherholt to develop machine learning algorithms to predict student-level outcomes from the High School Longitudinal Study of 2009 public-use file in conjunction with RTI’s Synthetic Household Population data set. We explored whether students who strongly preferred to live close to home during college clustered in particular areas of North Carolina. We then created distance matrices to find the nearest in-state two- and four-year postsecondary institution to determine whether these students had nearby college options available. This hands-on work was invaluable; by directly applying what we learned during SICSS and seeing our fellow participants conduct their own research, we gained a better sense of how these methods can be applied across diverse topics and brainstorm applications within our personal content spheres.
Through SICSS, I was able to see first-hand the value of computational social science: the ability to creatively use multiple complex data sources, several computer languages (R and Python), and machine learning provides social scientists with the potential to examine pressing social questions. That ability goes hand-in-hand with very challenging methodological and ethical hurdles. These hurdles, however, provide us with a great opportunity for collaboration across disciplines and content areas—and when we overcome them, we can develop research agendas that promote RTI’s mission of improving the human condition through high quality science.