Insights

Unified Framework Brings Fresh Approach to Text Classification

Peter Baumgartner Senior Research Data Scientist

Jason Nance

September 16, 2019

RTI International’s gobbli bridges natural language processing techniques and real-world challenges to innovate in the text classification field.

How did we get here?

Imagine having to completely relearn how to read every time you picked up a new book. Until recently, this was the process used by machine learning algorithms to solve problems like sentiment analysis or document classification. These algorithms—unable to carry information between tasks or understand anything about general language—were only capable of solving the narrow problem defined by the data they learned from.

In the last year, there’s been a complete paradigm shift in how these problems are solved, known by experts as the ImageNet moment. For example, computer vision now allows machines to detect pneumonia from X-rays, classify diabetic retinopathy in fungus photographs, and catalog satellite imagery into residential areas for surveys. But while the computer vision applications have had time to flourish, the world of text is just getting a first taste of potential improvements.

At the heart of recent improvements is a concept known as transfer learning. In machine learning, an algorithm uses patterns in example data to learn statistical rules that associate an input with an output. For each new task, a model has to relearn unique rules for that specific task and can’t carry information between problems. Transfer learning solves this problem by first modeling how an entire language works and then fine-tuning this language model for specific tasks. In this way, the algorithm is able to learn from text similar to how humans learn: using our general understanding of language applied to a specific task, without having to relearn every time we start a new task.

A unified approach with gobbli

While the shift to transfer learning with text has led to huge improvements on common language tasks, new language models are developed by different research teams at various institutions, meaning an applied practitioner has to learn the specifics of how that research team implemented their model, including the interface that researchers built to interact with it.

To fill this gap, RTI has developed gobbli, an open-source python library intended to bridge state-of-the-art research in natural language processing and application to real-world problems and data. Gobbli provides a unified interface and supplementary tools for common text classification problems to more easily leverage transfer learning.

Beyond providing a consistent interface to state-of-the-art models, gobbli provides supplementary tools inspired by the types of problems and datasets commonly faced when applying natural language processing to social science and survey research.

The first of these supplementary tools is an Experiments interface, which allows a practitioner to evaluate tradeoffs in performance among the suite of models and settings available to empirically determine which models are appropriate for their use case. Gobbli also provides an interface for data augmentation - a powerful tool for generating synthetic text data to aid in model development when there aren’t many examples to learn from. Data augmentation, combined with transfer learning, unlocks the potential to apply machine learning to these types of problems where high performance was previously out of reach.

Proven success in practice

This unified framework has already proven useful on text classification projects. The RTI International Employee Survey used gobbli to consistently evaluate models and develop a final model that provided suggestions for how free-text responses should be coded into thematic categories. The suggested codes helped the team by pointing out the most likely relevant themes for a given response, while the qualitative team read through every comment and reviewed the suggestions for accuracy.

“The automated codes really helped speed up the coding process this year," said Dawn Ohse, RTI Employee Survey Project Manager. "It was surprising the number of times it coded some fairly nuanced sentiment accurately. Given its high predictive accuracy it has great potential to make the coding process less and less time intensive in future administrations of the survey.”

The survey is a great example of the type of problem where transfer learning shines. Using 3,000 responses from the previous year's survey we built an algorithm that made coding suggestions for 25 unique codes. Typically, this type of problem sees poor results due to the complexity of the themes and the low number of examples to learn from. However, because transfer learning takes advantage of a language model, we can see much better performance from fewer examples.

Open source for future advancement

As is standard practice in the machine learning research community, gobbli will be released as open source on RTI International's GitHub. Releasing this software as open source has several benefits, including improving the quality of the software and helping advance the field by allowing others to apply gobbli to their own problems and contribute back their improvements.

Transfer learning shows great potential for solving text classification problems in situations with low amounts of data. The unified interface with gobbli, will allow applied practitioners to use and evaluate text classification methodologies and deliver the best solution for the problem at hand.

Visit RTI International's Github today to download gobbli or check out this post for more technical information on getting started.

Disclaimer: This piece was written by Peter Baumgartner (Senior Research Data Scientist) and Jason Nance to share perspectives on a topic of interest. Expression of opinions within are those of the author or authors.