How applying AI to academic transcript coding allows for processing more data in a fraction of the time
Academic transcripts can provide robust insights about education systems, but they have traditionally been time-consuming to process. To improve the efficiency and accuracy of the transcript data pipeline, RTI conducted an AI-assisted pilot project to auto-extract transcript data and recommend standardized course codes to which coders map course content. The team collaborated with the 2012/17 Beginning Postsecondary Students Longitudinal Study (BPS:12/17) to collect and analyze student data for the Department of Education and illustrate how AI can improve academic transcript processing.
Traditional Academic Transcript Processing
Simply speaking, the data processing pipeline is divided into three main stages:
- Data extraction
- Data harmonization, transformation, and coding
- Data insights
Academic transcripts are problematic in this pipeline because, first, they are dense. They contain a wealth of information, including schools attended, degrees and majors, courses and grades, and more. Manually extracting data from transcripts is a tedious process. Second, standardizing fields of study and course content across colleges and universities is complex. The Classification of Instructional Programs (CIP) and the College Course Map (CCM) comprise a set of six-digit codes to create comparable data across postsecondary institutions.
In a recent transcript study, the Education and Workforce Development (EWD) group at RTI processed more than 40,000 transcripts containing over 500,000 courses. The manual keying and coding of this volume of data can take upwards of eight months to complete. Code suggestions are based on keyword searches and staff often must look up course descriptions in an online catalog. They also double-code 10 percent of the data to ensure accuracy. Needless to say, the process is time-intensive and costly, which is where RTI saw an opportunity to apply AI capabilities.
Using AI in Academic Transcript Processing
Course coding is complex. While linking terms such as Intro to English and Freshman English makes intuitive sense to humans, it is very difficult to develop a rule-based coding system to classify courses that are under different names at different colleges. That’s where AI comes in. Traditional coding begins with rules, applies them to datasets, and presents insights. Machine learning uses past experiences to learn rules that are not easily articulated and develop insights from them. To create an effective machine learning model for academic transcript processing, the pilot project needed examples of how humans have coded specific courses in the past.
Luckily, RTI coders have coded more than 2.2 million courses over the last decade, which provided a large potential dataset. To develop a model that could predict course codes like a human would, the pilot project ran 1.5 million coded courses through the machine learning model 50 times, resulting in 75 million examples. After developing an adept model, the team applied AI to the academic transcript data pipeline via course code recommendations.
First, the pilot used OpenText, an off-the-shelf software, to auto-extract transcript data. Humans then validated that data within OpenText to ensure that there were no gaps or instances of mislabeling. It would previously take an average of 10.7 minutes for a human to enter the courses from one transcript, but, with the incorporation of AI, the time it took to process the entire transcript, courses plus additional info, dropped to 7.5 minutes. That included three minutes dedicated to automation and validation and four minutes dedicated to manual entry of items not designed to be captured via automation. Future uses of this software could capture more information.
The bottom line is that the use of AI for auto-extraction leads to more data in less time.
Course Code Recommendations
In addition to auto-extraction, the team developed a course code recommender that used historical data and deep learning to recommend the five course codes with the highest match probability. The model included an indication of how confident it was in that recommendation, although this information was not shared with coders to avoid bias. The course code recommender resulted in:
- Time savings of an average of 12 seconds per course, which translates to more than 650 hours saved when considering 200,000 courses.
- Coders choosing one of the five recommended codes more than 90 percent of the time and choosing the top recommended code nearly 70 percent of the time.
- Agreement rates between coders increasing with the incorporation of AI and the largest increase occurring at the most specific level of the coding taxonomy.
- The ability to process transcripts 67% more quickly, which will continue to increase over time given the nature of the deep learning framework.
Applying AI to challenges such as academic transcript processing has the potential to lower costs and save time without sacrificing accuracy. While this project focused on the optimization of data extraction, this software format can be applied beyond transcripts and is currently being piloted for medical billing records and resumes. RTI is also developing a College Course Coder 2.0, which incorporates full course descriptions and raises the accuracy of the top-recommended course code from 69 percent to 83 percent. Finally, the team is working on a High School Course Coder to be used on high school transcript studies.
As the AI space continues to emerge and evolve, it is important to remember three things: First, legacy systems pose a significant barrier to reaping the full benefits of AI, so applying AI effectively will mean being open to changing traditional processes. Second, the incorporation of AI across projects will only get easier in the future since a framework can be applied across domains after the initial development. Finally, the AI space is not as abstract as it might seem. This pilot project illustrated that, not only can AI incorporation create measurable gains in efficiency and quality, but it can also expand the scope of the future projects we take on, the problems we can solve, and the questions we can answer.