This process takes what was an arduous, days- or weeks-long data preparation task and returns predicted charge categories in a matter of seconds.
Behind the scenes: A fusion of machine learning and open-source at work
ROTA implements what is known in the Natural Language Processing (NLP) domain as a Transformer model. Transformer models provide a number of benefits over classic machine learning methods, chiefly high classification accuracy without substantial data present.
Transformer models derive much of their benefit from a concept called transfer learning, which is named such because these models can transfer general knowledge to our specific use case. For ROTA, this means that we’re able to take advantage of any general language information contained in a language model as well as any specific criminal justice text that was included within the original training data and build on it, rather than start that process from scratch.
The model used in ROTA is trained on a publicly available national lookup table combined with other hand-labeled offense text datasets. It yields an overall accuracy of 93% when predicting across 84 unique NCRP charge categories. The power of transfer learning is evident when evaluating performance on categories for which we have very little data – for example, the “Trafficking – Heroin” category has roughly 100 training observations, but on average ROTA is able to correctly classify 90% of the total number of examples in this category.
The model also outputs a confidence score for each assignment, which a researcher could use to identify which coded texts might need additional review, and which texts can be automatically coded with less concern. For example, if only predictions where the confidence score is above 95 are automated, this covers 93% of the data we used, and the model is 97% accurate on these texts.
For charge category-specific performance, see our open-source model repository for the latest trained model embedded in the web application and its performance statistics. RTI decided to open-source our model to provide transparency regarding the model performance and to enable researchers to continue model development for their similar data and/or tasks. For example, they can continue training the model or fine-tune model weights.
ROTA utilizes additional open-source technology in the form of the streamlit library and hosting services to bring the tool to a web interface. Streamlit enabled our developers to quickly and easily stand up a web application to make the model predictions accessible in a way that is just as quick and easy for researchers in the field to use.
Utilizing NCRP charge categories is only one method of training the ROTA tool to make predictions. Future iterations of ROTA may include additional categorization schemes for offense texts.