Classification Scoring for Cleaning Inconsistent Survey Data

M. Thissen

Classification Scoring for Cleaning Inconsistent Survey Data

Thissen, M. (2017). Classification Scoring for Cleaning Inconsistent Survey Data. International Journal of Data Engineering, 7(1), 1-14. http://www.cscjournals.org/library/manuscriptinfo.php?mc=IJDE-122

Copy citation

Abstract

Data engineers are often asked to detect and resolve inconsistencies within data sets. For some data sources with problems, there is no option to ask for corrections or updates, and the processing steps must do their best with the values in hand. Such circumstances arise in processing survey data, in constructing knowledge bases or data warehouses [1] and in using some public or open data sets.

The goal of data cleaning, sometimes called data editing or integrity checking, is to improve the accuracy of each data record and by extension the quality of the data set as a whole. Generally, this is accomplished through deterministic processes that recode specific data points according to static rules based entirely on data from within the individual record. This traditional method works well for many purposes. However, when high levels of inconsistency exist within an individual respondent's data, classification scoring may provide better results.

Classification scoring is a two-stage process that makes use of information from more than the individual data record. In the first stage, population data is used to define a model, and in the second stage the model is applied to the individual record. The author and colleagues turned to a classification scoring method to resolve inconsistencies in a key value from a recent health survey. Drawing records from a pool of about 11,000 survey respondents for use in training, we defined a model and used it to classify the vital status of the survey subject, since in the case of proxy surveys, the subject of the study may be a different person from the respondent. The scoring model was tested on the next several months' receipts and then applied on a flow basis during the remainder of data collection to the scanned and interpreted forms for a total of 18,841 unique survey subjects. Classification results were confirmed through external means to further validate the approach. This paper provides methodology and algorithmic details and suggests when this type of cleaning process may be useful.

Publications Info

To contact an RTI author, request a report, or for additional information about publications by our experts, send us your request.

publications@rti.org

RTI shares its evidence-based research - through peer-reviewed publications and media - to ensure that it is accessible for others to build on, in line with our mission and scientific standards.

Recent Publications

Article

Target trial emulation for regulatory and clinical decision making in cancer

April 2026

Article

A systematic process to accurately link large-scale research consents to state public health newborn screening samples

April 2026

Article

The acute and chronic pharmacokinetics and pharmacodynamics of oral cannabidiol with and without low doses of delta-9-tetrahydrocannabinol

April 2026

Article

Newborn screening for type 1 diabetes using genome-based risk scores in the Early Check program

April 2026

Article

Policing as a Structural Determinant of Health

April 2026

Article

Grocery store workers’ knowledge, attitudes, and barriers influencing uptake of COVID-19 vaccine in the United States: A qualitative study

April 2026

Article

A comparative analysis of pediatric pneumococcal vaccination strategies: A dynamic model of PCV20 vs. PCV15 and PCV13

April 2026

Article

"She clearly thought that something bad had happened to her": How military lawyers construct narratives of victim legitimacy and perceived harm in sexual assault cases

April 2026

View All Publications