Testing a tool for the classification of study designs in systematic reviews of interventions and exposures showed moderate reliability and low accuracy
Hartling, L., Bond, K., Santaguida, P. L., Viswanathan, M., & Dryden, D. M. (2011). Testing a tool for the classification of study designs in systematic reviews of interventions and exposures showed moderate reliability and low accuracy. Journal of Clinical Epidemiology, 64(8), 861-871. https://doi.org/10.1016/j.jclinepi.2011.01.010
Objectives: To develop and test a study design classification tool.
Study design: We contacted relevant organizations and individuals to identify tools used to classify study designs and ranked these using predefined criteria. The highest ranked tool was a design algorithm developed, but no longer advocated, by the Cochrane Non-Randomized Studies Methods Group; this was modified to include additional study designs and decision points. We developed a reference classification for 30 studies; 6 testers applied the tool to these studies. Interrater reliability (Fleiss' κ) and accuracy against the reference classification were assessed. The tool was further revised and retested.
Results: Initial reliability was fair among the testers (κ=0.26) and the reference standard raters κ=0.33). Testing after revisions showed improved reliability (κ=0.45, moderate agreement) with improved, but still low, accuracy. The most common disagreements were whether the study design was experimental (5 of 15 studies), and whether there was a comparison of any kind (4 of 15 studies). Agreement was higher among testers who had completed graduate level training versus those who had not.
Conclusion: The moderate reliability and low accuracy may be because of lack of clarity and comprehensiveness of the tool, inadequate reporting of the studies, and variability in tester characteristics. The results may not be generalizable to all published studies, as the test studies were selected because they had posed challenges for previous reviewers with respect to their design classification. Application of such a tool should be accompanied by training, pilot testing, and context-specific decision rules.