Inter-Rater Variability in Checklist Assessment of Resident Peformance
Heather N Mack, Suzanne Dintzis, Sheila Mehri, Daniel F Luff, Jennie Stuijk, Gregory Kotnis, Stephen S Raab. University of Washington, Seattle, WA; University of Colorado, Denver; Cleveland Clinic, Cleveland, OH
Background: The assessment of resident performance through the use of test checklists provides feedback for improvement. The measure of inter-rater scoring of test items establishes checklist reliability that is necessary prior to using these tests as a true measure of performance. We measured inter-rater variability in assessing resident responses in simulated communication scenarios.
Design: We developed 20 information transfer scenarios of resident communication with pathologists and clinical educators who role-played clinicians, laboratory staff, and pathologists, From 2 to 7 raters (pathologists, technicians and pathologist assistants) measured resident performance with checklists that contained 15-25 elements, divided into categories of Introduction, Information Giving, Information Seeking, Information Verification, Content, and Empathy/Conflict Resolution. The scenarios represented a range of typical communications and behaviors, including anger management and empathy. We tested 2-10 point Likert scales (ranging from categories of performed well to performed poorly to not performed) for each checklist item to measure initial and experienced inter-rater scoring variability using the metric of crude agreement.
Results: The inter-rater variability in scoring depended on checklist category and rater experience. Depending on the number of raters, Introduction and Content items had the highest level of scoring agreement with crude agreement ranging from 80%-100% for even novice raters on Likert scale checklists of 5 or more categories. Information Giving/Seeking/Verification items generally had a lower level of scoring agreement (20%-100%) for novice raters on Likert scale checklists of 3 or more categories. Scoring agreement for these items improved with more experienced raters. Empathy/Conflict Resolution items had variable scoring agreement (20%-100%) that depended on specific scenarios and rater experience; Likert scales with 2-3 categories had higher scoring agreement then scales with more categories.
Conclusions: We conclude that the inter-rater checklist scoring variability in communication scenarios testing resident performance depended on factors such as rater experience, checklist item, and scenario design. In testing scenarios, lower scoring variability across all checklist items generally was achieved using Likert scales comprised of 2 or 3 categories. For training and educational scenarios, a larger number of Likert scale categories allows for more granular feedback.
Tuesday, March 20, 2012 9:30 AM
Poster Session III # 129, Tuesday Morning