The Hierarchical Rater Model for Rated Test Items and Its Application to Large-Scale Educational Assessment Data
نویسندگان
چکیده
Single and multiple ratings of test items have become a stock component of standardized educational tests and surveys. For both formative and summative evaluation of raters, a number of multiple-read rating designs are now commonplace (Wilson and Hoskens, 1999), including designs with as many as six raters per item (e.g. Sykes, Heidorn and Lee, 1999). As digital image based distributed rating becomes commonplace, we expect the use of multiple raters as a routine part of test scoring to grow; increasing the number of raters also raises the possibility of improving the precision of examinee proficiency estimates. In this paper we develop Patz’s (1996) hierarchical rater model (HRM) for polytomously scored item response data, and show how it can be used, for example, to scale examinees and items, to model aspects of consensus among raters, and to model individual rater severity and consistency effects. The HRM treats examinee responses to open-ended items as unobserved discrete variables, and it explicitly models the “proficiency” of raters in assigning accurate scores as well as the proficiency of examinees in providing correct responses. We show how the HRM “fits in” to the generalizability theory framework that has been the traditional analysis tool for rated item response data, and give some relationships between the HRM, the design effects correction of Bock, Brennan and Muraki (1999), and the rater bundles model of Wilson and Hoskens (1999). Using simulated data, we compare analyses using the conventional IRT Facets model for rating data (e.g. Linacre, 1989; Engelhard, 1994, 1996) and illustrate parameter recovery for the HRM. We also analyze data from a study of three different rating modalities intended to support a Grade 5 mathematics exam given in the State of Florida (Sykes, Heidorn and Lee, 1999) to show how the HRM can be used to identify individual raters of poor reliability or excessive severity, how standard errors of estimation of examinee scale scores are affected by multiple reads, and how the HRM scales up to rating designs involving large numbers of raters.
منابع مشابه
Towards a Task-Based Assessment of Professional Competencies
Performance assessment is exceedingly considered a key concept in teacher education programs worldwide. Accordingly, in Iran, a national assessment system was proposed by Farhangian University to assess the professional competencies of its ELT graduates. The concerns regarding the validity and authenticity of traditional measures of teachers' competencies have motivated us to devise a localized...
متن کاملConstructing and Validating a Q-Matrix for Cognitive Diagnostic Analysis of a Reading Comprehension Test Battery
Of paramount importance in the study of cognitive diagnostic assessment (CDA) is the absence of tests developed for small-scale diagnostic purposes. Currently, much of the research carried out has been mainly on large-scale tests, e.g., TOEFL, MELAB, IELTS, etc. Even so, formative language assessment with a focus on informing instruction and engaging in identification of student’s strengths and...
متن کاملImplicational Scaling of Reading Comprehension Construct: Is it Deterministic or Probabilistic?
In English as a Second Language Teaching and Testing situations, it is common to infer about learners’ reading ability based on his or her total score on a reading test. This assumes the unidimensional and reproducible nature of reading items. However, few researches have been conducted to probe the issue through psychometric analyses. In the present study, the IELTS exemplar module C (1994) wa...
متن کاملروایی و پایایی نسخه فارسیشده مقیاس فعالیتهای کارساز روزمره زندگی لوتون در مبتلایان به دمانس
Objectives: The Lawton IADL (instrumental activities of daily living) Scale is considered as one of the widely used tools to assess activities of daily livings in patients with dementia, but its validity and reliability has never been assessed in Persian-speaking populations. The purpose of this study was to investigate the validity and reliability of this widely used scale among patients with ...
متن کاملA Many-facet Rasch Model to Detect Halo Effect in Three Types of Raters
Raters play a central role in rater-mediated assessment, and rater variability manifested in various forms including rater errors contributes to construct-irrelevant variance which can adversely affect an examinee’s test score. Halo effect as a subcomponent of rater errors is one of the most pervasive errors which, if not detected, can result in obscuring an examinee’s score and threatening val...
متن کامل