Kappa Statistics for Multiple Raters Using Categorical Classifications
نویسنده
چکیده
In order to assess the reliability of a given characterization of a subject it is often necessary to obtain multiple readings, usually but not always from different individuals or raters. The degree of agreement among the various raters gives some indication as to the consistency of the values. If agreement is high, we feel more confident the ratings reflect the actual circumstance. If agreement among the raters is low, we are less confident in the results. While several methods are available for measuring agreement when there are only two raters, this paper concentrates on presenting a generalized implementation of the Fleiss (1981) technique. This method can be utilized even in situations where there are more than two raters and/or categories. A review of the statistical theory behind the intraclass correlation coefficients and kappa statistics obtained when looking at the above situations is presented. SAS code is provided which utilizes basic SAS procedures. INTRODUCTION Often we are faced with determining the measurement of interrater agreement when the ratings are on a categorical scale. When the number of raters is equal to two, this is easily accomplished by using SAS PROC CORR to get an estimate of the correlation coefficient. SAS 6.10 PROC FREQ with the AGREE option also provides an easy way to obtain the kappa statistic when there are only two raters. Fleiss describes a technique for obtaining interrater agreement when the number of raters is greater than or equal to two. This paper concentrates on the ability to obtain a measure of agreement when the number of raters is greater than two. It also concentrates on the technique necessary when the number of categories into which the ratings can fall is greater than two. BACKGROUND The data used in this paper to demonstrate the technique of calculating interrater agreement was compiled from a gastric graft versus host disease study. (Washington, et al, 1996, submitted). This research compares the results of three pathologists’ diagnoses of 51 different gastric biopsies. Each pathologist reviewed the gastric biopsies in a blinded fashion. The specific data used in the example in this paper involves degree of agreement on density of the inflammatory infiltrate in the lamina propria. The categories of degree of inflammation range from zero to three. METHODS Since the kappa statistic ( $ k ) was first proposed by Cohen (1960), variants have been proposed by others, including Scott (1955), Maxwell and Pilliner (1968), and Bangdiwala (1987). SAS code has also been presented by Gaccione (1993) to compute the kappa statistic. The various
منابع مشابه
Computing Inter-Rater Reliability With the SAS System
The SAS system V.8 implements the computation of unweighted and weighted kappa statistics as an option in the FREQ procedure. A major limitation of this implementation is that the kappa statistic can only be evaluated when the number of raters is limited to 2. Extensions to the case of multiple raters due to Fleiss (1971) have not been implemented in the SAS system. A SAS macro called MAGREE.SA...
متن کاملInter-rater reliability of categorical versus continuous scoring of fish vitality: Does it affect the utility of the reflex action mortality predictor (RAMP) approach?
Scoring reflex responsiveness and injury of aquatic organisms has gained popularity as predictors of discard survival. Given this method relies upon the individual interpretation of scoring criteria, an evaluation of its robustness is done here to test whether protocol-instructed, multiple raters with diverse backgrounds (research scientist, technician, and student) are able to produce similar ...
متن کاملKappa — A Critical Review
The Kappa coefficient is widely used in assessing categorical agreement between two raters or two methods. It can also be extended to more than two raters (methods). When using Kappa, the shortcomings of this coefficient should be not neglected. Bias and prevalence effects lead to paradoxes of Kappa. These problems can be avoided by using some other indexes together, but the solutions of the Ka...
متن کاملTesting the Validity of Maturity Model for E-Government Implementation in Indonesia
The research was conducted to empirically validate the proposed maturity model of e-Government implementation, composed of four dimensions, further specified by 54 success factors as attributes. To do so, there are two steps were performed. First, expert’s judgment was conducted to test its content validity. The second, reliability study was performed to evaluate inter-rater agreement by using ...
متن کاملBeyond kappa: A review of interrater agreement measures
In 1960, Cohen introduced the kappa coefficient to measure chance-corrected nominal scale agreement between two raters. Since then, numerous extensions and generalizations of this interrater agreement measure have been proposed in the literature. This paper reviews and critiques various approaches to the study of interrater agreement, for which the relevant data comprise either nominal or ordin...
متن کامل