Inter-rater Agreement Measures and the Refinement of Metrics in the PLATO MT Evaluation Paradigm
نویسندگان
چکیده
The PLATO machine translation (MT) evaluation (MTE) research program has as a goal the systematic development of a predictive relationship between discrete, welldefined MTE metrics and the specific information processing tasks that can be reliably performed with output. Traditional measures of quality, informed by the International Standards for Language Engineering (ISLE), namely, clarity, coherence, morphology, syntax, general and domain-specific lexical robustness, and named-entity translation, as well as a DARPAinspired measure of adequacy are its core. For robust validation, indispensable for refinement of tests and guidelines, we measure inter-rater reliability on the assessments. Here we report on our results, focusing on the PLATO Clarity and Coherence assessments, and we discuss our method for iteratively refining both the linguistic metrics and the guidelines for applying them within the PLATO evaluation paradigm. Finally, we discuss reasons why kappa might not be the best measure of interrater agreement for our purposes, and suggest directions for future investigation.
منابع مشابه
Test-Retest and Inter-Rater Reliability Study of the Schedule for Oral-Motor Assessment in Persian Children
Objectives: Reliable and valid clinical tools to screen, diagnose, and describe eating functions and dysphagia in children are highly warranted. Today most specialists are aware of the role of assessment scales in the treatment of affected individuals. However, the problem is that the clinical tools used might be nonstandard, and worldwide, there is no integrated assessment performed to assess ...
متن کاملFunctional Movement Screen in Elite Boy Basketball Players: A Reliability Study
Purpose: To investigate the reliability of Functional Movement Screen (FMS) in basketball players. A few studies have compared the reliability of FMS between raters with different experience in athletes. The purpose of this study was to compare the FMS scoring between the beginners and expert raters using video records. Methods: This is a cross-sectional study. The study subjects compris...
متن کاملNurse-Physician Agreement on Triage Category: A Reliability Analysis of Emergency Severity Index
Background and Objectives: MThe Emergency Severity Index (ESI) triage is commonly used in clinical settings to determine the patients’ emergency severity. However, the reliability of this index is not sufficiently explored. The present study examines the inter-rater reliability of ESI by comparing triage ratings as performed by nurses and physicians. Methods: This prospective cross-sectional st...
متن کاملFormal v. Informal: Register-Differentiated Arabic MT Evaluation in the PLATO Paradigm
Tasks performed on machine translation (MT) output are associated with input text types such as genre and topic. Predictive Linguistic Assessments of Translation Output, or PLATO, MT Evaluation (MTE) explores a predictive relationship between linguistic metrics and the information processing tasks reliably performable on output. PLATO assigns a linguistic signature, which cuts across the task-b...
متن کاملQuiz-Based Evaluation of Machine Translation
This paper proposes a newmethod of manual evaluation for statistical machine translation, the so-called quiz-based evaluation, estimating whether people are able to extract information from machine-translated texts reliably. We apply the method to two commercial and two experimental MT systems that participated in WMT 2010 in English-to-Czech translation. We report inter-annotator agreement for...
متن کامل