Learning from human judgments of machine translation output
نویسندگان
چکیده
Human translators are the key to evaluating machine translation (MT) quality and also to addressing the so far unanswered question when and how to use MT in professional translation workflows. Usually, human judgments come in the form of ranking outputs of different translation systems and recently, post-edits of MT output have come into focus. This paper describes the results of a detailed large scale human evaluation consisting of three tightly connected tasks: ranking, error classification and post-editing. Translation outputs from three domains and six translation directions generated by five distinct translation systems have been analysed with the goal of getting relevant insights for further improvement of MT quality and applicability.
منابع مشابه
Combining Confidence Estimation and Reference-based Metrics for Segment-level MT Evaluation
We describe an effort to improve standard reference-based metrics for Machine Translation (MT) evaluation by enriching them with Confidence Estimation (CE) features and using a learning mechanism trained on human annotations. Reference-based MT evaluation metrics compare the system output against reference translations looking for overlaps at different levels (lexical, syntactic, and semantic)....
متن کاملTraining a Sentence-Level Machine Translation Confidence Measure
We present a supervised method for training a sentence level confidence measure on translation output using a humanannotated corpus. We evaluate a variety of machine learning methods. The resultant measure, while trained on a very small dataset, correlates well with human judgments, and proves to be effective on one task based evaluation. Although the experiments have only been run on one MT sy...
متن کاملCrowd-Sourcing of Human Judgments of Machine Translation Fluency
Human evaluation of machine translation quality is a key element in the development of machine translation systems, as automatic metrics are validated through correlation with human judgment. However, achievement of consistent human judgments of machine translation is not easy, with decreasing levels of consistency reported in annual evaluation campaigns. In this paper we describe experiences g...
متن کاملX-Score: Automatic Evaluation of Machine Translation Grammaticality
In this paper we report an experiment of an automated metric used to analyze the grammaticality of machine translation output. The approach (Rajman, Hartley, 2001) is based on the distribution of the linguistic information within a translated text, which is supposed similar between a learning corpus and the translation. This method is quite inexpensive, since it does not need any reference tran...
متن کاملA Study of Translation Edit Rate with Targeted Human Annotation
We examine a new, intuitive measure for evaluating machine-translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judgments. Translation Edit Rate (TER) measures the amount of editing that a human would have to perform to change a system output so it exactly matches a reference translation. We show that the single-refere...
متن کامل