PE2rr Corpus: Manual Error Annotation of Automatically Pre-annotated MT Post-edits

نویسندگان

  • Maja Popovic
  • Mihael Arcan
چکیده

We present a freely available corpus containing source language texts from different domains along with their automatically generated translations into several distinct morphologically rich languages, their post-edited versions, and error annotations of the performed post-edit operations. We believe that the corpus will be useful for many different applications. The main advantage of the approach used for creation of the corpus is the fusion of post-editing and error classification tasks, which have usually been seen as two independent tasks, although naturally they are not. We also show benefits of coupling automatic and manual error classification which facilitates the complex manual error annotation task as well as the development of automatic error classification tools. In addition, the approach facilitates annotation of language pair related issues.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Terra: a Collection of Translation Error-Annotated Corpora

Recently the first methods of automatic diagnostics of machine translation have emerged; since this area of research is relatively young, the efforts are not coordinated. We present a collection of translation error-annotated corpora, consisting of automatically produced translations and their detailed manual translation error analysis. Using the collected corpora we evaluate the available stat...

متن کامل

Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction

Until now, error type performance for Grammatical Error Correction (GEC) systems could only be measured in terms of recall because system output is not annotated. To overcome this problem, we introduce ERRANT, a grammatical ERRor ANnotation Toolkit designed to automatically extract edits from parallel original and corrected sentences and classify them according to a new, dataset-agnostic, ruleb...

متن کامل

Automatic Error Detection in Annotated Corpora

Annotated corpus is a linguistic resource which explicitly encodes the information at syntactic and semantic levels for each sentence. Annotated corpora play a crucial role in many applications of natural language processing (NLP). Error free and consistent annotated corpora is vital for these applications. Creating annotated corpora is an expensive and time consuming process. Errors or anomali...

متن کامل

Treebank Development with Deductive and Abductive Explanation-based Learning: Exploratory Experiments

In pace with the success of corpus-based approaches to theoretical and computational linguistics, the collocation of corpora has evolved into a research activity in its own. As the currently available corpora either lack annotation depth or closure, more data will be annotated in the future, preferably with minimal human intervention. This paper tries to approach the problem of treebank develop...

متن کامل

Coping with the Subjectivity of Human Judgements in MT Quality Estimation

Supervised approaches to NLP tasks rely on high-quality data annotations, which typically result from expensive manual labelling procedures. For some tasks, however, the subjectivity of human judgements might reduce the usefulness of the annotation for real-world applications. In Machine Translation (MT) Quality Estimation (QE), for instance, using humanannotated data to train a binary classifi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016