Translation errors from English to Portuguese: an annotated corpus
نویسندگان
چکیده
Analysing the translation errors is a task that can help us finding and describing translation problems in greater detail, but can also suggest where the automatic engines should be improved. Having these aims in mind we have created a corpus composed of 150 sentences, 50 from the TAP magazine, 50 from a TED talk and the other 50 from the from the TREC collection of factoid questions. We have automatically translated these sentences from English into Portuguese using Google Translate and Moses. After we have analysed the errors and created the error annotation taxonomy, the corpus was annotated by a linguist native speaker of Portuguese. Although Google’s overall performance was better in the translation task (we have also calculated the BLUE and NIST scores), there are some error types that Moses was better at coping with, specially discourse level errors.
منابع مشابه
TimeBankPT: A TimeML Annotated Corpus of Portuguese
In this paper, we introduce TimeBankPT, a TimeML annotated corpus of Portuguese. It has been produced by adapting an existing resource for English, namely the data used in the first TempEval challenge. TimeBankPT is the first corpus of Portuguese with rich temporal annotations (i.e. it includes annotations not only of temporal expressions but also about events and temporal relations). In additi...
متن کاملThe Presence and Influence of English in the Portuguese Financial Media
As the lingua franca of the 21st century, English has become the main language for intercultural communication for those wanting to embrace globalization. In Portugal, it is the second language of most public and private domains influencing its culture and discourses. Language contact situations transform languages by the incorporations they make from other languages and Portugal has...
متن کاملDomain-Specific Hybrid Machine Translation from English to Portuguese
Machine translation (MT) from English to Portuguese has not typically received much attention in existing research. In this paper, we focus on MT from English to Portuguese for the specific domain of information technology (IT), building a small in-domain parallel corpus to address the lack of IT-specific and publicly-available parallel corpora and then adapted an existing hybrid MT system to t...
متن کاملCan Projected Chains in Parallel Corpora Help Coreference Resolution?
The majority of current coreference resolution systems rely on annotated corpora to train classifiers for this task. However, this is possible only for languages for which annotated corpora are available. This paper presents a system that automatically extracts coreference chains from texts in Portuguese without the need for Portuguese corpora manually annotated with coreferential information. ...
متن کاملMorphological Annotation System for Automated Tagging of Electronic Textual Corpora: from English to Romance Languages
Based on the Penn-Helsinki Parsed Corpus of Middle English[1], the Tycho Brahe Parsed Corpus of Historical Portuguese[2] consists of an electronic annotated corpus composed of prose, originally written in Portuguese by native speakers of European Portuguese (henceforth EP) born between the 16th and 19th centuries. The present annotation system to be applied to Portuguese has been developed in t...
متن کامل