Large SMT data-sets extracted from Wikipedia
نویسنده
چکیده
The article presents experiments on mining Wikipedia for extracting SMT useful sentence pairs in three language pairs. Each extracted sentence pair is associated with a cross-lingual lexical similarity score based on which, several evaluations have been conducted to estimate the similarity thresholds which allow the extraction of the most useful data for training three-language pairs SMT systems. The experiments showed that for a similarity score higher than 0.7 all sentence pairs in the three language pairs were fully parallel. However, including in the training sets less parallel sentence pairs (that is with a lower similarity score) showed significant improvements in the translation quality (BLEU-based evaluations). The optimized SMT systems were evaluated on unseen test-sets also extracted from Wikipedia. As one of the main goals of our work was to help Wikipedia contributors to translate (with as little post editing as possible) new articles from major languages into less resourced languages and vice-versa, we call this type of translation experiments “in-genre” translation. As in the case of “in-domain” translation, our evaluations showed that using only “in-genre” training data for translating same genre new texts is better than mixing the training data with “out-of-genre” (even) parallel texts.
منابع مشابه
Wikipedia as an SMT Training Corpus
This article reports on mass experiments supporting the idea that data extracted from strongly comparable corpora may successfully be used to build statistical machine translation systems of reasonable translation quality for in-domain new texts. The experiments were performed for three language pairs: SpanishEnglish, German-English and RomanianEnglish, based on large bilingual corpora of simil...
متن کاملWiki-Translator: Multilingual Experiments for In-Domain Translations
The benefits of using comparable corpora for improving translation quality for statistical machine translators have been already shown by various researchers. The usual approach is starting with a baseline system, trained on out-of-domain parallel corpora, followed by its adaptation to the domain in which new translations are needed. The adaptation to a new domain, especially for a narrow one, ...
متن کاملMining for Domain-specific Parallel Text from Wikipedia
Previous attempts in extracting parallel data from Wikipedia were restricted by the monotonicity constraint of the alignment algorithm used for matching possible candidates. This paper proposes a method for exploiting Wikipedia articles without worrying about the position of the sentences in the text. The algorithm ranks the candidate sentence pairs by means of a customized metric, which combin...
متن کاملAutomatic Building and Using Parallel Resources for SMT from Comparable Corpora
Building parallel resources for corpus based machine translation, especially Statistical Machine Translation (SMT), from comparable corpora has recently received wide attention in the field Machine Translation research. In this paper, we propose an automatic approach for extraction of parallel fragments from comparable corpora. The comparable corpora are collected from Wikipedia documents and t...
متن کاملDBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia
The DBpedia community project extracts structured, multilingual knowledge from Wikipedia and makes it freely available using Semantic Web and Linked Data standards. The extracted knowledge, comprising more than 1.8 billion facts, is structured according to an ontology maintained by the community. The knowledge is obtained from different Wikipedia language editions, thus covering more than 100 l...
متن کامل