Morphological Processing of Compounds for Statistical Machine Translation
نویسنده
چکیده
Machine Translation denotes the translation of a text written in one language into another language performed by a computer program. In times of internet and globalisation, there has been a constantly growing need for machine translation. For example, think of the European Union, with its 24 official languages into which each official document must be translated. The translation of official documents would be less manageable and much less affordable without computer-aided translation systems. Most state-of-the-art machine translation systems are based on statistical models. These are trained on a bilingual text collection to “learn” translational correspondences of words (and phrases) of the two languages. The underlying text collection must be parallel, i.e. the content of one line must exactly correspond to the translation of this line in the other language. After training the statistical models, they can be used to translate new texts. However, one of the drawbacks of Statistical Machine Translation (SMT) is that it can only translate words which have occurred in the training texts. This applies in particular to SMT systems which have been designed for translating from and to German. It is widely known that German allows for productive word formation processes. Speakers of German can put together existing words to form new words, called compounds. An example is the German “Apfel + Baum = Apfelbaum” (= “apple + tree = apple tree”). Theoretically there is no limit to the length of a German compound. Whereas “Apfelbaum” (= “apple tree”) is a rather common German compound, “Apfelbaumholzpalettenabtransport” (= “apple|tree|wood|pallet|removal”) is a spontaneous new creation, which (probably) has not occurred in any text collection yet. The productivity of German compounds leads to a large number of distinct compound types, many of which occur only with low frequency in a text collection, if they occur at all. This fact makes German compounds a challenge for SMT systems, as only words which have occurred in the parallel training data can later be translated by the systems. Splitting compounds into their component words can solve this problem. For example, splitting “Apfelbaumholzpalettenabtransport” into its component words, it becomes in-
منابع مشابه
A new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملOrthographic and Morphological Processing for Persian-to-English Statistical Machine Translation
In statistical machine translation, data sparsity is a challenging problem especially for languages with rich morphology and inconsistent orthography, such as Persian. We show that orthographic preprocessing and morphological segmentation of Persian verbs in particular improves the translation quality of Persian-English by 1.9 BLEU points on a blind test set.
متن کاملMorphological Processing for English-Tamil Statistical Machine Translation
Various experiments from literature suggest that in statistical machine translation (SMT), applying either pre-processing or post-processing to morphologically rich languages leads to better translation quality. In this work, we focus on the English-Tamil language pair. We implement suffix-separation rules for both of the languages and evaluate the impact of this preprocessing on translation qu...
متن کاملEffects of Morphological Analysis in Translation between German and English
We describe the LIU systems for GermanEnglish and English-German translation submitted to the Shared Task of the Third Workshop of Statistical Machine Translation. The main features of the systems, as compared with the baseline, is the use of morphological preand post-processing, and a sequence model for German using morphologically rich parts-of-speech. It is shown that these additions lead to...
متن کاملA Joint Dependency Model of Morphological and Syntactic Structure for Statistical Machine Translation
When translating between two languages that differ in their degree of morphological synthesis, syntactic structures in one language may be realized as morphological structures in the other, and SMT models need a mechanism to learn such translations. Prior work has used morpheme splitting with flat representations that do not encode the hierarchical structure between morphemes, but this structur...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014