Agreement Matters: Challenges of Translating into a Morphologically Rich Language, and the Advantages of a Syntax-Based System
نویسنده
چکیده
Consider the following (simple) English sentences: “I drive a car.”, “I don’t know how to drive”, “I wash the car”, “I wash the floor”. Translating them to Hebrew using Google’s statistical MT system, yields: zipekna bdep ip` (I drive(masculine) a car); bedpl zr ei `l ip` (I don’t know(feminine) how to drive); ugex ip` zipeknd z` (I wash(masculine) the car); and dtvxd z` zthey ip` (I wash(feminine) the floor). While amusing and not quite politically correct, these are all arguably very good translations: without explicit gender marking, the translator can not know if the speaker is masculine or feminine, and he (she?) resorts to deciding based on her (his?) cultural knowledge. This does, however, highlight a class of problems which arise when attempting to translate from a morphologically clean language (e.g. English) into a morphologically rich one (e.g. Hebrew): many words in the target language are morphologically marked for gender and number, and the translator should be able to generate these markings correctly, based on little, elusive or sometimes no evidence in the source language. These issues are orthogonal to the data sparsity issues associated with highly inflected languages. Can current state-of-the-art statistical MT systems handle this? In what follows we present a few cases where the target language output should be morphologically marked for either gender or number, with varying amounts (and sources) of information available on the source language text, and discuss the suitability of current translation models to handle these phenomena. We show that correct handling of morphological agreement is beyond the reach of current systems as it requires better syntactic models, looking beyond a single sentence, and performing accurate anaphora resolution. However, while phrase-based models can not model even the simplest cases, syntax based models already posses most of the necessary machinery. While we demonstrate using English ⇒ Hebrew translations, similar issues will occur when translating into practically any morphologically rich language. Moreover, the issues discussed remain relevant also when the source language is also morphologically rich.
منابع مشابه
Enriching Morphologically Poor Languages for Statistical Machine Translation
We address the problem of translating from morphologically poor to morphologically rich languages by adding per-word linguistic information to the source language. We use the syntax of the source sentence to extract information for noun cases and verb persons and annotate the corresponding words accordingly. In experiments, we show improved performance for translating from English into Greek an...
متن کاملImproving Translation to Morphologically Rich Languages (Améliorer la traduction des langages morphologiquement riches) [in French]
Améliorer la traduction des langages morphologiquement riches While statistical techniques for machine translation have made significant progress in the last 20 years, results for translating to morphologically rich languages are still mixed versus previous generation rule-based systems. Current research in statistical techniques for translating to morphologically rich languages varies greatly ...
متن کاملIntegrating morpho-syntactic features in English-Arabic statistical machine translation
This paper presents a hybrid approach to the enhancement of English to Arabic statistical machine translation quality. Machine Translation has been defined as the process that utilizes computer software to translate text from one natural language to another. Arabic, as a morphologically rich language, is a highly flexional language, in that the same root can lead to various forms according to i...
متن کاملThe musical language Elements of Persian musical language: modes, rhythm and syntax
In treating the subject of musical language, a Persian musician would be intrinsically drawn to the structural similarities between the Persian music and language. Indeed Persian music and language are extremely related in their metrics, intonations and structural phrases (syntax). Although we will draw upon this relationship, our aim in this article is to present “music as a language,” c...
متن کاملMorphological, Syntactical and Semantic Knowledge in Statistical Machine Translation
This tutorial focuses on how morphology, syntax and semantics may be introduced into a standard phrase-based statistical machine translation system with techniques such as machine learning, parsing and word sense disambiguation, among others. Regarding the phrase-based system, we will describe only the key theory behind it. The main challenges of this approach are that the output contains unkno...
متن کامل