Copied Monolingual Data Improves Low-Resource Neural Machine Translation

نویسندگان

  • Anna Currey
  • Antonio Valerio Miceli Barone
  • Kenneth Heafield
چکیده

We train a neural machine translation (NMT) system to both translate sourcelanguage text and copy target-language text, thereby exploiting monolingual corpora in the target language. Specifically, we create a bitext from the monolingual text in the target language so that each source sentence is identical to the target sentence. This copied data is then mixed with the parallel corpus and the NMT system is trained like normal, with no metadata to distinguish the two input languages. Our proposed method proves to be an effective way of incorporating monolingual data into low-resource NMT. On Turkish↔English and Romanian↔English translation tasks, we see gains of up to 1.2 BLEU over a strong baseline with back-translation. Further analysis shows that the linguistic phenomena behind these gains are different from and largely orthogonal to back-translation, with our copied corpus method improving accuracy on named entities and other words that should remain identical between the source and target languages.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adapting Attention-Based Neural Network to Low-Resource Mongolian-Chinese Machine Translation

Neural machine translation (NMT) has shown very promising results for some resourceful languages like En-Fr and En-De. The success partly relies on the availability of large scale and high quality parallel corpora. We research on how to adapt NMT to very low-resource Mongolian-Chinese machine translation by introducing attention mechanism, sub-words translation, monolingual data and a NMT corre...

متن کامل

On Using Monolingual Corpora in Neural Machine Translation

Recent works on end-to-end neural network-based architectures for machine translation have shown promising results for English-French and English-German translation. Unlike these language pairs, however, in the majority of scenarios, there is a lack of high quality parallel corpora. In this work, we focus on applying neural machine translation to challenging/low-resource languages Turkish and l...

متن کامل

Improving word alignment for low resource languages using English monolingual SRL

We introduce a new statistical machine translation approach specifically geared to learning translation from low resource languages, that exploits monolingual English semantic parsing to bias inversion transduction grammar (ITG) induction. We show that in contrast to conventional statistical machine translation (SMT) training methods, which rely heavily on phrase memorization, our approach focu...

متن کامل

Improving Neural Machine Translation Models with Monolingual Data

Neural Machine Translation (NMT) has obtained state-of-the art performance for several language pairs, while only using parallel data for training. Monolingual data plays an important role in boosting fluency for phrase-based statistical machine translation, and we investigate the use of monolingual data for neural machine translation (NMT). In contrast to previous work, which integrates a sepa...

متن کامل

Semi-Supervised Learning for Neural Machine Translation

While end-to-end neural machine translation (NMT) has made remarkable progress recently, NMT systems only rely on parallel corpora for parameter estimation. Since parallel corpora are usually limited in quantity, quality, and coverage, especially for low-resource languages, it is appealing to exploit monolingual corpora to improve NMT. We propose a semisupervised approach for training NMT model...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017