Handling of Out-of-vocabulary Words in Japanese-English Machine Translation by Exploiting Parallel Corpus

نویسندگان

  • Juan Luo
  • Yves Lepage
چکیده

A large number of loanwords and orthographic variants in Japanese pose a challenge for machine translation. In this article, we present a hybrid model for handling out-of-vocabulary words in Japanese-to-English statistical machine translation output by exploiting parallel corpus. As the Japanese writing system makes use of four different script sets (kanji, hiragana, katakana, and romaji), we treat these scripts differently. A machine transliteration model is built to transliterate out-of-vocabulary Japanese katakana words into English words. A Japanese dependency structure analyzer is employed to tackle out-of-vocabulary kanji and hiragana words. The evaluation results demonstrate that it is an effective approach for addressing out-of-vocabulary word problems and decreasing the OOVs rate in the Japanese-to-English machine translation tasks.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploiting Parallel Corpus for Handling Out-of-Vocabulary Words

This paper presents a hybrid model for handling out-of-vocabulary words in Japaneseto-English statistical machine translation output by exploiting parallel corpus. As the Japanese writing system makes use of four different script sets (kanji, hiragana, katakana, and romaji), we treat these scripts differently. A machine transliteration model is built to transliterate out-ofvocabulary Japanese k...

متن کامل

Applying Text Categorization to Vocabulary Enhancement for Japanese-English Cross-Language Retrieval

In this paper we explore a new method for vocabulary enhancement in cross-language retrieval. The focus is on whether we can improve upon dictionary-based retrieval, machine translation of queries, or the use of a bilingual lexicon derived from parallel corpus alignment. All experiments are done with the NACSIS collection of Japanese scientific abstracts with titles and author-assigned keywords...

متن کامل

CMU Haitian Creole-English Translation System for WMT 2011

This paper describes the statistical machine translation system submitted to the WMT11 Featured Translation Task, which involves translating Haitian Creole SMS messages into English. In our experiments we try to address the issue of noise in the training data, as well as the lack of parallel training data. Spelling normalization is applied to reduce out-of-vocabulary words in the corpus. Using ...

متن کامل

Assamese-English Bilingual Machine Translation

Machine translation is the process of translating text from one language to another. In this paper, Statistical Machine Translation is done on Assamese and English language by taking their respective parallel corpus. A statistical phrase based translation toolkit Moses is used here. To develop the language model and to align the words we used two another tools IRSTLM, GIZA respectively. BLEU sc...

متن کامل

Improving Japanese-to-English Neural Machine Translation by Paraphrasing the Target Language

Neural machine translation (NMT) produces sentences that are more fluent than those produced by statistical machine translation (SMT). However, NMT has a very high computational cost because of the high dimensionality of the output layer. Generally, NMT restricts the size of the vocabulary, which results in infrequent words being treated as out-of-vocabulary (OOV) and degrades the performance o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Int. J. of Asian Lang. Proc.

دوره 23  شماره 

صفحات  -

تاریخ انتشار 2015