Pre-processing of Bilingual Corpora for Mandarin-English EBMT

نویسندگان

  • Ying Zhang
  • Ralf Brown
  • Robert Frederking
  • Alon Lavie
چکیده

Pre-processing of bilingual corpora plays an important role in Example-Based Machine Translation (EBMT) and Statistical-Based Machine Translation (SBMT). For our Mandarin-English EBMT system, pre-processing includes segmentation for Mandarin, bracketing for English and building a statistical dictionary from the corpora. We used the Mandarin segmenter from the Linguistic Data Consortium (LDC). It uses dynamic programming with a frequency dictionary to segment the text. Although the frequency dictionary is large, it does not completely cover the corpora. In this paper, we describe the work we have done to improve the segmentation for Mandarin and the bracketing process for English to increase the length of English phrases. A statistical dictionary is built from the aligned bilingual corpus. It is used as feedback to segmentation and bracketing to re-segment / re-bracket the corpus. The process iterates several times to achieve better results. The final results of the corpus pre-processing are a segmented/bracketed aligned bilingual corpus and a statistical dictionary. We achieved positive results by increasing the average length of Chinese terms about 60% and 10% for English. The statistical dictionary gained about a 30% increase in coverage.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Noun versus Verb Bias in Mandarin- English Bilingual Pre-School Children

This study investigated the presence of noun or verb bias in 15 MandarinEnglish bilingual pre-school children. The naturalistic bilingual child-caregiver interactions were tape-recorded for 30 minutes each time. The study also addressed the relationship between children‟s language production and the salient positions of the caregivers‟ language input. The findings show that the bilingual childr...

متن کامل

Building Multiword Expressions Bilingual Lexicons for Domain Adaptation of an Example-Based Machine Translation System

We describe in this paper a hybrid approach to build automatically bilingual lexicons of Multiword Expressions (MWEs) from parallel corpora. We more specifically investigate the impact of using a domain-specific bilingual lexicon of MWEs on domain adaptation of an Example-Based Machine Translation (EBMT) system. We conducted experiments on the English-French language pair and two kinds of texts...

متن کامل

Subsentential Translation Memory for Computer Assisted Writing and Translation

This paper describes a database of translation memory, TotalRecall, developed to encourage authentic and idiomatic use in second language writing. TotalRecall is a bilingual concordancer that support search query in English or Chinese for relevant sentences and translations. Although initially intended for learners of English as Foreign Language (EFL) in Taiwan, it is a gold mine of texts in En...

متن کامل

The Effect of Second-Language Experience on Native-Language Processing.

Previous work on bilingual language processing indicates that native-language skills can influence second-language acquisition. The goal of the present work was to examine the influence of second-language experiences on native-language vocabulary and reading skills in two groups of bilingual speakers. English-Spanish and English-Mandarin bilingual adults were tested on vocabulary knowledge and ...

متن کامل

Multilingual Speech Corpora for TTS System Development

In this paper, four speech corpora collected in the Speech Lab of NCTU in recent years are discussed. They include a Mandarin treebank speech corpus, a Min-Nan speech corpus, a Hakka speech corpus, and a Chinese-English mixed speech corpus. Currently, they are used separately to develop a corpus-based Mandarin TTS system, a Min-Nan TTS system, a Hakka TTS system, and a Chinese-English bilingual...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001