Bilingual Corpus Cleaning Focusing O

نویسنده

  • Kenji Imamura
چکیده

When we automatically acquire translation knowledge from a bilingual corpus, redundant rules are generated due to translation variety. To overcome this problem, we propose bilingual corpus cleaning based on translation literality. Word-level correspondence and phrase-level correspondence are applied as the criteria of literality. Using these criteria, a bilingual corpus was cleaned, and translation knowledge for a pattern-based MT system was acquired from the cleaned corpus. As a result, the translation quality of the MT was improved despite reductions in the the corpus size to about 81% and 87% by using word-level and phrase-level literality scores, respectively.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bilingual corpus cleaning focusing on translation literality

When we automatically acquire translation knowledge from a bilingual corpus, redundant rules are generated due to translation variety. To overcome this problem, we propose bilingual corpus cleaning based on translation literality. Word-level correspondence and phrase-level correspondence are applied as the criteria of literality. Using these criteria, a bilingual corpus was cleaned, and transla...

متن کامل

Automatic Thesaurus Generation through Multiple Filtering

11, this paper, we propose a method of gen(',rating bilingual keyword eh.lsters or thesauri from parallel or comi.m, able bilingual corpora. The method combines nmrphological and lexical processing, bilingual word aligmnent, and graph-theoretic cluster generation. An experiment shows that the method is promising. 1 I n t r o d u c t i o n In this paper, we propose a method of automatte bilingua...

متن کامل

Lattice Score Based Data Cleaning For Phrase-Based Statistical Machine Translation

Statistical machine translation relies heavily on parallel corpora to train its models for translation tasks. While more and more bilingual corpora are readily available, the quality of the sentence pairs should be taken into consideration. This paper presents a novel lattice score-based data cleaning method to select proper sentence pairs from the ones extracted from a bilingual corpus by the ...

متن کامل

Application of Translation Knowledge Acquired by Hierarchical Phrase Alignment for Pattern-based MT

Hierarchical phrase alignment is a method for extracting equivalent phrases from bilingual sentences, even though they belong to different language families. The method automatically extracts transfer knowledge from about 125K English and Japanese bilingual sentences and then applies it to a pattern-based MT system. The translation quality is then evaluated. The knowledge needs to be cleaned, s...

متن کامل

Primary Data Encoding of a Bilingual Corpus

This paper discusses the building of a bilingual corpus of legal and administrative texts, focusing on the encoding of documentation and structural information according to the Corpus Encoding Standard. The corpus is one module in an ongoing research project about (semi-)automatic terminology acquisition at the European Academy Bolzano and will serve as a basis for applying term extraction prog...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002