Automatic Construction of Translation Knowledge for Corpus-based Machine Translation

نویسنده

  • Kenji Imamura
چکیده

Many machine translation (MT) systems that utilize the knowledge automatically acquired from bilingual corpora have been proposed in conjunction with efforts to accumulate corpora. We call this approach corpus-based machine translation in this thesis. This thesis focuses on automatic construction of the translation knowledge needed for corpus-based MT and discusses the following three tasks. 1. Proposing a knowledge acquisition method from bilingual corpora. 2. Applying the acquired knowledge to an actual MT engine and measuring the MT quality. 3. Identifying the inherent problems of the corpus-based MT that decrease MT quality and proposing solutions. A feature of this thesis is not only investigating the first task but also investigating the second and third tasks. In order to clarify features of corpus-based MT, this thesis identifies inherent problems by translating sentences using acquired knowledge and proposes solutions. For the first task, this thesis proposes a hierarchical phrase alignment (HPA) method. This method automatically extracts equivalent phrases, which are corresponding expressions between bilingual sentences. HPA employs parsers. Previous methods extract correspondences after determining the parsing trees of the bilingual sentence, while HPA simultaneously extracts the best parsing trees and corresponding phrases by utilizing the structural similarity measure called the phrase correspondence score. HPA has two features. One is the ability to resolve ∗Doctor’s Thesis, Department of Information Processing, Graduate School of Information Science, Nara Institute of Science and Technology, NAIST-IS-DT0261006, May 10, 2004.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language

Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...

متن کامل

Automatic Construction of Machine Translation Knowledge Using Translation Literalness

When machine translation (MT) knowledge is automatically constructed from bilingual corpora, redundant rules are acquired due to translation variety. These rules increase ambiguity or cause incorrect MT results. To overcome this problem, we constrain the sentences used for knowledge extraction to "the appropriate bilingual sentences for the MT." In this paper, we propose a method using translat...

متن کامل

Automatic and Human Evaluation Study of a Rule-based and a Statistical Catalan-Spanish Machine Translation Systems

Machine translation systems can be classified into rule-based and corpus-based approaches, in terms of their core technology. Since both paradigms have largely been used during the last years, one of the aims in the research community is to know how these systems differ in terms of translation quality. To this end, this paper reports a study and comparison of a rule-based and a corpus-based (pa...

متن کامل

Automatic Translation Template Acquisition Based on Bilingual Structure Alignment

Knowledge acquisition is a bottleneck in machine translation and many NLP tasks. A method for automatically acquiring translation templates from bilingual corpora is proposed in this paper. Bilingual sentence pairs are first aligned in syntactic structure by combining a language parsing with a statistical bilingual language model. The alignment results are used to extract translation templates ...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004