Learning Translations from Comparable Corpora

نویسنده

  • David Talbot
چکیده

This thesis examines the possibility of using comparable corpora to augment statistical models of translation. Treating comparable corpora as marginal samples from an aligned bilingual joint distribution, the estimation of translation models from a combination of bilingual parallel and comparable corpora is seen as a variation of the labelled-unlabelled problem [Seeger, 2000b]. Results on synthetic data confirm that successful re-estimation within the EM framework [Dempster et al., 1977] is highly-dependent on the balance between complete and incomplete data [Nigam, 2001]. Here we show that the utility of re-estimation with additional incomplete data is highly-dependent on the accuracy of initial parameters estimated from the complete data alone. We propose a method for constraining the re-estimation procedure in relation to the degree of comparability between marginal samples. This is seen to result in better conditional models when the assumption of comparability is valid. Finally, we consider how more complex marginal models could be used to further constrain the re-estimation of the conditional.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mining New Word Translations from Comparable Corpora

New words such as names, technical terms, etc appear frequently. As such, the bilingual lexicon of a machine translation system has to be constantly updated with these new word translations. Comparable corpora such as news documents of the same period from different news agencies are readily available. In this paper, we present a new approach to mining new word translations from comparable corp...

متن کامل

Bootstrapping Entity Translation on Weakly Comparable Corpora

This paper studies the problem of mining named entity translations from comparable corpora with some “asymmetry”. Unlike the previous approaches relying on the “symmetry” found in parallel corpora, the proposed method is tolerant to asymmetry often found in comparable corpora, by distinguishing different semantics of relations of entity pairs to selectively propagate seed entity translations on...

متن کامل

Rare Word Translation Extraction from Aligned Comparable Documents

We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obta...

متن کامل

Identification of Fertile Translations in Medical Comparable Corpora: a Morpho-Compositional Approach

This paper defines a method for lexicon in the biomedical domain from comparable corpora. The method is based on compositional translation and exploits morpheme-level translation equivalences. It can generate translations for a large variety of morphologically constructed words and can also generate ’fertile’ translations. We show that fertile translations increase the overall quality of the ex...

متن کامل

Utilizing Citations of Foreign Words in Corpus-Based Dictionary Generation

Previous work concerned with the identification of word translations from text collections has been either based on parallel or on comparable corpora of the respective languages. In the case of comparable corpora basic dictionaries have been necessary to form a bridge between the languages under consideration. We present here a novel approach to identify word translations from a single monoling...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003