LINA: Identifying Comparable Documents from Wikipedia

نویسندگان

  • Emmanuel Morin
  • Amir Hazem
  • Florian Boudin
  • Elizaveta Loginova Clouet
چکیده

This paper describes the LINA system for the BUCC 2015 shared track. Following (Enright and Kondrak, 2007), our system identify comparable documents by collecting counts of hapax words. We extend this method by filtering out document pairs sharing target documents using pigeonhole reasoning and cross-lingual information.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On-line Compilation of Comparable Corpora and Their Evaluation

Using comparable corpora is became a topic in the mainstream Machine Translation (MT) research because, for less resourced languages, mining the Web for comparable corpora is assumed to be more productive than searching for parallel corpora. The experiments in using comparable corpora in enhancing translation models demonstrated significant improvements in MT accuracy. This paper reports on spe...

متن کامل

Identifying Word Translations from Comparable Documents Without a Seed Lexicon

The extraction of dictionaries from parallel text corpora is an established technique. However, as parallel corpora are a scarce resource, in recent years the extraction of dictionaries using comparable corpora has obtained increasing attention. In order to find a mapping between languages, almost all approaches suggested in the literature rely on a seed lexicon. The work described here achieve...

متن کامل

Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia

While several recent works on dealing with large bilingual collections of texts, e.g. (Smith et al., 2010), seek for extracting parallel sentences from comparable corpora, we present PARADOCS, a system designed to recognize pairs of parallel documents in a (large) bilingual collection of texts. We show that this system outperforms a fair baseline (Enright and Kondrak, 2007) in a number of contr...

متن کامل

A Wikipedia-based Corpus for Contextualized Machine Translation

We describe a corpus for and experiments in target-contextualized machine translation (MT), in which we incorporate language models from target-language documents that are comparable in nature to the source documents. This corpus comprises (i) a set of curated English Wikipedia articles describing news events along with (ii) their comparable Spanish counterparts, (iii) a number of the Spanish s...

متن کامل

Improving MT System Using Extracted Parallel Fragments of Text from Comparable Corpora

In this article, we present an automated approach of extracting English-Bengali parallel fragments of text from comparable corpora created using Wikipedia documents. Our approach exploits the multilingualism of Wikipedia. The most important fact is that this approach does not need any domain specific corpus. We have been able to improve the BLEU score of an existing domain specific EnglishBenga...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015