Improving Named Entity Translation by Exploiting Comparable and Parallel Corpora
نویسندگان
چکیده
Translation of named entities (NEs), such as person, organization, country, and location names is very important for several natural language processing applications. It plays a vital role in applications like cross lingual information retrieval, and machine translation. Web and news documents introduce new named entities on regular basis. Those new names cannot be captured by ordinary machine translation systems. In this paper, we introduce a framework for extracting named entity translation pairs. The framework contains methods for exploiting both comparable and parallel corpora to generate a regularly updated list of named entity translation pairs. We evaluate the quality of the extracted translation pairs by showing that it improves the performance of a named entity translation system. We report results on the ACE 2007 Entity Translation (ET) pilot evaluation development set.
منابع مشابه
استخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملBootstrapping Entity Translation on Weakly Comparable Corpora
This paper studies the problem of mining named entity translations from comparable corpora with some “asymmetry”. Unlike the previous approaches relying on the “symmetry” found in parallel corpora, the proposed method is tolerant to asymmetry often found in comparable corpora, by distinguishing different semantics of relations of entity pairs to selectively propagate seed entity translations on...
متن کاملACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora
The lack of parallel corpora and linguistic resources for many languages and domains is one of the major obstacles for the further advancement of automated translation. A possible solution is to exploit comparable corpora (non-parallel bior multi-lingual text resources) which are much more widely available than parallel translation data. Our presented toolkit deals with parallel content extract...
متن کاملUsing Word Embeddings to Translate Named Entities
In this paper we investigate the usefulness of neural word embeddings in the process of translating Named Entities (NEs) from a resource-rich language to a language low on resources relevant to the task at hand, introducing a novel, yet simple way of obtaining bilingual word vectors. Inspired by observations in (Mikolov et al., 2013b), which show that training their word vector model on compara...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کامل