Translating Chinese Romanized Name into Chinese Idiographic Characters via Corpus and Web Validation

نویسندگان

  • Yiping Li
  • Gregory Grafenstette
چکیده

Cross-language information retrieval performance depends on the quality of the translation resources used to pass from a user’s source language query to target language documents. Translation lists of proper names are rare but vital resources for cross-language retrieval between languages using different character sets. Named entities translation dictionaries can be extracted from bilingual corpus with some degree of success, but the problem of the coverage of these scarce bilingual corpora remains. In this article, we present a technique for finding Chinese transliterations for any Chinese name written in English script. Our system performs transliteration of Pinyin (the standard Romanization for Chinese) to Chinese characters via corpus and web validation. Though Chinese family names form a small set, the number and variety of multisyllabic first names is great, and treatment is complicated by the fact that one Pinyin transliteration can correspond to hundred of different Chinese characters. Our method finds the best translations of a Chinese name written in Pinyin by filtering out unlikely translations using a bigram model derived from a very large monolingual Chinese corpus, and then vetting remaining candidate transliterations using Web statistics. We experimentally validate our method using an independent gold standard. RESUME. La performance en recherche d'information translingue dépend de la qualité des ressources de traduction utilisées pour passer de la langue source (requête d'utilisateur) vers la langue cible des documents. Les listes de traduction de noms de personnes sont rares, et constituent en même temps des ressources essentielles pour la recherche d'information translingue entre des langues utilisant des jeux de caractères différents. Les dictionnaires de traduction d'entités nommées peuvent être extraits des corpus bilingues avec un certain succès, mais le problème du recouvrement de ces corpus bilingues, rares, reste présent. Dans cet article, nous présentons une technique pour retrouver la translittération en chinois de tous les noms chinois écrits en anglais. Notre système effectue la translittération du Pinyin (la romanisation standard du chinois) en caractères chinois via des validations effectuée sur corpus et sur le Web. Bien que les noms de famille en chinois constituent un ensemble peu important, les variétés des prénoms multi-syllabiques sont très importantes. Le traitement s'avère d'autant plus compliqué qu'à une translittération du Pinyin peut correspondre jusqu'à plus de cent caractères chinois différents. Notre méthode sélectionne la meilleure traduction CORIA 05 France Grenoble 9-11 mars 2005

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Finding Ideographic Representations of Japanese Names Written in Latin Script via Language Identification and Corpus Validation

Multilingual applications frequently involve dealing with proper names, but names are often missing in bilingual lexicons. This problem is exacerbated for applications involving translation between Latin-scripted languages and Asian languages such as Chinese, Japanese and Korean (CJK) where simple string copying is not a solution. We present a novel approach for generating the ideographic repre...

متن کامل

Mining Web data for Chinese segmentation

within documents as indexing terms for search of relevant documents. As Chinese is an ideographic character-based language, the words in the texts are not delimited by white spaces. Indexing of Chinese documents is impossible without a proper segmentation algorithm. Many Chinese segmentation algorithms have been proposed in the past. Traditional segmentation algorithms cannot operate without a ...

متن کامل

Dissociation in the neural basis underlying Chinese tone and vowel production.

Neuropsychologists have debated over whether the processing of segmental and suprasegmental units involves different neural mechanisms. Focusing on the production of Chinese lexical tones (suprasegmental units) and vowels (segmental units), this study used the adaptation paradigm to investigate a possible neural dissociation for tone and vowel production. Ten native Chinese speakers were asked ...

متن کامل

Korean-Chinese Person Name Translation for Cross Language Information Retrieval

Named entity translation plays an important role in many applications, such as information retrieval and machine translation. In this paper, we focus on translating person names, the most common type of name entity in Korean-Chinese cross language information retrieval (KCIR). Unlike other languages, Chinese uses characters (ideographs), which makes person name translation difficult because one...

متن کامل

Translating Characters’ Names in Hong Lou Meng during the 20th Century: From Seeking Lexical Equivalence to Maintaining Communicative Function

The classic Chinese novel Hong Lou Meng has been introduced into many different cultures through an important medium: translation. Over one dozen of English versions have been published so far, and have been studied by so many researchers. In those translated works, a variety of translation strategies are adopted for translating characters’ names. Name translation is a small field of studies on...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005