Substring-based unsupervised transliteration with phonetic and contextual knowledge

نویسندگان

  • Anoop Kunchukuttan
  • Pushpak Bhattacharyya
  • Mitesh M. Khapra
چکیده

We propose an unsupervised approach for substring-based transliteration which incorporates two new sources of knowledge in the learning process: (i) context by learning substring mappings, as opposed to single character mappings, and (ii) phonetic features which capture cross-lingual character similarity via prior distributions. Our approach is a two-stage iterative, boot-strapping solution, which vastly outperforms Ravi and Knight (2009)’s state-of-the-art unsupervised transliteration method and outperforms a rule-based baseline by up to 50% for top-1 accuracy on multiple language pairs. We show that substring-based models are superior to character-based models, and observe that their top-10 accuracy is comparable to the top-1 accuracy of supervised systems. Our method only requires a phonemic representation of the words. This is possible for many language-script combinations which have a high grapheme-to-phoneme correspondence e.g. scripts of Indian languages derived from the Brahmi script. Hence, Indian languages were the focus of our experiments. For other languages, a grapheme-to-phoneme converter would be required.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unsupervised Named Entity Transliteration Using Temporal and Phonetic Correlation

In this paper we investigate unsupervised name transliteration using comparable corpora, corpora where texts in the two languages deal in some of the same topics — and therefore share references to named entities — but are not translations of each other. We present two distinct methods for transliteration, one approach using an unsupervised phonetic transliteration method, and the other using t...

متن کامل

Learning Transliteration Lexicons from the Web

This paper presents an adaptive learning framework for Phonetic Similarity Modeling (PSM) that supports the automatic construction of transliteration lexicons. The learning algorithm starts with minimum prior knowledge about machine transliteration, and acquires knowledge iteratively from the Web. We study the active learning and the unsupervised learning strategies that minimize human supervis...

متن کامل

Phoneme-based Statistical Transliteration of Foreign Names for OOV Problem

Given a source language term, machine transliteration is to automatically generate the phonetic equivalents in a target language. It is useful in many cross language applications. Recently, there are increasing concerns about automatic transliteration, especially with languages with significant distinctions in their phonetic representations, e.g. English and Chinese. Despite many cross-language...

متن کامل

An English-Korean Transliteration Model Using Pronunciation and Contextual Rules

There is increasing concern about English-Korean (E-K) transliteration recently. In the previous works, direct converting methods from English alphabets to Korean alphabets were a main research topic. In this paper, we present an E-K transliteration model using pronunciation and contextual rules. Unlike the previous works, our method uses phonetic information such as phoneme and its context. We...

متن کامل

Substring-Based Transliteration

Transliteration is the task of converting a word from one alphabetic script to another. We present a novel, substring-based approach to transliteration, inspired by phrasebased models of machine translation. We investigate two implementations of substringbased transliteration: a dynamic programming algorithm, and a finite-state transducer. We show that our substring-based transducer not only ou...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016