Bootstrapping a Stochastic Transducer for Arabic-English Transliteration Extraction

نویسندگان

  • Tarek Sherif
  • Grzegorz Kondrak
چکیده

We propose a bootstrapping approach to training a memoriless stochastic transducer for the task of extracting transliterations from an English-Arabic bitext. The transducer learns its similarity metric from the data in the bitext, and thus can function directly on strings written in different writing scripts without any additional language knowledge. We show that this bootstrapped transducer performs as well or better than a model designed specifically to detect Arabic-English transliterations.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

English-Arabic Transliteration

Proper nouns may be considered as the most important query words in information retrieval. If the two languages use the same alphabet, the same proper nouns can be found in either language. However, if the two languages use different alphabets, the names must be transliterated. Short vowels are not usually marked on the Arabic words in almost all Arabic documents (except very important document...

متن کامل

Noise-Aware Character Alignment for Bootstrapping Statistical Machine Transliteration from Bilingual Corpora

This paper proposes a novel noise-aware character alignment method for bootstrapping statistical machine transliteration from automatically extracted phrase pairs. The model is an extension of a Bayesian many-to-many alignment method for distinguishing nontransliteration (noise) parts in phrase pairs. It worked effectively in the experiments of bootstrapping Japanese-to-English statistical mach...

متن کامل

English to Indonesian Transliteration to Support English Pronunciation Practice

The work presented in this paper explores the use of Indonesian transliteration to support English pronunciation practice. It is mainly aimed for Indonesian speakers who have no or minimum English language skills. The approach implemented combines a rule-based and a statistical method. The rules of English-Phone-to-Indonesian-Grapheme mapping are implemented with a Finite State Transducer (FST)...

متن کامل

Transliteration System Using Pair HMM with Weighted FSTs

This paper presents a transliteration system based on pair Hidden Markov Model (pair HMM) training and Weighted Finite State Transducer (WFST) techniques. Parameters used by WFSTs for transliteration generation are learned from a pair HMM. Parameters from pair-HMM training on English-Russian data sets are found to give better transliteration quality than parameters trained for WFSTs for corresp...

متن کامل

Hindi Urdu Machine Transliteration using Finite-State Transducers

Finite-state Transducers (FST) can be very efficient to implement inter-dialectal transliteration. We illustrate this on the Hindi and Urdu language pair. FSTs can also be used for translation between surface-close languages. We introduce UIT (universal intermediate transcription) for the same pair on the basis of their common phonetic repository in such a way that it can be extended to other l...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007