Arabic Language Modeling with Finite State Transducers
نویسنده
چکیده
In morphologically rich languages such as Arabic, the abundance of word forms resulting from increased morpheme combinations is significantly greater than for languages with fewer inflected forms (Kirchhoff et al., 2006). This exacerbates the out-of-vocabulary (OOV) problem. Test set words are more likely to be unknown, limiting the effectiveness of the model. The goal of this study is to use the regularities of Arabic inflectional morphology to reduce the OOV problem in that language. We hope that success in this task will result in a decrease in word error rate in Arabic automatic speech recognition.
منابع مشابه
Arabic Diacritization Using Weighted Finite-State Transducers
Arabic is usually written without short vowels and additional diacritics, which are nevertheless important for several applications. We present a novel algorithm for restoring these symbols, using a cascade of probabilistic finitestate transducers trained on the Arabic treebank, integrating a word-based language model, a letter-based language model, and an extremely simple morphological model. ...
متن کاملModeling Imperative String Operations with Transducers
We present a domain-specific imperative language, Bek, that directly models low-level string manipulation code featuring boolean state, search operations, and substring substitutions. We show constructively that Bek is reversible through a semantics-preserving translation to symbolic finite state transducers, a novel representation for transducers that annotates transitions with logical formula...
متن کاملNovel Probabilistic Finite-State Transducers for Cognate and Transliteration Modeling
We present and empirically compare a range of novel probabilistic finite-state transducer (PFST) models targeted at two major natural language string transduction tasks, transliteration selection and cognate translation selection. Evaluation is performed on 10 distinct language pair data sets, and in each case novel models consistently and substantially outperform a well-established standard re...
متن کاملExplicit Modeling of Phonological Changes in Finite-state Transducer Based Hungarian Lvcsr
This article describes the operation and the experimental evaluation of the pronunciation modeling component of the first Hungarian large vocabulary continuous speech recognition system. The proposed method is based on the implementation of context dependent rewrite rules by weighted finite state transducers (WFSTs). The proposed phonological model decreases the error rate by 8.32% relatively c...
متن کاملRational Kernels for Arabic Stemming and Text Classification
In this paper, we address the problems of Arabic Text Classification and stemming using Transducers and Rational Kernels. We introduce a new stemming technique based on the use of Arabic patterns (Pattern Based Stemmer). Patterns are modelled using transducers and stemming is done without depending on any dictionary. Using transducers for stemming, documents are transformed into finite state tr...
متن کامل