Finite State Automata and Arabic Writing
نویسنده
چکیده
Arabic writing has specific features, which imply computational overload for any arabicized software. Finite state automata are well known to give efficient solutions for translation problems which can be formalized as regular languages. These automata are as more easily built that their alphabet have been reduced through a careful linguistic analysis. This reduction makes it possible to write directly an automaton without going through the intermediate stage of contextual rules, which have to be translated into an automaton for the sake of efficiency. This paper presents two Moore automata, the first one, taken as an example, gives a solution to the choice of right shape for a letter to be printed or displayed (usually known as contextual analysis), the second one studies the more complex problem of determining the right carrying letter for hamza. Every arabicized software has to face these questions and finite state automata are certainly a good answer to them. I N T R O D U C T I O N Arabic writing has specific features, which imply computational overload for any arabicized software. The first one, well known now for many years, is the fact that Arabic printing tries to imitate handwriting. Because of this, consonants and long vowels can have four or only two shapes depending of their ability to be bound to the following letter and of where they appear in the word. These shapes can be very different : for example letter o 2 (h) ICERTAL : Centre d'l~tudes et de Recherche en Traitement Automatique des Langues, I N A L C O : Institut National des Langues et Civilisations Orientales ~the Arabic parts of this paper have been typeset isolated final medial initial or present only small variations : for example letter ~r* (s) isolated final medial initial Letters which cannot be bound to the next one have only two shapes, for example letters (d) and .~ (w and fi) isolated final isolated final During the seventies and the beginning of the eighties, hard controversies took place within the Arabs concerned with these questions, linguists and computer scientists. Finally in 1983 the ASMO (Arab Society for Normalization which unfortunately does not exist any more), influenced by Pr. Lakhdar-Ghazal from IERA (Rabat Morocco) chose to give a unique code to all shapes of one particular letter. This is certainly a good choice from a linguistic point of view, but even so, compromises had to be made to take into account writing habits that conflicted with it. Letter hamza is the most noticeable example of such a compromise for reasons we shall explain later. 1 C O N T E X T U A L A N A L Y S I S Whatever be the choice made for coding, from a typesetting or a computational point of view, there must be different codes for the different shapes of a letter. So every arabicized software has to use two systems for coding : the reduced code we have just introduced and the extended code in which the different shapes have different using Klaus Lagally's ArabTEX
منابع مشابه
Reduction of Computational Complexity in Finite State Automata Explosion of Networked System Diagnosis (RESEARCH NOTE)
This research puts forward rough finite state automata which have been represented by two variants of BDD called ROBDD and ZBDD. The proposed structures have been used in networked system diagnosis and can overcome cominatorial explosion. In implementation the CUDD - Colorado University Decision Diagrams package is used. A mathematical proof for claimed complexity are provided which shows ZBDD ...
متن کاملArabic Morphology Using Only Finite-State Operations
Finite-state morphology has been successful in the description and computational implementation of a wide variety of natural languages. However, the particular challenges of Arabic, and the limitations of some implementations of finite-state morphology, have led many researchers to believe that finite-state power was not sufficient to handle Arabic and other Semitic morphology. This paper illus...
متن کاملCreating and Weighting Hunspell Dictionaries as Finite-State Automata
There are numerous formats for writing spell-checkers for open-source systems and there are many lexical descriptions for natural languages written in these formats. In this paper, we demonstrate a method for converting Hunspell and related spell-checking lexicons into finite-state automata. We also present a simple way to apply unigram corpus training in order to improve the spellchecking sugg...
متن کاملOdin's Runes: A Rule Language for Information Extraction
Odin is an information extraction framework that applies cascades of finite state automata over both surface text and syntactic dependency graphs. Support for syntactic patterns allow us to concisely define relations that are otherwise difficult to express in languages such as Common Pattern Specification Language (CPSL), which are currently limited to shallow linguistic features. The interacti...
متن کاملApplying Finite State Morphology to Conversion Between Roman and Perso-Arabic Writing Systems
This paper presents a method for converting back and forth between the Perso-Arabic and a Romanized writing systems for Persian. Given a word in one writing system, we use finite state transducers to generate morphological analysis for the word that is subsequently used to regenerate the orthography of the word in the other writing system. The system has been implemented in XFST and LEXC.
متن کامل