Parsing Arabic Dialects
نویسندگان
چکیده
The Arabic language is a collection of spoken dialects with important phonological, morphological, lexical, and syntactic differences, along with a standard written language, Modern Standard Arabic (MSA). Since the spoken dialects are not officially written, it is very costly to obtain adequate corpora to use for training dialect NLP tools such as parsers. In this paper, we address the problem of parsing transcribed spoken Levantine Arabic (LA). We do not assume the existence of any annotated LA corpus (except for development and testing), nor of a parallel corpus LAMSA. Instead, we use explicit knowledge about the relation between LA and MSA.
منابع مشابه
Arabic Dialect Processing Tutorial
The existence of dialects for any language constitutes a challenge for NLP in general since it adds another set of variation dimensions from a known standard. The problem is particularly interesting and challenging in Arabic and its different dialects, where the diversion from the standard could, in some linguistic views, warrant a classification as different languages. This problem would not b...
متن کاملColing • Acl 2006 Tag + 8
This paper discusses a novel probabilistic synchronous TAG formalism, synchronous Tree Substitution Grammar with sister adjunction (TSG+SA). We use it to parse a language for which there is no training data, by leveraging off a second, related language for which there is abundant training data. The grammar for the resource-rich side is automatically extracted from a treebank; the grammar on the...
متن کاملAutomatic Transliteration of Judeo-Arabic Texts into Arabic Script
! The Judeo-Arabic languages comprise a set of dialects spoken and written by Jewish communities living in Arab countries, mainly during the middle ages. Judeo-Arabic is typically written in Hebrew letters, enriched with various diacritic marks. The Judeo-Arabic spoken and written by any particular Jewish community is similar to the Arabic dialect used by their local Muslim community. In additi...
متن کاملAutomatically building a Tunisian Lexicon for Deverbal Nouns
The sociolinguistic situation in Arabic countries is characterized by diglossia (Ferguson, 1959) : whereas one variant Modern Standard Arabic (MSA) is highly codified and mainly used for written communication, other variants coexist in regular everyday’s situations (dialects). Similarly, while a number of resources and tools exist for MSA (lexica, annotated corpora, taggers, parsers . . . ), ve...
متن کاملThe Hidden TAG Model: Synchronous Grammars for Parsing Resource-Poor Languages
This paper discusses a novel probabilistic synchronous TAG formalism, synchronous Tree Substitution Grammar with sister adjunction (TSG+SA). We use it to parse a language for which there is no training data, by leveraging off a second, related language for which there is abundant training data. The grammar for the resource-rich side is automatically extracted from a treebank; the grammar on the...
متن کامل