A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic
نویسندگان
چکیده
This paper presents a multi-dialect, multi-genre, human annotated corpus of dialectal Arabic with data obtained from both online newspaper commentary and Twitter. Most Arabic corpora are small and focus on Modern Standard Arabic (MSA). There has been recent interest, however, in the construction of dialectal Arabic corpora (Zaidan and Callison-Burch, 2011a; Al-Sabbagh and Girju, 2012). This work differs from previously constructed corpora in two ways. First, we include coverage of five dialects of Arabic: Egyptian, Gulf, Levantine, Maghrebi and Iraqi. This is the most complete coverage of any dialectal corpus known to the authors. In addition to data, we provide results for the Arabic dialect identification task that outperform those reported in Zaidan and Callison-Burch (2011a).
منابع مشابه
Topic Modeling of Phonetic Latin-Spelled Arabic for the Relative Analysis of Genre-Dependent and Dialect-Dependent Variation
We demonstrate a data collection and analysis system that can be used to analyze the relative contributions of dialect dependent variation in the lexical of speech-like Arabic text. We utilize Latent Dirichlet Allocation (LDA), a generative Probabilistic modeling method, to analyze a phonetic Latin Spelled Arabic online chat corpus. The corpus produces different word choices and word relations ...
متن کاملTowards Developing a Multi-Dialect Morphological Analyser for Arabic
In this paper we address the problem of the analysis of multi-dialect Arabic morphology. Our method involves based on the synthesis of two methods. The first method is linguistic based, using an adopted Modern Standard Arabic (MSA) Morphology Analyser to first deal with dialect prefixes and suffixes and then analyse the words. This method improves accuracy of dialect words by 69%. The second me...
متن کاملYouDACC: the Youtube Dialectal Arabic Comment Corpus
In the Arab world, while Modern Standard Arabic is commonly used in formal written context, on sites like Youtube, people are increasingly using Dialectal Arabic, the language for everyday use to comment on a video and interact with the community. These user-contributed comments along with the video and user attributes, offer a rich source of multi-dialectal Arabic sentences and expressions fro...
متن کاملSentiment Analysis of Tunisian Dialects: Linguistic Ressources and Experiments
Dialectal Arabic (DA) is significantly different from the Arabic language taught in schools and used in written communication and formal speech (broadcast news, religion, politics, etc.). There are many existing researches in the field of Arabic language Sentiment Analysis (SA); however, they are generally restricted to Modern Standard Arabic (MSA) or some dialects of economic or political inte...
متن کاملLarge Multi-lingual, Multi-level and Multi-genre Annotation Corpus
High accuracy for automated translation and information retrieval calls for linguistic annotations at various language levels. The plethora of informal internet content sparked the demand for porting state-of-art natural language processing (NLP) applications to new social media as well as diverse language adaptation. Effort launched by the BOLT (Broad Operational Language Translation) program ...
متن کامل