A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic

نویسندگان

  • Ryan Cotterell
  • Chris Callison-Burch
چکیده

This paper presents a multi-dialect, multi-genre, human annotated corpus of dialectal Arabic with data obtained from both online newspaper commentary and Twitter. Most Arabic corpora are small and focus on Modern Standard Arabic (MSA). There has been recent interest, however, in the construction of dialectal Arabic corpora (Zaidan and Callison-Burch, 2011a; Al-Sabbagh and Girju, 2012). This work differs from previously constructed corpora in two ways. First, we include coverage of five dialects of Arabic: Egyptian, Gulf, Levantine, Maghrebi and Iraqi. This is the most complete coverage of any dialectal corpus known to the authors. In addition to data, we provide results for the Arabic dialect identification task that outperform those reported in Zaidan and Callison-Burch (2011a).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Topic Modeling of Phonetic Latin-Spelled Arabic for the Relative Analysis of Genre-Dependent and Dialect-Dependent Variation

We demonstrate a data collection and analysis system that can be used to analyze the relative contributions of dialect dependent variation in the lexical of speech-like Arabic text. We utilize Latent Dirichlet Allocation (LDA), a generative Probabilistic modeling method, to analyze a phonetic Latin Spelled Arabic online chat corpus. The corpus produces different word choices and word relations ...

متن کامل

Towards Developing a Multi-Dialect Morphological Analyser for Arabic

In this paper we address the problem of the analysis of multi-dialect Arabic morphology. Our method involves based on the synthesis of two methods. The first method is linguistic based, using an adopted Modern Standard Arabic (MSA) Morphology Analyser to first deal with dialect prefixes and suffixes and then analyse the words. This method improves accuracy of dialect words by 69%. The second me...

متن کامل

YouDACC: the Youtube Dialectal Arabic Comment Corpus

In the Arab world, while Modern Standard Arabic is commonly used in formal written context, on sites like Youtube, people are increasingly using Dialectal Arabic, the language for everyday use to comment on a video and interact with the community. These user-contributed comments along with the video and user attributes, offer a rich source of multi-dialectal Arabic sentences and expressions fro...

متن کامل

Sentiment Analysis of Tunisian Dialects: Linguistic Ressources and Experiments

Dialectal Arabic (DA) is significantly different from the Arabic language taught in schools and used in written communication and formal speech (broadcast news, religion, politics, etc.). There are many existing researches in the field of Arabic language Sentiment Analysis (SA); however, they are generally restricted to Modern Standard Arabic (MSA) or some dialects of economic or political inte...

متن کامل

Large Multi-lingual, Multi-level and Multi-genre Annotation Corpus

High accuracy for automated translation and information retrieval calls for linguistic annotations at various language levels. The plethora of informal internet content sparked the demand for porting state-of-art natural language processing (NLP) applications to new social media as well as diverse language adaptation. Effort launched by the BOLT (Broad Operational Language Translation) program ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014