Tweet Conversation Annotation Tool with a Focus on an Arabic Dialect, Moroccan Darija
نویسندگان
چکیده
This paper presents the DATOOL, a graphical tool for annotating conversations consisting of short messages (i.e., tweets), and the results we obtain in using it to annotate tweets for Darija, an historically unwritten Arabic dialect spoken by millions but not taught in schools and lacking standardization and linguistic resources. With the DATOOL, a native-Darija speaker annotated hundreds of mixedlanguage and mixed-script conversations at approximately 250 tweets per hour. The resulting corpus was used in developing and evaluating Arabic dialect classifiers described briefly herein. The DATOOL supports downstream discourse analysis of tweeted “conversations” by mapping extracted relations such as, who tweets to whom in which language, into graph markup formats for analysis in network visualization tools.
منابع مشابه
Finding Romanized Arabic Dialect in Code-Mixed Tweets
Recent computational work on Arabic dialect identification has focused primarily on building and annotating corpora written in Arabic script. Arabic dialects however also appear written in Roman script, especially in social media. This paper describes our recent work developing tweet corpora and a token-level classifier that identifies a romanized Arabic dialect and distinguishes it from French...
متن کاملAn Arabic-Moroccan Darija Code-Switched Corpus
In multilingual communities, speakers often switch between languages or dialects within the same context. This phenomenon is called code-switching. It can be observed, e.g., in the Arab world, where Modern Standard Arabic and Dialectal Arabic coexist. Recently, the computational treatment of code-switching has received attention. Just as other natural language processing tasks, this task requir...
متن کاملSyllable structure in spoken Arabic: a comparative investigation
The aim of this study is to demonstrate that rhythm variation across Arabic dialects is to a great extent correlated with the different types of syllabic structure observed in these dialects, especially with regard to the relative complexity of onsets and codas. The main focus is on the relationship between syllabic structures on the one hand, and rhythm classes based on segmental duration on t...
متن کاملMorphologically Annotated Corpora and Morphological Analyzers for Moroccan and Sanaani Yemeni Arabic
We present new language resources for Moroccan and Sanaani Yemeni Arabic. The resources include corpora for each dialect which have been morphologically annotated, and morphological analyzers for each dialect which are derived from these corpora. These are the first sets of resources for Moroccan and Yemeni Arabic. The resources will be made available to the public.
متن کاملF0 Alignment Patterns in Arabic Dialects
A comparison of F0 alignment values was carried out for three Arabic dialects (Moroccan Arabic, Kuwaiti Arabic and Yemeni Arabic) using five speakers from each dialect. Clear differences found in alignment enable separation of Moroccan Arabic from the two other dialects: a) values of the F0 valley differed significantly, with Moroccan Arabic showing a later synchronisation than Kuwaiti Arabic a...
متن کامل