Cross-Dialectal Arabic Processing

نویسندگان

  • Salima Harrat
  • Karima Meftouh
  • Mourad Abbas
  • Salma Jamoussi
  • Motaz Saad
  • Kamel Smaïli
چکیده

We present, in this paper an Arabic multi-dialect study including dialects from both the Maghreb and the Middle-east that we compare to the Modern Standard Arabic (MSA). Three dialects from Maghreb are concerned by this study: two from Algeria and one from Tunisia and two dialects from Middle-east (Syria and Palestine). The resources which have been built from scratch have lead to a collection of a multi-dialect parallel resource. Furthermore, this collection has been aligned by hand with a MSA corpus. We conducted several analytical studies in order to understand the relationship between these vernacular languages. For this, we studied the closeness between all the pairs of dialects and MSA in terms of Hellinger distance. We also performed an experiment of dialect identification. This experiment showed that neighbouring dialects as expected tend to be confused, making difficult their identification. Because the Arabic dialects are different from one region to another which make the communication between people difficult, we conducted cross-lingual machine translation between all the pairs of dialects and also with MSA. Several interesting conclusions have been carried out from this experiment.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition

Dialectal Arabic speech recognition is a difficult problem and is relatively less studied. In this paper, we propose a cross-dialectal Gaussian mixture model training criteria to transfer knowledge from one domain to the other by data sharing. Specifically, phone classification experiments on West Point Modern Standard Arabic Speech corpus and Babylon Levantine Arabic Speech corpus demonstrate ...

متن کامل

Creating Resources for Dialectal Arabic from a Single Annotation: A Case Study on Egyptian and Levantine

Arabic dialects present a special problem for natural language processing because there are few Arabic dialect resources, they have no standard orthography, and they have not been studied much. However, as more and more written dialectal Arabic is found on social media, natural language processing for Arabic dialects has become an important goal. We present a methodology for creating a morpholo...

متن کامل

POS Tagging of Dialectal Arabic: A Minimally Supervised Approach

Natural language processing technology for the dialects of Arabic is still in its infancy, due to the problem of obtaining large amounts of text data for spoken Arabic. In this paper we describe the development of a part-of-speech (POS) tagger for Egyptian Colloquial Arabic. We adopt a minimally supervised approach that only requires raw text data from several varieties of Arabic and a morpholo...

متن کامل

COLABA: Arabic Dialect Annotation and Processing

In this paper, we describe COLABA, a large effort to create resources and processing tools for Dialectal Arabic Blogs. We describe the objectives of the project, the process flow and the interaction between the different components. We briefly describe the manual annotation effort and the resources created. Finally, we sketch how these resources and tools are put together to create DIRA, a term...

متن کامل

DALILA: The Dialectal Arabic Linguistic Learning Assistant

Dialectal Arabic (DA) poses serious challenges for Natural Language Processing (NLP). The number and sophistication of tools and datasets in DA are very limited in comparison to Modern Standard Arabic (MSA) and other languages. MSA tools do not effectively model DA which makes the direct use of MSA NLP tools for handling dialects impractical. This is particularly a challenge for the creation of...

متن کامل

Crowdsource a little to label a lot: labeling a speech corpus of dialectal Arabic

Arabic is a language with great dialectal variety, with Modern Standard Arabic (MSA) being the only standardized dialect. Spoken Arabic is characterized by frequent code-switching between MSA and Dialectal Arabic (DA). DA varieties are typically differentiated by region, but despite their wide-spread usage, they are under-resourced and lack viable corpora and tools necessary for speech recognit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015