Advances in Dialectal Arabic Speech Recognition: A Study Using Twitter to Improve Egyptian ASR

نویسندگان

  • Ahmed Ali
  • Hamdy Mubarak
  • Stephan Vogel
چکیده

This paper reports results in building an Egyptian Arabic speech recognition system as an example for under-resourced languages. We investigated different approaches to build the system using 10 hours for training the acoustic model, and results for both grapheme system and phoneme system using MADA. The phoneme-based system shows better results than the grapheme-based system. In this paper, we explore the use of tweets written in dialectal Arabic. Using 880K Egyptian tweets reduced the Out Of Vocabulary (OOV) rate from 15.1% to 3.2% and the WER from 59.6% to 44.7%, a relative gain 25% in WER.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Collecting Data for Automatic Speech Recognition Systems in Dialectal Arabic Using Games with a Purpose

Building Automatic Speech Recognition (ASR) systems for spoken languages usually suffer from the problem of limited available transcriptions. Automatic Speech Recognition (ASR) systems require large speech corpora that contain speech and their corresponding transcriptions for training acoustic models. In this paper, we target the Egyptian dialectal Arabic. As other spoken languages, it is mainl...

متن کامل

Multi-Reference Evaluation for Dialectal Speech Recognition System: A Study for Egyptian ASR

Dialectal Arabic has no standard orthographic representation. This creates a challenge when evaluating an Automatic Speech Recognition (ASR) system for dialect. Since the reference transcription text can vary widely from one user to another, we propose an innovative approach for evaluating dialectal speech recognition using Multi-References. For each recognized speech segments, we ask five diff...

متن کامل

Best Practices for Crowdsourcing Dialectal Arabic Speech Transcription

In this paper, we investigate different approaches in crowdsourcing transcriptions of Dialectal Arabic speech with automatic quality control to ensure good transcription at the source. Since Dialectal Arabic has no standard orthographic representation, it is very challenging to perform quality control. We propose a complete recipe for speech transcription quality control that includes using out...

متن کامل

Crowdsource a little to label a lot: labeling a speech corpus of dialectal Arabic

Arabic is a language with great dialectal variety, with Modern Standard Arabic (MSA) being the only standardized dialect. Spoken Arabic is characterized by frequent code-switching between MSA and Dialectal Arabic (DA). DA varieties are typically differentiated by region, but despite their wide-spread usage, they are under-resourced and lack viable corpora and tools necessary for speech recognit...

متن کامل

Analysis of Dialectal Influence in Pan-Arabic ASR

In this paper, we analyze the impact of five Arabic dialects on the front-end and pronunciation dictionary components of an Automatic Speech Recognition (ASR) system. We use ASR’s phonetic decision tree as a diagnostic tool to compare the robustness of MFCC and MLP front-ends to dialectal variations in the speech data and found that MLP Bottle-Neck features are less robust to such variations. W...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015