Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing
نویسندگان
چکیده
A boundary-annotated and part-of-speech tagged corpus is a prerequisite for developing phrase break classifiers. Boundary annotations in English speech corpora are descriptive, delimiting intonation units perceived by the listener. We take a novel approach to phrase break prediction for Arabic, deriving our prosodic annotation scheme from Tajwīd (recitation) mark-up in the Qur‟an which we then interpret as additional text-based data for computational analysis. This mark-up is prescriptive, and signifies a widely-used recitation style, and one of seven original styles of transmission. Here we report on version 1.0 of our Boundary-Annotated Qur‟an dataset of 77430 words and 8230 sentences, where each word is tagged with prosodic and syntactic information at two coarse-grained levels. In (Sawalha et al., 2012), we use the dataset in phrase break prediction experiments. This research is part of a larger-scale project to produce annotation schemes, language resources, algorithms, and applications for Classical and Modern Standard Arabic.
منابع مشابه
روشی جدید جهت استخراج موجودیتهای اسمی در عربی کلاسیک
In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result...
متن کاملText Analytics and Transcription Technology for Quranic Arabic
Natural Language Processing Working Together with Arabic and Islamic Studies is a 2-year project funded by the UK Engineering and Physical Sciences Research Council (EPSRC) to study prosodicsyntactic mark-up in the Quran (Atwell et al 2013). Tajwīd or correct Quranic recitation is very important in Islam. The original insight informing this project is to view tajwīd mark-up in the Quran as addi...
متن کاملA Corpus of Spontaneous Speech in Lectures: The KIT Lecture Corpus for Spoken Language Processing and Translation
With the increasing number of applications handling spontaneous speech, the needs to process spoken languages become stronger. Speech disfluency is one of the most challenging tasks to deal with in automatic speech processing. As most applications are trained with well-formed, written texts, many issues arise when processing spontaneous speech due to its distinctive characteristics. Therefore, ...
متن کاملBuilding Annotated Written and Spoken Arabic LRs in NEMLAR Project
The NEMLAR project: Network for Euro-Mediterranean LAnguage Resource and human language technology development and support; (www.nemlar.org) is a project supported by the EC with partners from Europe and the Middle East; whose objective is to build a network of specialized partners to promote and support the development of Arabic Language Resources in the Mediterranean region. The project focus...
متن کاملROOTS: a toolkit for easy, fast and consistent processing of large sequential annotated data collections
The development of new methods for given speech and natural language processing tasks usually consists in annotating large corpora of data before applying machine learning techniques to train models or to extract information. Beyond scientific aspects, creating and managing such annotated data sets is a recurrent problem. While using human annotators is obviously expensive in time and money, re...
متن کامل