Building A Modern Standard Arabic Corpus
نویسندگان
چکیده
Language Engineering, including Information Retrieval, Machine Translation and other Natural Language-related disciplines, is showing in recent years more interest in the Arabic language. Suitable resources for Arabic are becoming a vital necessity for the progress of this research. Until recently, only two Arabic corpora were commonly available for researchers: the AFP Arabic newswire from LDC and the AlHayat newspaper collection from the European Language Resources Distribution Agency. But the necessity of a suitable corpus is key for any objective research. In this paper we present the results of experiments in building a corpus for Modern Standard Arabic using data available on the World Wide Web. We selected samples of online published newspapers from different Arabic countries. The selection was driven mainly by the amount of data available. We will demonstrate the completeness and the representatives of this corpus using standard metrics and show its suitability for Language Engineering experiments.
منابع مشابه
روشی جدید جهت استخراج موجودیتهای اسمی در عربی کلاسیک
In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result...
متن کاملLanguage Variation as a Context for Information Retrieval
Speakers of widespread languages may encounter problems in information retrieval and document understanding when they access documents in the same language from another country. The work described here focuses on the development of resources to support improved document retrieval and understanding by users of Modern Standard Arabic (MSA). The lexicon of an Egyptian Arabic speaker and the lexico...
متن کاملA Lexical Database for Modern Standard Arabic Interoperable with a Finite State Morphological Transducer
Current Arabic lexicons, whether computational or otherwise, make no distinction between entries from Modern Standard Arabic (MSA) and Classical Arabic (CA), and tend to include obsolete words that are not attested in current usage. We address this problem by building a large-scale, corpus-based lexical database that is representative of MSA. We use an MSA corpus of 1,089,111,204 words, a pre-a...
متن کاملA study of a non-resourced language: an Algerian dialect
The objective of this paper is to present an under-resourced language related to Arabic. In fact, in several countries through the Arabic world, no one speaks the modern standard Arabic language. People speak something which is inspired from Arabic but could be very different from the modern standard Arabic. This one is reserved for the official broadcast news, official discourses and so on. Th...
متن کاملBuilding Annotated Written and Spoken Arabic LRs in NEMLAR Project
The NEMLAR project: Network for Euro-Mediterranean LAnguage Resource and human language technology development and support; (www.nemlar.org) is a project supported by the EC with partners from Europe and the Middle East; whose objective is to build a network of specialized partners to promote and support the development of Arabic Language Resources in the Mediterranean region. The project focus...
متن کامل