Assessment of a Significant Arabic Corpus

نویسندگان

  • Abduelbaset Goweder
  • Anne De Roeck
چکیده

The development of Language Engineering and Information Retrieval applications for Arabic require availability of sizeable, reliable corpora of modern Arabic text. These are not routinely available. This paper describes how we constructed an 18.5 million word corpus from Al-Hayat newspaper text, with articles tagged as belonging to one of 7 domains. We outline the profile of the data and how we assessed its representativeness. The literature suggests that the statistical profile of Arabic text is significantly different from that of English in ways that might affect the applicability of standard techniques. The corpus allowed us to verify a collection of experiments which had, so far, only been conducted on small, manually collected datasets. We draw some comparisons with English and conclude that there is evidence that Arabic data is much sparser than English for the same data size.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

روشی جدید جهت استخراج موجودیت‌های اسمی در عربی کلاسیک

In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result...

متن کامل

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

Optical Coherence Tomography and Corpus Callosum Index in Cognitive Assessment of Multiple Sclerosis Patients

Background: Multiple Sclerosis (MS) is a neurodegenerative disease of central nervous system. Different approaches have been developed to study MS progression and cognitive dysfunction as the major symptom of the disease. The current study compared Optical Coherence Tomography (OCT) and Corpus Callosum Index (CCI) for the early evaluation of cognitive dysfunction in MS patients.  Objectives: T...

متن کامل

Assessment of epididymal sperm obtained from dromedary camel

Testicles were isolated from dromedary camels in a local slaughterhouse at breeding and non-breeding seasons. Sperms were recovered from different parts of the epididymis (caput, corpus and cauda) and stained separately on slide glasses by eosin nigrosin staining method and dried by a hair dryer and carried to the laboratory. In the lab, slides were observed for evaluation of the proportion of ...

متن کامل

Quality Assessment of General and Categorized Arabic Text Corpora

Many Natural Language Processing and Information Retrieval methods are based on the extensive use of text corpora. The credibility of the results can be heavily influenced by the underlying corpus quality. Much research has been utilizing Arabic corpora into various tasks of Arabic Information Processing. In this paper we discuss a suite of metrics that can be used to ascertain the quality of A...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001