Constructing An Automatic Lexicon for Arabic Language
نویسندگان
چکیده
In this paper, we have designed and implemented a system for building an Automatic Lexicon for the Arabic language. Our Arabic Lexicon contains word specific information. These pieces of information include; morphological information such as the root (stem) of the word, its pattern and its affixes, the part-of-speech tag of the word, which classifies it as a noun, verb or particle; lexical attributes such as gender, number, person, case, definiteness, aspect, and mood are also extracted and stored with the word in the lexicon. A lexicon its a collection of representations for words used by a natural language processor as a source of words specific information; this representation may contain information about the morphology, phonology, syntactic argument structure and semantics of the word. A good lexicon is badly needed for many Natural Language applications such as: parsing, text generation, noun phrase and verb phrase construction and so on. Many rules based on the grammar of the Arabic language were used in our system to identify the part-of-speech tag and the related lexical attributes of the word [13]. We have tested our system using a vowelized and non-vowelized Arabic text documents taken from the holly Qur'an and 242 Arabic abstracts chosen randomly from the proceedings of the Saudi Arabian National Computer Conference, and we achieved an accuracy of about 96%. We discuss the factors behind these errors and how this accuracy rate can be enhanced.
منابع مشابه
A Supervised Method for Constructing Sentiment Lexicon in Persian Language
Due to the increasing growth of digital content on the internet and social media, sentiment analysis problem is one of the emerging fields. This problem deals with information extraction and knowledge discovery from textual data using natural language processing has attracted the attention of many researchers. Construction of sentiment lexicon as a valuable language resource is a one of the imp...
متن کاملConstructing and Using Broad-coverage Lexical Resource for Enhancing Morphological Analysis of Arabic
Broad-coverage language resources which provide prior linguistic knowledge must improve the accuracy and the performance of NLP applications. We are constructing a broad-coverage lexical resource to improve the accuracy of morphological analyzers and part-of-speech taggers of Arabic text. Over the past 1200 years, many different kinds of Arabic language lexicons were constructed; these lexicons...
متن کاملClitics in Arabic Language: A Statistical Study
Clitics in Arabic language can be attached to a stem or to each other without orthographic marks such as an apostrophe. In this paper we present a statistical study of clitics and its effect in Arabic language. We tokenize large Arabic text using white-spaces and an automatic clitics tokenizer (AMIRA 2.0) and compare the unique-word count in both cases with English language. We also show the re...
متن کاملBuilding the Valency Lexicon of Arabic Verbs
This paper describes the building of a valency lexicon of Arabic verbs using a morphologically and syntactically annotated corpus, the Prague Arabic Dependency Treebank, as its primary source. We present the theoretical account on valency developed within the Functional Generative Description theory. We apply the framework to Arabic and discuss various valency-related phenomena with respect to ...
متن کاملCollaboratively Constructed Linguistic Resources for Language Variants and their Exploitation in NLP Application - the case of Tunisian Arabic and the Social Media
Modern Standard Arabic (MSA) is the formal language in most Arabic countries. Arabic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. This paper aims to bridge the gap between MSA and AD by providing a framework for the translation of texts of soc...
متن کامل