arabic text classification

TAT: An Author Profiling Tool with Application to Arabic Emails

2007

Dominique Estival Tanja Gaustad Son Bao Pham Will Radford Ben Hutchinson

This paper reports on the application of the Text Attribution Tool (TAT) to profiling the authors of Arabic emails. The TAT system has been developed for the purpose of language-independent author profiling and has now been trained on two email corpora, English and Arabic. We describe the overall TAT system and the Machine Learning experiments resulting in classifiers for the different author t...

متن کامل

Correction Annotation for Non-Native Arabic Texts: Guidelines and Corpus

2015

Wajdi Zaghouani Nizar Habash Houda Bouamor Alla Rozovskaya Behrang Mohit Abeer Heider Kemal Oflazer

We present our correction annotation guidelines to create a manually corrected nonnative (L2) Arabic corpus. We develop our approach by extending an L1 large-scale Arabic corpus and its manual corrections, to include manually corrected non-native Arabic learner essays. Our overarching goal is to use the annotated corpus to develop components for automatic detection and correction of language er...

متن کامل

A Study of Text Preprocessing Tools for Arabic Text Categorization

2009

Dina A. Said Nayer M. Wanas Nevin M. Darwish Nadia H. Hegazy

Text preprocessing is an essential stage in text categorization (TC) particularly and text mining generally. Morphological tools can be used in text preprocessing to reduce multiple forms of the word to one form. There has been a debate among researchers about the benefits of using morphological tools in TC. Studies in the English language illustrated that performing stemming during the preproc...

متن کامل

Benchmarking Strategy for Arabic Screen-Rendered Word Recognition

2012

Fouad Slimane Slim Kanoun Jean Hennebert Rolf Ingold Adel M. Alimi

This chapter presents a new benchmarking strategy for Arabic screenbased word recognition. Firstly, we report on the creation of the new APTI (Arabic Printed Text Image) database. This database is a large-scale benchmarking of open-vocabulary, multi-font, multi-size and multi-style word recognition systems in Arabic. Such systems take as input a text image and compute as output a character stri...

متن کامل

Language Classification and Segmentation of Noisy Documents in Hebrew Scripts

2012

Alex Zhicharevich Nachum Dershowitz

Language classification is a preliminary step for most natural-language related processes. The significant quantity of multilingual documents poses a problem for traditional language-classification schemes and requires segmentation of the document to monolingual sections. This phenomenon is characteristic of classical and medieval Jewish literature, which frequently mixes Hebrew, Aramaic, Judeo...

متن کامل

An automated arabic text categorization based on the frequency ratio accumulation

Journal: :Int. Arab J. Inf. Technol. 2014

Baraa T. Sharef Nazlia Omar Zeyad T. Sharef

Compared to other languages, there is still a limited body of research which has been conducted for the automated Arabic Text Categorization (TC) due to the complex and rich nature of the Arabic language. Most of such research includes supervised Machine Learning (ML) approaches such as Naïve Bayes (NB), K-Nearest Neighbour (KNN), Support Vector Machine and Decision Tree. Most of these techniqu...

متن کامل

An Improved Arabic WordS roots Extraction method using n-Gram Technique

Journal: :JCS 2014

Nidal Yousef Aymen M. Abu-Errub Ashraf Odeh Hayel Khafajeh

Arabic language is distinguished by its morphological richness, which forces the workers in the field of Arabic language Processing (i.e., information retrieval, document’s classification, text summarizing) to deal with many words that seem to be different but in reality they came from an identical root word. One of the methods to overcome this problem is to return the words to their roots. Thi...

متن کامل

Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks

2004

Mona Diab Kadri Hacioglu Daniel Jurafsky

To date, there are no fully automated systems addressing the community’s need for fundamental language processing tools for Arabic text. In this paper, we present a Support Vector Machine (SVM) based approach to automatically tokenize (segmenting off clitics), part-ofspeech (POS) tag and annotate base phrases (BPs) in Arabic text. We adapt highly accurate tools that have been developed for Engl...

متن کامل

Towards a Suitable Rhetorical Representation for Arabic Text Summarization

2005

Waleed Al-Sanie Ameur Touir Hassan Mathkour

Text summarization based on rhetorical structure theory has shown extremely interesting result. The process of extracting the text summary from the result of the rhetorical parser is not a singleton. Different rhetorical structure trees are generated from one text. Unfortunately, the result of the generated summary is not equivalent for those trees, and the correctness of the result is affected...

متن کامل

Arabizi Detection and Conversion to Arabic

2014

Kareem Darwish

Arabizi is Arabic text that is written using Latin characters. Arabizi is used to present both Modern Standard Arabic (MSA) or Arabic dialects. It is commonly used in informal settings such as social networking sites and is often with mixed with English. In this paper we address the problems of: identifying Arabizi in text and converting it to Arabic characters. We used word and sequence-level ...

متن کامل