Adaptive models of Arabic text
نویسنده
چکیده
The main aim of this thesis is to build adaptive language models of Arabic text that can achieve the best compression performance over existing models. Prediction by partial matching (PPM) language models has been the best performing over the other adaptive language models through the past three decades in term of compression performance. In order to get such performance for Arabic text, the rich morphological nature of Arabic language should be taken into consideration. In this thesis, two new resources of Arabic language have been introduced for understanding the nature of Arabic language and standardizing the experiments on Arabic text. The first is a new corpus, the Bangor Arabic Compression Corpus (BACC), for standardizing compression experiments and creating a benchmark corpus for future compression experiments on Arabic text. The second is a new corpus, Bangor Balanced Corpus of Contemporary Arabic (BBCCA), The purpose of this corpus is to mirror similar balanced corpora that are available for the English language (Brown and LOB) but instead comprises the Arabic language. Two new adaptive models, BS-PPM and CS-PPM, based on the Prediction by Partial Matching (PPM) compression scheme are then introduced to improve the compression performance of standard PPM model by using preprocessing techniques. The first model works by replacing the most
منابع مشابه
High capacity steganography tool for Arabic text using 'Kashida'
Steganography is the ability to hide secret information in a cover-media such as sound, pictures and text. A new approach is proposed to hide a secret into Arabic text cover media using "Kashida", an Arabic extension character. The proposed approach is an attempt to maximize the use of "Kashida" to hide more information in Arabic text cover-media. To approach this, some algorithms have been des...
متن کاملArabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملOff-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model
In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...
متن کاملUsing Machine Learning Algorithms for Automatic Cyber Bullying Detection in Arabic Social Media
Social media allows people interact to express their thoughts or feelings about different subjects. However, some of users may write offensive twits to other via social media which known as cyber bullying. Successful prevention depends on automatically detecting malicious messages. Automatic detection of bullying in the text of social media by analyzing the text "twits" via one of the machine l...
متن کاملAn Adaptive Algorithm for the Automatic Segmentation of Printed Arabic Text
Character segmentation is a crucial step in most Arabic optical text recognition systems. The recognition process depends mainly on the accuracy of the character segmentation. This paper presents a novel adaptive algorithm for the off-line segmentation of printed Arabic text. There are many challenging features in the Arabic writing, for example, it is cursive and characters in a word can take ...
متن کامل