Adaptive models of Arabic text

نویسنده

  • Khaled M. Alhawiti
چکیده

The main aim of this thesis is to build adaptive language models of Arabic text that can achieve the best compression performance over existing models. Prediction by partial matching (PPM) language models has been the best performing over the other adaptive language models through the past three decades in term of compression performance. In order to get such performance for Arabic text, the rich morphological nature of Arabic language should be taken into consideration. In this thesis, two new resources of Arabic language have been introduced for understanding the nature of Arabic language and standardizing the experiments on Arabic text. The first is a new corpus, the Bangor Arabic Compression Corpus (BACC), for standardizing compression experiments and creating a benchmark corpus for future compression experiments on Arabic text. The second is a new corpus, Bangor Balanced Corpus of Contemporary Arabic (BBCCA), The purpose of this corpus is to mirror similar balanced corpora that are available for the English language (Brown and LOB) but instead comprises the Arabic language. Two new adaptive models, BS-PPM and CS-PPM, based on the Prediction by Partial Matching (PPM) compression scheme are then introduced to improve the compression performance of standard PPM model by using preprocessing techniques. The first model works by replacing the most

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High capacity steganography tool for Arabic text using 'Kashida'

Steganography is the ability to hide secret information in a cover-media such as sound, pictures and text. A new approach is proposed to hide a secret into Arabic text cover media using "Kashida", an Arabic extension character. The proposed approach is an attempt to maximize the use of "Kashida" to hide more information in Arabic text cover-media. To approach this, some algorithms have been des...

متن کامل

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

Off-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model

In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...

متن کامل

Using Machine Learning Algorithms for Automatic Cyber Bullying Detection in Arabic Social Media

Social media allows people interact to express their thoughts or feelings about different subjects. However, some of users may write offensive twits to other via social media which known as cyber bullying. Successful prevention depends on automatically detecting malicious messages. Automatic detection of bullying in the text of social media by analyzing the text "twits" via one of the machine l...

متن کامل

An Adaptive Algorithm for the Automatic Segmentation of Printed Arabic Text

Character segmentation is a crucial step in most Arabic optical text recognition systems. The recognition process depends mainly on the accuracy of the character segmentation. This paper presents a novel adaptive algorithm for the off-line segmentation of printed Arabic text. There are many challenging features in the Arabic writing, for example, it is cursive and characters in a word can take ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014