Arabic text classification using k-nearest neighbour algorithm
نویسندگان
چکیده
Many algorithms have been implemented to the problem of Automatic Text Categorization (ATC). Most of the work in this area has been carried out on English texts, with only a few researchers addressing Arabic texts. We have investigated the use of the K-Nearest Neighbour (K-NN) classifier, with an Inew, cosine, jaccard and dice similarities, in order to enhance Arabic ATC. We represent the dataset as un-stemmed and stemmed data; with the use of TREC-2002, in order to remove prefixes and suffixes. However, for statistical text representation, Bag-Of-Words (BOW) and character-level 3 (3-Gram) were used. In order to, reduce the dimensionality of feature space; we used several feature selection methods. Experiments conducted with Arabic text showed that the K-NN classifier, with the new method similarity Inew 92.6% Macro-F1, had better performance than the K-NN classifier with cosine, jaccard and dice similarities. Chi-square feature selection, with representation by BOW, led to the best performance over other feature selection methods using BOW and 3-Gram.
منابع مشابه
An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification
The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...
متن کاملAn Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification
The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...
متن کاملNatural Language Text Classification and Filtering with Trigrams and Evolutionary Nearest Neighbour Classifiers
N grams o er fast language independent multi-class text categorization. Text is reduced in a single pass to ngram vectors. These are assigned to one of several classes by a) nearest neighbour (KNN) and b) genetic algorithm operating on weights in a nearest neighbour classi er. 91% accuracy is found on binary classi cation on short multi-author technical English documents. This falls if more cat...
متن کاملTEXT CATEGORIZATION Building a kNN classifier for the Reuters-21578 collection
Categorization of texts into topical categories has gained booming interest over the past few years. There is a growing need for tools that help in finding, filtering and managing the highdimensional data due to the rapid growth of online information. Building a text classifier by hand is time consuming and costly and hence automated text categorization has gained a lot of importance. A general...
متن کاملDynamic & Attribute Weighted KNN for Document Classification Using Bootstrap Sampling
Although publicly accessible databases containing speech documents. It requires a great deal of time and effort required to keep them up to date is often burdensome. In an effort to help identify speaker of speech if text is available, text-mining tools, from the machine learning discipline, it can be applied to help in this process also. Here, we describe and evaluate document classification a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Int. Arab J. Inf. Technol.
دوره 12 شماره
صفحات -
تاریخ انتشار 2015