Arabic Part of speech Tagging using k-Nearest Neighbour and Naive Bayes Classifiers Combination

نویسندگان

  • Rund Mahafdah
  • Nazlia Omar
  • Omaia Al-Omari
چکیده

Part Of Speech (POS) tagging forms the important preprocessing step in many of the natural language processing applications such as text summarization, question answering and information retrieval system. It is the process of classifying every word in a given context to its appropriate part of speech. Different POS tagging techniques in the literature have been developed and experimented. Currently, it is well known that some POS tagging models are not performing well on the Quranic Arabic due to the complexity of the Quranic Arabic text. This complexity presents several challenges for POS tagging such as high ambiguity, data sparseness and large existence of unknown words. With this in mind, the main problem here is to find out how existing and efficient methods perform in Arabic and how can Quranic corpus be utilized to produce an efficient framework for Arabic POS tagging. We propose a classifiers combination experimental framework for Arabic POS tagger, by selecting two best diverse probabilistic classifiers used in numerous works in non-Arabic language; namely K-Nearest Neighbour (KNN) and Naive Bayes (NB). The Majority voting is used here as the combination strategy to exploit classifiers advantages. In addition, an in-depth study has been conducted on a large list of features for exploiting effective features and investigating their role in enhancing the performance of POS taggers for the Quranic Arabic. Hence, this study aims to efficiently integrate different feature sets and tagging algorithms to synthesize more accurate POS tagging procedure. The data used in this study is the Arabic Quranic Corpus, an annotated linguistic resource consisting of 77,430 words with Arabic grammar, syntax and morphology for each word in the Holy Quran. The highest accuracy in the results achieved is 98.32%, which can be a significant enhancement for the state-of-the-art for Arabic Quranic text. The most effective features that yield this accuracy are a combination of w0 (the current word), p0 (POS of the current word), p-3 (POS of three words before), p-2 (POS of two words before) and p-1 (POS of the word before).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison of Classification Methods: Peril to Avoid for Binary and Multi Propose Combination Approach

ABSTRACT: Classification plays an important role in various fields like Object recognition, text categorization etc. Studying classifiers for purpose of estimating probability for a ce is crucial for classification .In this paper, we present a survey of four k Nearest Neighbour, Naive Bayes and Neural Network focusing on their merits and demerits.We will also shed light on combination of the ab...

متن کامل

Ensembles of nearest neighbour classifiers and serial analysis of gene expression

In this paper, we represent experimental results obtained with ensembles of nearest neighbour classifiers on the binary classification problem of cancer classification using serial analysis of gene expression (SAGE) data. Nearest neighbours are selected as classifiers since they were rarely employed in building ensembles because their predictions are stable to small perturbations of data, which...

متن کامل

Twitter Sentiment Analysis: Lexicon Method, Machine Learning Method and Their Combination

This paper presents a step-by-step methodology for Twitter sentiment analysis. Two approaches are tested to measure variations in the public opinion about retail brands. The first, a lexicon-based method, uses a dictionary of words with assigned to them semantic scores to calculate a final polarity of a tweet, and incorporates part of speech tagging. The second, machine learning approach, tackl...

متن کامل

Scaling up the Accuracy of K -nearest-neighbour Classifiers: a Naive-bayes Hybrid

k-nearest-neighbour (KNN) has been widely used as an effective classification model. In this paper, we summarize three main shortcomings confronting KNN and then single out three categories of approaches for overcoming its three main shortcomings. After reviewing some algorithms in each category, we presented a hybrid algorithm called dynamic k-nearest-neighbour naive Bayes with attribute weigh...

متن کامل

Performance Evaluation of Multistage Classifier

Ensemble of classifiers is one of the most researched methods in pattern classification in recency. It’s a well-known fact that multiple phases for evaluation provides more accuracy. In this paper we proposed a multistage classifier approach where we are applying three supervised classifiers for the classification in pattern recognition. Three Classifiers are Multilayer Perceptron (MLP), K-Near...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • JCS

دوره 10  شماره 

صفحات  -

تاریخ انتشار 2014