Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
نویسنده
چکیده مقاله:
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for the creation of Arabic text corpora. In particular, we create a text classification process for Arabic news articles downloaded from web news portals and sites. The suggested procedure is a pilot project that uses some human predefined set of documents that have been assigned to some subjects or categories. A vectorized Term Frequency, Inverse Document Frequency (TF-IDF) based information processing was used for the initial verification of the categories. The resulting validated categories used to predict categories for new documents. The experiment used 1000 initial documents pre-assigned into five categories of each with 200 documents assigned. An initial set of 2195 documents were downloaded from a number of Arabic news sources. They were pre-processed for use in testing the utility of the suggested classification procedure using the cosine similarity as a classifier. Results were very encouraging with very satisfying precision, recall and F1-score. It is the intention of the authors to improve the procedure and to use it for Arabic corpora creation.
منابع مشابه
Statistical Classification Methods for Arabic News Articles
In this paper, we present experimental results on document clustering and classification achieved on the Arabic NEWSWIRE corpus using statistical methods. Arabic is a highly inflecting language. The methods presented here show to be very robust and reliable without morphological analysis.
متن کاملEvent Based Emotion Classification for News Articles
Reading of news articles can trigger emotional reactions from its readers. But comparing to other genre of text, news articles that are mainly used to report events, lack emotion linked words and other features for emotion classification. In this paper, we propose an event anchor based method for emotion classification for news articles. Firstly, we build an emotion linked news corpus through c...
متن کاملClassification of News Web Documents Based on Structural Features
The motivation of this work comes from the need of a Thai web corpus for testing our information retrieval algorithm. Two collections of news web documents are gathered from two different Thai newspaper web sites. Our goal is to find a simple yet effective method to extract news articles from these web collections. We explore the use of machine learning methods to distinguish article pages from...
متن کاملIndexing and classification of TV news articles based on speech dictation using word bigram
In order to construct a news database with a function of video on demand (VOD), it is required to classify news articles into topics. In this paper, we propose a method to automatically index and classify TV news articles into 10 topics based on a speech dictation techniques using speaker independent triphone HMMs and word bigram.
متن کاملNews Articles Classification Using Random Forests and Weighted Multimodal Features
This research investigates the problem of news articles classification. The classification is performed using N-gram textual features extracted from text and visual features generated from one representative image. The application domain is news articles written in English that belong to four categories: Business-Finance, Lifestyle-Leisure, Science-Technology and Sports downloaded from three we...
متن کاملArabic documents classification using fuzzy R.B.F. classifier with sliding window
In this paper, we propose a system for contextual and semantic Arabic documents classification by improving the standard fuzzy model. Indeed, promoting neighborhood semantic terms that seems absent in this model by using a radial basis modeling. In order to identify the relevant documents to the query. This approach calculates the similarity between related terms by determining the relevance of...
متن کاملمنابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ذخیره در منابع من قبلا به منابع من ذحیره شده{@ msg_add @}
عنوان ژورنال
دوره 5 شماره 2
صفحات 117- 128
تاریخ انتشار 2019-05-01
با دنبال کردن یک ژورنال هنگامی که شماره جدید این ژورنال منتشر می شود به شما از طریق ایمیل اطلاع داده می شود.
میزبانی شده توسط پلتفرم ابری doprax.com
copyright © 2015-2023