Text Categorization with Class-Based and Corpus-Based Keyword Selection

نویسندگان

  • Arzucan Özgür
  • Levent Özgür
  • Tunga Güngör
چکیده

In this paper, we examine the use of keywords in text categorization with SVM. In contrast to the usual belief, we reveal that using keywords instead of all words yields better performance both in terms of accuracy and time. Unlike the previous studies that focus on keyword selection metrics, we compare the two approaches for keyword selection. In corpus-based approach, a single set of keywords is selected for all classes. In class-based approach, a distinct set of keywords is selected for each class. We perform the experiments with the standard Reuters21578 dataset, with both boolean and tf-idf weighting. Our results show that although tf-idf weighting performs better, boolean weighting can be used where time and space resources are limited. Corpus-based approach with 2000 keywords performs the best. However, for small number of keywords, class-based approach outperforms the corpus-based approach with the same number of keywords.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA

With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...

متن کامل

Comparison of text feature selection policies and using an adaptive framework

0957-4174/$ see front matter 2013 Elsevier Ltd. A http://dx.doi.org/10.1016/j.eswa.2013.02.019 ⇑ Corresponding author. Tel.: +90 (212) 359 7094. E-mail addresses: [email protected] (S . (T. Güngör). Text categorization is the task of automatically assigning unlabeled text documents to some predefined category labels by means of an induction algorithm. Since the data in text categoriz...

متن کامل

A Technique for Proper Feature Selection with Automated Text Categorization in the Vector Space Model

Efficient and effective text categorization and information retrieval techniques are very important and play a major role in managing the ever increasing amount of data and textual information available in digital form. Text categorization has important applications like information retrieval, bad information identification, document and web resource filtering. Before the application of various...

متن کامل

Classification of Skewed and Homogenous Document Corpora with Class-Based and Corpus-Based Keywords

In this paper, we examine the performance of the two policies for keyword selection over standard document corpora of varying properties. While in corpus-based policy a single set of keywords is selected for all classes globally, in class-based policy a distinct set of keywords is selected for each class locally. We use SVM as the learning method and perform experiments with boolean and tf-idf ...

متن کامل

A feature selection approach based on term distributions

Feature selection has a direct impact on text categorization. Most existing algorithms are based on document level, and they haven't considered the influence of term frequency on text categorization. Based on these, we put forward a feature selection approach, FSATD, based on term distributions in the paper. In our proposed algorithm, three critical factors which are term frequency, the inter-c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005