Support Vector Machines for Text Categorization
نویسندگان
چکیده
Text categorization is the process of sorting text documents into one or more predefined categories or classes of similar documents. Differences in the results of such categorization arise from the feature set chosen to base the association of a given document with a given category. Advocates of text categorization recognize that the sorting of text documents into categories of like documents reduces the overhead required for fast retrieval of such documents and provides smaller domains in which the users may explore similar documents. In this paper we are interested in examining whether automatic classification of news texts can be improved by a prefiltering the vocabulary to reduce the feature set used in the computations. First we compare artificial neural network and support vector machine algorithms for use as text classifiers of news items. Secondly, we identify a reduction in feature set that provides improved results.
منابع مشابه
Universit at Dortmund Fachbereich Informatik Lehrstuhl Viii K Unstliche Intelligenz Text Categorization with Support Vector Machines: Learning with Many Relevant Features Text Categorization with Support Vector Machines: Learning with Many Relevant Features
This paper explores the use of Support Vector Machines (SVMs) for learning text classiers from examples. It analyzes the particular properties of learning with text data and identi es, why SVMs are appropriate for this task. Empirical results support the theoretical ndings. SVMs achieve substantial improvements over the currently best performing methods and they behave robustly over a variety o...
متن کاملText Categorization with Support Vector Machines: Learning with Many Relevant F Eatures Text Categorization with Support Vector Machines: Learning with Many Relevant F Eatures
This paper explores the use of Support Vector Machines (SVMs) for learning text classiers from examples. It analyzes the particular properties of learning with text data and identi es, why SVMs are appropriate for this task. Empirical results support the theoretical ndings. SVMs achieve substantial improvements over the currently best performing methods and they behave robustly over a variety o...
متن کاملText Categorization and Support Vector Machines
Text categorization is used to automatically assign previously unseen documents to a predefined set of categories. This paper gives a short introduction into text categorization (TC), and describes the most important tasks of a text categorization system. It also focuses on Support Vector Machines (SVMs), the most popular machine learning algorithm used for TC, and gives some justification why ...
متن کاملSupport Vector Machines for Text Categorization Based on Latent Semantic Indexing
Text Categorization(TC) is an important component in many information organization and information management tasks. Two key issues in TC are feature coding and classifier design. In this paper Text Categorization via Support Vector Machines(SVMs) approach based on Latent Semantic Indexing(LSI) is described. Latent Semantic Indexing[1][2] is a method for selecting informative subspaces of featu...
متن کاملUsing Bag-of-Concepts to Improve the Performance of Support Vector Machines in Text Categorization
This paper investigates the use of conceptbased representations for text categorization. We introduce a new approach to create concept-based text representations, and apply it to a standard text categorization collection. The representations are used as input to a Support Vector Machine classifier, and the results show that there are certain categories for which concept-based representations co...
متن کاملEffect of small sample size on text categorization with support vector machines
Datasets that answer difficult clinical questions are expensive in part due to the need for medical expertise and patient informed consent. We investigate the effect of small sample size on the performance of a text categorization algorithm. We show how to determine whether the dataset is large enough to train support vector machines. Since it is not possible to cover all aspects of sample size...
متن کامل