Automatic Text Categorization: Case Study
نویسندگان
چکیده
Text Categorization is a process of classifying documents with regard to a group of one or more existent categories [1] according to themes or concepts present in their contents. The most common application of it is in Information Retrieval Systems (IRS) to document indexing [2]. The organization of text in categories allow the user to limit the target of a search submitted to IRS, to explore the collection and to find relevant information to they need with poor knowledge about the keywords of a theme. A method to transform text categorization into a viable task is to use machine-learning algorithms to automate text classification, allowing it to be carried out fast, into concise manner and in broad range. The objective of this work is to present and compare the results of experiments on text categorization using artificial neural networks of the type Multilayer Perceptron (MLP) [3] and Self-organizing Maps (SOM) [6], and traditional machine-learning algorithms used in this task [4]: C4.5 decision tree, PART decision rules and Naive Bayes classifier. The experiments were carried out with three collections of texts, the collection K1 [5], the collection PubsFinder [4] and a subcollection of the Reuters-21758 Collection called Metals Collection [1]. Comparing the best performance of each algorithm, in terms of classification error on test set for each collection, the experimental results show artificial neural networks as good classifiers for problems of text categorization. In general, the MLP Networks distinguished as the bests classifiers and the SOM networks had better performance than the symbolic machine learning algorithms. The classification error obtained by SOM was not twice bigger than the minor founded by the other classifiers for the collections. Thus, SOM networks can be used as an auxiliary tool to manual text categorization, as well as a way to explore a text collection, having as initial interface the map generated and labeled with the most numerous category in each neuron. References:
منابع مشابه
Automatic Text Categorization and Its Applicationto Text
We develop an automatic text categorization approach and investigate its application to text retrieval. The categorization approach is derived from a combination of a learning paradigm known as instancebased learning and an advanced document retrieval technique known as retrieval feedback. We demonstrate the e ectiveness of our categorization approach using two real-world document collections f...
متن کاملAutomatic Text Categorization and Its Application to Text Retrieval
ÐWe develop an automatic text categorization approach and investigate its application to text retrieval. The categorization approach is derived from a combination of a learning paradigm known as instance-based learning and an advanced document retrieval technique known as retrieval feedback. We demonstrate the effectiveness of our categorization approach using two realworld document collections...
متن کاملAutomatic Generation of Background Text to Aid Classification
We illustrate that Web searches can often be utilized to generate background text for use with text classification. This is the case because there are frequently many pages on the World Wide Web that are relevant to particular text classification tasks. We show that an automatic method of creation of a secondary corpus of unlabeled but related documents can help decrease error rates in text cat...
متن کاملCross-Lingual Text Categorization
This article deals with the problem of Cross-Lingual Text Categorization (CLTC), which arises when documents in different languages must be classified according to the same classification tree. We describe practical and cost-effective solutions for automatic Cross-Lingual Text Categorization, both in case a sufficient number of training examples is available for each new language and in the cas...
متن کاملFeature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News
This work focuses on selecting features in the automatic text categorization of Chinese industrial and financial news. We use feature selecting method for the characteristics of subclass Chinese financial and industrial news. However, it is an open challenge for subclass news in solving real-world problems which are often high-dimensional. Therefore, we proposed a feature selecting model in aut...
متن کاملSpanish/English Cross-Lingual Categorization
This article deals with the problem of Cross-Lingual Text Categorization (CLTC), which arises when documents in different languages must be classified according to the same classification tree. We describe practical and cost-effective solutions for automatic Cross-Lingual Text Categorization, both in case a sufficient number of training examples is available for each new language and in the cas...
متن کامل