Skinning the Cat: Comparing Alternative Text Mining Algorithms for Categorization
نویسندگان
چکیده
Applying category labels to textual documents can be useful for 1) Search Indexing, 2) Document Filtering, and 3) Summarization. Many different algorithms have been proposed for applying category labels to text documents. We compare and contrast different approaches to text mining using Enterprise Miner for Text. INTRODUCTION The automatic classification of documents into categories is an increasingly important task. As document collections continue to grow at remarkable rates, the task of classifying the documents by hand can become unmanageable. However, without the organization provided by a classification system, the collection as a whole is nearly impossible to comprehend and specific documents are difficult to locate. Examples of document collections that are often organized into categories include web pages, patents, news articles, email, research papers, and various knowledge bases. Text categorization techniques have traditionally determined a subset of terms that are most diagnostic of particular categories and then tried to predict the categories using the weighted frequencies of each of those terms in each document. We will refer to this technique as the truncation approach (since only a subset of terms are used). This approach is subject to several deficiencies: 1. It does not take into account terms that are highly correlated with each other, such as synonyms. As a result, it is very important to employ a useful stemming algorithm, as well. 2. Documents are rated close to each other only according to co-occurrence of terms. Documents may be semantically similar to each other while having very few of the truncated terms in common. Most of these terms only occur in a small percentage of the documents. 3. The terms used need to be recomputed for each category of interest. These problems present themselves also for text retrieval; as a result it has become de rigueur to use a reduced-dimensionality vector-space model when retrieving documents using search terms. In the vector-space model, vectors in a multi-dimensional space can represent both documents and terms. To determine which documents match retrieval terms, Latent Semantic Analysis [3] is used to find the nearest documents in that space to the search terms. We believe that the use of a reduced-dimensionality normalized vector-space model to represent documents in multi-dimensional space can be useful for both classification and categorization of text documents, particularly in the context of categorization approaches that are based on Euclidean distances between documents, such as discriminant analysis, neural networks, and memory-based reasoning. In this paper, we compare the singular value decomposition (SVD) technique for projecting documents into a k-dimensional subspace to that of truncation, using Enterprise Miner for Text. Different techniques for weighting the terms for both approaches are compared, and both neural networks and memory-based Cat Document fin 1 route the cash and the check to the bank fin 2 borrow with credit borrow based on credit fin 3 I can borrow cash from the bank riv 4 river boat floats up the river riv 5 boat is by the dock near the bank riv 6 the river boat is on the south bank riv 7 the boat floats by the dock near the river bank par 8 check the parade route to see the floats par 9 the parade the parade the parade Term-Document Frequency Matrix, A Documents 1 2 3 4 5 6 7 8 9 route 1 1 1 cash 2 1 1 check 3 1 1
منابع مشابه
Text Mining for Technology Monitoring
A considerable part of scientific and technological knowledge is coded in writing. In this context, automated text categorization can be regarded as a promising tool particularly for patent data analysis. In a real-life example, we show that automated text categorization can closely resemble the time -consuming categorisation job of an expert. By comparing different algorithms we reveal systema...
متن کاملMining Linguistically Interpreted Texts
This paper proposes and evaluates the use of linguistic information in the pre-processing phase of text mining tasks. We present several experiments comparing our proposal for selection of terms based on linguistic knowledge with usual techniques applied in the field. The results show that part of speech information is useful for the pre-processing phase of text categorization and clustering, a...
متن کاملDesigning a System for Trend Analysis of Users in Website Surfing in Iran Using Data Mining and Text Mining Algorithms
Background and Aim: As of the entrance of web surfing to the lifestyle of a vast majority of people in the society and the need for a more accurate social and cultural policy making in the field, authors intended to analyze the behavior of the society users in viewing different websites so as to help politicians and practitioners. Methods: Design science research method is used in this research...
متن کاملA Review on Various Text Mining Techniques and Algorithms
Text mining is the method of extracting meaningful information or knowledge or patterns from the available text documents from various sources. The pattern discovery from the text and document organization of document is a well-known problem in data mining. At present world, the amount of stored information has been enormously increasing day by day which is generally in the unstructured form an...
متن کاملEvaluating the use of linguistic information in the pre-processing phase of Text Mining
This work proposes and evaluates the use of linguistic information in the pre-processing phase for text mining tasks applied to Portuguese texts. We present several experiments comparing our proposal to the usual techniques applied in the field. The results show that the use of linguistic information in the pre-processing phase brings some improvements for both text categorization and clustering.
متن کامل