text classification rocchio

Training Algorithms for Linear Text Classiiers

1996

David D. Lewis Robert E. Schapire James P. Callan Ron Papka

Systems for text retrieval, routing, categorization and other IR tasks rely heavily on linear classiiers. We propose that two machine learning algorithms, the Widrow-Hoo and EG algorithms, be used in training linear text classiiers. In contrast to most IR methods, theoretical analysis provides performance guarantees and guidance on parameter settings for these algorithms. Experimental data is p...

متن کامل

Text Categorization through Multistrategy Learning and Visualization

2001

Ali Hadjarian Jerzy W. Bala Peter W. Pachowicz

This paper introduces a multistrategy learning approach to the categorization of text documents. The approach benefits from two existing, and in our view complimentary, sets of categorization techniques: those based on Rocchios algorithm and those belonging to the rule learning class of machine learning algorithms. Visualization is used for the presentation of the output of learning.

متن کامل

Text Categorization Based on Topic Model

Journal: :Int. J. Computational Intelligence Systems 2008

Shibin Zhou Kan Li Yushu Liu

In the text literature, many topic models were proposed to represent documents and words as topics or latent topics in order to process text effectively and accurately. In this paper, we propose LDACLM or Latent Dirichlet Allocation Category Language Model for text categorization and estimate parameters of models by variational inference. As a variant of Latent Dirichlet Allocation Model, LDACL...

متن کامل

Experiment on Pseudo Relevance Feedback Method Using Taylor Formula at NTCIR-3 Patent Retrieval Task

2002

Kazuaki Kishida

Pseudo relevance feedback is empirically known as a useful method for enhancing retrieval performance. For example, we can apply the Rocchio method, which is well-known relevance feedback method, to the results of an initial search by assuming that the top-ranked documents are relevant a priori. In this paper, for searching NTCIR-3 patent test collection through pseudo feedback, we try to emplo...

متن کامل

Using WordNet to Complement Training Information in Text Categorization

Journal: :CoRR 1997

Manuel de Buenaga Rodríguez José María Gómez Hidalgo Belén Díaz-Agudo

Automatic Text Categorization (TC) is a complex and useful task for many natural language applications, and is usually performed through the use of a set of manually classified documents, a training collection. We suggest the utilization of additional resources like lexical databases to increase the amount of information that TC systems make use of, and thus, to improve their performance. Our a...

متن کامل

A Robust Model for Intelligent Text Classification

2001

Roberto Basili Alessandro Moschitti

Methods for taking into account linguistic content into text retrieval are receiving a growing attention [16],[14]. Text categorization is an interesting area for evaluating and quantifying the impact of linguistic information. Works in text retrieval through Internet suggest that embedding linguistic information at a suitable level within traditional quantitative approaches (e.g. sense distinc...

متن کامل

Pseudo Relevance Feedback Method Based On Taylor Expansion Of Retrieval Function In NTCIR-3 Patent Retrieval Task

2003

Kazuaki Kishida

Pseudo relevance feedback is empirically known as a useful method for enhancing retrieval performance. For example, we can apply the Rocchio method, which is well-known relevance feedback method, to the results of an initial search by assuming that the top-ranked documents are relevant. In this paper, for searching the NTCIR-3 patent test collection through pseudo feedback, we employ two releva...

متن کامل

Query-By-Multiple-Examples using Support Vector Machines

Journal: :JDIM 2009

Dell Zhang Wee Sun Lee

We identify and explore an Information Retrieval paradigm called Query-By-Multiple-Examples (QBME) where the information need is described not by a set of terms but by a set of documents. Intuitive ideas for QBME include using the centroid of these documents or the well-known Rocchio algorithm to construct the query vector. We consider this problem from the perspective of text classification, a...

متن کامل

Extraction of user preferences from a few positive documents

2003

Byeong Man Kim Qing Li Jong-Wan Kim

In this work, we propose a new method for extracting user preferences from a few documents that might interest users. For this end, we first extract candidate terms and choose a number of terms called initial representative keywords (IRKs) from them through fuzzy inference. Then, by expanding IRKs and reweighting them using term co-occurrence similarity, the final representative keywords are ex...

متن کامل

KAN and RinSCut: Lazy Linear Classifier and Rank-in-Score Threshold in Similarity-Based Text Categorization

2007

Kang Hyuk Lee Judy Kay Byeong Ho Kang

Two important research areas in statistical approaches for automated text categorization are similarity-based learning algorithms and associated thresholding strategies. The combination of these techniques significantly influences the overall performance of text categorization systems. After researching common techniques in both areas, we describe a lazy linear classifier known as the keyword a...

متن کامل