A New Probabilistic Model of Text Classi cation and Retrieval
نویسنده
چکیده
This paper introduces the multinomial model of text classiication and retrieval. One important feature of the model is that the tf statistic, which usually appears in probabilistic IR models as a heuristic, is an integral part of the model. Another is that the variable length of documents is accounted for, without either making a uniform length assumption or using length normalization. The multinomial model employs independence assumptions which are similar to assumptions made in previous probabilistic models , particularly the binary independence model and the 2-Poisson model. The use of simulation to study the model is described. Performance of the model is evaluated on the TREC-3 routing task. Results are compared with the binary independence model and with the simulation studies.
منابع مشابه
A New Probabilistic Model of Text Classi cation and
This paper introduces the multinomial model of text classiication and retrieval. One important feature of the model is that the tf statistic, which usually appears in probabilistic IR models as a heuristic, is an integral part of the model. Another is that the variable length of documents is accounted for, without either making a uniform length assumption or using length normalization. The mult...
متن کاملA Term Association Translation Model for Naive Bayes Text Classification
Text classi cation (TC) has long been an important research topic in information retrieval (IR) related areas. In the literature, the bag-of-words (BoW) model has been widely used to represent a document in text classi cation and many other applications. However, BoW, which ignores the relationships between terms, o ers a rather poor document representation. Some previous research has shown tha...
متن کاملA Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization
The Rocchio relevance feedback algorithm is one of the most popular and widely applied learning methods from information retrieval. Here, a probabilistic analysis of this algorithm is presented in a text categorization framework. The analysis gives theoretical insight into the heuristics used in the Rocchio algorithm, particularly the word weighting scheme and the similarity metric. It also sug...
متن کاملIndependent component analysis for understanding multimedia content
This paper focuses on using independent component analysis of combined text and image data from web pages. This has potential for search and retrieval applications in order to retrieve more meaningful and context dependent content. It is demonstrated that using ICA on combined text and image features provides a synergistic e ect, i.e., the retrieval classi cation rates increase if based on mult...
متن کاملA Common Lisp Framework for Document Classi cation and Retrieval
This paper describes the Document Classi cation Substrate (DCS) and accompanying protocols. The DCS is a framework of Lisp support code facilitating the prototyping and deployment of systems for automatic document classi cation and retrieval applications. The DCS design re ects the following observations concerning the problem of classi cation of texts. 1. Initial preprocessing (lexical feature...
متن کامل