Semi-subsumed Events: A Probabilistic Semantics of the BM25 Term Frequency Quantification
نویسندگان
چکیده
Through BM25, the asymptotic term frequency quantification TF = tf/(tf+K), where tf is the within-document term frequency and K is a normalisation factor, became popular. This paper reports a finding regarding the meaning of the TF quantification: in the triangle of independence and subsumption, the TF quantification forms the altitude, that is, the middle between independent and subsumed events. We refer to this new assumption as semi-subsumed. While this finding of a well-defined probabilistic assumption solves the probabilistic interpretation of the BM25 TF quantification, it is also of wider impact regarding probability theory.
منابع مشابه
A Proximity Probabilistic Model for Information Retrieval
We propose a proximity probabilistic model (PPM) that advances a bag-of-words probabilistic retrieval model. In our proposed model, a document is transformed to a pseudo document, in which a term count is propagated to other nearby terms. Then we consider three heuristics, i.e., the distance of two query term occurrences, their order, and term weights, and try four kernel functions in measuring...
متن کاملThe Effect of Weighted Term Frequencies on Probabilistic Latent Semantic Term Relationships
Probabilistic latent semantic analysis (PLSA) is a method of calculating term relationships within a document set using term frequencies. It is well known within the information retrieval community that raw term frequencies contain various biases that affect the precision of the retrieval system. Weighting schemes, such as BM25, have been developed in order to remove such biases and hence impro...
متن کاملA Log-Logistic Model-Based Interpretation of TF Normalization of BM25
The effectiveness of BM25 retrieval function is mainly due to its sub-linear term frequency (TF) normalization component, which is controlled by a parameter k1. Although BM25 was derived based on the classic probabilistic retrieval model, it has been so far unclear how to interpret its parameter k1 probabilistically, making it hard to optimize the setting of this parameter. In this paper, we pr...
متن کاملSemantically Enhanced Term Frequency
In this paper, we complement the term frequency, which is used in many bag-of-words based information retrieval models, with information about the semantic relatedness of query and document terms. Our experiments show that when employed in the standard probabilistic retrieval model BM25, the additional semantic information significantly outperforms the standard term frequency, and also improves...
متن کاملLearning Term Weights for Ad-hoc Retrieval
Most Information Retrieval models compute the relevance score of a document for a given query by summing term weights specific to a document or a query. Heuristic approaches, like TF-IDF, or probabilistic models, like BM25, are used to specify how a term weight is computed. In this paper, we propose to leverage learning-to-rank principles to learn how to compute a term weight for a given docume...
متن کامل