Approximating Document Frequency with Term Count Values
نویسندگان
چکیده
For bounded datasets such as the TREC Web Track (WT10g) the computation of term frequency (TF) and inverse document frequency (IDF) is not difficult. However, when the corpus is the entire web, direct IDF calculation is impossible and values must instead be estimated. Most available datasets provide values for term count (TC) meaning the number of times a certain term occurs in the entire corpus. Intuitively this value is different from document frequency (DF), the number of documents (e.g., web pages) a certain term occurs in. We conduct a comparison study between TC and DF values within the Web as Corpus (WaC). We found a very strong correlation with Spearman’s ρ ≥ 0.8 (p ≤ 0.005) which makes us confident in claiming that for such recently created corpora the TC and DF values can be used interchangeably to compute IDF values. These results are useful for the generation of accurate lexical signatures based on the TF-IDF scheme.
منابع مشابه
Correlation of Term Count and Document Frequency for Google N-Grams
For bounded datasets such as the TRECWeb Track (WT10g) the computation of term frequency (TF) and inverse document frequency (IDF) is not difficult. However, when the corpus is the entire web, direct IDF calculation is impossible and values must instead be estimated. Most available datasets provide values for term count (TC) meaning the number of times a certain term occurs in the entire corpus...
متن کاملFiltering Methods for Feature Selection in Web-Document Clustering
This paper presents the results of a comparative study of filtering methods for feature selection in web document clustering. First, we focused on feature selection methods based on Mutual Information (MI) and Information Gain (IG). With those features and feature values, and using MI and IG, we extracted from documents representative max-value features as well as a representative cluster for a...
متن کاملSentiTFIDF – Sentiment Classification using Relative Term Frequency Inverse Document Frequency
Sentiment Classification refers to the computational techniques for classifying whether the sentiments of text are positive or negative. Statistical Techniques based on Term Presence and Term Frequency, using Support Vector Machine are popularly used for Sentiment Classification. This paper presents an approach for classifying a term as positive or negative based on its proportional frequency c...
متن کاملA Proximity Probabilistic Model for Information Retrieval
We propose a proximity probabilistic model (PPM) that advances a bag-of-words probabilistic retrieval model. In our proposed model, a document is transformed to a pseudo document, in which a term count is propagated to other nearby terms. Then we consider three heuristics, i.e., the distance of two query term occurrences, their order, and term weights, and try four kernel functions in measuring...
متن کاملMeasuring Popularity of Machine-Generated Sentences Using Term Count, Document Frequency, and Dependency Language Model
We investigated the notion of “popularity” for machine-generated sentences. We defined a popular sentence as one that contains words that are frequently used, appear in many documents, and contain frequent dependencies. We measured the popularity of sentences based on three components: content morpheme count, document frequency, and dependency relationships. To consider the characteristics of a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/0807.3755 شماره
صفحات -
تاریخ انتشار 2008