IRRA at TREC 2010: Index Term Weighting by Divergence From Independence Model

نویسندگان

  • Bekir Taner Dinçer
  • Ilker Kocabas
  • Bahar Karaoglan
چکیده

IRRA (IR-Ra) group participated in the 2010 Web track. In this year, the major concern is to examine the effect of supplementary methods on the effectiveness of the new nonparametric index term weighting model, divergence from independence (DFI). Every written text document contains words, but the words used in individual documents may differ due to many divergent (latent) factors, such as topic, author, style, etc. Some words should be intentionally used by authors, in order to compose the information contents of documents, while some words are used due to the grammatical rules. The former set of words is commonly referred to as the keywords or the content bearing words, and the later ones are referred to as the function words or the stop words. Since the function words are used due to the grammatical rules, they should appear, less or more, but in almost all documents, irrespective of (or independently from) the information contents of documents. It is, therefore, reasonable to expect the function words be distributed proportionally to the lengths of documents. On the other hand, since the content bearing words are intentionally used by the authors, their frequency distributions must be affected, and hence should differ from the frequency distributions of the function words on a collection of documents. The content bearing words of a document can be identified by measuring the divergence from independence. According to the DFI model, if the ratio of the frequencies of two different words remains constant for all documents, the occurrences of those words in documents are said to be independent from the documents. Assume that the magnitude of the contribution of a word to the information content of a particular document is proportional to the observed frequency of the word on that document. Then, it can be said that both words contributes to the information contents of all documents, equally. However notice that an equal contribution to the information contents of all documents actually implies no contribution. Such words can only be the words that are used due to a particular reason/rule, such as grammar; because otherwise, a word could not appear in all documents having different information contents. In analogy, the use of HTML tags in Web pages is a good basis to exemplify the independence notion. Since the function words can appear in all documents, not because of their contribution to the information contents of documents, but because of the grammatical rules, they can be thought of as the HTML tags. For instance, every Web page contains exactly two “html” tags and two “body” tags, so the ratio of the frequencies of the “html” and the “body” tags remains constant for all Web pages. According to the independence model, this suggests that the occurrence of “html” tag relative to the “body” tag does not depend on the Web pages, and that the “html” and the “body” tags contribute to the information content of each Web page, equally. It is already known that the HTML tags are used by design, independently from the information contents of the Web pages. But the point in here is that, by using the independence model, this property of HTML tags can be related to their observed frequency distributions on the Web pages, and thereby, it can be recovered without any external knowledge. This definition of independence is easy to understand, but hard to use in practice. In order to use it in practice, it is necessary to measure the degree of independence/dependence between a word and a document, individually. In fact, for each pair of word and document, the independence model can suggest the frequency expected under independence. This enable us to decide whether a particular word is independent from a given document.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

IRRA at TREC 2012: Divergence From Independence (DFI)

IRRA (IR-Ra) group participated in the 2012 Web track, with a system implementing a non-parametric term weighting method based on measuring the divergence from independence (DFI). This is the third year of participation for IRRA group, following the participations in TREC 2009 and 2010 Web tracks. In this year, the aim is to evaluate a new DFI-based term weighting model developed on the basis o...

متن کامل

IRRA at TREC 2009: Index Term Weighting Based on Divergence From Independence Model

IRRA (IR-Ra) group participated in the 2009 Web track (both adhoc task and diversity task) and the Million Query track. In this year, the major concern is to examine the effectiveness of a novel, nonparametric index term weighting model, divergence from independence (DFI). The notion of independence, which is the notion behind the well-known statistical exploratory data analysis technique calle...

متن کامل

Cengage Learning at the TREC 2010 Session Track

This paper details Cengage Leaning’s TREC 2010 Session track submission and our efforts to improve retrieval performance over a user’s session. We use a number of different techniques to achieve this goal including query term weighting, query expansion and re-ranking. In this paper we detail these techniques and the results of our submission. Using our query term weighting technique combined wi...

متن کامل

University of Glasgow at TREC 2007: Experiments in Blog and Enterprise Tracks with Terrier

In TREC 2007, we participate in four tasks of the Blog and Enterprise tracks. We continue experiments using Terrier [14], our modular and scalable Information Retrieval (IR) platform, and the Divergence From Randomness (DFR) framework. In particular, for the Blog track opinion finding task, we propose a statistical term weighting approach to identify opinionated documents. An alternative approa...

متن کامل

University of Glasgow at TREC 2004: Experiments in Web, Robust, and Terabyte Tracks with Terrier

With our participation in TREC2004, we test Terrier, a modular and scalable Information Retrieval framework, in three tracks. For the mixed query task of the Web track, we employ a decision mechanism for selecting appropriate retrieval approaches on a per-query basis. For the robust track, in order to cope with the poorlyperforming queries, we use two pre-retrieval performance predictors and a ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010