Using TF-IDF to Determine Word Relevance in Document Queries

نویسنده

  • Juan Ramos
چکیده

In this paper, we examine the results of applying Term Frequency Inverse Document Frequency (TF-IDF) to determine what words in a corpus of documents might be more favorable to use in a query. As the term implies, TF-IDF calculates values for each word in a document through an inverse proportion of the frequency of the word in a particular document to the percentage of documents the word appears in. Words with high TF-IDF numbers imply a strong relationship with the document they appear in, suggesting that if that word were to appear in a query, the document could be of interest to the user. We provide evidence that this simple algorithm efficiently categorizes relevant words that can enhance query retrieval.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparative Analysis of IDF Methods to Determine Word Relevance in Web Document

Inverse document frequency (IDF) is one of the most useful and widely used concepts in information retrieval. When it is used in combination with the term frequency (TF), the result is a very effective term weighting scheme (TF-IDF) that has been applied in information retrieval to determine the weight of the terms. Terms with high TF-IDF values imply a strong relationship with the document the...

متن کامل

R2D2at NTCIR: Using the Relevance-based Superimposition Model

Our information retrieval project submitted fully automatic ad-hoc results. We use only description fields as queries. is the baseline tf idf result, and is the result using the proposed RS model which expands document vectors based on the relevance of documents. This method is expected to show better retrieval effectiveness than conventional methods, such as query expansion. The RS run achieve...

متن کامل

Assigning Belief Scores to Names in Queries

Assuming that the goal of a person name query is to find references to a particular person, we argue that one can derive better relevance scores using probabilities derived from a language model of personal names than one can using corpus based occurrence frequencies such as inverse document frequency (idf). We present here a method of calculating person name match probability using a language ...

متن کامل

Compact Indexes for Flexible Top- k k Retrieval

We engineer a self-index based retrieval system capable of rank-safe evaluation of top-k queries. The framework generalizes the GREEDY approach of Culpepper et al. (ESA 2010) to handle multiterm queries, including over phrases. We propose two techniques which significantly reduce the ranking time for a wide range of popular Information Retrieval (IR) relevance measures, such as TF×IDF and BM25....

متن کامل

Click-words: learning to predict document keywords from a user perspective

MOTIVATION Recognizing words that are key to a document is important for ranking relevant scientific documents. Traditionally, important words in a document are either nominated subjectively by authors and indexers or selected objectively by some statistical measures. As an alternative, we propose to use documents' words popularity in user queries to identify click-words, a set of prominent wor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003