Top-k Document Retrieval in External Memory
نویسندگان
چکیده
Let D be a given set of (string) documents of total length n. The top-k document retrieval problem is to index D such that when a pattern P of length p, and a parameter k come as a query, the index returns those k documents which are most relevant to P . Hon et al. [22] proposed a linear space framework to solve this problem in O(p+k log k) time. This query time was improved to O(p+k) by Navarro and Nekrich [33]. These results are powerful enough to support arbitrary relevance functions like frequency, proximity, PageRank, etc. Despite of continued progress on this problem in terms of theoretical, practical and compression aspects, any non-trivial bounds in external memory model have so far been elusive. In this paper, we propose the first external memory index supporting top-k document retrieval queries (outputs unsorted) in optimal O(p/B+logB n+k/B) I/Os, where B is the block size. The index space is almost linear O(n log∗ n) words, where log∗ n is the iterated logarithm of n. We also improve the existing internal memory results. Specifically, we propose a linear space index for retrieving top-k documents in O(k) time, once the locus of the pattern match is given.
منابع مشابه
On Optimal Top-K String Retrieval
Let D = {d1, d2, d3, ..., dD} be a given set of D (string) documents of total length n. The top-k document retrieval problem is to index D such that when a pattern P of length p, and a parameter k come as a query, the index returns the k most relevant documents to the pattern P . Hon et. al. [13] gave the first linear space framework to solve this problem in O(p + k log k) time. This was improv...
متن کاملSpace-efficient Data Structures for String Searching and Retrieval
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 The Models of Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...
متن کاملInvestigating the Impact of Authors’ Rank in Bibliographic Networks on Expertise Retrieval
Background and Aim: this research investigates the impact of authors’ rank in Bibliographic networks on document-centered model of Expertise Retrieval. Its purpose is to find out what kind of authors’ ranking in bibliographic networks can improve the performance of document-centered model. Methodology: Current research is an experimental one. To operationalize research goals, a new test colle...
متن کاملDocument Image Retrieval Based on Keyword Spotting Using Relevance Feedback
Keyword Spotting is a well-known method in document image retrieval. In this method, Search in document images is based on query word image. In this Paper, an approach for document image retrieval based on keyword spotting has been proposed. In proposed method, a framework using relevance feedback is presented. Relevance feedback, an interactive and efficient method is used in this paper to imp...
متن کاملShared-Constraint Range Reporting
Orthogonal range reporting is one of the classic and most fundamental data structure problems. (2,1,1) query is a 3 dimensional query with two-sided constraint on the first dimension and one sided constraint on each of the 2nd and 3rd dimension. Given a set of N points in three dimension, a particular formulation of such a (2, 1, 1) query (known as four-sided range reporting in threedimension) ...
متن کامل