Distributed Multisearch and Resource Selection for the TREC Million Query Track

نویسندگان

  • Christopher T. Fallen
  • Gregory B. Newby
  • Kylie McCormick
چکیده

A distributed information retrieval system with resource‐selection and result‐set merging capability was used to search subsets of the GOV2 document corpus for the 2008 TREC Million Query Track. The GOV2 collection was partitioned into host‐name subcollections and distributed to multiple remote machines. The Multisearch demonstration application restricted each search to a fraction of the available sub‐collections that was pre‐determined by a resource‐selection algorithm. Experiment results from topic‐by‐topic resource selection and aggregate topic resource selection are compared. The sensitivity of Multisearch retrieval performance to variations in the resource selection algorithm is discussed. The information processing research group at ARSC works on problems affecting the performance of distributed information retrieval applications such as metasearch [1], federated search [2], and collection sampling [3]. An ongoing goal of this research is to guide the selection of standards and reference implementations for Grid Information Retrieval (GIR) applications [4]. Prototype GIR applications developed at ARSC help to evaluate theoretical research and gain experience with the capabilities and limitations of existing APIs, middleware, and security requirements. The TREC experiments provide an additional context to test and develop distributed IR technology. Prior TREC Terabyte (TB) and Million Query (MQ) Track experiments performed at ARSC have explored the IR performance and search efficiency of result‐set merging and ranking across small numbers of heterogeneous systems and large numbers of homogeneous systems. In the 2005 TREC [5] TB Track [6], the ARSC IR group used a variant of the logistic regression merging strategy [7], modified for efficiency, to merge results from a metasearch‐ style application that searched and merged results from two indexed copies of the GOV2 corpus [8]. One index was constructed with the Lucene Toolkit and the other index was constructed with the Amberfish application [9]. For the 2006 TREC [10] TB Track [11], the GOV2 corpus was partitioned into approximately 17,000 collections by grouping documents with identical URL host names [12]. Each query was searched against every collection and the ranked results from each collection were merged using the logistic regression algorithm used in the 2005 TB track. The large number of document collections used in the 2006 experiment, coupled with the distribution of collection size (measured in number of documents contained) that spanned five orders of magnitude, introduced significant wall‐clock, bandwidth, and IR performance problems. Follow‐up work found that the bandwidth performance could be improved somewhat without sacrificing IR performance by

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Collection Selection Based on Historical Performance for Efficient Processing

A Grid Information Retrieval (GIR) simulation was used to process the TREC Million Query Track queries. The GOV2 collection was partitioned by hostname and the aggregate performance of each host, as measured by qrel counts from the past TREC Terabyte Tracks, was used to rank the hosts in order of quality. Only the 100 highest quality hosts were included in the Grid IR simulation, representing l...

متن کامل

IIIT Hyderabad at Million Query Track TREC 2009

This was our maiden attempt at Million Query track, TREC 2009. We submitted three runs for ad-hoc retrieval task in Million Query track. We explored ad-hoc retrieval of web pages using Hadoop—a distributed infrastructure. To enhance recall, we expanded the queries using WordNet and also by combining the query with all possible subsets of tokens present in the query. To prevent query drift we ex...

متن کامل

University of Amsterdam and University of Twente at the TREC 2007 Million Query Track

In this paper, we document our submissions to the TREC 2007 Million Query track. Our main aim is to compare results of the earlier Terabyte tracks to the Million Query track. We submitted a number of runs using different document representations (such as full-text, title-fields, or incoming anchor-texts) to increase pool diversity. The initial results show broad agreement in system rankings ove...

متن کامل

RUC at TREC 2014: Select Resources Using Topic Models

This paper describes the work done in Renmin University of China for the Federated Web Search Track of TREC 2014. We participated in the resource selection task. We used the LDA topic modeling approach to select potentially relevant resources for a given query. The initial results are promising.

متن کامل

Query Transformations for Result Merging

This paper describes Carnegie Mellon University’s entry at the TREC 2014 Federated Web Search track (FedWeb14). Federated search pipelines typically have two components: (i) resource-selection, and (ii) result-merging. This work documents experiments to modify queries to merge results in the federated-search pipeline. Approaches from previous attempts at solving this problem involved custom que...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008