Data cleansing for Web information retrieval using query independent features

نویسندگان

  • Yiqun Liu
  • Min Zhang
  • Rongwei Cen
  • Liyun Ru
  • Shaoping Ma
چکیده

We report on a study that was undertaken to better understand what kinds of Web pages are the most useful for web search engine users by exploiting queryindependent features of retrieval target pages. To our knowledge, there has been little research towards query-independent web page cleansing for web information retrieval. Based on more than 30 million web pages obtained both from TREC and from a widely-used Chinese search engine SOGOU (www.sogou.com), we provide analysis on the differences between retrieval target pages and ordinary ones. We also propose a learning-based data cleansing algorithm for reducing Web pages which are not likely to be useful for user request. The results obtained show that retrieval target pages can be separated from low quality pages using queryindependent features and cleansing algorithms. Our algorithm succeeds in reducing 95% web pages with less than 8% loss in retrieval target pages. It makes it possible for web IR tools to meet over 92% users’ needs with only 5% pages on the Web.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning-based Web Data Cleansing for Information Retrieval

With rapid growth of web information, how to select high quality web pages that cover valuable information query-independently becomes more and more important in web IR research. Based on query-independent feature analysis, we propose a data cleansing algorithm by selecting an important type of high quality pages (key resources) on the web. Study into the cleansed page set shows that the set co...

متن کامل

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

Developing a BIM-based Spatial Ontology for Semantic Querying of 3D Property Information

With the growing dominance of complex and multi-level urban structures, current cadastral systems, which are often developed based on 2D representations, are not capable of providing unambiguous spatial information about urban properties. Therefore, the concept of 3D cadastre is proposed to support 3D digital representation of land and properties and facilitate the communication of legal owners...

متن کامل

Selective Application of Query-Independent Features in Web Information Retrieval

The application of query-independent features, such as PageRank, can boost the retrieval effectiveness of a Web Information Retrieval (IR) system. In some previous works, a query-independent feature is uniformly applied to all queries. Other works predict the most useful feature based on the query type. However, the accuracy of the current query type prediction methods is not high. In this pape...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • JASIST

دوره 58  شماره 

صفحات  -

تاریخ انتشار 2007