Towards a Content-Provider-Friendly Web Page Crawler

نویسندگان

  • Jie Xu
  • Qinglan Li
  • Huiming Qu
  • Alexandros Labrinidis
چکیده

Search engine quality is impacted by two factors: the quality of the ranking/matching algorithm used and the freshness of the search engine’s index, which maintains a “snapshot” of the Web. Web crawlers capture web pages and refresh the index, but this is always a never-ending quest, as web pages get updated frequently (and thus have to be re-crawled). Knowing when to re-crawl a web page is fundamentally linked to the freshness of the index, given the size of the Web today and the inherent resource constraints: re-crawling too frequently leads to wasted bandwidth, recrawling too infrequently brings down the quality of the search engine. In this work, we address the scheduling problem for web crawlers, with the objective of optimizing the quality of the index (i.e., maximize the freshness probability of the local repository as well as of the index). Towards this, we utilize feedback from the users (content providers) on when their web pages are updated and consider the entire spectrum of collaboration, from no feedback to explicit update schedules. We propose a unified online scheduling algorithm which utilizes different levels of collaboration from content providers. Extensive experiments with real web traces demonstrate that cooperation from users plays a major role in improving search engine index quality.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards a Keyword-Focused Web Crawler

This paper concerns predicting the content of textual web documents based on features extracted from web pages that link to them. It may be applied in an intelligent, keyword-focused web crawler. The experiments made on publicly available real data obtained from Open Directory Project with the use of several classification models are promising and indicate potential usefulness of the studied ap...

متن کامل

PDD Crawler: A focused web crawler using link and content analysis for relevance prediction

Majority of the computer or mobile phone enthusiasts make use of the web for searching activity. Web search engines are used for the searching; The results that the search engines get are provided to it by a software module known as the Web Crawler. The size of this web is increasing round-the-clock. The principal problem is to search this huge database for specific information. To state whethe...

متن کامل

Pdd Crawler: a Focused Web Crawler Using Link and Content Analysis for Relevence Prediction

Majority of the computer or mobile phone enthusiasts make use of the web for searching activity. Web search engines are used for the searching; The results that the search engines get are provided to it by a software module known as the Web Crawler. The size of this web is increasing round-the-clock. The principal problem is to search this huge database for specific information. To state whethe...

متن کامل

Identifying Informative Web Content Blocks using Web Page Segmentation

Information Extraction has become an important task for discovering useful knowledge or information from the Web. A crawler system, which gathers the information from the Web, is one of the fundamental necessities of Information Extraction. A search engine uses a crawler to crawl and index web pages. Search engine takes into account only the informative content for indexing. In addition to info...

متن کامل

Improving the performance of focused web crawlers

This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007