Learning Capable Focused Crawler for Information Technology Domain

نویسندگان

  • Mukesh Kumar
  • Renu Vig
  • C. Aggarwal
  • F. Al-Garawi
  • D. Bergmark
  • Carl Lagoze
  • Martin Ester
  • Matthias Groß
چکیده

The Web provides us with a huge and endless resource for information. But, the rapidly growing size of the Web poses great challenge for general purpose crawlers and search engines. It is impossible for any search engine to index the whole Web. Focused crawler collects domain relevant pages from the Web by avoiding the irrelevant portion of the Web. Focused crawler can help the search engine to index all documents present on the Web related to a specific domain which in turn provides the search engine's users complete and up-to-date contents. In this paper we present a focused crawler capable of learning from the previous crawl results to collect the relevant documents. Crawling results for three consecutive learning phases are shown. Results indicate significant improvement in terms of relevancy to the focused domain

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Domain-Specific Web Site Identification: The CROSSMARC Focused Web Crawler

This paper presents techniques for identifying domain specific web sites that have been implemented as part of the EC-funded R&D project, CROSSMARC. The project aims to develop technology for extracting interesting information from domain-specific web pages. It is therefore important for CROSSMARC to identify web sites in which interesting domain specific pages reside (focused web crawling). Th...

متن کامل

Improving the performance of focused web crawlers

This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths ...

متن کامل

Semantic Focused Crawling for Retrieving E-Commerce Information

Focused crawling is proposed to selectively seek out pages that are relevant to a predefined set of topics without downloading all pages of the Web. With the rapid growth of the E-commerce, how to discovery the specific information such as about buyer, seller and products etc. adapting for the online business user becomes a focused issue to the information search engine. We present a novel sema...

متن کامل

Survey on Self Adaptive Semantic Focused Crawling Using Ontology Learning

The Internet today has become a vast storehouse for a scintillating amount of knowledge. It is an excellent source of information catering to the needs of people of varied interests. But this process of information retrieval does have its shortcomings too viz. heterogeneity, ubiquity and ambiguity. Thus a self-adaptive semantic focused crawler SASF crawler that addresses these issues and optimi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012