An Approach for Identifying URLs Based on Division Score and Link Score in Focused Crawler

نویسندگان

  • Debashis Hati
  • Amritesh Kumar
  • A. Pal
  • D. S. Tomar
چکیده

The rapid growth of the World Wide Web (WWW) poses unprecedented scaling challenges for general-purpose crawlers. Crawlers are software which can traverse the internet and retrieve web pages by hyperlinks. The focused crawler of a special-purpose search engine aims to selectively seek out pages that are relevant to a pre-defined set of topics, rather than to exploit all regions of the Web. Focused crawler is developed to collect relevant web pages of interested topics from the Internet. Maintaining currency of search engine indices by exhaustive crawling is rapidly becoming impossible due to the increasing size of the web. Focused crawlers aim to search only the subset of the web related to a specific topic, and offer a potential solution to the problem. In our proposed approach, we calculate the link score based on average relevancy score of parent pages (because we know that the parent page is always related to child page which means that for detailed information any author prefers the child page) and division score (means how many topic keywords belong to division in which particular link belongs). After finding out link score, we compare the link score with some threshold value. If link score is greater than or equal to threshold value, then it is relevant link. Otherwise, it is discarded. Focused crawler first fetches that link which has greater value compared to all link scores and threshold.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Unvisited URL Relevancy Calculation in Focused Crawling Based on Naïve Bayesian Classification

Vertical search engines use focused crawler as their key component and develop some specific algorithms to select web pages relevant to some pre-defined set of topics. Crawlers are software which can traverse the internet and retrieve web pages by hyperlinks. The focused crawler of a special-purpose search engine aims to selectively seek out pages that are relevant to a pre-defined set of topic...

متن کامل

Intelligent Event Focused Crawling

There is need for an integrated event focused crawling system to collect Web data about key events. When an event occurs, many users try to locate the most up-todate information about that event. Yet, there is little systematic collecting and archiving anywhere of information about events. We propose intelligent event focused crawling for automatic event tracking and archiving, as well as effec...

متن کامل

Empirical evaluation of the link and content-based focused Treasure-Crawler

Indexing the Web is becoming a laborious task for search engines as the Web exponentially grows in size and distribution. Presently, the most effective known approach to overcome this problem is the use of focused crawlers. A focused crawler applies a proper algorithm in order to detect the pages on the Web that relate to its topic of interest. For this purpose we proposed a custom method that ...

متن کامل

Clustered Based User-Interest Ontology Construction for Selecting Seed URLs of Focused Crawler

With the increasing number of accessible web pages on Internet, it has become gradually difficult for users to find the web pages that are relevant to their particular needs. Knowledge about computer users is very beneficial for assisting them, predicting their future actions. Seed URLs selection for focused Web crawler intends to guide related and valuable information that meets a user’s perso...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010