A Focused Crawler Combinatory Link and Content Model Based on T-Graph Principles

نویسنده

  • Ali Seyfi
چکیده

The two significant tasks of a focused Web crawler are finding relevant topic-specific documents on the Web and analytically prioritizing them for later effective and reliable download. For the first task, we propose a sophisticated custom algorithm to fetch and analyze the most effective HTML structural elements of the page as well as the topical boundary and anchor text of each unvisited link, based on which the topical focus of an unvisited page can be predicted and elicited with a high accuracy. Thus, our novel method uniquely combines both link-based and content-based approaches. For the second task, we propose a scoring function of the relevant URLs through the use of T-Graph (Treasure Graph) to assist in prioritizing the unvisited links that will later be put into the fetching queue. Our Web search system is called the Treasure-Crawler. This research paper embodies the architectural design of the Treasure-Crawler system which satisfies the principle requirements of a focused Web crawler, and asserts the correctness of the system structure including all its modules through illustrations and by the test results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Focused Crawling

Focused crawling is an efficient mechanism for discovering resources of interest on the web. Link structure is an important property of the web that defines its content. In this thesis, FOCUS a novel focused crawler is described, which primarily uses the link structure of the web in its crawling strategy. It uses currently available search engine APIs, provided by Google, to construct a layered...

متن کامل

An Adaptive Updating Topic Specific Web Search System Using T-Graph

Problem statement: The main goal of a Web crawler is to collect documents that are relevant to a given topic in which the search engine specializes. These topic specific search systems typically take the whole document’s content in predicting the importance of an unvisited link. But current research had proven that the document’s content pointed to by an unvisited link is mainly dependent on th...

متن کامل

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Empirical evaluation of the link and content-based focused Treasure-Crawler

Indexing the Web is becoming a laborious task for search engines as the Web exponentially grows in size and distribution. Presently, the most effective known approach to overcome this problem is the use of focused crawlers. A focused crawler applies a proper algorithm in order to detect the pages on the Web that relate to its topic of interest. For this purpose we proposed a custom method that ...

متن کامل

A Focused Crawler Based on Correlation Analysis

With the rapid development of network and information technology, there is a wealth of huge amounts of data on the internet. But it’s a major problem faced by the majority of researchers how to effectively filter out a particular subject or field of information from these data. In this paper, we try to builder a focused crawler based on vector space model and TFIDF text correlation analysis. We...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computer Standards & Interfaces

دوره 43  شماره 

صفحات  -

تاریخ انتشار 2016