A Novel Architecture of a Parallel Web Crawler

نویسنده

  • Shruti Sharma
چکیده

Due to the explosion in the size of the WWW[1,4,5] it becomes essential to make the crawling process parallel. In this paper we present an architecture for a parallel crawler that consists of multiple crawling processes called as C-procs which can run on network of workstations. The proposed crawler is scalable, is resilient against system crashes and other event. The aim of this architecture is to efficiently and effectively crawl the current set of publically indexable web pages so that we can maximize the download rate while minimizing the overhead from parallelization

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Architecture for Domain Specific Parallel Crawler

The World Wide Web is an interlinked collection of billions of documents formatted using HTML. Due to the growing and dynamic nature of the web, it has become a challenge to traverse all URLs in the web documents and handle these URLs, so it has become imperative to parallelize a crawling process. The crawler process is further being parallelized in the form ecology of crawler workers that para...

متن کامل

A Novel Architecture of Mercator: A Scalable, Extensible Web Crawler with Focused Web Crawler

This Paper described A Novel Architecture of Mercator: A Scalable, Extensible Web Crawler with Focused Web Crawler. We enumerate the major components of any Scalable and Focused Web Crawler and describe the particular components used in this Novel Architecture. We also describe this Novel Architecture support for Extensibility and downloaded user’s support information. We also describe how the ...

متن کامل

A Clickstream-based Focused Trend Parallel Web Crawler

The immense growing dimension of the World Wide Web induces many obstacles for all-purpose single-process crawlers including the presence of some incorrect answers among search results and the scaling drawbacks. As a result, more enhanced heuristics are needed to provide more accurate search outcomes in an appropriate timely manner. Regarding the fact that employing link dependent Web page impo...

متن کامل

Search optimization technique for Domain Specific Parallel Crawler

Architectural framework of World Wide Web is used for accessing linked documents spread out over millions of machines all over the Internet. Web is a system that makes exchange of data on the internet easy and efficient. Due to the exponential growth of web, it has become a challenge to traverse all URLs in the web documents and handle these documents, so it is necessary to optimize the paralle...

متن کامل

Analyzing new features of infected web content in detection of malicious web pages

Recent improvements in web standards and technologies enable the attackers to hide and obfuscate infectious codes with new methods and thus escaping the security filters. In this paper, we study the application of machine learning techniques in detecting malicious web pages. In order to detect malicious web pages, we propose and analyze a novel set of features including HTML, JavaScript (jQuery...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011