Design and Implementation of an Efficient Distributed Web Crawler with Scalable Architecture
نویسندگان
چکیده
Distributed Web crawlers have recently received more and more attention from researchers. Centralized solutions are known to have problems like link congestion, being a single point of failure ,while the fully distributed crawlers become an interesting architectural paradigm for its scalability, increased autonomy of nodes. This paper provides a distributed crawler system which consists of multiple controllers and takes the advantages of both two architecture.The design involves a fully distributed architecture,a strategy to assign tasks and a method to assure system scalability. Finally, an experimental study is used to verify the advantages of our crawler, and the results are comparatively satisfying.
منابع مشابه
A Scalable, Distributed Web-Crawler*
In this paper we present a design and implementation of a scalable, distributed web-crawler. The motivation for design of such a system to effectively distribute crawling tasks to different machined in a peer-peer distributed network. Such architecture will lead to scalability and help tame the exponential growth or crawl space in the World Wide Web. With experiments on the implementation of th...
متن کاملTrovatore: Towards a Highly Scalable Distributed Web Crawler
Trovatore is an ongoing project aimed at realizing an efficient distributed and highly scalable web crawler. This poster illustrates the main ideas behind its design.
متن کاملUbiCrawler: a scalable fully distributed Web crawler
We present the design and implementation of UbiCrawler, a scalable distributed web crawler, and we analyze its performance. The main features of UbiCrawler are platform independence, fault tolerance, a very effective assignment function for partitioning the domain to crawl, and more in general the complete decentralization of every task.
متن کاملWorld Wide Web Crawler
We describe our ongoing work on world wide web crawling, a scalable web crawler architecture that can use resources distributed world-wide. The architecture allows us to use loosely managed compute nodes (PCs connected to the Internet), and may save network bandwidth significantly. In this poster, we discuss why such architecture is necessary, point out difficulties in designing such architectu...
متن کاملDesign, implementation and experiment of a YeSQL Web Crawler
We describe a novel, “focusable”, scalable, distributed web crawler based on GNU/Linux and PostgreSQL that we designed to be easily extendible and which we have released under a GNU public licence. We also report a first use case related to an analysis of Twitter’s streams about the french 2012 presidential elections and the URL’s it contains.
متن کامل