Building a Peer-to-Peer, domain specific web crawler

نویسندگان

Tushar Bansal

Ling Liu

چکیده

The introduction of a crawler in mid 90s opened the floodgates for research in various application domains. Many attempts to create an ideal crawler failed due to the explosive nature of the web. In this paper, we describe the building blocks of PeerCrawl a Peer-to-Peer web crawler. This crawler can be used for generic crawling, is easily scalable and can be implemented on a grid of day-to-day use computers. Also, we demonstrate and implement a novel scheme for coordinating peers to follow a focused crawler. We cover the issues faced during the building of this crawler and decisions taken to overcome the same.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web

1 This paper describes a decentralized peer-to-peer model for building a Web crawler. Most of the current systems use a centralized client-server model, in which the crawl is done by one or more tightly coupled machines, but the distribution of the crawling jobs and the collection of crawled results are managed in a centralized system using a centralized URL repository. Centralized solutions ar...

متن کامل

Distributed High-Performance Web Crawler Based on Peer-to-Peer Network

Distributing the crawling activity among multiple machines can distribute processing to reduce the analysis of web page. This paper presents the design of a distributed web crawler based on Peer-to-Peer network. The distributed crawler harnesses the excess bandwidth and computing resources of nodes in system to crawl the web. Each crawler is deployed in a computing node of P2P to analyze web pa...

متن کامل

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Bookmark-driven Query Routing in Peer-to-Peer Web Search

We consider the problem of collaborative Web search and query routing strategies in a peer-to-peer (P2P) environment. In our architecture every peer has a full-fledged search engine with a (thematically focused) crawler and a local index whose contents may be tailored to the user’s specific interest profile. Peers are autonomous and post meta-information about their bookmarks and index lists to...

متن کامل