Hypergraph-Theoretic Partitioning Models for Parallel Web Crawling
نویسندگان
چکیده
Parallel web crawling is an important technique employed by large-scale search engines for content acquisition. A commonly used inter-processor coordination scheme in parallel crawling systems is the link exchange scheme, where discovered links are communicated between processors. This scheme can attain the coverage and quality level of a serial crawler while avoiding redundant crawling of pages by different processors. The main problem in the exchange scheme is the high inter-processor communication overhead. In this work, we propose a hypergraph model that reduces the communication overhead associated with link exchange operations in parallel web crawling systems by intelligent assignment of sites to processors. Our hypergraph model can correctly capture and minimize the number of network messages exchanged between crawlers. We evaluate the performance of our models on four benchmark datasets. Compared to the traditional hash-based assignment approach, significant performance improvements are observed in reducing the inter-processor communication overhead.
منابع مشابه
Models and Algorithms for Parallel Text Retrieval
MODELS AND ALGORITHMS FOR PARALLEL TEXT RETRIEVAL Berkant Barla Cambazoğlu Ph.D. in Computer Engineering Supervisor: Prof. Dr. Cevdet Aykanat January, 2006 In the last decade, search engines became an integral part of our lives. The current state-of-the-art in search engine technology relies on parallel text retrieval. Basically, a parallel text retrieval system is composed of three components:...
متن کاملHypergraph Partitioning for Faster Parallel PageRank Computation
The PageRank algorithm is used by search engines such as Google to order web pages. It uses an iterative numerical method to compute the maximal eigenvector of a transition matrix derived from the web’s hyperlink structure and a user-centred model of web-surfing behaviour. As the web has expanded and as demand for user-tailored web page ordering metrics has grown, scalable parallel computation ...
متن کاملData-Parallel Web Crawling Models
The need to quickly locate, gather, and store the vast amount of material in the Web necessitates parallel computing. In this paper, we propose two models, based on multi-constraint graph-partitioning, for efficient data-parallel Web crawling. The models aim to balance the amount of data downloaded and stored by each processor as well as balancing the number of page requests made by the process...
متن کاملHypergraph-Partitioning-Based Decomposition for Parallel Sparse-Matrix Vector Multiplication
ÐIn this work, we show that the standard graph-partitioning-based decomposition of sparse matrices does not reflect the actual communication volume requirement for parallel matrix-vector multiplication. We propose two computational hypergraph models which avoid this crucial deficiency of the graph model. The proposed models reduce the decomposition problem to the well-known hypergraph partition...
متن کاملRevisiting Hypergraph Models for Sparse Matrix Partitioning
We provide an exposition of hypergraph models for parallelizing sparse matrix-vector multiplies. Our aim is to emphasize the expressive power of hypergraph models. First, we set forth an elementary hypergraph model for parallel matrix-vector multiply based on one-dimensional (1D) matrix partitioning. In the elementary model, the vertices represent the data of a matrix-vector multiply, and the n...
متن کامل