Distributed Web Crawling Using Network Coordinates
نویسندگان
چکیده
In this report we will outline the relevant background research, the design, the implementation and the evaluation of a distributed web crawler. Our system is innovative in that it assigns Euclidean coordinates to crawlers and web servers such that the distances in the space give an accurate prediction of download times. We will demonstrate that our method gives the crawler the ability to adapt and compensate for changes in the underlying network topology, and in doing so can achieve significant decreases in download times when compared with other approaches.
منابع مشابه
A Dynamically Reconfigurable Model for a Distributed Web Crawling System
A web crawling system using a distributed architecture needs to coordinate the whole system when the nodes in the system change. This paper presents an efficiently dynamic reconfigurability model that can be used in such a system. Through analyzing the model, we got methods to achieve the optimized performance in the distributed web crawling system, i.e., retain load balance and produce low net...
متن کاملGeoreferencing Semi-Structured Place-Based Web Resources Using Machine Learning
In recent years, the shared content on the web has had significant growth. A great part of these information are publicly available in the form of semi-strunctured data. Moreover, a significant amount of these information are related to place. Such types of information refer to a location on the earth, however, they do not contain any explicit coordinates. In this research, we tried to georefer...
متن کاملOn the feasibility of geographically distributed web crawling
We identify the issues that are important in design of a geographically distributed Web crawler. The identified issues are discussed from a “benefit” and “challenge” point of view. More specifically, we focus on the effect of geographical locality of Web sites on crawling performance, and, as a practical study, investigate the feasibility of a distributed crawler in terms of network costs. For ...
متن کاملMinimizing the Network Distance in Distributed Web Crawling
Distributed crawling has shown that it can overcome important limitations of the centralized crawling paradigm. However, the distributed nature of current distributed crawlers is currently not fully utilized. The optimal benefits of this approach are usually limited to the sites hosting the crawler. In this work we describe IPMicra, a distributed location aware web crawler that utilizes an IP a...
متن کاملDistributed Indexing of the Web Using Migrating Crawlers
Due to the tremendous increase rate and the high change frequency of Web documents, maintaining an up-to-date index for searching purposes (search engines) is becoming a challenge. The traditional crawling methods are no longer able to catch up with the constantly updating and growing Web. Realizing the problem, in this paper we suggest an alternative distributed crawling method with the use of...
متن کامل