An investigation of web crawler behavior: characterization and metrics
نویسندگان
چکیده
In this paper, we present a characterization study of search-engine crawlers. For the purposes of our work, we use Web-server access logs from five academic sites in three different countries. Based on these logs, we analyze the activity of different crawlers that belong to five search engines: Google, AltaVista, Inktomi, FastSearch and CiteSeer. We compare crawler behavior to the characteristics of the general World-Wide Web traffic and to general characterization studies. We analyze crawler requests to derive insights into the behavior and strategy of crawlers. We propose a set of simple metrics that describe qualitative characteristics of crawler behavior, vis-à-vis a crawler’s preference on resources of a particular format, its frequency of visits on a Web site, and the pervasiveness of its visits to a particular site. To the best of our knowledge, this is the first extensive and in depth characterization of search-engine crawlers. Our results and observations provide useful insights into crawler behavior and serve as basis of our ongoing work on the automatic detection of Web crawlers. q 2005 Elsevier B.V. All rights reserved.
منابع مشابه
Prioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملA Clickstream-based Focused Trend Parallel Web Crawler
The immense growing dimension of the World Wide Web induces many obstacles for all-purpose single-process crawlers including the presence of some incorrect answers among search results and the scaling drawbacks. As a result, more enhanced heuristics are needed to provide more accurate search outcomes in an appropriate timely manner. Regarding the fact that employing link dependent Web page impo...
متن کاملClassification of web robots: An empirical study based on over one billion requests
Many studies on detection and classification of web robots have focused their attention mostly on text crawlers, and empirical experiments used relatively small data collected at universities. In this paper, we analyzed more than one billion requests to www.microsoft. com in 24 h. Web logs were made anonymous to eliminate potential privacy concerns while preserving essential characteristics (e....
متن کاملReview of ranked-based and unranked-based metrics for determining the effectiveness of search engines
Purpose: Traditionally, there have many metrics for evaluating the search engine, nevertheless various researchers’ proposed new metrics in recent years. Aware of this new metrics is essential to conduct research on evaluation of the search engine field. So, the purpose of this study was to provide an analysis of important and new metrics for evaluating the search engines. Methodology: This is ...
متن کاملAn Integrated Approach to Defence Against Degrading Application-Layer DDoS Attacks
Application layer Distributed Denial of Service (DDoS) attacks are recognized as one of the most damaging attacks on the Internet security today. In our recent work [1], we have shown that unsupervised machine learning can be effectively utilized in the process of distinguishing between regular (human) and automated (web/botnet crawler) visitors to a web site. We have also shown that with a sli...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Computer Communications
دوره 28 شماره
صفحات -
تاریخ انتشار 2005