The Viúva Negra crawler: an experience report
نویسندگان
چکیده
This paper documents hazardous situations on the Web that crawlers must address. This knowledge was accumulated while developing and operating the Viúva Negra (VN) crawler to feed a search engine and a Web archive for the Portuguese Web for four years. The design, implementation and evaluation of the VN crawler are also presented as a case study of a Web crawler design. The case study tested provides crawling techniques that may be useful for the further development of crawlers. Copyright c © 2007 John Wiley & Sons, Ltd.
منابع مشابه
The Viuva Negra crawler
This report discusses architectural aspects of web crawlers and details the design, implementation and evaluation of the Viuva Negra (VN) crawler. VN has been used for 4 years, feeding a search engine and an archive of the Portuguese web. In our experiments it crawled over 2 million documents per day, correspondent to 63 GB of data. We describe hazardous situations to crawling found on the web ...
متن کاملPrioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملLearnable Topic-specific Web Crawler
Topic-specific web crawler collects relevant web pages of interested topics from the Internet. There are many previous researches focusing on algorithms of web page crawling. The main purpose of those algorithms is to gather as many relevant web pages as possible, and most of them only detail the approaches of the first crawling. However, no one has ever mentioned some important questions, such...
متن کاملWeb Crawling as an AI Project
This paper argues for the introduction of real-world programming projects into AI curricula, specifically using Python as an implementation language. We describe a modular set of projects centered around a focused web crawler, along with potential extensions. The author’s experiences using this project in a class of undergraduates and Master’s students are also discussed.
متن کاملCaption Crawler: Enabling Reusable Alternative Text Descriptions using Reverse Image Search
Accessing images online is often difficult for users with vision impairments. This population relies on text descriptions of images that vary based on website authors’ accessibility practices. Where one author might provide a descriptive caption for an image, another might provide no caption for the same image, leading to inconsistent experiences. In this work, we present the Caption Crawler sy...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Softw., Pract. Exper.
دوره 38 شماره
صفحات -
تاریخ انتشار 2008