Effects of Start URLs in Focused Web Crawling
نویسنده
چکیده
Web crawling refers to the process of gathering data from the Web. Focused crawlers are programs that selectively download Web documents (pages), restricting the scope of crawling to a pre-defined domain or topic. The downloaded documents can be indexed for a domain specific search engine or a digital library. In this paper, we describe the focused crawling technique, review relevant literature, and report novel experimental results. Crawling is often started with URLs that point to the pages of central NorthAmerican and European universities, research institutions, and other organizations in North-America and Europe. In the experiments we investigated, first, how strongly this central region of the Web is connected to three other large geographical regions of the Web: Australia (top level domain .au), China (.cn), and five South-American countries (.ar, .br, .cl, .mx, and .uy). Test topics were selected from the domains of genomics and genetics which are typical scientific fields. We found that two focused crawling processes, one started from the central region and the other from the region of Australia / China / South-America, overlap only to a small extent, identifying mainly different relevant documents. Document relevance was assessed (1) by a human judge and (2) by assigning probability scores to documents using a search engine. Second, we investigated the coverage (number) of relevant documents obtained for different focused crawling processes started with URLs from the four different geographical regions. The results showed that all regions considered in this study are good starting points for focused crawling in the domains of genetics and genomics since each of them yielded a high coverage. As genomics and genetics are typical scientific domains we assume the obtained results to be generalizable to other scientific domains. We discuss what implications the observed results have for the selection of crawling approach in scientific focused crawling tasks.
منابع مشابه
Ranking Hyperlinks Approach for Focused Web Crawler
The World Wide Web is growing rapidly and many search engines do not cover all the visible pages. Therefore, a more effective crawling method is required to collect more accurate data. In this paper, we introduce an effective focused web crawler containing smart methods. In text analysis, similarity measurement applies to different parts of the Web pages including title, body, anchor text and U...
متن کاملIntelligent Event Focused Crawling
There is need for an integrated event focused crawling system to collect Web data about key events. When an event occurs, many users try to locate the most up-todate information about that event. Yet, there is little systematic collecting and archiving anywhere of information about events. We propose intelligent event focused crawling for automatic event tracking and archiving, as well as effec...
متن کاملPrioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملAn Effective Focused Web Crawler for Web Resource Discovery
In the given volume of the Web and its speed of change, the coverage of modern search engines is relatively small. Web crawling is the process used by search engines to collect pages from the Web. Therefore, collecting domain-specific information from the Web is a special theme of research in many papers. In this paper, we introduce a new effective focused web crawler. It uses smart methods to ...
متن کاملLanguage Specific and Topic Focused Web Crawling
We describe an experiment on collecting large language and topic specific corpora automatically by using a focused Web crawler. Our crawler combines efficient crawling techniques with a common text classification tool. Given a sample corpus of medical documents, we automatically extract query phrases and then acquire seed URLs with a standard search engine. Starting from these seed URLs, the cr...
متن کامل