A method for language-specific Web crawling and its evaluation

نویسندگان

  • Takayuki Tamura
  • Kulwadee Somboonviwat
  • Masaru Kitsuregawa
چکیده

Many countries have created Web archiving projects aiming at long-term preservation of Web information, which is now considered precious in cultural and social aspects. However, because of its borderless character, the Web poses obstacles to comprehensively gathering information originating in a specific nation or culture. This paper proposes an efficient method for selectively collecting Web pages written in a specific language. First, a linguistic graph analysis of real Web data obtained from a large crawl is conducted in order to derive a crawling guideline, which makes use of language attributes per Web server. The guideline then is formed into a few variations of link selection strategies. Simulation-based evaluation reveals that one of the strategies, which carefully accepts newly discovered Web servers, shows superior results in terms of harvest rate/coverage and runtime efficiency. © 2007 Wiley Periodicals, Inc. Syst Comp Jpn, 38(2): 10–20, 2007; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/scj.20693

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

DEWS 2006 3 A - i 6 Finding Thai Web

While the Web has been increasingly recognized as a culturally valuable social artifact, many nations endeavor to create national Web archives for long term preservation. However, due to its borderless-ness, gathering information for a specific nation from the Web is challenging. This paper proposes language specific web crawling (LSWC) as a method of creating Web archives for countries with li...

متن کامل

The Method of Improving the Specific Language Focused Crawler

In recent years, more and more CJK (Chinese, Japanese, and Korean) web pages appear in the Internet. The information in the CJK web page also becomes more and more important. Web crawler is a kind of tool to retrieve web pages. Previous researches focused on English web crawlers and the web crawler is always optimized for English web pages. We found that the performance of the web crawler is wo...

متن کامل

An Effective Focused Web Crawler for Web Resource Discovery

In the given volume of the Web and its speed of change, the coverage of modern search engines is relatively small. Web crawling is the process used by search engines to collect pages from the Web. Therefore, collecting domain-specific information from the Web is a special theme of research in many papers. In this paper, we introduce a new effective focused web crawler. It uses smart methods to ...

متن کامل

Evaluation Methods for Focused Crawling

The exponential growth of documents available in the World Wide Web makes it increasingly difficult to discover relevant information on a specific topic. In this context, growing interest is emerging in focused crawling, a technique that dynamically browses the Internet by choosing directions that maximize the probability of discovering relevant pages, given a specific topic. Predicting the rel...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Systems and Computers in Japan

دوره 38  شماره 

صفحات  -

تاریخ انتشار 2007