The Method of Improving the Specific Language Focused Crawler

نویسندگان

Shan-Bin Chan

Hayato Yamana

چکیده

In recent years, more and more CJK (Chinese, Japanese, and Korean) web pages appear in the Internet. The information in the CJK web page also becomes more and more important. Web crawler is a kind of tool to retrieve web pages. Previous researches focused on English web crawlers and the web crawler is always optimized for English web pages. We found that the performance of the web crawler is worse in retrieving CJK web pages. We tried to enhance the performance of the CJK crawler by analyzing the web link structure, anchor text, and host name on the hyperlink and changing the crawling algorithm. We distinguish the top-level domain name and the language of the anchor text on hyperlinks. The method that distinguishes the language of the anchor text on hyperlinks is not used on CJK language specific crawler by other researches. Control experiment is used in this research. According to the experimental results, when the target crawling language is Japanese, the 87% of the crawled web pages are Japanese web pages and improves the efficiency about 0.24% compares to the baseline results. When the target crawling language is Chinese, the 88% of the crawled web pages are Chinese web pages and improves the efficiency about 0.07% compares to the baseline results. When the target crawling language is Korean, the 71% of the crawled web pages are Korean web pages and improves the efficiency about 10% compares to the baseline results.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Domain-Specific Corpus Expansion with Focused Webcrawling

This work presents a straightforward method for extending or creating in-domain web corpora by focused webcrawling. The focused webcrawler uses statistical N-gram language models to estimate the relatedness of documents and weblinks and needs as input only N-grams or plain texts of a predefined domain and seed URLs as starting points. Two experiments demonstrate that our focused crawler is able...

متن کامل

Language Specific and Topic Focused Web Crawling

We describe an experiment on collecting large language and topic specific corpora automatically by using a focused Web crawler. Our crawler combines efficient crawling techniques with a common text classification tool. Given a sample corpus of medical documents, we automatically extract query phrases and then acquire seed URLs with a standard search engine. Starting from these seed URLs, the cr...

متن کامل

Domain-Specific Web Site Identification: The CROSSMARC Focused Web Crawler

This paper presents techniques for identifying domain specific web sites that have been implemented as part of the EC-funded R&D project, CROSSMARC. The project aims to develop technology for extracting interesting information from domain-specific web pages. It is therefore important for CROSSMARC to identify web sites in which interesting domain specific pages reside (focused web crawling). Th...

متن کامل

An Effective Focused Web Crawler for Web Resource Discovery

In the given volume of the Web and its speed of change, the coverage of modern search engines is relatively small. Web crawling is the process used by search engines to collect pages from the Web. Therefore, collecting domain-specific information from the Web is a special theme of research in many papers. In this paper, we introduce a new effective focused web crawler. It uses smart methods to ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2010

The Method of Improving the Specific Language Focused Crawler

نویسندگان

چکیده

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

Domain-Specific Corpus Expansion with Focused Webcrawling

Language Specific and Topic Focused Web Crawling

Domain-Specific Web Site Identification: The CROSSMARC Focused Web Crawler

An Effective Focused Web Crawler for Web Resource Discovery

عنوان ژورنال:

اشتراک گذاری