A Task-specific Approach for Crawling the Deep Web

نویسندگان

  • Manuel Álvarez
  • Juan Raposo
  • Fidel Cacheda
  • Alberto Pan
چکیده

There is a great amount of valuable information on the web that cannot be accessed by conventional crawler engines. This portion of the web is usually known as the Deep Web or the Hidden Web. Most probably, the information of highest value contained in the deep web, is that behind web forms. In this paper, we describe a prototype hidden-web crawler able to access such content. Our approach is based on providing the crawler with a set of domain definitions, each one describing a specific data-collecting task. The crawler uses these descriptions to identify relevant query forms and to learn to execute queries on them. We have tested our techniques for several real world tasks, obtaining a high degree of effectiveness.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

From Focused Crawling to Expert Information: an Application Framework for Web Exploration and Portal Generation

Focused crawling is a relatively new, promising approach to improving the recall of expert search on the Web. It typically starts from a useror communityspecific tree of topics along with a few training documents for each tree node, and then crawls the Web with focus on these topics of interest. This process can efficiently build a theme-specific, hierarchical directory whose nodes are populate...

متن کامل

A Novel Approach to Integrated Search Information Retrieval Technique for Hidden Web for Domain Specific Crawling

The traditional web crawlers retrieve contents from only the “Surface web” and are unable to crawl through the hidden portion of the Web containing high quality information which is dynamically generated through querying databases when the queries are submitted through a search interface. For Hidden web, most of the published research has been done to identify/detect such searchable forms and m...

متن کامل

Exploiting Multiple Features with MEMMs for Focused Web Crawling

Focused web crawling traverses theWeb to collect documents on a specific topic. This is not an easy task, since focused crawlers need to identify the next most promising link to follow based on the topic and the content and links of previously crawled pages. In this paper, we present a framework based on Maximum Entropy Markov Models(MEMMs) for an enhanced focused web crawler to take advantage ...

متن کامل

Expert Discovery: A web mining approach

Expert discovery is a quest in search of finding an answer to a question: “Who is the best expert of a specific subject in a particular domain within peculiar array of parameters?” Expert with domain knowledge in any field is crucial for consulting in industry, academia and scientific community. Aim of this study is to address the issues for expert-finding task in real-world community. Collabor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Engineering Letters

دوره 13  شماره 

صفحات  -

تاریخ انتشار 2006