Fast and Scalable Pattern Mining for Media-Type Focused Crawling

نویسندگان

  • Jürgen Umbrich
  • Marcel Karnstedt
  • Andreas Harth
چکیده

Search engines targeting content other than hypertext documents require a crawler that discovers resources identifying files of certain media types. Naı̈ve crawling approaches do not guarantee a sufficient supply of new URIs (Uniform Resource Identifiers) to visit; effective and scalable mechanisms for discovering and crawling targeted resources are needed. One promising approach is to use data mining techniques to identify the media type of a resource without the need for downloading the content of the resource. The idea is to use a learning approach on features derived from patterns occuring in the resource identifier. We present a focused crawler as a use case for fast and scalable data mining and discuss classification and pattern mining techniques suited for selecting resources satisfying specified media types. We show that we can process an average of 17,000 URIs/second and still detect the media type of resources with a precision of more than 80% and a recall of over 65% for all media types.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Highly Efficient Architecture for Scalable Focused Crawling Using Incremental Parallel Web Crawler

Corresponding Author: P. Jaganathan Department of Computer Application, PSNA College of Engineering and Technology, Dindigul, India Email: [email protected] Abstract: With the growing industrial impact over the recent years in computer science, data mining has established itself as one of the most important disciplines. In the fast growing Web and in an appropriate amount of time, locating th...

متن کامل

Profile-Based Focused Crawling for Social Media-Sharing Websites

We present a novel profile-based focused crawling system for dealing with the increasingly popular social media-sharing websites. In this system, we treat the user profiles as ranking criteria for guiding the crawling process. Furthermore, we divide a user’s profile into two parts, an internal part, which comes from the user’s own contribution, and an external part, which comes from the user’s ...

متن کامل

Journal of International Scientific Publications

In recent years, several approaches have been proposed to extract information from web pages on the internet. In this research, a key technique focused on crawling and ontology used to discover knowledge from web. In this paper, we present intelligent crawling system that uses pattern and ontology to extract particular information from WEB sites. The system developed as an efficient tool to con...

متن کامل

A new conforming mesh generator for three-dimensional discrete fracture networks

Nowadays, numerical modelings play a key role in analyzing hydraulic problems in fractured rock media. The discrete fracture network model is one of the most used numerical models to simulate the geometrical structure of a rock-mass. In such media, discontinuities are considered as discrete paths for fluid flow through the rock-mass while its matrix is assumed impermeable. There are two main pa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009