Learning to Crawl: Classifier-guided Topical Crawlers

نویسندگان

  • Gautam Pant
  • Padmini Srinivasan
  • Nick Street
  • Shannon Bradshaw
  • Gary J. Russell
چکیده

Topical or focused crawlers follow the hyperlinked structure of the Web guided by the scent of information to identify and harvest topically relevant pages. For sniffing the appropriate scent they mine the content of pages that are already fetched to prioritize the fetching of unvisited pages. Topical crawling is currently a young and creative area of research that holds the promise of benefiting from several sophisticated data mining techniques. Sporadically, the use of classification algorithms to guide topical crawlers has been suggested in the literature. No systematic study, however, has been done on their relative merits. Using the lessons learned from our previous crawler evaluation studies, we experiment with multiple versions of different classification schemes. We also explore the effects of various techniques for deriving contexts of hyperlinks on crawling performance. The crawling process is modeled as a parallel best-first search over a graph defined by the Web. The classifiers provide heuristics to the crawler thus biasing it towards certain portions of the graph (i.e., the Web). We have designed and developed a crawling framework that allows for flexible addition of new classifiers. The crawlers themselves are implemented as multi-threaded objects that run concurrently. Our results show that Naive Bayes is a weak choice for guiding a topical crawler. We also find that a crawler that exploits words both in immediate vicinity of a hyperlink as well as the entire parent page performs better than a crawler that depends on just one of those cues. Also, a crawler that uses the tag tree hierarchy within Web pages provides effective coverage. We

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Expanding Reinforcement Learning Approaches for Efficient Crawling the Web

The amount of accessible information on World Wide Web is increasing rapidly, so that a general-purpose search engine cannot index everything on the Web. Focused crawlers have been proposed as a potential approach to overcome the coverage problem of search engines by limiting the domain of concentration of them. Focused crawling is a technique which is able to crawl particular topical portions ...

متن کامل

Defining Evaluation Methodologies for Topical Crawlers

Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks through well-defined performance measures. We have argued that general evaluation methodologies a...

متن کامل

Combining Classifier Guided by Semi-Supervision

The article suggests an algorithm for regular classifier ensemble methodology. The proposed methodology is based on possibilistic aggregation to classify samples. The argued method optimizes an objective function that combines environment recognition, multi-criteria aggregation term and a learning term. The optimization aims at learning backgrounds as solid clusters in subspaces of the high...

متن کامل

Combining Classifier Guided by Semi-Supervision

The article suggests an algorithm for regular classifier ensemble methodology. The proposed methodology is based on possibilistic aggregation to classify samples. The argued method optimizes an objective function that combines environment recognition, multi-criteria aggregation term and a learning term. The optimization aims at learning backgrounds as solid clusters in subspaces of the high...

متن کامل

Topical Crawling for Business Intelligence

The Web provides us with a vast resource for business intelligence. However, the large size of the Web and its dynamic nature make the task of foraging appropriate information challenging. Generalpurpose search engines and business portals may be used to gather some basic intelligence. Topical crawlers, driven by richer contexts, can then leverage on the basic intelligence to facilitate in-dept...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004