Design of an RSS Crawler with Adaptive Revisit Manager
نویسندگان
چکیده
is widely used for notifying readers of updated information on blogs and feeding news to readers quickly. RSS is very simple, and so is mostly used as a web service. However there is no satisfactory search engine which works for RSS. The reason is that RSS is continuously modified, and the structure of general search engines is ineffective to collect information from RSS sources. In this paper, we discuss a web crawling algorithm, and propose a structure for an RSS crawler which is geared toward collecting and updating RSS in the Web2.0 environment. The proposed method (1) uses visited domain name history to predict the location of the RSS of a new seed URL, and (2) updates RSS information adaptively, based on some update-checking heuristics. These approaches can serve as cornerstones for an efficient and effective RSS search engine.
منابع مشابه
Personalized RSS Search Service Using RSS Characteristics and User Context
RSS is a one of the most important techniques in Web 2.0. Although there are a lot of RSS feeds available, finding which information are relevant to user isn't easy. The previous RSS search services have not taken into account RSS feed characteristics and user’s contextual information. This seriously limits to offer users with useful information. This paper proposes a new personalized RSS searc...
متن کاملAutomated System for Improving RSS Feeds Data Quality
Nowadays, the majority of RSS feeds provide incomplete information about their news items. The lack of information leads to engagement loss in users. We present a new automated system for improving the RSS feeds’ data quality. RSS feeds provide a list of the latest news items ordered by date. Therefore, it makes it easy for a web crawler to precisely locate the item and extract its raw content....
متن کاملCobra: Content-based Filtering and Aggregation of Blogs and RSS Feeds
Blogs and RSS feeds are becoming increasingly popular. The blogging site LiveJournal has over 11 million user accounts, and according to one report, over 1.6 million postings are made to blogs every day. The “Blogosphere” is a new hotbed of Internet-based media that represents a shift from mostly static content to dynamic, continuously-updated discussions. The problem is that finding and tracki...
متن کاملRSS-Crawler Enhancement for Blogosphere-Mapping
The massive adoption of social media has provided new ways for individuals to express their opinions online. The blogosphere, an inherent part of this trend, contains a vast array of information about a variety of topics. It is a huge think tank that creates an enormous and ever-changing archive of open source intelligence. Mining and modeling this vast pool of data to extract, exploit and desc...
متن کاملAutomatic Extraction of Complex Web Data
A new wrapper induction algorithm WTM for generating rules that describe the general web page layout template is presented. WTM is mainly designed for use in weblog crawling and indexing system. Most weblogs are maintained by content management systems and have similar layout structures in all pages. In addition, they provide RSS feeds to describe the latest entries. These entries appear in the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008