RSS-Crawler Enhancement for Blogosphere-Mapping

نویسندگان

  • Justus Bross
  • Patrick Hennig
  • Philipp Berger
  • Christoph Meinel
چکیده

The massive adoption of social media has provided new ways for individuals to express their opinions online. The blogosphere, an inherent part of this trend, contains a vast array of information about a variety of topics. It is a huge think tank that creates an enormous and ever-changing archive of open source intelligence. Mining and modeling this vast pool of data to extract, exploit and describe meaningful knowledge in order to leverage structures and dynamics of emerging networks within the blogosphere is the higher-level aim of the research presented here. Our proprieteary development of a tailor-made feed-crawler-framework meets exactly this need. While the main concept, as well as the basic techniques and implementation details of the crawler have already been dealt with in earlier publications, this paper focuses on several recent optimization efforts made on the crawler framework that proved to be crucial for the performance of the overall framework.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mapping the Australian Networked Public Sphere

This article reports on a research program that has developed new methodologies for mapping the Australian blogosphere and tracking how information is disseminated across it. The authors improve on conventional web crawling methodologies in a number of significant ways: First, the authors track blogging activity as it occurs, by scraping new blog posts when such posts are announced through Real...

متن کامل

Cobra: Content-based Filtering and Aggregation of Blogs and RSS Feeds

Blogs and RSS feeds are becoming increasingly popular. The blogging site LiveJournal has over 11 million user accounts, and according to one report, over 1.6 million postings are made to blogs every day. The “Blogosphere” is a new hotbed of Internet-based media that represents a shift from mostly static content to dynamic, continuously-updated discussions. The problem is that finding and tracki...

متن کامل

Design of an RSS Crawler with Adaptive Revisit Manager

is widely used for notifying readers of updated information on blogs and feeding news to readers quickly. RSS is very simple, and so is mostly used as a web service. However there is no satisfactory search engine which works for RSS. The reason is that RSS is continuously modified, and the structure of general search engines is ineffective to collect information from RSS sources. In this paper,...

متن کامل

Personalized RSS Search Service Using RSS Characteristics and User Context

RSS is a one of the most important techniques in Web 2.0. Although there are a lot of RSS feeds available, finding which information are relevant to user isn't easy. The previous RSS search services have not taken into account RSS feed characteristics and user’s contextual information. This seriously limits to offer users with useful information. This paper proposes a new personalized RSS searc...

متن کامل

Automated System for Improving RSS Feeds Data Quality

Nowadays, the majority of RSS feeds provide incomplete information about their news items. The lack of information leads to engagement loss in users. We present a new automated system for improving the RSS feeds’ data quality. RSS feeds provide a list of the latest news items ordered by date. Therefore, it makes it easy for a web crawler to precisely locate the item and extract its raw content....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013