Exploiting content redundancy for web information extraction

نویسندگان
چکیده

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploiting Structural Similarity For Effective Web Information Extraction

In this paper we propose an architecture that exploit web pages stuctural information for the extraction of relevant information from them. In this architecture, a primary role played by a distance-based classification methodology is devised. Such a methodology is based on an efficient and effective technique for detecting structural similarities among semistructured documents, which significan...

متن کامل

Web-based Multimedia Information Extraction Based on Social Redundancy

Social networking sites are among the most frequently visited on the web (Cha et al. 2007) and their use has expanded into professional contexts for expertise sharing and knowledge discovery (Millen, Feinberg and Kerr 2006). These virtual communities can be enormous, with millions of users and shared resources. Social multimedia websites, such as YouTube, are particularly popular. Network traff...

متن کامل

Graph compression—save information by exploiting redundancy

In this paper we raise the question of how to compress sparse graphs. By introducing the idea of redundancy, we find a way to measure the overlap of neighbors between nodes in networks. We exploit symmetry and information by making use of the overlap in neighbors and analyzing how information is reduced by shrinking the network and, using the specific data structure we created, we generalize th...

متن کامل

Automatic Corpus Expansion for Chinese Word Segmentation by Exploiting the Redundancy of Web Information

Currently most of state-of-the-art methods for Chinese word segmentation (CWS) are based on supervised learning, which depend on large scale annotated corpus. However, these supervised methods do not work well when we deal with a new different domain without enough annotated corpus. In this paper, we propose a method to automatically expand the training corpus for the out-of-domain texts by exp...

متن کامل

Exploiting ASP for Semantic Information Extraction

The paper describes HıLεX, a new ASP-based system for the extraction of information from unstructured documents. Unlike previous systems, which are mainly syntactic, HıLεX combines both semantic and syntactic knowledge for a powerful information extraction. In particular, the exploitation of background knowledge, stored in a domain ontology, allows to empower significantly the information extra...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the VLDB Endowment

سال: 2010

ISSN: 2150-8097

DOI: 10.14778/1920841.1920915