Multi-Pattern Wrappers for Relation Extraction from the Web
نویسندگان
چکیده
Numerous sources of data are available on the web, for instance, product catalogs, multiple directories, conference and event sites, etc. The extraction of information from the content of these sources is a challenging problem and a hard task since they are heterogeneous and dynamic. This paper presents a new method for extracting wrappers and relations from the web using both page encoding and context generalization. Its starting point is a training set of instances of the relation the user wishes to extract. Multiple patterns are then extracted considering the occurrences of the input instances in the data source. The generalization of these patterns allows us to identify new instances of the relation in the same data source. The main features of this method are its simplicity, genericity and robustness faced to the diversity of sources. Its efficiency is shown by the experimental results on different sources, i.e., search engines, shopping, product catalogs, paper listings, etc.
منابع مشابه
Data Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملAnti-Unification Based Learning of T-Wrappers for Information Extraction
We present a method for learning wrappers for multi-slot extraction from semi-structured documents. The presented method learns how to construct automatically wrappers from positive examples, consisting of text tuples occurring in the document. These wrappers (T-wrappers) are based on a feature structure unification based pattern language for information extraction. The presented technique is a...
متن کاملLearning T-Wrappers for Information Extraction
We present a method for learning wrappers for multi-slot extraction from semi-structured documents. The presented method learns how to construct automatically wrappers from positive examples, consisting of text tuples occurring in the document. These wrappers (T-wrappers) are based on a feature structure unification based pattern language for information extraction. The presented technique is a...
متن کاملMulti-level Alignment for Attribute Extraction in IEPAD
The problem of information extraction (IE) regards automatic generation of extraction programs (also called wrappers). Similar to compiler generator, the core problem is to generate extraction rules. In this paper, we introduce IEPAD (an acronym for Information Extraction based on PAttern Discovery), a system that generalizes extraction patterns from Web pages without user-labeled examples. The...
متن کاملDeclarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto
Lixto is a system and method for the visual and interactive generation of wrappers for Web pages under the supervision of a human developer, for automatically extracting information from Web pages using such wrappers, and for translating the extracted content into XML. This paper describes some advanced features of Lixto, such as disjunctive pattern definitions, specialization rules, and Lixto’...
متن کامل