Cooperative CG-Wrappers for Web Content Extraction

نویسندگان

  • Fotis Kokkoras
  • Nick Bassiliades
  • Ioannis P. Vlahavas
چکیده

We use Conceptual Graphs (CGs) to model web content extraction rules (CG-Wrappers). The approach presented incorporates all major existing extraction techniques and allows the definition of synergies of cooperative wrappers for handling complex extraction task, without requiring programming.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

WysiWyg Web Wrapper Factory (W4F)

In this paper, we present the W4F toolkit for the generation of wrappers for Web sources. W4F consists of a retrieval language to identify Web sources, a declarative extraction language (the HTML Extraction Language) to express robust extraction rules and a mapping interface to export the extracted information into some userde ned data-structures. To assist the user and make the creation of wra...

متن کامل

Gleaning answers from the web∗

A wide variety of valuable textual information resides on the Web, but very little is in a machineunderstandable form such as XML. Instead, the content is usually embedded in HTML markup or other encodings designed for human consumption. The information extraction task is to automatically populate a database with content gleaned from information sources such as Web pages. Wrappers are an import...

متن کامل

AutoWrapper: automatic wrapper generation for multiple online services

A crucial challenge for information extraction from the WWW is to generate wrappers, which are information extraction patterns or rules, which apply to numerous Web sites with great diversity in both format and content. Generating wrappers manually is tedious, time consuming and errorprone. Recent research has successfully adapted machine learning technology to generate wrappers for semi-struct...

متن کامل

Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto

Lixto is a system and method for the visual and interactive generation of wrappers for Web pages under the supervision of a human developer, for automatically extracting information from Web pages using such wrappers, and for translating the extracted content into XML. This paper describes some advanced features of Lixto, such as disjunctive pattern definitions, specialization rules, and Lixto’...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007