Estratégias baseadas em exemplos para extração de dados semi-estruturados da web

نویسنده

  • Altigran Soares da Silva
چکیده

In this work we propose, implement and evaluate strategies and techniques for the problem of extracting semistructured data from Web data sources within the context of an approach we call DEByE (Data Extraction By Example). The results we have reached have been used in the implementation of a data extraction tool, also called DEByE, and have their effectiveness verified through experiments. The DEByE approach is semi-automatic, in the sense that the role of users (i.e., wrapper developers) is limited to providing examples of the data to be extracted, what shields them from being aware of specific formatting features of the target pages. The examples provided describe the structure of the objects being extracted by means of nested tables, which are simple and intuitive, and expressive enough to represent the structure of the data normally present in Web pages. To deal with typical variations of complex semistructured objects, we have extended the original concept of nested tables by relaxing the original assumption that all inner tables nested in a column should have a same internal structure. Based on this extended form of nested tables, we formalize the concept of wrappers by means of tabular grammars. Such context-free grammars are formed by productions that lead to parse trees that can be directly mapped to nested tables. We have developed strategies for generating tabular grammars from a set of example objects provided by a user from a sample page. This includes: (1) the generation of terminal productions for extracting single values belonging to a specific domain (e.g., an item description, a price, etc.) and (2) the generation of non-terminal productions that represent the structure of the complex objects to be extracted. The extraction of data from target pages is accomplished by parsing these pages using a tabular grammar. For this, we have developed an efficient bottom-up strategy. This strategy includes two distinct phases: an extraction phase, in which atomic attribute values are extracted based on local context information available in the extraction productions, and an assembling phase, in which such values are assembled to form complex objects according to the target structure supplied by the user through examples, which is encoded in the non-terminal productions. We experimentally demonstrate the effectiveness of the bottom-up strategy for dealing with multi-level objects presenting structural variations. The general principle used by the bottom-up algorithm, that is, first extracting atomic values and then grouping these values to assemble complex objects, has been further exploited by the Hot Cycles algorithm we have developed. This algorithm aims at uncovering a plausible tabular structure for assembling complex objects with a given set of atomic values extracted from a target page. This algorithm is useful for deploying the DEByE approach in applications where the user is not available for assembling example tables.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Uma Abordagem para Armazenamento de Dados Semi-Estruturados em Bancos de Dados Relacionais

This paper presents an approach to storing semistructured data in relational databases. We focus on semistructured data as extracted from Web pages by a tool called DEByE (Data Extraction By Example), and organized according to its data model, the DEByE Object Model (DEByE-OM). The approach presented here consists in representing the structure of objects extracted by DEByE by a relational schem...

متن کامل

Mapeamento de Relacionamentos em Rede Armazenados em Bancos de Dados Espaciais para Documentos GML

Resumo. Dados representados em documentos GML são utilizados em diversas aplicações GIS e na Web visando principalmente o armazenamento, a manipulação e a troca de informações geográficas. Entretanto, uma grande parte das informações geográficas estão armazenadas em bancos de dados espaciais. Este trabalho apresenta uma metodologia para mapear dados geográficos, estruturados usando relacionamen...

متن کامل

Extracção de Relações Semânticas de Textos em Português Explorando a DBpédia e a Wikipédia

A identificação de relações semânticas, expressas entre entidades mencionadas em textos, é um passo importante para a extracção automática de conhecimento a partir de grandes colecções de documentos, tais como a Web. Vários trabalhos anteriores abordaram esta tarefa para o caso da ĺıngua inglesa, usando técnicas de aprendizagem automática supervisionada para classificação de relações, sendo que...

متن کامل

Uma Abordagem para Detecção e Extração de Rótulos em Formulários Web

Deep Web volume continues to increase as well as the interest to discover and extract Web hidden database data and schemata. This is motivated by applications that intend to provide uni ed search over several Web forms or the hidden content of Web databases. On considering this context, this paper presents an approach for detecting and extracting labels in Web forms. For detecting a Web form, w...

متن کامل

Descoberta de ruído em páginas da web oculta através de uma abordagem de aprendizagem supervisionada

Um dos problemas da extração de dados na web é a remoção de ruídos existentes nas páginas. Esta tarefa busca identi car todos os elementos não informativos em meio ao conteúdo, como por exemplo cabeçalhos, menus ou propagandas. A presença de ruídos pode prejudicar seriamente o desempenho de motores de busca e tarefas de mineração de dados na web. Este trabalho aborda o problema da descoberta de...

متن کامل

Ferramentas de Apoio à Criação e Edição de Ontologias: Tainacan Ontology e uma Análise Comparativa

With the need to handle large amounts of data on the web, and treat this data as significant knowledge, the Web of Data has migrated to a new paradigm, the Semantic Web. Ontologies composes the core of the semantic web, and to develop it is necessary to use tools called ontologies editors. This article presents a comparative analysis of these tools, focused on providing functionality based on O...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002