Automatic Ontology-Based Knowledge Extraction from Web Documents vs. Automating the Extraction of Data from HTML Tables with Unknown Structure

نویسندگان

Stefan Bischof

Stefan Rümmele

چکیده

In this report we compare the papers [AKM + 03] and [ETL03]. We show that the two proposed systems realize different goals with the same or similar underlying technics. • Source data of interest [ETL03] takes web pages containing HTML tables of interest for a given application domain as the input whereas [AKM + 03] considers unstructured text from webpages for the knowledge extraction process. • Resulting data format The difference between the output data is similar to the input. While [ETL03] returns structured data fit into a target schema, [AKM + 03] generates text in a narrative form specified by the user. • Interface specification [AKM + 03] is designed as a whole process from knowledge extraction over information management to narrative generation and thus provides an interface for human beings. [ETL03] however, only provides an interface to an information extraction procedure and is not designed as an complete application. • Internal data source While [AKM + 03] uses a knowledge base to store the aquired knowledge, [ETL03] doesn't outline a specific internal data storage approach. • Project structure The approach of [ETL03] is a stand alone project only built up previous research on extraction ontologies. [AKM + 03] uses several projects for the diverse tasks needed and combines them to a whole process. • Extraction ontology For both papers the central key aspect of the information extraction process is the extraction ontology. Without it, for every new page a wrapper has to be created. But with the ontology the wrapper creation can be automated.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automating the extraction of data from HTML tables with unknown structure

Data on the Web in HTML tables is mostly structured, but we usually do not know the structure in advance. Thus, we cannot directly query for data of interest. We propose a solution to this problem based on document-independent extraction ontologies. Our solution entails elements of table understanding, data integration, and wrapper creation. Table understanding allows us to find tables of inter...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Automating the Extraction of Data from HTML Tables with Unknown Structure

The authors propose a solution to the problem of web information extraction, which aims to extract relevant information out of webpages. However since this is a broad field they have limited their work to information which is available in HTML tables found on the Web and relates to a specific domain of interest. As a running example in their paper, the authors use car advertisements. I suggest ...

متن کامل

Categorisation of web documents using extraction ontologies

Automatically recognising which HTML documents on the Web contain items of interest for a user is non-trivial. As a step toward solving this problem, we propose an approach based on information-extraction ontologies. Given HTML documents, tables, and forms, our document recognition system extracts expected ontological vocabulary (keywords and keyword phrases) and expected ontological instance d...

متن کامل