A Framework for Populating Ontological Models from Semi-structured Web Documents

نویسندگان

  • Hassan A. Sleiman
  • Inma Hernández
چکیده

TheWeb is the largest repository of information that has ever existed. This information is presented in a human friendly format using HTML, which complicates the consumption of this information by automatic processes. Solutions to this problem are the Semantic Web and Web Services, but the lack of such services in the majority of web sites has increased the interest on information extraction which allow extracting and structuring information from web documents in ontological models. Despite the high number of proposals on information extraction, there does not exist a universally applicable information extractor. As a consequence, when populating an ontology model automatically from a web site, it is not unusual to need more than one information extractor. We propose a framework that allows the development, training, and the application of information extractors on semistructured web documents to produce semantic data. We have developed a version of the framework and verified it by means of experiments on 35 web sites. Experimental results are very promising.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Populating Ontologies with Data from OCRed Lists

A flexible, accurate, and efficient method of automatically extracting facts from lists in OCRed documents and inserting them into an ontology would help make those facts machine searchable, queryable, and linkable and expose their rich ontological interrelationships. To work well, such a process must be adaptable to variations in list format, tolerant of OCR errors, and careful in its selectio...

متن کامل

A Case Study on Linked Data Generation and Consumption

The availability of large amounts of interlinked semantic data is a fundamental prerequisite of the Semantic Web. At present, almost all the usable ontological data is built manually or by directly transforming certain (semi-)structured data sources into certain formats of semantic data. To solve the “isolated data island” problem of the Semantic Web caused by this situation, the Linking Open D...

متن کامل

Toward Ontology-based Knowledge Extraction from Web Data with the Lexicalization of Ontology for Korean QA System

Most of knowledge is written in natural language and structured knowledge base includes the partially limited information of them. In QA system perspective, the quality of knowledge base is depends on how it covers the knowledge to answer user’s questions. To deal with this knowledge base construction problem, we define the natural language question sets and answer documents which contains know...

متن کامل

Information extraction and imprecise query answering from web documents

Word based searches for relevant information from texts retrieve a huge collection and burden the user with information overload. Ontology based text information retrieval can perform concept-based search and extract only relevant portions of text containing concepts that are present in the query or those that are semantically linked to query concepts. While these systems have better precision ...

متن کامل

Searching web data: An entity retrieval and high-performance indexing model

More and more (semi) structured information is becoming available on the Web in the form of documents embedding metadata (e.g., RDF, RDFa, Microformats and others). There are already hundreds of millions of such documents accessible and their number is growing rapidly. This calls for large scale systems providing effective means of searching and retrieving this semi-structured information with ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012