Record-Level Information Extraction from a Web Page based on Visual Features

نویسنده

  • A Suresh Babu
چکیده

Web databases contain a huge amount of structured data which are easily obtained via their query interfaces only. Query results are presented in dynamically generated web pages, usually in the form of data records, for human use. Decisive for web data integration applications is the problem of automatically extracting data records from query result pages, such as comparison shopping sites, meta-search engines, etc. A number of approaches to query result extraction have been proposed. As the structures of web pages become more critical, these approaches start to fail. Query result pages usually also contain other types of information in addition to query results, e.g., advertisements, navigation bar, etc. Most of the existing approaches do not move out such impertinent contents which may affect the accuracy of data record extraction. We have observed that query results are usually displayed in regular visual patterns and terms used in a query often reappear in query results. The paper proposes a novel approach that makes use of visual features and query terms to identify the data section and extract data records from it. this also uses several content and visual features of visual blocks in a data section to filter out noisy blocks. The results of this experiment tests on a large set of query result pages in different domains show that the proposed approach is

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

Extraction of Flat and Nested Data Records from Web Pages

This paper studies the problem of identification and extraction of flat and nested data records from a given web page. With the explosive growth of information sources available on the World Wide Web, it has become increasingly difficult to identify the relevant pieces of information, since web pages are often cluttered with irrelevant content like advertisements, navigation-panels, copyright n...

متن کامل

Visual Architecture based Web Information Extraction

ISSN 2250 – 107X | © 2011 Bonfring Abstract--The World Wide Web has more online web database which can be searched through their web query interface. Deep Web contents are accessed by queries submitted to Web databases and the returned data records are enwrapped in dynamically generated Web pages. Extracting structured data from deep Web pages is a challenging task due to the underlying complic...

متن کامل

An Efficient Image Based Approach for Extraction of Deep Web Data

The Internet presents a huge amount of useful information which is usually formatted for its users, which makes it difficult to extract relevant data from various sources. Deep Web contents are extracted by submitting the queries to semi structured Web databases and the returned data records are enwrapped in dynamically generated Web pages. Extracting structured data from deep Web pages is a ch...

متن کامل

Hybrid Adaptive Educational Hypermedia ‎Recommender Accommodating User’s Learning ‎Style and Web Page Features‎

Personalized recommenders have proved to be of use as a solution to reduce the information overload ‎problem. Especially in Adaptive Hypermedia System, a recommender is the main module that delivers ‎suitable learning objects to learners. Recommenders suffer from the cold-start and the sparsity problems. ‎Furthermore, obtaining learner’s preferences is cumbersome. Most studies have only focused...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012