Data Extraction using Content-Based Handles

Authors

A. Pouramini Department of Computer Engineering, University of Sirjan Technology, Sirjan, Iran.

S. Khaje Hassani Department of Computer Engineering, University of Sirjan Technology, Sirjan, Iran.

Sh. Nasiri Department of Computer Engineering, University of Sirjan Technology, Sirjan, Iran.

Abstract:

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text features such as textual delimiters, keywords, constants or text patterns, which we call handles, to construct patterns for the target data regions and data records. We offer a polynomial algorithm, in which these patterns are checked against the page elements in a mixed bottom-up and top-down traverse of the DOM-tree. The extracted data is directly mapped onto a hierarchical XML structure, which forms the output of the wrapper. The wrappers that are generated by this method are robust and independent of the HTML structure. Therefore, they can be adapted to similar websites to gather and integrate information.

Download for Free

Already have an account?login

similar resources

A Model-Based Analysis of Semiautomated Data Discovery and Entry Using Automated Content-Extraction

JADE GOLDSTEIN-STEWART and JUSTIN GROSSMAN United States Department of Defense, Washington, DC ________________________________________________________________________ Content extraction systems can automatically extract entities and relations from raw text and use the information to populate knowledge bases, potentially eliminating the need for manual data discovery and entry. Unfortunately, c...

full text

Content-Based Authentication Using Digital Speech Data

A watermarking technique for speech content and speaker authentication scheme, which is based on using abstracts of speech features relevant to semantic meaning and combined with an ID for the speaker is proposed in this paper. The ID which, represents the watermark for the speaker, is embedded using spread spectrum technique. While the extracted abstracts of speech features are used to represe...

full text

XSLT-based Web-Content Extraction

In this paper, we describe the Semantic

full text

Content Based Video Retrieval Using Integrated Feature Extraction

Traditional video retrieval methods fail to meet technical challenges due to large and rapid growth of multimedia data, demanding effective retrieval systems. In the last decade Content Based Video Retrieval (CBVR) has become more and more popular. The amount of lecture video data on the Worldwide Web (WWW) is growing rapidly. Therefore, a more efficient method for video retrieval in WWW or wit...

full text

EXTRACTION-BASED TEXT SUMMARIZATION USING FUZZY ANALYSIS

Due to the explosive growth of the world-wide web, automatictext summarization has become an essential tool for web users. In this paperwe present a novel approach for creating text summaries. Using fuzzy logicand word-net, our model extracts the most relevant sentences from an originaldocument. The approach utilizes fuzzy measures and inference on theextracted textual information from the docu...

full text

Deep Web Data Extraction by Using Vision-Based Item and Data Extraction Algorithms

Deep Web contents are accessed by queries submitted to Web databases and the returned data records are enwrapped in dynamically generated Web pages (they will be called deep Web pages in this paper). Extracting structured data from deep Web pages is a challenging problem due to the underlying intricate structures of such pages. Until now, a large number of techniques have been proposed to addre...

full text

My Resources

Save resource for easier access later

Save to my library Already added to my library

{@ msg_add @}

Journal title

Journal of Artificial Intelligence and Data Mining

volume 6 issue 2

pages 399- 407

publication date 2018-07-01

unfollow

{@ msg @}

By following a journal you will be notified via email when a new issue of this journal is published.

Keywords

Web Data Record Extraction Web Wrapper Generation Web Information Extraction

Hosted on Doprax cloud platform doprax.com