Data Extraction using Content-Based Handles
Authors
Abstract:
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text features such as textual delimiters, keywords, constants or text patterns, which we call handles, to construct patterns for the target data regions and data records. We offer a polynomial algorithm, in which these patterns are checked against the page elements in a mixed bottom-up and top-down traverse of the DOM-tree. The extracted data is directly mapped onto a hierarchical XML structure, which forms the output of the wrapper. The wrappers that are generated by this method are robust and independent of the HTML structure. Therefore, they can be adapted to similar websites to gather and integrate information.
similar resources
A Model-Based Analysis of Semiautomated Data Discovery and Entry Using Automated Content-Extraction
JADE GOLDSTEIN-STEWART and JUSTIN GROSSMAN United States Department of Defense, Washington, DC ________________________________________________________________________ Content extraction systems can automatically extract entities and relations from raw text and use the information to populate knowledge bases, potentially eliminating the need for manual data discovery and entry. Unfortunately, c...
full textContent-Based Authentication Using Digital Speech Data
A watermarking technique for speech content and speaker authentication scheme, which is based on using abstracts of speech features relevant to semantic meaning and combined with an ID for the speaker is proposed in this paper. The ID which, represents the watermark for the speaker, is embedded using spread spectrum technique. While the extracted abstracts of speech features are used to represe...
full textContent Based Video Retrieval Using Integrated Feature Extraction
Traditional video retrieval methods fail to meet technical challenges due to large and rapid growth of multimedia data, demanding effective retrieval systems. In the last decade Content Based Video Retrieval (CBVR) has become more and more popular. The amount of lecture video data on the Worldwide Web (WWW) is growing rapidly. Therefore, a more efficient method for video retrieval in WWW or wit...
full textEXTRACTION-BASED TEXT SUMMARIZATION USING FUZZY ANALYSIS
Due to the explosive growth of the world-wide web, automatictext summarization has become an essential tool for web users. In this paperwe present a novel approach for creating text summaries. Using fuzzy logicand word-net, our model extracts the most relevant sentences from an originaldocument. The approach utilizes fuzzy measures and inference on theextracted textual information from the docu...
full textDeep Web Data Extraction by Using Vision-Based Item and Data Extraction Algorithms
Deep Web contents are accessed by queries submitted to Web databases and the returned data records are enwrapped in dynamically generated Web pages (they will be called deep Web pages in this paper). Extracting structured data from deep Web pages is a challenging problem due to the underlying intricate structures of such pages. Until now, a large number of techniques have been proposed to addre...
full textMy Resources
Journal title
volume 6 issue 2
pages 399- 407
publication date 2018-07-01
By following a journal you will be notified via email when a new issue of this journal is published.
Hosted on Doprax cloud platform doprax.com
copyright © 2015-2023