VIPS: A VIsion based Page Segmentation Algorithm

نویسندگان

  • Deng Cai
  • Shipeng Yu
  • Ji-Rong Wen
  • Wei-Ying Ma
چکیده

A new web content structure analysis based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception. Comparing to other existing techniques such as DOM tree, our approach is independent to the HTML documentation representation. Our method can work well even when the HTML structure is quite different from the visual layout structure. Several experiments show the effectiveness of our method.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Segmenting Webpage with Gomory-Hu Tree Based Clustering

We propose a novel web page segmentation algorithm based on finding the Gomory-Hu tree in a planar graph. The algorithm firstly distills vision and structure information from a web page to construct a weighted undirected graph, whose vertices are the leaf nodes of the DOM tree and the edges represent the visible position relationship between vertices. Then it partitions the graph with the Gomor...

متن کامل

A language independent web data extraction using vision based page segmentation algorithm

Web usage mining is a process of extracting useful information from server logs i.e. user’s history. Web usage mining is a process of finding out what users are looking for on the internet. Some users might be looking at only textual data, where as some others might be interested in multimedia data. One would retrieve the data by copying it and pasting it to the relevant document. But this is t...

متن کامل

TVGraz: Multi-Modal Learning of Object Categories by Combining Textual and Visual Features

Internet offers a vast amount of multi-modal and heterogeneous information mainly in the form of textual and visual data. Most of the current web-based visual object classification methods only utilize one of these data streams. As we will show in this paper, combining these modalities in a proper way often provides better results not attainable by relying on only one of these data streams. How...

متن کامل

Vision-based Presentation Modeling of Web Applications: A Reverse Engineering Approach

Presentation modeling, which captures the layout of an HTML page, is a very important aspect of modeling Web Applications (WAs). However, presentation modeling is often neglected during forward engineering of Web Applications; therefore, most of these applications are poorly modeled or not modeled at all. This paper discusses the design, implementation, and evaluation of a reverse engineering t...

متن کامل

An Efficient Image Based Approach for Extraction of Deep Web Data

The Internet presents a huge amount of useful information which is usually formatted for its users, which makes it difficult to extract relevant data from various sources. Deep Web contents are extracted by submitting the queries to semi structured Web databases and the returned data records are enwrapped in dynamically generated Web pages. Extracting structured data from deep Web pages is a ch...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003