Document layout structure extraction using bounding boxes of different entitles

نویسندگان

  • Jisheng Liang
  • Jaekyu Ha
  • Robert M. Haralick
  • Ihsin T. Phillips
چکیده

This paper presents an eficient technique for document page layout structure extraction and classification by analyzing the spatial configuration of the bounding boxes of different entities on the given image. The algorithm segments an image into a list of homogeneous zones.The classification algorithm labels each zone as text, table, line-drawing, halftone, ruling, or noise. The text-lines and words are extracted within text zones and neighboring text-lines are merged to form text-blocks. The tabular structure is further decomposed into row and column items. Finally, the document layout hierarchy is produced from these extracted entities.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Document Layout Structure Extraction Using Bounding Boxes of Diierent Entities

This paper presents an eecient and accurate technique for document page layout structure extraction and classiication by analyzing the spatial connguration of the bounding boxes of diierent entities on a given image. The text, table, and nontext structures are detected on document images. The text-lines and words are extracted and the tabular structure is further decomposed into row and column ...

متن کامل

Extraction of text lines and text blocks on document images based on statistical modeling

In this article, we developed a Bayesian model to characterize text line and text block structures on document images using the text word bounding boxes. We posed the extraction problem as finding the text lines and text blocks that maximize the Bayesian probability of the text lines and text blocks given the text word bounding boxes. In particular, we derived the so-called probabilistic linear...

متن کامل

Bayesian Learning of 2d Document Layout Models for Preservation Metadata Extraction

Digital preservation addresses the storage, maintenance, accessibility, and technical integrity of digital materials over the long term. Preservation metadata is the information required to perform these tasks. Given the volume of these journals and high labor cost of manual metadata entry, automated metadata extraction is necessary. Document layout analysis is a process of partitioning documen...

متن کامل

Word Segmentation for Document Images by Successively Merging Adjacent Character Bounding Boxes by Iterative Dilation

A new method of word segmentation for document images is presented. The method uses the bounding box regions to enclose the letters (characters) of the words and then the resulting letter spaces are progressively filled to merge the character bounding boxes to get the word bounding boxes. The method holds good for inclined and irregularly distributed words. The proposed method completely avoids...

متن کامل

An Algorithm for Finding Maximal Whitespace Rectangles at Arbitrary Orientations for Document Layout Analysis

The analysis of the background structure (whitespace) of page images has become an important technique for physical document layout analysis. Globally maximal whitespace rectangles have been previously demonstrated to constitute a concise representation of the major layout features of documents. However, previous methods for computing maximal whitespace rectangles were limited to axisaligned re...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996