Multi-Class Document Layout Classification using Random Chopping

نویسنده

Mei Huang

چکیده

This paper proposes a multi-class document layout classification/recognition system using a method called random chopping. A scanned document image undergoes text line extraction and is represented as a set of quadrilaterals for every pair of text lines. For compact representation, a dictionary of quadrilateral clusters is built beforehand, and a document image is then represented as a word occurrence histogram by looking up its quadrilaterals in the dictionary. The training process iteratively chops all training classes into two partitions and trains a linear classifier for this split. A binary coordinate space is built from all chops, and every document’s histogram descriptor is then projected to this space to form a binary signature. Layout similarity is reduced to distance computation between two signatures. Our experiments demonstrate that this multi-class classification system achieves very good performance not only on trained classes but also on instances from layout classes never seen in the training.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Content-free Document Genre Classification using First Order Random Graphs

We approach the general problem of machineprinted document genre classification using contentfree layout structure analysis. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our approach uses attributed relational graphs (ARGs) to represent the lay...

متن کامل

Model-Guided Segmentation and Layout Labelling of Document Images Using a Hierarchical Conditional Random Field

We present a model-guided segmentation and document layout extraction scheme based on hierarchical Conditional Random Fields (CRFs, hereafter). Common methods to classify a pixel of a document image into classes text, background and image are often noisy, and error-prone, often requiring post-processing through heuristic methods. The input to the system is a pixel-wise classification based on t...

متن کامل

Fine-Grained Document Genre Classification Using First Order Random Graphs

We approach the general problem of classifying machine-printed documents into genres. Layout is a critical factor in recognizing fine-grained genres, as document content features are similar. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our met...

متن کامل

Text Classification and Layout Analysis of Paper Fragments∗

Document image analysis such as text classification and layout analysis allow for the automated extraction of document properties. In general these methodologies are pre-processing steps for Optical Character Recognition (OCR) systems. In contrast, the proposed method aims at clustering document snippets so that an automated clustering of documents can be performed. First, localized words are c...

متن کامل

Feature-based Malicious URL and Attack Type Detection Using Multi-class Classification

Nowadays, malicious URLs are the common threat to the businesses, social networks, net-banking etc. Existing approaches have focused on binary detection i.e. either the URL is malicious or benign. Very few literature is found which focused on the detection of malicious URLs and their attack types. Hence, it becomes necessary to know the attack type and adopt an effective countermeasure. This pa...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

Multi-Class Document Layout Classification using Random Chopping

نویسنده

چکیده

منابع مشابه

Content-free Document Genre Classification using First Order Random Graphs

Model-Guided Segmentation and Layout Labelling of Document Images Using a Hierarchical Conditional Random Field

Fine-Grained Document Genre Classification Using First Order Random Graphs

Text Classification and Layout Analysis of Paper Fragments∗

Feature-based Malicious URL and Attack Type Detection Using Multi-class Classification

عنوان ژورنال:

اشتراک گذاری