Multi-Class Document Layout Classification using Random Chopping
نویسنده
چکیده
This paper proposes a multi-class document layout classification/recognition system using a method called random chopping. A scanned document image undergoes text line extraction and is represented as a set of quadrilaterals for every pair of text lines. For compact representation, a dictionary of quadrilateral clusters is built beforehand, and a document image is then represented as a word occurrence histogram by looking up its quadrilaterals in the dictionary. The training process iteratively chops all training classes into two partitions and trains a linear classifier for this split. A binary coordinate space is built from all chops, and every document’s histogram descriptor is then projected to this space to form a binary signature. Layout similarity is reduced to distance computation between two signatures. Our experiments demonstrate that this multi-class classification system achieves very good performance not only on trained classes but also on instances from layout classes never seen in the training.
منابع مشابه
Content-free Document Genre Classification using First Order Random Graphs
We approach the general problem of machineprinted document genre classification using contentfree layout structure analysis. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our approach uses attributed relational graphs (ARGs) to represent the lay...
متن کاملModel-Guided Segmentation and Layout Labelling of Document Images Using a Hierarchical Conditional Random Field
We present a model-guided segmentation and document layout extraction scheme based on hierarchical Conditional Random Fields (CRFs, hereafter). Common methods to classify a pixel of a document image into classes text, background and image are often noisy, and error-prone, often requiring post-processing through heuristic methods. The input to the system is a pixel-wise classification based on t...
متن کاملFine-Grained Document Genre Classification Using First Order Random Graphs
We approach the general problem of classifying machine-printed documents into genres. Layout is a critical factor in recognizing fine-grained genres, as document content features are similar. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our met...
متن کاملText Classification and Layout Analysis of Paper Fragments∗
Document image analysis such as text classification and layout analysis allow for the automated extraction of document properties. In general these methodologies are pre-processing steps for Optical Character Recognition (OCR) systems. In contrast, the proposed method aims at clustering document snippets so that an automated clustering of documents can be performed. First, localized words are c...
متن کاملFeature-based Malicious URL and Attack Type Detection Using Multi-class Classification
Nowadays, malicious URLs are the common threat to the businesses, social networks, net-banking etc. Existing approaches have focused on binary detection i.e. either the URL is malicious or benign. Very few literature is found which focused on the detection of malicious URLs and their attack types. Hence, it becomes necessary to know the attack type and adopt an effective countermeasure. This pa...
متن کامل