Fine-Grained Document Genre Classification Using First Order Random Graphs
نویسندگان
چکیده
We approach the general problem of classifying machine-printed documents into genres. Layout is a critical factor in recognizing fine-grained genres, as document content features are similar. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our method uses attributed relational graphs (ARGs) to represent the layout structure of document instances, and a first order random graphs (FORGs) to represent document genres. In this paper we develop our FORG-based genre classification method and present a comparative evaluation between our technique and a variety of statistical pattern classifiers. FORGs are capable of modeling common layout structure within a document genre and are shown to significantly outperform traditional pattern classification techniques when fine-grained genre distinctions must be drawn.
منابع مشابه
Content-free Document Genre Classification using First Order Random Graphs
We approach the general problem of machineprinted document genre classification using contentfree layout structure analysis. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our approach uses attributed relational graphs (ARGs) to represent the lay...
متن کاملLabel Propagation for Fine-Grained Cross-Lingual Genre Classification
Cross-lingual methods can bring the benefits of genre classification to languages which lack genre-annotated training data. However, prior work in this field has been evaluated on coarse genres only. To predict fine-grained genres across languages, we propose a label propagation method, which combines separate sets of features. The results are promising, as the approach outperforms most baselin...
متن کاملEnhanced Genre Classification through Linguistically Fine-Grained POS Tags
We propose the use of fine-grained part-of-speech (POS) tags as discriminatory attributes for automatic genre classification and report empirical results from an experiment that indicate substantial accuracy gain by such features over the conventional bag-of-words approach through word unigrams. In particular, this paper reports our research to investigate the performance of a fine-grained tag ...
متن کاملFine-Grained Sentiment Analysis for Movie Reviews in Bulgarian
We present a system for fine-grained sentiment analysis in Bulgarian movie reviews. As this is pioneering work for this combination of language and sentiment granularity, we create suitable, freely available resources: a dataset of movie reviews with fine-grained scores, and a sentiment polarity lexicon. We further compare experimentally the performance of classification, regression and ordinal...
متن کاملDiscovering Fine-Grained Sentiment with Latent Variable Structured Prediction Models
In this paper we investigate the use of latent variable structured prediction models for fine-grained sentiment analysis in the common situation where only coarse-grained supervision is available. Specifically, we show how sentencelevel sentiment labels can be effectively learned from document-level supervision using hidden conditional random fields (HCRFs) [10]. Experiments show that this tech...
متن کامل