Document-specific character template estimation
نویسندگان
چکیده
An approach to supervised training of document-specific character templates from sample page images and unaligned transcriptions is presented. The template estimation problem is formulated as one of constrained maximum likelihoodparameter estimation within the document image decoding (DID) framework. This leads to a two-phase iterative training algorithm consisting of transcriptionalignment and aligned template estimation (ATE) steps. The ATE step is the heart of the algorithm and involves assigning template pixel colors to maximize likelihoodwhile satisifyinga template disjointness constraint. The training algorithm is demonstrated on a variety of English documents, including newspaper columns, 15th century books, degraded images of 19th century newspapers and connected text in a script-like font. Three applications enabled by the training procedure are described— high-accuracy document-specific decoding, transcription error visualization and printer font generation.
منابع مشابه
Document image decoding approach to character template estimation
Template Estimation 1 Gary E. Kopec2 Xerox Palo Alto Research Center Mauricio Lomelin3 Microsoft Corp. November 29, 1995 Abstract This paper develops an approach to supervised training of character templates from page images and unaligned transcriptions. The template estimation problem is formulated as one of constrained maximum likelihood parameter estimation within the document image decoding...
متن کاملSupervised Template Estimation for Document Image Decoding
Gary E. Kopec, Member, IEEE, and Mauricio Lomelin, Member, IEEE July 20, 1997 Abstract An approach to supervised training of character templates from page images and unaligned transcriptions is proposed. The template training problem is formulated as one of constrained maximum likelihood parameter estimation within the document image decoding framework. This leads to a three-phase iterative tra...
متن کاملPrototype Extraction and Adaptive OCR
ÐTo maintain OCR accuracy with decreasing quality of page image composition, production, and digitization, it is essential to tune the system to each document. We propose a prototype extraction method for document-specific OCR systems. The method automatically generates training samples from unsegmented text images and the corresponding transcripts. It is tolerant of transcription errors, so a ...
متن کاملCan Wavelet Denoising Improve Motor Unit Potential Template Estimation?
Background: Electromyographic (EMG) signals obtained from a contracted muscle contain valuable information on its activity and health status. Much of this information lies in motor unit potentials (MUPs) of its motor units (MUs), collected during the muscle contraction. Hence, accurate estimation of a MUP template for each MU is crucial. Objective: To investigate the possibility of improv...
متن کاملOptical Character Recognition from Text Image
Optical Character Recognition (OCR) is a system that provides a full alphanumeric recognition of printed or handwritten characters by simply scanning the text image. OCR system interprets the printed or handwritten characters image and converts it into corresponding editable text document. The text image is divided into regions by isolating each line, then individual characters with spaces. Aft...
متن کامل