Document-specific character template estimation

نویسندگان

  • Gary E. Kopec
  • Mauricio Lomelin
چکیده

An approach to supervised training of document-specific character templates from sample page images and unaligned transcriptions is presented. The template estimation problem is formulated as one of constrained maximum likelihoodparameter estimation within the document image decoding (DID) framework. This leads to a two-phase iterative training algorithm consisting of transcriptionalignment and aligned template estimation (ATE) steps. The ATE step is the heart of the algorithm and involves assigning template pixel colors to maximize likelihoodwhile satisifyinga template disjointness constraint. The training algorithm is demonstrated on a variety of English documents, including newspaper columns, 15th century books, degraded images of 19th century newspapers and connected text in a script-like font. Three applications enabled by the training procedure are described— high-accuracy document-specific decoding, transcription error visualization and printer font generation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Document image decoding approach to character template estimation

Template Estimation 1 Gary E. Kopec2 Xerox Palo Alto Research Center Mauricio Lomelin3 Microsoft Corp. November 29, 1995 Abstract This paper develops an approach to supervised training of character templates from page images and unaligned transcriptions. The template estimation problem is formulated as one of constrained maximum likelihood parameter estimation within the document image decoding...

متن کامل

Supervised Template Estimation for Document Image Decoding

Gary E. Kopec, Member, IEEE, and Mauricio Lomelin, Member, IEEE July 20, 1997 Abstract An approach to supervised training of character templates from page images and unaligned transcriptions is proposed. The template training problem is formulated as one of constrained maximum likelihood parameter estimation within the document image decoding framework. This leads to a three-phase iterative tra...

متن کامل

Prototype Extraction and Adaptive OCR

ÐTo maintain OCR accuracy with decreasing quality of page image composition, production, and digitization, it is essential to tune the system to each document. We propose a prototype extraction method for document-specific OCR systems. The method automatically generates training samples from unsegmented text images and the corresponding transcripts. It is tolerant of transcription errors, so a ...

متن کامل

Can Wavelet Denoising Improve Motor Unit Potential Template Estimation?

Background: Electromyographic (EMG) signals obtained from a contracted muscle contain valuable information on its activity and health status. Much of this information lies in motor unit potentials (MUPs) of its motor units (MUs), collected during the muscle contraction. Hence, accurate estimation of a MUP template for each MU is crucial. Objective: To investigate the possibility of improv...

متن کامل

Optical Character Recognition from Text Image

Optical Character Recognition (OCR) is a system that provides a full alphanumeric recognition of printed or handwritten characters by simply scanning the text image. OCR system interprets the printed or handwritten characters image and converts it into corresponding editable text document. The text image is divided into regions by isolating each line, then individual characters with spaces. Aft...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996