Text Pre-processing and Text Segmentation for OCR
نویسنده
چکیده
Optical Character Recognition (OCR) systems have been effectively developed for the recognition of printed script. The accuracy of OCR system mainly depends on the text preprocessing and segmentation algorithm being used. When the document is scanned it can be placed in any arbitrary angle which would appear on the computer monitor at the same angle. This paper addresses the algorithm for correction of skew angle generated in scanning of the text document and a novel profile based method for segmentation of printed text which separates the text in document image into lines, words and characters. Keywords—Skew correction, Segmentation, Text preprocessing, Horizontal Profile, Vertical Profile.
منابع مشابه
Directional Stroke Width Transform to Separate Text and Graphics in City Maps
One of the complex documents in the real world is city maps. In these kinds of maps, text labels overlap by graphics with having a variety of fonts and styles in different orientations. Usually, text and graphic colour is not predefined due to various map publishers. In most city maps, text and graphic lines form a single connected component. Moreover, the common regions of text and graphic lin...
متن کاملOcr-optical Character Recognition
Optical Character Recognition or OCR is the electronic translation of handwritten, typewritten or printed text into machine translated images. It is widely used to recognize and search text from electronic documents or to publish the text on a website. OCR is the machine replication of human reading and has been the subject of intensive research for more than three decades. OCR can be described...
متن کاملImproved document image segmentation algorithm using multiresolution morphology
Page segmentation into text and non-text components is an essential preprocessing step before OCR operation. If this is not done properly, an OCR classification engine produces garbage text due to the presence of nontext components. This paper describes improvements to the text/image segmentation algorithm described by Bloomberg, which is also available in his open-source Leptonica library. The...
متن کاملDocument Image Dewarping Based on Text Line Detection and Surface Modeling (RESEARCH NOTE)
Document images produced by scanner or digital camera, usually suffer from geometric and photometric distortions. Both of them deteriorate the performance of OCR systems. In this paper, we present a novel method to compensate for undesirable geometric distortions aiming to improve OCR results. Our methodology is based on finding text lines by dynamic local connectivity map and then applying a l...
متن کاملA Morphological Image Preprocessing Suite for OCR on Natural Scene Images
As demand grows for mobile applications, research in optical character recognition (OCR), a technology well-developed for document imaging, is shifting focus to the recognition of text embedded in digital photographs or video. Segmenting text and background in natural scenes is a difficult classification problem, and the accuracy of this segmentation is of utmost importance when the output of a...
متن کامل