A Survey on Various OCR Errors
نویسنده
چکیده
Research has been carried out in correcting words in OCR text and mainly surrounds around (1) non word errors (2) isolated word error correction and context dependent word correction. Various kinds of techniques have been developed. This papers surveys various techniques in correcting these errors and determines which techniques are better. General Terms Optical Character Recognition, Natural Language Processing
منابع مشابه
Problems and Review of Line Segmentation of Handwritten Text Document
Optical character recognition (OCR) is a very popular research area since 1950's. Many people has done a lot of work on various scripts. Line segmentation is a very important step in OCR as the accuracy of the recognition algorithm highly depends on the correct line segmentation. Incorrect line segmentation not only decreases the accuracy but also may lead to some other errors. The objective of...
متن کاملStrategies for Reducing and Correcting OCR Errors
In this paper we describe our efforts in reducing and correcting OCR errors in the context of building a large multilingual heritage corpus of Alpine texts which is based on digitizing the publications of various Alpine clubs. We have already digitized the yearbooks of the Swiss Alpine Club from its start in 1864 until 1995 with more than 75,000 pages resulting in 29 million running words. Sinc...
متن کاملA Study of Style Effects on OCR Errors in the MEDLINE Database
The National Library of Medicine has developed a system for the automatic extraction of data from scanned journal articles to populate the MEDLINE database. Although the 5-engine OCR system used in this process exhibits good performance overall, it does make errors in character recognition that must be corrected in order for the process to achieve the requisite accuracy. The correction process ...
متن کاملStudy of style effects on OCR errors in the MEDLINE database
The National Library of Medicine has developed a system for the automatic extraction of data from scanned journal articles to populate the MEDLINE database. Although the 5-engine OCR system used in this process exhibits good performance overall, it does make errors in character recognition that must be corrected in order for the process to achieve the requisite accuracy. The correction process ...
متن کاملEvaluating supervised topic models in the presence of OCR errors
Supervised topic models are promising tools for text analytics that simultaneously model topical patterns in document collections and relationships between those topics and document metadata, such as timestamps. We examine empirically the effect of OCR noise on the ability of supervised topic models to produce high quality output through a series of experiments in which we evaluate three superv...
متن کامل