A Semi-automatic Adaptive OCR for Digital Libraries
نویسندگان
چکیده
This paper presents a novel approach for designing a semi-automatic adaptive OCR for large document image collections in digital libraries. We describe an interactive system for continuous improvement of the results of the OCR. In this paper a semi-automatic and adaptive system is implemented. Applicability of our design for the recognition of Indian Languages is demonstrated. Recognition errors are used to train the OCR again so that it adapts and learns for improving its accuracy. Limited human intervention is allowed for evaluating the output of the system and take corrective actions during the recognition process.
منابع مشابه
Adaptive detection of missed text areas in OCR outputs: application to the automatic assessment of OCR quality in mass digitization projects
The French National Library (BnF∗) has launched many mass digitization projects in order to give access to its collection. The indexation of digital documents on Gallica (digital library of the BnF) is done through their textual content obtained thanks to service providers that use Optical Character Recognition softwares (OCR). OCR softwares have become increasingly complex systems composed of ...
متن کاملPerformance Characterization and Parallelization of Tesseract Optical Character Recognition on Multicore Architectures
Optical Character Recognition, or OCR, is one of the major topics in computer vision technology. It is widely used in various applications, such as a digital libraries, automatic banking systems, and mailing services. Tesseract OCR Engine, which we evaluate in this paper, is one of renowned OCR programs. It was originally developed by Hewlett Packard Lab between 1985 and 1995, and has been main...
متن کاملOn Automatic Similarity Linking in Digital Libraries
Hypertext links are a powerful extension of standard information retrieval techniques based on query languages. However, the generation of links is often impractical due to large manual and/or computational effort. In this paper, we analyze the effects of two main approaches that aim at a restriction of the necessary efforts: The direct use of OCR-processed documents instead of manually post-pr...
متن کاملVideo OCR for Video Indexing
OCR is a technique that can greatly help to locate the topics of interest in video via the automatic extraction and reading of captions and annotations. Text in video can provide key indexing information. Recognizing such text for search application is critical. Major difficult problem for character recognition for videos is degraded and deformated characters, low resolution characters or very ...
متن کامل2 Toshio
The automatic extraction and recognition of news captions and annotations can be of great help locating topics of interest in digital news video libraries. To achieve this goal, we present a technique, called Video OCR (Optical Character Reader), which detects, extracts, and reads text areas in digital video data. In this paper, we address problems, describe the method by which Video OCR operat...
متن کامل