A Survey of Retrieval Strategies for OCR Text Collections
نویسندگان
چکیده
The importance of effectively retrieving OCR text has grown significantly in recent years. We provide a brief overview of work done to improve the effectiveness of retrieval of OCR text.
منابع مشابه
Length Normalization in Degraded Text Collections
Optical character recognition (OCR) is the most commonly used technique to convert printed material into electronic form. Using OCR, large repositories of machine readable text can be created in a short time. An information retrieval system can then be used to search through large information bases thus created. Many information retrieval systems use sophisticated term weighting functions to im...
متن کاملQuerying Short OCR'd Documents
Studies have shown that OCR errors have little eeect on average precision for full text collections. A question that was left unanswered from these studies was how OCR errors would aaect short document collections. This issue was examined in this study using documents consisting of only titles and abstracts. The results of our experimentation are presented in this paper.
متن کاملRetrieving OCR Text: A Survey of Current Approaches
The importance of effectively retrieving OCR text has grown significantly in recent years. We provide a brief overview of work done to improve the effectiveness of retrieval of OCR text.
متن کاملEffects of OCR Errors on Ranking and Feedback Using the Vector Space Model
We report on the performance of the vector space model in the presence of OCR errors. We show that average precision and recall is not affected for our full text document collection when the OCR version is compared to its corresponding corrected set. We do see divergence though between the relevant document rankings of the OCR and corrected collections with different weighting combinations. In ...
متن کاملOCR correction based on document level knowledge
For over 10 years, the Information Science Research Institute (ISRI) at UNLV has worked on problems associated with the electronic conversion of archival document collections. Such collections typically have a large fraction of poor quality images and present a special challenge to OCR systems. Frequently, because of the size of the collection, manual correction of the output is not affordable....
متن کامل