Probabilistic Retrieval Methods for Text with Miss-Recognized OCR Characters

نویسندگان

  • Manabu Ohta
  • Atsuhiro Takasu
  • Jun Adachi
چکیده

This paper presents two probabilistic text retrieval methods speci cally designed to carry out a full-text search of Japanese documents containing OCR errors. By searching for any query term under the premise that errors exist in recognized text, the presented methods can tolerate such errors, and therefore manual post-editing is not required after OCR recognition. In the applied approach, confusion matrices are used to store (i) characters which are likely to be interchanged when a particular character is miss-recognized, and (ii) the respective probability of each occurrence. Multiple search terms are generated for an input query term by referencing these matrices, after which a full-text search is applied for each search term. The validity of retrieved terms is determined based on the error-occurrence probabilities, and those with a validity value greater than a certain threshold are judged to satisfy the input query. In addition, method performance is experimentally evaluated by determining retrieval e ectiveness, i.e., by calculating recall and precision rates. Results indicate marked improvement in comparison with exact matching.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Retrieval methods for English-text with missrecognized OCR characters

This paper presents three probabilistic text retrieval methods designed to carry out a full-text search of English documents containing OCR errors. By searching for any query term on the premise that there are errors in the recognized text, the methods presented can tolerate such errors, and therefore costly manual postediting is not required after OCR recognition. In the applied approach, conf...

متن کامل

OCR for Handwritten Kannada Language Script

The optical character recognition (OCR) is the process of converting textual scanned image into a computer editable format. The proposed OCR system is for complex handwritten Kannada characters. One of the major challenges faced by Kannada OCR system is recognition of handwritten text from an image. The input text image is subjected to preprocessing and then converted into binary image. Segment...

متن کامل

Optical Font Recognition from Projection Profiles

• Recognition of logical document structures [1], where knowledge of the font used in a word, line, or text block may be useful for defining its logical label (chapter title, section title or paragraph). • Document reproduction, where knowledge of the font is necessary in order to reproduce (reprint) the document. • Document indexing and information retrieval, where word indexes are generally p...

متن کامل

Fusion of Multiple Corrupted Transmissions and its effect on Information Retrieval

Much previous work has focused on correction of OCR degraded text with little work addressing the possibility of fusing the generated text from different OCR systems, which are assumed to produce different types of errors. This paper explores text fusion, which involves the use of language modeling to determine which OCR system (if any) properly recognized individual words. The technique was ap...

متن کامل

Report on the TREC-5 Confusion Track

For TREC retrieval from corrupted data was studied through retrieval of single target documents from a corpus which was corrupted by producing page images corrupting the bit maps and applying OCR techniques to the results In general methods which attempted a probabilistic estimation of the original clean text fare better than methods which simply accept corrupted versions of the query text

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996