Information retrieval for OCR documents: a content-based probabilistic correction model

نویسندگان

  • Rong Jin
  • ChengXiang Zhai
  • Alexander G. Hauptmann
چکیده

The difficulty with information retrieval for OCR documents lies in the fact that OCR documents comprise of a significant amount of erroneous words and unfortunately most information retrieval techniques rely heavily on word matching between documents and queries. In this paper, we propose a general content-based correction model that can work on top of an existing OCR correction tool to “boost” retrieval performance. The basic idea of this correction model is to exploit the whole content of a document to supplement any other useful information provided by an existing OCR correction tool for word corrections. Instead of making an explicit correction decision for each erroneous word as typically done in a traditional approach, we consider the uncertainties in such correction decisions and compute an estimate of the original “uncorrupted” document language model accordingly. The document language model can then be used for retrieval with a language modeling retrieval approach. Evaluation using the TREC standard testing collections indicates that our method significantly improves the performance compared with simple word correction approaches such as using only the top ranked correction.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Content-based Probabilistic Correction Model for OCR Document Retrieval

The difficulty with information retrieval for OCR documents lies in the fact that OCR documents comprise of a significant amount of erroneous words and unfortunately most information retrieval techniques rely heavily on word matching between documents and queries. In this paper, we propose a general content-based correction model that can work on top of an existing OCR correction tool to “boost...

متن کامل

Retrieving Arabic Printed Document: a Survey

This paper surveys some of the literature pertaining to searching and retrieving OCR’ed printed documents with emphasis on Arabic documents. It examines peculiarities of Arabic morphology, orthography, retrieval, word clustering, display, OCR, and error correction. The paper surveys existing evaluation test-beds for retrieval of Arabic OCR texts. Lastly, it concludes with possible directions fo...

متن کامل

Probabilistic Logical Information Retrieval for Content, Hypertext, and Database Querying

Classical retrieval models support content-oriented searching for documents using a set of words as data model. However, in hypertext and database applications we want to consider the link structure and attribute values of documents in addition to the pure content. In this paper, we present a framework based on probabilistic logical retrieval for describing the retrieval function for a query wh...

متن کامل

Retrieving Images of Scanned Text Documents

Information retrieval is the task of nding documents, usually text, which are relevant to a user's information need. A conventional approach to information management of paper documents is normally based on classifying them into a hierarchical classiication structure. More recently we have seen electronic document management systems which manage scanned images of documents in the same way as pa...

متن کامل

A Survey on Various Word Spotting Techniques for Content Based Document Image Retrieval

Searching documents for information and retrieval of relevant documents is a basic activity. Various tools are readily available for searching and retrieval from digital documents, but not much robust methods are available for retrieval from historic documents and old manuscripts as they are not digitized but available in scanned formats. Conventional way of retrieval from scanned document imag...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003