Word Spotting: Indexing Handwritten Archives
نویسنده
چکیده
There are many historical manuscripts written in a single hand which it would be useful to index. Examples include the early Presidential papers at the Library of Congress and the collected works of W. B. DuBois at the library of the University of Massachusetts. The standard technique for indexing documents is to scan them in, convert them to machine readable form (ASCII) using Optical Character Recognition (OCR) and then index them using a text retrieval engine. However, OCR does not work well on handwriting. Here, an alternative scheme is proposed for indexing such texts. Each page of the document is segmented into words. The images of the words are then matched against each other to create equivalence classes (each equivalence classes contains multiple instances of the same word). The user then provides ASCII equivalents for say the top 2000 equivalence classes. The current paper deals with the matching aspects of this process. Due to variations in even a single person’s handwriting, it is expected that the matching will be the most difficult step in the whole process. Two different techniques for matching words are discussed. The first method, based on Euclidean distance mapping, matches words assuming that the transformation between the words may be modelled by a translation (shift). The second method, based on an algorithm developed by Scott and Longuet-Higgins, matches words assuming that the transformation between the words may be modelled by an affine transform. Experiments are shown demonstrating the feasibility of the approach for indexing handwriting.
منابع مشابه
Connected Component Based Word Spotting on Persian Handwritten image documents
Word spotting is to make searchable unindexed image documents by locating word/words in a doc-ument image, given a query word. This problem is challenging, mainly due to the large numberof word classes with very small inter-class and substantial intra-class distances. In this paper, asegmentation-based word spotting method is presented for multi-writer Persian handwritten doc-...
متن کاملIndexing and Retrieval of On-line Handwritten Documents
Recent advances in on-line data capturing technologies and its widespread deployment in devices like PDAs and notebook PCs is creating large amounts of handwritten data that need to be archived and retrieved efficiently. Word-spotting, which is based on a direct comparison of a handwritten keyword to words in the document, is commonly used for indexing and retrieval. We propose a string matchin...
متن کاملScript Independent Word Spotting in Multilingual Documents
This paper describes a method for script independent word spotting in multilingual handwritten and machine printed documents. The system accepts a query in the form of text from the user and returns a ranked list of word images from document image corpus based on similarity with the query word. The system is divided into two main components. The first component known as Indexer, performs indexi...
متن کاملScale space technique for word segmentation in handwrittenmanuscriptsR
Indexing large archives of historical manuscripts is required to allow rapid perusal by scholars and researchers who wish to consult the original manuscripts. However, automatic conversion of handwritten manuscripts to digital form allowing eecient storage and retrieval of the original documents is a challenging problem. Word spotting is a scheme to index such data. The important steps in this ...
متن کاملWord Spotting in Cursive Handwritten Documents Using Modified Character Shape Codes
There is a large collection of Handwritten English paper documents of Historical and Scientific importance. But paper documents are not recognised directly by computer. Hence the closest way of indexing these documents is by storing their document digital image. Hence a large database of document images can replace the paper documents. But the document and data corresponding to each image canno...
متن کاملA survey of document image word spotting techniques
Vast collections of documents available in image format need to be indexed for information retrieval purposes. In this framework, word spotting is an alternative solution to optical character recognition (OCR), which is rather inefficient for recognizing text of degraded quality and unknown fonts usually appearing in printed text, or writing style variations in handwritten documents. Over the p...
متن کامل