Advanced Training Set Construction for Retrieval in Historic Documents
نویسندگان
چکیده
Retrieval in historic documents with non-standard spelling requires a mapping from search terms onto the historic terms in the document. For describing this mapping, we have developed a rule-based approach. The bottleneck of this method has been the training set construction for the algorithm where an expert has to assign manually current word forms to historic spelling variants. As a better solution, we apply a spell checker on a corpus of historic texts, which gives us a list of candidate terms and associated suggestions. The new method generates possible rules for the suggestions and accepts the most frequent rules. Experimental results with German and English texts from different centuries demonstrate the feasibility of our approach. Thus a training set can be constructed with much less initial effort.
منابع مشابه
A Cross-Language Approach to Historic Document Retrieval
Our cultural heritage, as preserved in libraries, archives and museums, is made up of documents written many centuries ago. Largescale digitization initiatives make these documents available to nonexpert users through digital libraries and vertical search engines. For a user, querying a historic document collection may be a disappointing experience: queries involving modern words may not be ver...
متن کاملA text mining approach for automatic construction of hypertexts
The research on automatic hypertext construction emerges rapidly in the last decade because there exists a urgent need to translate the gigantic amount of legacy documents into web pages. Unlike traditional ‘flat’ texts, a hypertext contains a number of navigational hyperlinks that point to some related hypertexts or locations of the same hypertext. Traditionally, these hyperlinks were construc...
متن کاملUsing Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents
Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...
متن کاملA Survey on Various Word Spotting Techniques for Content Based Document Image Retrieval
Searching documents for information and retrieval of relevant documents is a basic activity. Various tools are readily available for searching and retrieval from digital documents, but not much robust methods are available for retrieval from historic documents and old manuscripts as they are not digitized but available in scanned formats. Conventional way of retrieval from scanned document imag...
متن کاملLearning from Relevant Documents in Large Scale Routing Retrieval
The normal practice of selecting relevant documents for training routing queries is to either use all relevants or the 'best n' of them after a (retrieval) ranking operation with respect to each query. Using all relevants can introduce noise and ambiguities in training because documents can be long with many irrelevant portions. Using only the 'best n' risks leaving out documents that do not re...
متن کامل