Revisiting N-Gram Based Models for Retrieval in Degraded Large Collections
نویسندگان
چکیده
The traditional retrieval models based on term matching are not effective in collections of degraded documents (output of OCR or ASR systems for instance). This paper presents a n-gram based distributed model for retrieval on degraded text large collections. Evaluation was carried out with both the TREC Confusion Track and Legal Track collections showing that the presented approach outperforms in terms of effectiveness the classical term centred approach and the most of the participant systems in the TREC Confusion Track.
منابع مشابه
Emphasizing the Need for TREC-like Collaboration Towards MIR Evaluation
The need for standardized large-scale evaluation of music information retrieval (MIR) and music digital library (MDL) methodologies is being addressed with the recent resolution calling for the construction of the infrastructure necessary to support MIR/MDL research. The methodology of our MIR study investigating the use of n-grams for polyphonic music retrieval has been based on a small-scale ...
متن کاملFast Database Indexing for Large Protein Sequence Collections Using Parallel N-Gram Transformation Algorithm
With the rapid development in the field of life sciences and the flooding of genomic information, the need for faster and scalable searching methods has become urgent. One of the approaches that were investigated is indexing. The indexing methods have been categorized into three categories which are the lengthbased index algorithms, transformation-based algorithms and mixed techniques-based alg...
متن کاملProbabilistic Retrieval of OCR Degraded Text Using N-Grams
Abst rac t . The retrieval of OCR degraded text using n-gram formulations within a probabilistic retrieval system is examined in this paper. Direct retrieval of documents using mgram databases of 2 and 3-grams or 2, 3, 4 and 5-grams resulted in improved retrieval performance over standard (word based) queries on the same data when a level of 10 percent degradation or worse was achieved. A secon...
متن کاملMulti Word Term queries for focused Information Retrieval
In this paper, we address both standard and focused retrieval tasks based on comprehensible language models and interactive query expansion (IQE). Query topics are expanded using an initial set of Multi Word Terms (MWTs) selected from top n ranked documents. MWTs are special text units that represent domain concepts and objects. As such, they can better represent query topics than ordinary phra...
متن کاملProbabilistic Retrieval of OCR
The retrieval of OCR degraded text using n-gram formulations within a probabilistic retrieval system is examined in this paper. Direct retrieval of documents using n-gram databases of 2 and 3-grams or 2, 3, 4 and 5-grams resulted in improved retrieval performance over standard (word based) queries on the same data when a level of 10 percent degradation or worse was achieved. A second method of ...
متن کامل