A comparison of sub-word indexing methods for information retrieval
نویسنده
چکیده
This paper compares different methods of subword indexing and their performance on the English and German domain-specific document collection of the Cross-language Evaluation Forum (CLEF). Four major methods to index sub-words are investigated and compared to indexing stems: 1) sequences of vowels and consonants, 2) a dictionary-based approach for decompounding, 3) overlapping character n-grams, and 4) Knuth’s algorithm for hyphenation. The performance and effects of sub-word extraction on search time and index size and time are reported for English and German retrieval experiments. The main results are: For English, indexing sub-words does not outperform the baseline using standard retrieval on stemmed word forms (–8% mean average precision (MAP), – 11% geometric MAP (GMAP), +1% relevant and retrieved documents (rel ret) for the best experiment). For German, with the exception of n-grams, all methods for indexing sub-words achieve a higher performance than the stemming baseline. The best performing sub-word indexing methods are to use consonant-vowelconsonant sequences and index them together with word stems (+17% MAP, +37% GMAP, +14% rel ret compared to the baseline), or to index syllable-like sub-words obtained from the hyphenation algorithm together with stems (+9% MAP, +23% GMAP, +11% rel ret).
منابع مشابه
Content Based Radiographic Images Indexing and Retrieval Using Pattern Orientation Histogram
Introduction: Content Based Image Retrieval (CBIR) is a method of image searching and retrieval in a database. In medical applications, CBIR is a tool used by physicians to compare the previous and current medical images associated with patients pathological conditions. As the volume of pictorial information stored in medical image databases is in progress, efficient image indexing and retri...
متن کاملWord and Sub-word Indexing Approaches for Reducing the Effects of OOV Queries on Spoken Audio
We explore the problem of out of vocabulary (OOV) queries in audio indexing systems by comparing three indexing methods on a broadcast news repository containing 75 hours of audio. Our systems are word-based, phoneme-based and a novel system based on syllable-like units called particles. To better examine the performance of these three approaches we use a query set where the percentage of OOVs ...
متن کاملبررسی تأثیرات ریشهیابی در بازیابی اطلاعات در زبان فارسی
Using the language-specific behavior in information retrieval systems can improve the quality of the retrieved results significantly. Part of the word that remains after removing its affixes is called stem. Stemming process can be used for improving the relevancy of the results in information retrieval system. Different morphological variants of words (plural, past tense…) will be mapped into t...
متن کاملFast decoding for open vocabulary spoken term detection
Information retrieval and spoken-term detection from audio such as broadcast news, telephone conversations, conference calls, and meetings are of great interest to the academic, government, and business communities. Motivated by the requirement for high-quality indexes, this study explores the effect of using both word and sub-word information to find in-vocabulary and OOV query terms. It also ...
متن کاملEvaluation of Stop Word Lists in Chinese Language
In modern information retrieval systems, effective indexing can be achieved by removal of stop words. Till now many stop word lists have been developed for English language. However, no standard stop word list has been constructed for Chinese language yet. With the fast development of information retrieval in Chinese language, exploring the evaluation of Chinese stop word lists becomes critical...
متن کامل