Examining the Content Load of Part of Speech Blocks for Information Retrieval
نویسندگان
چکیده
We investigate the connection between part of speech (POS) distribution and content in language. We define POS blocks to be groups of parts of speech. We hypothesise that there exists a directly proportional relation between the frequency of POS blocks and their content salience. We also hypothesise that the class membership of the parts of speech within such blocks reflects the content load of the blocks, on the basis that open class parts of speech are more content-bearing than closed class parts of speech. We test these hypotheses in the context of Information Retrieval, by syntactically representing queries, and removing from them content-poor blocks, in line with the aforementioned hypotheses. For our first hypothesis, we induce POS distribution information from a corpus, and approximate the probability of occurrence of POS blocks as per two statistical estimators separately. For our second hypothesis, we use simple heuristics to estimate the content load within POS blocks. We use the Text REtrieval Conference (TREC) queries of 1999 and 2000 to retrieve documents from the WT2G and WT10G test collections, with five different retrieval strategies. Experimental outcomes confirm that our hypotheses hold in the context of Information Retrieval.
منابع مشابه
Content Based Radiographic Images Indexing and Retrieval Using Pattern Orientation Histogram
Introduction: Content Based Image Retrieval (CBIR) is a method of image searching and retrieval in a database. In medical applications, CBIR is a tool used by physicians to compare the previous and current medical images associated with patients pathological conditions. As the volume of pictorial information stored in medical image databases is in progress, efficient image indexing and retri...
متن کاملA syntactically-based query reformulation technique for information retrieval
Whereas in language words of high frequency are generally associated with low content [Bookstein, A., & Swanson, D. (1974). Probabilistic models for automatic indexing. Journal of the American Society of Information Science, 25(5), 312–318; Damerau, F. J. (1965). An experiment in automatic indexing. American Documentation, 16, 283–289; Harter, S. P. (1974). A probabilistic approach to automatic...
متن کاملبررسی تأثیرات ریشهیابی در بازیابی اطلاعات در زبان فارسی
Using the language-specific behavior in information retrieval systems can improve the quality of the retrieved results significantly. Part of the word that remains after removing its affixes is called stem. Stemming process can be used for improving the relevancy of the results in information retrieval system. Different morphological variants of words (plural, past tense…) will be mapped into t...
متن کاملLight Syntactically-Based Index Pruning for Information Retrieval
Most index pruning techniques eliminate terms from an index on the basis of the contribution of those terms to the content of the documents. We present a novel syntactically-based index pruning technique, which uses exclusively shallow syntactic evidence to decide upon which terms to prune. This type of evidence is document-independent, and is based on the assumption that, in a general collecti...
متن کاملAn Efficient Secret Sharing-based Storage System for Cloud-based Internet of Things
Internet of things (IoTs) is the newfound information architecture based on the internet that develops interactions between objects and services in a secure and reliable environment. As the availability of many smart devices rises, secure and scalable mass storage systems for aggregate data is required in IoTs applications. In this paper, we propose a new method for storing aggregate data in Io...
متن کامل