Word Form Text Database Indexes Developing an Automatic Linguistic Truncation Operator for Best-match Retrieval of Finnish in Inflected Developing an Automatic Linguistic Truncation Operator for Best-match Retrieval of Finnish in Inflected Word Form Text Database Indexes 465
نویسنده
چکیده
The paper presents a new method for handling of morphological variation of query terms in best-match IR. The method is based on enhanced inflectional stems. Use of inflectional stems has earlier been shown to be a good retrieval method in inflected indexes in a best-match environment for a highly inflected and compound-rich language, Finnish. In this paper the earlier stem method is elaborated upon by enhancing the stems with regular expressions. Contrary to our expectations the results show that the enhanced stem queries do not outperform basic inflectional stems, but neither are they considerably worse with long queries. With short web-like queries they perform relatively better than with long queries and outperform clearly stemming (Finnish stemmer of Snowball) and plain, unprocessed query words. The main benefits of the proposed method, besides fairly good precision and recall (P-R) performance, are shorter and more manageable queries, which is of practical importance, e.g. with large web
منابع مشابه
Developing an automatic linguistic truncation operator for best-match retrieval of Finnish in inflected word form text database indexes
The paper presents a new method for handling of morphological variation of query terms in best-match IR. The method is based on enhanced inflectional stems. Use of inflectional stems has earlier been shown to be a good retrieval method in inflected indexes in a best-match environment for a highly inflected and compound-rich language, Finnish. In this paper the earlier stem method is elaborated ...
متن کاملRule-based Automatic Multi-word Term Extraction and Lemmatization
In this paper we present a rule-based method for multi-word term extraction that relies on extensive lexical resources in the form of electronic dictionaries and finite-state transducers for modelling various syntactic structures of multi-word terms. The same technology is used for lemmatization of extracted multi-word terms, which is unavoidable for highly inflected languages in order to pass ...
متن کاملEija Airio THE EFFECTS OF SEPARATE AND MERGED INDEXES AND WORD NORMALIZATION IN MULTILINGUAL CLIR
Multilingual IR may be performed in two environments: there may exist a separate index for each target language, or all the languages may be indexed in a merged index. In the first case, retrieval must be performed separately in each index, after which the result lists have to be merged. In the case of the merged index, there are two alternatives: either to perform retrieval with a merged query...
متن کاملUsing Linguistic Knowledge in Information Retrieval Technical Report
The current practice in Information Retrieval is largely based on statistical techniques. These techniques are reasonably successful but many researchers believe that statistical techniques have reached their upper bound. Some recent research in IR is aimed at investigating whether Natural Language Processing techniques can be used to improve the performance of existing retrieval strategies. In...
متن کاملImproving the Automatic Retrieval of Text Documents
This paper reports on a statistical stemming algorithm based on link analysis. Considering that a word is formed by a prefix (stem) and a suffix, the key idea is that the interlinked prefixes and suffixes form a community of sub-strings. Thus, discovering these communities means searching for the best word splits that give the best word stems. The algorithm has been used in our participation in...
متن کامل