Stemming and n-grams in Spanish: an evaluation of their impact on information retrieval

نویسندگان

  • Carlos G. Figuerola
  • Raquel Gómez Díaz
  • Eva López de San Román
چکیده

At some stage, most of the models and techniques implemented in IR use frequency counts of the terms appearing in documents and in queries. However, many words, since they are derived from the same stem, have very close semantic contents. This makes a grouping of such variants under a single term advisable. Otherwise, dispersal occurs in the calculation of frequency of these terms, and it also becomes difficult to compare queries and documents. On the other hand, there are notable differences between different languages in the way of forming derivatives and inflected forms, so that the application of specific techniques can produce unequal results according to the language of the documents and queries. A description is given of the tests carried out for documents in Spanish, which involved some stemming techniques widely used in English, as well as the application of n-grams, and the results are compared. Most of the models and techniques employed in IR at some stage use frequency counts of the terms that appear in documents and queries. However, in this context, the concept of term is not exactly equivalent to that of word. Leaving aside the matter of so-called empty words, which cannot be considered terms as such, we have the case of words derived from the same stem, which can be said to have a very close semantic content [1]. The possible variations of the derivatives, together with inflected forms, changes in gender and number, etc., make the grouping of these variants under a single term advisable. Otherwise, dispersal in the calculation of frequency of these terms occurs, and it is difficult to compare queries and documents [2]. On the other hand, if this grouping does not occur, the comparison between a query and the documents of a collection becomes problematic. Somehow the programmes that are to solve this query must identify inflected forms or derivatives –which may be different in the query and in the document– as similar and corresponding to the same stem. Therefore, it is necessary to incorporate into information retrieval systems a mechanism that makes it possible to undertake the standardization of different words, in the sense of representing the different forms of one same stem under one same form which may appear in documents and queries. This operation is generically known as “stemming”, in that it is a matter of automatically obtaining the stem corresponding to each word that may appear in documents and queries. The very concept of stem can be approached in different ways. Although it seems evident that the derivation, and even the mere inflexion, of a word modifies its semantic content, there is no clear line that makes it possible to delimit to what extent we are dealing with forms corresponding to one same stem, or whether it is a clearly differentiated term [3]. Naturally, this is directly related to the specific objective that stemming pursues. In our case, this objective consists of improving the performance of IR systems, but for other types of applications the criteria do not have to be the same. Thus, a linguist will admit that a change in gender or number, for example, is perfectly acceptable; in this way, ‘catalogue’ and ‘catalogues’ clearly correspond to the same stem. However, something different occurs with ‘catalogue’ and ‘cataloguing’, for example, although it is evident that they are different words, even belonging to different grammatical categories. For the purposes of retrieval it seems reasonable to suppose that if someone makes a query related to ‘catalogue’ they should obtain documents in which the word ‘cataloguing’ appears. This question has been posed in diverse ways, from a simple stripping to the application of rather more sophisticated algorithms. Among the more wellknown contributions we find the algorithm proposed by Lovin in 1968 [4], which, to some extent, is the basis of subsequent algorithms and proposals, such as those of Dawson [5], Porter [2] and Paice [6]. The results of the different forms of stemming, however, are irregular. Thus, they have been abundantly applied to texts in English with satisfactory results. With other, more morphologically complex languages, such as those derived from Latin, it is quite a different matter. On the one hand, there has been generally less IR work done in these languages and on the other hand, the application of stemming algorithms requires the implementation of considerable linguistic knowledge, which is not always available. In any case, it is possible to find proposals and algorithms for specific languages, among which are Latin itself, despite its being a dead language [7], Malay [8], French [9], [10] or Arabic [11].

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

بررسی تأثیرات ریشه‌یابی در بازیابی اطلاعات در زبان فارسی

Using the language-specific behavior in information retrieval systems can improve the quality of the retrieved results significantly. Part of the word that remains after removing its affixes is called stem. Stemming process can be used for improving the relevancy of the results in information retrieval system. Different morphological variants of words (plural, past tense…) will be mapped into t...

متن کامل

Statistical Phrases in Automated Text Categorization

In this work we investigate the usefulness of n-grams for document indexing in text categorization (TC). We call n-gram a set tk of n word stems, and we say that tk occurs in a document dj when a sequence of words appears in dj that, after stop word removal and stemming, consists exactly of the n stems in tk, in some order. Previous researches have investigated the use of n-grams (or some varia...

متن کامل

Evaluation of N-grams Conflation Approach in Text-Based Information Retrieval

This paper examines a conflation method based on the N-grams approach and evaluates its performance relative to the results achieved by other techniques such as Porter algorithm and successor variety stemming. In addition to that, an alternative way of enhancing the N-grams method, derived from the concept of inverse frequency weighing, is introduced and evaluated. The experimental results gene...

متن کامل

Modified Makagonov’s Method for Testing Word Similarity and its Application to Constructing Word Frequency Lists

By (morphologically) similar wordforms we understand wordforms (strings) that have the same base meaning (roughly, the same root), such as sadly and sadden. The task of deciding whether two given strings are similar (in this sense) has numerous applications in text processing, e.g., in information retrieval, for which usually stemming is employed as an intermediate step. Makagonov has suggested...

متن کامل

Factors Affecting Student's Scientific Information Retrieval based on Fuzzy Logic Method Compared to Traditional Method

Background and aim: The aim of this study was to identify the factors affecting on students' performance in information retrieval based on fuzzy logic method compared to traditional method. Materials and methods: This survey-descriptive study was performed using quantitative approach. The research population was 34 PhD students, and the researcher-made questionnaire was used. Data were analyzed...

متن کامل

A comparison of sub-word indexing methods for information retrieval

This paper compares different methods of subword indexing and their performance on the English and German domain-specific document collection of the Cross-language Evaluation Forum (CLEF). Four major methods to index sub-words are investigated and compared to indexing stems: 1) sequences of vowels and consonants, 2) a dictionary-based approach for decompounding, 3) overlapping character n-grams...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • J. Information Science

دوره 26  شماره 

صفحات  -

تاریخ انتشار 2000