Benefits of Resource-Based Stemming in Hungarian Information Retrieval

نویسندگان

Péter Halácsy

Viktor Trón

چکیده

This paper discusses the impact of resource-driven stemming in information retrieval tasks. We conducted experiments in order to identify the relative benefit of various stemming strategies in a language with highly complex morphology. The results reveal the importance of various aspects of stemming in enhancing system performance in the IR task of the CLEF ad-hoc monolingual Hungarian track. The first Hungarian test collection for information retrieval (IR) appeared in the 2005 CLEF ad-hoc task monolingual track. Prior to that no experiments had been published that measured the effect of Hungarian language-specific knowledge on retrieval performance. Hungarian is a language with highly complex morphology.1 Its inventory of morphological processes include both affixation (prefix and suffix) and compounding. Morphological processes are standardly viewed as providing the grammatical means for (i) creating new lexemes (derivation) as well as (ii) expressing morphosyntactic variants belonging to a lexeme (inflection). To illustrate the complexity of Hungarian morphology, we mention that a nominal stem can be followed by 7 types of Possessive, 3 Plural, 3 Anaphoric Possessive and 17 Case suffixes yielding as many as 1134 possible inflected forms. Similarly to German and Finnish, compounding is very productive in Hungarian. Almost any two (or more) nominals next to each other can form a compound (e.g., üveg+ház+hat-ás = glass+house+effect ’greenhouse effect’). The complexity of Hungarian and the problems it creates for IR is detailed in [1]. 1 Stemming and Hungarian All of the top five systems of the 2005 track (Table 1) had some method for handling the rich morphology of Hungarian: either words were tokenized to ngrams or an algorithmic stemmer was used. 1 A more detailed descriptive grammar of Hungarian is available at http:// mokk.bme.hu/resources/ir C. Peters et al. (Eds.): CLEF 2006, LNCS 4730, pp. 99–106, 2007. c © Springer-Verlag Berlin Heidelberg 2007 100 P. Halácsy and V. Trón Table 1. The top five runs for Hungarian ad hoc monolingual task of CLEF 2005 part run map stemming method jhu/apl aplmohud 41.12% 4gram unine UniNEhu3 38.89% Savoy’s stemmer + decompounding miracle xNP01ST1 35.20% Savoy’s stemmer humminngbird humHU05tde 33.09% Savoy’s stemmer + 4gram hildesheim UHIHU2 32.64% 5gram The best result was achieved by JHU/APL with an IR system based on lanugage modelling in the run called aplmohud [2]. This system used a character 4-gram based tokenization. Such n-gram techniques can efficiently get round the problem of rich agglutinative morphology and compounding. For example the word atomenergia = ’atomic energy’ in the query is tokenized to atom, tome, omen, mene, ener, nerg, ergi, rgia strings. When the text only contains the form atomenergiával = ’with atomic energy’, the system still finds the relevant document. Although this system used the Snowball stemmer together with the n-gram tokenization for the English and French tasks, the Hungarian results were nearly as good: English 43.46%, French 41.22% and Hungarian 41.12%. From these results it seems that the difference between the isolating and agglutinating languages can be eliminated by character n-gram methods. Unine [3], Miracle [4] and Hummingbird [5] all employ the same algorithmic stemmer for Hungarian that removes the nominal suffixes corresponding to the different cases, the possessive and plural (http://www.unine.ch/info/clef). UniNEhu3 [3] also uses a language independent decompounding algorithm that tries to segment words according to corpus statistics calculated from the document collection [6]. The idea is to find a segmentation that maximizes the probability of hypothesized segments given the document and the language. Given the density of short words in the language, spurious segmentations can be avoided by setting a minimum length limit (8-characters in the case of Savoy) on the words to be segmented. Among the 2005 CLEF contestants, [7] is especially important for us, since she uses the same baseline system. She developed four stemmers which implement successively more agressive stripping strategies. The lightest only strips some of the frequent case suffixes and the plural and the heaviest strips all major inflections. We conjecture that the order in which the suffix list is enriched is based on an intuitive scale of suffix transparency or meaningfulness which is assumed to impact on the retrieval results. In line with previous findings, she reports that the stemming enhances retrieval, with the most agressive strategy winning. However, the title of her paper ’Four stemmers and a funeral’ aptly captures her main finding that even the best of her stemmers performs the same as a 6-gram method. Benefits of Resource-Based Stemming in Hungarian Information Retrieval 101 2 The Experimental Setting In order to measure the effects of various flavours of stemming on retrieval performance, we put together an unsophisticated IR system. We used Jakarta Lucene 2.0, an off-the-shelf system to perform indexing and retrieval with ranking based on its default vector space model. No further ranking heuristics or postretrieval query expansion is applied. Before indexing the XML documents are converted to plain text format, header (title, lead, source, etc.) and body sequentially appended. For tokenization we used Lucene’s LetterTokenizer class: this considers every non-alphanumeric character (according to the Java Character class) as token boundary. All tokens are lowercased but not stripped of accents. The document and query texts are tokenized the same way including topic and description fields used in the search. Our various runs differ only in how these text tokens are mapped on terms for document indexing and querying. In the current context, we define stemming as solutions to this mapping. Tokens that exactly matched a stopword2 before or after the application of our stemming algorithms were eliminated. Stemming can map initial document and query tokens onto zero, one or more (zero only for stopwords) terms. If stemming yields more than one term, each resulting term (but not the original token) was used as an index term. For the construction of the query each term resulting from the stemmed token (including multiple ones) was used as a disjuntive query term. Using this simple framework allows us to isolate and compare the impact of various strategies of stemming. Using an unsophisticated architecture has the further advantage that any results reminiscent of a competitive outcome will suggest the beneficial effect of the particular stemming method even if no direct comparison to other systems is available due to different retrieval and ranking solution employed. 2.1 Strategies of Stemming Instead of some algorithmic stemmers employed in all previous work, we leveraged our word-analysis technology designed for generic NLP tasks including morphological analysis and POS-tagging. Hunmorph is a word-analysis toolkit, with a language independent analyser using a language-specific lexicon and morphological grammar [8]. The core engine for recognition and analysis can (i) perform guessing (i.e., hypothesize new stems) and (ii) analyse compounds. Guessing means that possible analyses (morphosyntactic tag and hypothesized lemma) are given even if the input word’s stem is not in the lexicon. This feature allows for a stemming mode very similar to resourceless algorithmic stemmers if no lexicon is used. However, guessing can be used in addition to the lexicon. To facilitate resource sharing and to enable systematic task-dependent optimizations from a central lexical knowledge base, the toolkit offers a general framework for describing the lexicon and morphology of any language. hunmorph uses the Hungarian lexical database and morphological grammar called 2 We used the same stopword list as [7], which is downloadable at http:// ilps.science.uva.nl/Resources/HungarianStemmer/ 102 P. Halácsy and V. Trón morphdb.hu [9]. morphdb.hu is by far the broadest-coverage resource for Hungarian reaching about 93% recall on the 700M word Hungarian Webcorpus [10]. We set out to test the impact of using this lexical resource and grammar in an IR task. In particular we wanted to test to what extent guessing can compensate for the lack of the lexicon and what types of affixes should be recognized (stripped). We also compared this technology with the two other stemmers mentioned above, Savoy’s stemmer and Snowball. Decompounding is done based on the compound analyses of hunmorph according to compounding rules in the resource. In our resource, only two nominals can form a compound. Although compounding can be used with guessing, this only makes sense if the head of the compound can be hypothesized independently, i.e., if we use a lexicon. We wanted to test to what extent compounding boosts IR efficiency. Due to extensive morphological ambiguities, hunmorph often gives several alternative analyses. Due to limitations of our simple IR system we choose only one candidate. We have various strategies as to which of these alternatives should be retained as the index term. We can use (i) basic heuristics to choose from the alternants, or (ii) use a POS-tagger that disambiguates the analysis based on textual context. As a general heuristics used in (i) and (ii) we prefer analyses that are neither compound nor guessed, if no such analysis exists then we prefer non-guessed compounds over guessed analyses. (ii) involves a linguistically sophisticated method, pos-tagging that restricts the candidate set by choosing an inflectional category based on contextual disambiguation. We used a statistical POS-tagger [11] and tested its impact on IR performance. This method relies on the availability of a large tagged training corpus; if a language has such a corpus, lexical resources for stemming are very likely to exist. Therefore it seemed somewhat irrelevant to test the effect of pos-tagging without lexical resources (only on guesser output).3 If there are still multiple analyses either with or without POS-tagging, we found that choosing the shortest lemma for guessed analyses (agressive stemming) and the longest lemma for known analyses (blocking of further analysis by known lexemes) works best. When a compound analysis is chosen, lemmata of all constituents are retained as

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sampling Precision to Depth 10000: Evaluation Experiments at CLEF 2007

We describe evaluation experiments conducted by submitting retrieval runs for the monolingual Bulgarian, Czech and Hungarian information retrieval tasks of the AdHoc Track of the Cross-Language Evaluation Forum (CLEF) 2007. In the ad hoc retrieval tasks, the system was given 50 natural language queries, and the goal was to find all of the relevant documents (with high precision) in a particular...

متن کامل

Stemming Strategies for European Languages

In this paper, we describe and evaluate different general stemming approaches for the French, Portuguese (Brazilian), German and Hungarian languages. Based on the CLEF test-collections, we demonstrate that light stemming approaches are quite effective for the French, Portuguese and Hungarian languages, and perform reasonably well for the German language. Variations in mean average precision amo...

متن کامل

بررسی تأثیرات ریشه‌یابی در بازیابی اطلاعات در زبان فارسی

Using the language-specific behavior in information retrieval systems can improve the quality of the retrieved results significantly. Part of the word that remains after removing its affixes is called stem. Stemming process can be used for improving the relevancy of the results in information retrieval system. Different morphological variants of words (plural, past tense…) will be mapped into t...

متن کامل

Hungarian Monolingual Retrieval at CLEF 2005

We describe our official runs for the ad hoc monolingual task in Hungarian for CLEF 2005. We conducted experiments with four stemmers of varying impact. The experiments indicate that stemmers focusing on noun inflection are as effective as more broadly oriented stemmers, and that extensive stemming is especially beneficial for Hungarian monolingual retrieval.

متن کامل