Combining Linguistics with statistics for multiword term extraction: a fruitfull association?
نویسندگان
چکیده
The acquisition of multiword terms from large text collections is a fundamental issue in the context of Information Retrieval. Indeed, their identification leads to improvements in the indexing process and allows guiding the user in his search for information. In this paper, we present an original methodology that allows extracting multiword terms by either (1) exclusively considering statistical word regularities or by (2) combining word statistics with endogenously acquired linguistic information. For that purpose, we conjugate a new association measure called the Mutual Expectation with a new acquisition process called the LocalMaxs. On one hand, the Mutual Expectation, based on the concept of Normalised Expectation, evaluates the degree of cohesiveness that links together all the textual units contained in an n-gram (i.e. ∀n, n ≥ 2). On the other hand, the LocalMaxs retrieves the candidate terms from the set of all the valued n-grams by evidencing local maxima of association measure values. Finally, we compare the results obtained by applying the methodology over a raw Portuguese text with the results reached by combining word statistics with linguistic information endogenously acquired from the same corpus previously tagged.
منابع مشابه
Multiword Unit Hybrid Extraction
This paper describes an original hybrid system that extracts multiword unit candidates from part-of-speech tagged corpora. While classical hybrid systems manually define local part-ofspeech patterns that lead to the identification of well-known multiword units (mainly compound nouns), our solution automatically identifies relevant syntactical patterns from the corpus. Word statistics are then c...
متن کاملMulti-word Term Extraction Based on New Hybrid Approach for Arabic Language
Arabic Multiword Term are relevant strings of words in text documents. Once they are automatically extracted, they can be used to increase the performance of any text mining applications such as Categorisation, Clustering, Information Retrieval System, Machine Translation, and Summarization, etc. This paper introduces our proposed Multiword term extraction system based on the contextual informa...
متن کاملLanguage Independent Automatic Acquisition of Rigid Multiword Units from Unrestricted Text Corpora
Multiword units are groups of words that occur together more often than expected by chance in sub-languages. Président de la République, Coupe du monde and Traité de Maastricht are multiword units. Unfortunately, most of the machine-readable dictionaries contain clearly insufficient information about multiword units. Therefore, their automatic extraction from corpora is an important issue not o...
متن کاملYet Another Ranking Function for Automatic Multiword Term Extraction
Term extraction is an essential task in domain knowledge acquisition. We propose two new measures to extract multiword terms from a domain-specific text. The first measure is both linguistic and statistical based. The second measure is graph-based, allowing assessment of the importance of a multiword term of a domain. Existing measures often solve some problems related (but not completely) to t...
متن کاملA System for Compound Noun Multiword Expression Extraction for Hindi
Compound noun multiword expressions are important for many NLP applications like machine translation and information retrieval. This paper describes a system for Hindi compound noun multiword expressions (MWE) extraction from a given corpus. We identify major categories of compound noun MWEs, based on linguistic and psycholinguistic principles. Our extraction methods use various statistical co-...
متن کامل