A Corpus-Based Approach to Deriving Lexical Mappings

نویسنده

Mark Stevenson

چکیده

This paper proposes a novel, corpusbased, method for producing mappings between lexical resources. Results from a preliminary experiment using part of speech tags suggests this is a promising area for future research. 1 I n t r o d u c t i o n Dictionaries are now commonly used resources in NLP systems. However, different lexical resources are not uniform; they contain different types of information and do not assign words the same number of senses. One way in which this problem might be tackled is by producing mappings between the senses of different resources, the "dictionary mapping problem". However, this is a non-trivial problem, as examination of existing lexical resources demonstrates. Lexicographers have been divided between "lumpers', or those who prefer a few general senses, and "splitters" who create a larger number of more specific senses so there is no guarantee that a word will have the same number of senses in different resources. Previous attempts to create lexical mappings have concentrated on aligning the senses in pairs of lexical resources and based the mapping decision on information in the entries. For example, Knight and Luk (1994) merged WordNet and LDOCE using information in the hierarchies and textual definitions of each resource. Thus far we have mentioned only mappings between dictionary senses. However, it is possible to create mappings between any pair of linguistic annotation tag-sets; for example, part of speech tags. We dub the more general class lexical mappings, mappings between two sets of lexical annotations. One example which we shall consider further is that of mappings between part of speech tags sets. This paper shall propose a method for producing lexical mappings based on corpus evidence. It is based on the existence of large-scale lexical annotation tools such as part of speech taggers and sense taggers, several of which have now been developed, for example (Brill, 1994)(Stevenson and Wilks, 1999). The availability of such taggers bring the possibility of automatically annotating large bodies of text. Our proposal is, briefly, to use a pair of taggers with each assigning annotations from the lexical tag-sets we are interested in mapping. These taggers can then be applied to, the same, large body of text and a mapping derived from the distributions of the pair of tag-sets in the corpus.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Corpus-based Study of Lexical Bundles in Discussion Section of Medical Research Articles

There has been increasing interest in utilizing corpora in linguistic research and pedagogy in recent years. Rhetorical organization of different sections of research articles may appear similar in various disciplines, but close examination may show subtle differences nonetheless. One of the features that has been at the center of attention especially in recent years is the idiomaticity of a di...

متن کامل

Lexical Bundles in English Abstracts of Research Articles Written by Iranian Scholars: Examples from Humanities

This paper investigates a special type of recurrent expressions, lexical bundles, defined as a sequence of three or more words that co-occur frequently in a particular register (Biber et al., 1999). Considering the importance of this group of multi-word sequences in academic prose, this study explores the forms and syntactic structures of three- and four-word bundles in English abstracts writte...

متن کامل

A Corpus-Based Study of the Lexical Make-up of Applied Linguistics Article Abstracts

This paper reports results from a corpus-based study that explored the frequency of words in the abstracts of applied linguistics journal articles. The abstracts of major articles in leading applied linguists journals, published since 2005 up to November 2001 were analyzed using software modules from the Compleat Lexical Tutor. The output includes a list of the most frequent content words, list...

متن کامل

Inferring parts of speech for lexical mappings via the Cyc KB

We present an automatic approach to learning criteria for classifying the parts-of-speech used in lexical mappings. This will further automate our knowledge acquisition system for non-technical users. The criteria for the speech parts are based on the types of the denoted terms along with morphological and corpus-based clues. Associations among these and the parts-of-speech are learned using th...

متن کامل

Mining of Parsed Data to Derive Deverbal Argument Structure

The availability of large parsed corpora and improved computing resources now make it possible to extract vast amounts of lexical data. We describe the process of extracting structured data and several methods of deriving argument structure mappings for deverbal nouns that significantly improves upon non-lexicalized rule-based methods. For a typical model, the F-measure of performance improves ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1999

A Corpus-Based Approach to Deriving Lexical Mappings

نویسنده

چکیده

منابع مشابه

A Corpus-based Study of Lexical Bundles in Discussion Section of Medical Research Articles

Lexical Bundles in English Abstracts of Research Articles Written by Iranian Scholars: Examples from Humanities

A Corpus-Based Study of the Lexical Make-up of Applied Linguistics Article Abstracts

Inferring parts of speech for lexical mappings via the Cyc KB

Mining of Parsed Data to Derive Deverbal Argument Structure

عنوان ژورنال:

اشتراک گذاری