Modeling Subcategorization through Co-occurrence: a Computational Lexical Resource for Italian Verbs

نویسندگان

  • Gabriella Lapesa
  • Alessandro Lenci
چکیده

1. Goals and Methodology The aim of this abstract is to introduce LexIt, a freely available lexical resource to characterize Italian verb argument properties in terms of distributional information automatically extracted from large corpora with state-of-the-art computational linguistics methods. Research on automatic extraction of subcategorization frames from corpora has a long tradition in computational linguistics, but to the best of our knowledge this is the first large-scale resource of such type for Italian, aiming at characterizing the predicate valence properties fully on distributional ground. Theoretically grounded on Levin’s assumption that distributional data can be used as “a probe into the elements entering into the lexical representations of word meaning” (Levin, 1993: 14) and methodologically based on the Distributional Semantics framework (Miller and Charles, 1991), our approach proposes to model subcategorization and semantic selection properties through co-occurrence data. Co-occurrence turns out to be a powerful instrument for the study of subcategorization, for more than one reason. First of all, co-occurrences can be automatically extracted from large corpora. Moreover, co-occurrences can used to model the association between verbs and syntactic constructions, arguments and semantic classes as a gradient preference instead of categorical selection. Last but not least, the basic notion of collocation as surface co-occurrence can be integrated with more abstract syntactic or semantic information. We used stochastic association measures (Evert, 2008) traditionally applied to the study of word collocations to evaluate the strength of the correlation between: verbs and syntactic frames, argument slots and the words filling them; argument slots and the semantic classes (or polysemies) selected by them. Currently, LexIt contains more than 3,900 Italian verbs associated with a syntactic and a semantic profile, automatically extracted from La Repubblica Corpus (Baroni et al. 2004). The syntactic profile contains the syntactic frames that best characterize the target verb. The semantic profile is further articulated in two subgroups: the prototypical fillers of each argument slot and the semantic classes abstracted over these fillers.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

LexIt: A Computational Resource on Italian Argument Structure

The aim of this paper is to introduce LexIt, a computational framework for the automatic acquisition and exploration of distributional information about Italian verbs, nouns and adjectives, freely available through a web interface at the address http://sesia.humnet.unipi.it/lexit. LexIt is the first large-scale resource for Italian in which subcategorization and semantic selection properties ar...

متن کامل

LexFr: Adapting the LexIt Framework to Build a Corpus-based French Subcategorization Lexicon

This paper introduces LexFr, a corpus-based French lexical resource built by adapting the framework LexIt, originally developed to describe the combinatorial potential of Italian predicates. As in the original framework, the behavior of a group of target predicates is characterized by a series of syntactic (i.e., subcategorization frames) and semantic (i.e., selectional preferences) statistical...

متن کامل

Unsupervised Acquisition of Verb Subcategorization Frames from Shallow-Parsed Corpora

In this paper, we reported experiments of unsupervised automatic acquisition of Italian and English verb subcategorization frames (SCFs) from general and domain corpora. The proposed technique operates on syntactically shallow-parsed corpora on the basis of a limited number of search heuristics not relying on any previous lexico-syntactic knowledge about SCFs. Although preliminary, reported res...

متن کامل

Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks

We present a methodology for extracting subcategorization frames based on an automatic lexical-functional grammar (LFG) f-structure annotation algorithm for the Penn-II and Penn-III Treebanks. We extract syntactic-function-based subcategorization frames (LFG semantic forms) and traditional CFG category-based subcategorization frames as well as mixed function/category-based frames, with or witho...

متن کامل

Lexical Semantics and Selection of TAM in Bantu Languages: A Case of Semantic Classification of Kiswahili Verbs

The existing literature on Bantu verbal semantics demonstrated that inherent semantic content of verbs pairs directly with the selection of tense, aspect and modality formatives in Bantu languages like Chasu, Lucazi, Lusamia, and Shiyeyi. Thus, the gist of this paper is the articulation of semantic classification of verbs in Kiswahili based on the selection of TAM types. This is because the sem...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011