Automatic Extraction of Subcorpora based on Subcategorization Frames from a Part-of-Speech Tagged Corpus
نویسنده
چکیده
This paper presents a method for extracting sub.cor.pora documenting different subcategorlzatlon frames for verbs, nouns, and adjectives in the 100 mio. word British National Corpus. The extraction tool consists of a set of batch files for use with the Corpus Query Processor (CQP), which is part of the IMS corpus workbench (cf. Christ 1994a,b). A macroprocessor has been developed that allows the user to specify in a simple input file which subcorpora are to be created for a given lemma. The resulting subcorpora can be used (1) to provide evidence for the subcategorization properties of a given lemma, and to facilitate the selection of corpus lines for lexicographic research, and (2) to determine the frequencies of different syntactic contexts of each lemma.
منابع مشابه
Automatic Extraction of Subcategorization Frames for Corpus-based Dictionary-building
This paper presents a method for automatically extracting subcorpora isolating different subcategorization frames for nouns, adjectives, and verbs in the 100 mi. word BNC. The tool is being used in the FrameNet project, an NSFfunded project that is involved in producing a database and tools for dictionary-building, based on the principles of Frame Semantics. The subcorpora are used (1) to facil...
متن کاملThe Automatic Acquisition Of Frequencies Of Verb Subcategorization Frames From Tagged Corpora
We describe a mechanism for automatically acquiring verb subcategorization frames and their frequencies in a large corpus. A tagged corpus is first partially parsed to identify noun phrases and then a finear grammar is used to estimate the appropriate subcategorization frame for each verb token in the corpus. In an experiment involving the identification of six fixed subcategorization frames, o...
متن کاملDistinguishing Complements from Adjuncts using Memory-Based Learning
The automatic distinction between complements and adjuncts, i.e. between subcategorized and non-subcategorized constituents, is crucial for the automatic acquisition of subcategorization lexicons from corpora. In this paper we present memory-based learning experiments for the task of distinguishing complements from adjuncts. Data is extracted from the part-of-speech tagged and parsed version of...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملAutomatic Extraction of Subcategorization Frames for Bulgarian
Knowledge of verb’s valency or subcategorization is essential for many NLP tasks. The present paper describes an attempt to learn this kind of information from a corpus of parsed sentences of Bulgarian. Our program acquired the subcategorization information for 38 verbs and achieved 87.7% precision and 68.3% recall. We did not use predefined sets of frames but automatically induced such from a ...
متن کامل