Noun Classification from Predicate-Argument Structures
نویسنده
چکیده
A method of determining the similarity of nouns on the basis of a metric derived from the distribution of subject, verb and object in a large text corpus is described. The resulting quasi-semantic classification of nouns demonstrates the plausibility of the distributional hypothesis, and has potential application to a variety of tasks, including automatic indexing, resolving nominal compounds, and determining the scope of modification. 1. I N T R O D U C T I O N A variety of linguistic relations apply to sets of semantically similar words. For example, modifiers select semantically similar nouns, selecfional restrictions are expressed in terms of the semantic class of objects, and semantic type restricts the possibilities for noun compounding. Therefore, it is useful to have a classification of words into semantically similar sets. Standard approaches to classifying nouns, in terms of an "is-a" hierarchy, have proven hard to apply to unrestricted language. Is-a hierarchies are expensive to acquire by hand for anything but highly restricted domains, while attempts to automatically derive these hierarchies from existing dictionaries have been only partially successful (Chodorow, Byrd, and Heidom 1985). This paper describes an approach to classifying English words according to the predicate-argument structures they show in a corpus of text. The general idea is straightforward: in any natural language there ate restrictions on what words can appear together in the same construction, and in particular, on what can he arguments of what predicates. For nouns, there is a restricted set of verbs that it appears as subject of or object of. For example, wine may be drunk, produced, and sold but not pruned. Each noun may therefore he characterized according to the verbs that it occurs with. Nouns may then he grouped according to the extent to which they appear in similar environments. This basic idea of the distributional foundation of meaning is not new. Hams (1968) makes this "distributional hypothesis" central to his linguistic theory. His claim is that: "the meaning of entities, and the meaning of grammatical relations among them, is related to the restriction of combinations of these entities relative to other entities." (Harris 1968:12). Sparck Jones (1986) takes a similar view. It is however by no means obvious that the distribution of words will directly provide a useful semantic classification, at least in the absence of considerable human intervention. The work that has been done based on Harris' distributional hypothesis (most notably, the work of the associates of the Linguistic String Project (see for example, Hirschman, Grishman, and Sager 1975)) unfortunately does not provide a direct answer, since the corpora used have been small (tens of thousands of words rather than millions) and the analysis has typically involved considerable intervention by the researchers. The stumbling block to any automatic use of distributional patterns has been that no sufficiently robust syntactic analyzer has been available. This paper reports an investigation of automatic distributional classification of words in English, using a parser developed for extracting grammatical structures from unrestricted text (Hindle 1983). We propose a particular measure of similarity that is a function of mutual information estimated from text. On the basis of a six million word sample of Associated Press news stories, a classification of nouns was developed according to the predicates they occur with. This purely syntax-based similarity measure shows remarkably plausible semantic relations. 268 2. A N A L Y Z I N G T H E CORPUS A 6 million word sample of Associated Press news stories was analyzed, one sentence at a time,
منابع مشابه
Corpus Based Method of Transforming Nominalized Phrases into Clauses for Text Mining Application
Nominalization is a linguistic phenomenon in which events usually described in terms of clauses are expressed in the form of noun phrases. Extracting event structures is an important task in text mining applications. To achieve this goal, clauses are parsed and the argument structure of main verbs are extracted from the parsed results. This kind of preprocessing has been commonly done in the pa...
متن کاملPropBank: Semantics of New Predicate Types
This research focuses on expanding PropBank, a corpus annotated with predicate argument structures, with new predicate types; namely, noun, adjective and complex predicates, such as Light Verb Constructions. This effort is in part inspired by a sister project to PropBank, the Abstract Meaning Representation project, which also attempts to capture “who is doing what to whom” in a sentence, but d...
متن کاملRevisiting Possessors in Hungarian Dps: a New Perspective
This paper offers a new LFG analysis of possessive DPs in Hungarian. This account is designed to overcome two difficulties that the majority of previous generative approaches had to face: (a) the problem of the (a)symmetrical cohead relationship between the noun head and the possessive marker (b) the problem of dual theta role assignment in GB (or its LFG equivalent) when the noun head is relat...
متن کاملA Semantic Kernel for Predicate Argument Classification
Automatically deriving semantic structures from text is a challenging task for machine learning. The flat feature representations, usually used in learning models, can only partially describe structured data. This makes difficult the processing of the semantic information that is embedded into parse-trees. In this paper a new kernel for automatic classification of predicate arguments has been d...
متن کاملComplex Predicates are Multi-Word Expressions
Practitioners of English Natural Language Processing often feel fortunate because their tokens are clearly marked by spaces on either side. However, the spaces can be quite deceptive, since they ignore the boundaries of multi-word expressions, such as noun-noun compounds, verb particle constructions, light verb constructions and constructions from Construction Grammar, e.g., caused-motion const...
متن کامل