Running Head: Recognition of Word Combinations Recognition of Common Word Combinations: Towards a Lexicon of Variable-Sized Units
نویسنده
چکیده
Connectionist approaches to word learning and word recognition imply that the units of lexical representation are not part of a fixed architecture. If the size of stored units is determined by extracting co-occurrence regularities, then we expect the lexicon to include multi-word units which have become entrenched due to repeated use. Experiments using Reicher's two-alternative forced-choice procedure revealed that letterdetection was superior in high-frequency word combinations compared to the same word presented as part of a random combination or syntactically legal combination. This “collocation effect” held even when both letter choices made a collocation (prop/ crop up, faced/laced with, next step/stop). A second type of evidence came from drastically impaired letter detection on ‘trick’ items -word pairs in which the incorrect letter choice creates a collocation (first kid, odd man). It was also found that accuracy was better in semantically coherent ‘trick’ items (first kid, sometimes perceived as first aid) then in incongruent ‘trick’ items (low coat, often perceived as low cost). The data is discussed with reference to a model of the lexicon which is consistent with connectionist principles and proposals by cognitive linguists. Recognition of Word Combinations 2 Recognition of Word Combinations 3 Recognition of Common Word Combinations: Towards a Lexicon of Variable-Sized Units Lexical access and representation are among the most heavily studied topics in cognitive science, yet there have been few studies of the processing of common word combinations. This omission may reflect the common assumption that communication is achieved by the purposeful combining of single words, accessed from a lexicon of single words (see, for example, Fodor, Bever & Garrett, 1974; Forster, 1979; Moulton & Robinson, 1981; Emmorey & Fromkin, 1988; Frazier & Clifton, 1995). English speakers easily recognize the familiarity and cohesive quality of noun compounds (last year, brand new), verb phrases (cut down, get a hold of, faced with) and other multi-word expressions such as common sayings and references to cultural concepts (saved by the bell, speed of light) (Jackendoff, 1995). A very general question is what type of representations allow speakers to recognize some word pairs as familiar. The proposal to be investigated in this paper is that common word combinations are directly stored in the lexicon. This proposal may seem an old one, as theorists have long acknowledged that idiosyncratic word combinations, such as idioms, need to be stored in the lexicon (Chomsky, 1965; Swinney & Cutler, 1979). However, for these theorists, the lexicon was primarily a list of single words. Word combinations were only to be stored in the lexicon if their meaning could not be predicted by the meaning of their components. Recently, linguists such as Langacker (1987), Fillmore (1988), and Goldberg (1992) have argued for the necessity of a data structure that is more flexible than a set of single words plus some listed exceptions. A view consistent with the data considered by these authors is that lexicon is the mental data structure where lexical regularities are stored. The regularities are of two types: overtly occurring expressions (both single words and multi-word phrases), and generalizations over overtly occurring expressions (Langacker, 1987). The source of these regularities is the corpus of phrases speakers encounter in daily life. Examples of common two-word combinations are lose sight, never mind, fast track, shelf life. An example of a set of word pairs which may give rise to a generalization is the set lose sight, lose track, and lose count. These local generalizations also support more abstract generalizations, such as verb+noun, which in turn support basic word-order rules, as suggested by Langacker (1987) and by some connectionist work (Elman, 1991). Generalizations at the level of phrases, such as the let alone construction, have been discussed extensively by Fillmore (1988, Fillmore et al., 1988) and Goldberg (1992), in work which is broadly compatible with the approach advocated here. According to this view, lexical structures form a continuum, from word combinations which have literally fossilized into single units (goldfish, nightclub, sandbar) to those that both exist as independent units and yet have bonds, varying in tightness, with the words with which they frequently co-occur. The lose examples given above illustrate this: lose can appear in many relatively unconstrained contexts, but also appears in the more formulaic contexts lose sight, lose track and lose touch. A lexicon of variable-sized units fits easily into a connectionist framework. A key idea of connectionism is that the entities available to conscious reflection, and which appear to have causal force in human cognition, are emergent from an underlying microstructure (Smolensky, 1988). According this view, the units of lexical representation are not part of a fixed architecture, but emerge Recognition of Word Combinations 4 through extracting co-occurrence regularities. One implication of this idea is that unit-status, and the size of units, may be a matter of degree (Harris, 1994a). Presumably the letter strings we call words come to have unit status via readers’ frequent (and usually early) exposure to these letter combinations. Elman (1991) has shown that words as units can emerge from a back-propagation network trained to predict the next letter in a letter sequence composed of words strung together without separations. Van Order, Pennington & Stone (1990) describe how the units of word identification can emerge through extracting co-occurrence regularities, which they call covariant learning. As they point out, “...any relatively invariant correspondence, at any grain size equal to or larger than the grain size of our subsymbols, may emerge as a rulelike force...” (Van Order et al, p. 504). The idea that co-occurrence patterns and corpus statistics are the source of linguistic regularities is an old one (Ervin-Tripp, 1970; Esper, 1973), but one that has never become central to psycholinguistics. The reasons for this are varied and involve competing conceptions of the nature of human intelligence (Chomsky, 1965; Fodor et al., 1974). However, there is currently in cognitive science a good deal of interest in the role played by the ‘training corpus’ (Elman, 1993; Charniak, 1993). It thus might be helpful to enrich our thinking about the relationship between utterances humans are exposed to and the types of regularities extracted. A standard assumption at present is that the corpus exists entirely outside the speaker’s mind. Inside the speaker’ mind, the corpus is completely reduced to summary statistics. These summary statistics constitute speakers’ knowledge of their language. An example of these assumptions appears in the HAL model of semantics. Lund & Burgess (in press; Burgess & Lund, 1995) subjected a very large corpus of colloquial written English to datareduction procedures such that each word could be assigned a 200-bit vector, thus locating it in 200dimensional space. Burgess and Lund found that words form clusters in this space which reflect grammatical category information, semantic attributes, synonymy and prototypicality. They have also shown that Euclidean distances between these words correlate with semantic priming experiments on humans. Burgess and Lund’s project has many worthy attributes. However, as a model of human regularity extraction, it lacks something which speakers clearly have. It has no structures in between the 300-million word corpus and the 200-dimensional space. It has no common word combinations, fixed expressions or abstractions over phrases. The alternative to summary statistics is that, while speaking and hearing a corpus of utterances, humans learn and commit to memory salient pieces of the corpus, probably those patterns which frequently occur and are short enough to have semantic coherency or a single communicative function (such as formulaic sayings: make yourself at home; got a handle on it?). These phrases are then available to be used as “language ready to speak,” thus minimizing lexical search and facilitating fluent speech. The phrases also function as a standing corpus from which generalizations and abstractions can continue to be made, as may happen when we use common adverbials like out with a new phrasal verb, on analogy to stored patterns such as bailed out and freaked out (Lindner, 1991; Harris, 1994b). Introspective evidence that speakers have stored a corpus of utterances is that speakers report recalling common uses of a word when asked to give word definitions. There is currently no psycholinguistics of language units between single words and sentences, (with the exception of the recently burgeoning work on idioms; see papers collected in Cacciari & Tabossi, 1993; and Everaert et al, 1995). The traditional division of research into the areas of single words and sentences is so entrenched that it may not be immediately clear what types of questions Recognition of Word Combinations 5 about phrases and word combinations would be interesting or meaningful. Does the vision of a lexicon of variable sized units, or the vision of a human linguistic ability which depends on a corpus of memorized phrases, allow us to formulate novel, testable hypotheses? Substantiating the hypothesis of a variable-sized lexicon is a large undertaking, and will ideally involve a number of types of stimuli and experimental methods (Harris, 1996a; Harris & Rensink, 1996). My approach in this paper is to begin with a simple question which maximizes connections to a strong area of psycholinguistic research, visual word recognition. Are common word combinations such as noun phrases (tax bill, large part) and verb-preposition combinations (look out, appear in) are directly stored in the lexicon? If collocations are part of the lexicon, then empirical generalizations about words should be extendable to collocations. In reviewing theories of why letters are more easily recognized in words than in random letter strings (the Word Superiority Effect, WSE), Carr (1986) summarized , “words benefit from higher-order, unitized codes that bundle all the available stimuli information together in a form that is safe from visual masking and memorable for a long enough time to support all of the decision and response selection processes required by the task.” Are collocations also a type of higher-order, unitized code? My starting point for investigating collocations’ possible unit-status was the prediction that if collocations form higher-order, unitized codes, then a word should benefit from being part of a collocation. For example, recognition of war should be better when it occurs as part of cold war rather than as part of a non-collocation like cold way or cold one. The question of whether words are easier to recognize in collocations than in non-collocations could be viewed as a question about how context influences perception. That is, in reading war in cold war, the word like cold provides context for the word war. If collocations are a type of context which facilitates word recognition, a logical question is how they play their facilitating role. In anticipation of this question, I will briefly review some of the controversial aspects about how words function as contexts for recognizing letters. Following this I describe and motivate my choice of a letter-detection task for investigating recognition of words in collocations.
منابع مشابه
Learning Sub-Word Units for Open Vocabulary Speech Recognition
Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of subword units. Previous work heuristically created the sub...
متن کاملیک روش دو مرحلهای برای بازشناسی کلمات دستنوشته فارسی به کمک بلوکبندی تطبیقی گرادیان تصویر
This paper presented a two step method for offline handwritten Farsi word recognition. In first step, in order to improve the recognition accuracy and speed, an algorithm proposed for initial eliminating lexicon entries unlikely to match the input image. For lexicon reduction, the words of lexicon are clustered using ISOCLUS and Hierarchal clustering algorithm. Clustering is based on the featur...
متن کاملTHE JOHNS HOPKINS UNIVERSITY Sub-Lexical and Contextual Modeling of Out-of-Vocabulary Words in Speech Recognition
Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of subword units. We present a novel probabilistic model to l...
متن کاملOn multiword lexical units and their role in maritime dictionaries
Multi-word lexical units are a typical feature of specialized dictionaries, in particular monolingual and bilingual maritime dictionaries. The paper studies the concept of the multi-word lexical unit and considers the similarities and differences of their selection and presentation in monolingual and bilingual maritime dictionaries. The work analyses such issues as the classification of multi-w...
متن کاملHolistic Farsi handwritten word recognition using gradient features
In this paper we address the issue of recognizing Farsi handwritten words. Two types of gradient features are extracted from a sliding vertical stripe which sweeps across a word image. These are directional and intensity gradient features. The feature vector extracted from each stripe is then coded using the Self Organizing Map (SOM). In this method each word is modeled using the discrete Hidde...
متن کامل