List Strength Effects 1 Running head: LIST STRENGTH EFFECTS Differential Effects of List Strength on Recollection and Familiarity
نویسنده
چکیده
The Complementary Learning Systems model of recognition (Norman & O'Reilly, 2001) predicts that increasing list strength (i.e., strengthening the memory traces associated with some studied items) should impair recognition of non-strengthened studied items when discrimination is based on recollection, but not when discrimination is based on familiarity. This finding implies that the magnitude of the list strength effect (LSE) for recognition sensitivity will depend on the extent to which participants are relying on recollection vs. familiarity. In the experiments reported here, we isolated the contribution of recollection to recognition performance in three different ways: by collecting self-report measures of recollection and familiarity (Experiments 1 and 2); by focusing on high-confidence responses (Experiments 2 and 3); and by using related lures at test (Experiment 3). In all three experiments, we found a significant LSE for measures of recognition sensitivity that isolate the contribution of recollection; in contrast, the LSE was not significant for sensitivity measures that load more heavily on familiarity. List Strength Effects 3 Differential Effects of List Strength on Recollection and Familiarity One of the fundamental goals of memory research is to characterize how memory traces interfere with one another. Traditionally, this question has been addressed by lengthening the study list (i.e., adding items that do not match other studied items), and seeing how this affects memory. Increasing list length impairs performance on tests of recognition, free recall, and cued recall (e.g., Gillund & Shiffrin, 1984; but see Dennis & Humphreys, 2001, for discussion of confounds that are frequently present in list length experiments). A related question is how increasing list strength -strengthening the memory traces associated with some, but not all, list items -affects memory for nonstrengthened list items. Tulving and Hastie (1972) were the first to ask this question; they found that strengthening some items (by increasing the presentation frequency of those items) impaired free recall of non-strengthened items. After the Tulving and Hastie study, the list strength issue lay dormant until it was revisited by Ratcliff, Clark, and Shiffrin (1990). Ratcliff et al. found that increasing list strength (by increasing the presentation frequency or presentation duration of some items) impaired free recall and cued recall of non-strengthened items, but list strength manipulations had no effect on recognition of non-strengthened items (i.e., participants' ability to discriminate between non-strengthened studied items and lures was unimpaired). This null list strength effect (LSE) for recognition sensitivity has since been replicated by several other researchers (Murnane & Shiffrin, 1991a, 1991b; Ratcliff, Sheu, & Gronlund, 1992; Ratcliff, McKoon, & Tindall, 1994; Yonelinas, Hockley, & Murdock, 1992; Shiffrin, Huber, & Marinelli, 1995; Hirshman, 1995). The LSE for cued recall (using pairs of unrelated words as stimuli) has since been replicated by Kahana and Rizzuto (submitted). The finding that list strength effects are obtained for cued recall, but not recognition, is potentially problematic for contemporary dual-process theories of recognition memory (for more on dual-process theories, see, e.g., Mandler, 1980; Hintzman & Curran, 1994; Jacoby, Yonelinas, & Jennings, 1997; Yonelinas, Dobbins, Szymanski, Dhaliwal, & King, 1996). Dual-process theories posit that recognition judgments are driven by 1) the familiarity of the test probe -a scalar that reflects, holistically, how similar the test probe is to studied items -and 2) recollection (retrieval) of specific details relating to presented items. The LSE for cued recall indicates that recollection of details from the study phase (i.e., which words were paired together) is impaired by list strength. If recollection contributes to recognition, and recollection is impaired by list strength, there should also be an LSE for overall recognition sensitivity -however, as discussed above, extant studies have consistently found a null LSE for recognition. A recently developed dual-process neural network model of recognition may help resolve this puzzle. The Complementary Learning Systems (CLS) model of recognition (Norman & O'Reilly, 2001) consists of a hippocampal network that supports recollection of studied details, and a cortical network that computes stimulus familiarity (but is not capable of supporting recollection on its own). The CLS model predicts that an LSE for recognition sensitivity should be obtained when recognition is driven by hippocampal recollection, but the LSE should be null or negative (i.e., strengthening some items may benefit recognition of non-strengthened items) when recognition is driven by cortical List Strength Effects 4 familiarity. More specifically, the CLS model predicts a null/negative LSE for familiarity-based discrimination because increasing list strength (initially) reduces lure familiarity more than studied-item familiarity -as such, the average difference in familiarity between studied items and lures can actually increase as a function of interference. There is an LSE for recollection-based discrimination because increasing list strength reduces studied-item recollection, but the amount of recollection triggered by lures is close to floor (and thus can not decrease much with interference); because studied-item recollection decreases, but lure recollection does not, interference has the effect of moving the studied and lure recollection distributions together, resulting in decreased sensitivity (see Norman & O'Reilly, 2001, for a detailed account of why the model makes these predictions, and the boundary conditions on these predictions). If increasing list strength adversely affects recollection-based discrimination, but not familiarity-based discrimination, the size of the LSE for recognition sensitivity will be a function of the extent to which participants are relying on recollection vs. familiarity when making their recognition judgments. According to the CLS model, the null LSE for recognition sensitivity reported in the literature can be explained in terms of participants relying more on familiarity (which, by hypothesis, shows a null or negative LSE) than on recollection. This explanation implies that, if we can develop recognition tests and/or measures that load more heavily on the hippocampal recollection process, then it should be possible to observe a list strength effect for recognition sensitivity. The experiments reported here have two goals: First, we set out to replicate the null LSE that is typically obtained when familiarity is making a substantial contribution to recognition performance. Second, we want to show that a list strength effect can be obtained when steps are taken to isolate the contribution of recollection to recognition performance. We used three distinct methods to isolate the contribution of recollection: In Experiments 1 and 2, we used self-report data to separately estimate the contributions of recollection and familiarity to recognition. In Experiments 2 and 3, we rely on the idea that high-confidence responses disproportionately reflect the contribution of recollection; in Experiment 3, we rely on the idea that recollection is especially important in discriminating studied items from highly related lures. Shared Design Elements The experiments presented here resemble previous list strength studies (conducted by Hirshman, 1995; Ratcliff et al., 1990; Ratcliff et al., 1992; Ratcliff et al., 1994; and Yonelinas et al., 1992) insofar as participants studied lists of unrelated words, and were given recognition tests consisting of studied words and unrelated lures (i.e., words without any planned relationship to specific studied items). We also included some novel design features to maximize the odds of detecting an LSE. Specifically, we used a more powerful strengthening manipulation than is typically used in list strength experiments; we also used an encoding task designed to minimize floor and ceiling effects on recollection. These design features are described in more detail below. Basic paradigm 1 The Norman & O’Reilly (2001) technical report is available online at this URL: http://psych.colorado.edu/~norman/NormanOReilly01_recmem.pdf -also, a version of the Norman & O’Reilly (2001) paper has been submitted to Psychological Review. List Strength Effects 5 Practically all list strength studies have used the “mixed-pure” paradigm pioneered by Ratcliff et al. (1990). In this paradigm, there are three types of study lists: mixed lists, consisting of both “strong” items and “weak” items, pure weak lists, and pure strong lists; strengthening is achieved either by using a longer study duration or additional study presentations. A typical experiment consists of multiple study-test blocks, alternating between mixed, pure weak, and pure strong lists. If there is an LSE for recognition, participants should be worse at recognizing non-strengthened (weak) items in mixed lists than in pure weak lists; likewise, they should be worse at recognizing strengthened (strong) items in pure strong lists than in mixed lists. One major limitation of the “mixed-pure” paradigm is that researchers are not free to repeat strong items as many times as they see fit; because strong items are tested, memory for those items has to be kept below ceiling (otherwise, it would be possible to explain away null list strength effects for strong items in terms of ceiling effects). To get around this limitation, the full “mixed-pure” design was not used in the experiments reported here; rather, a simplified design was used with only two kinds of lists, Weak Interference and Strong Interference. Both types of lists were comprised of Target (tobe-tested) items, and non-tested Interference items. Target items were presented once in both conditions; list strength was manipulated by presenting Interference items once in the Weak Interference condition, vs. multiple times in the Strong Interference condition; the effect of list strength could be measured by comparing memory for Targets in the Weak vs. Strong Interference conditions. A key facet of this design is that, because Interference items were not tested, these items could be overlearned in the Strong Interference condition without any adverse consequences. Taking advantage of this fact, Interference items were presented six times on Strong Interference lists in the experiments reported here. In all of the experiments reported here, participants studied (and were tested on) at least one Weak Interference list and at least one Strong Interference list. Figure 1 shows the general structure of the Strong and Weak Interference blocks. ------------------------------------------Insert Figure 1 about here ------------------------------------------In both kinds of blocks, participants first studied Target and Interference items once (mixed together). For Strong Interference lists, participants then studied the list of Interference items five more times. The Interference items were presented in a different order each time that the list of Interference items was repeated. Participants played a video game (“Catch the Buzz”, a variant of Tetris) immediately after each study list; Weak Interference study lists were followed by a long video game phase, and Strong Interference study lists were followed by a short video game phase, such that the average time between studying a Target, and being tested on that item, was the same in the Weak Interference and Strong Interference conditions. The Strong Interference video game phase lasted two minutes and the Weak Interference video game phase lasted longer (the exact length was a function of the number of items and the study duration for that particular experiment). After the video game phase, participants were given a recognition test consisting of studied Target items and nonstudied Lure items. Minimizing recollection ceiling/floor effects List Strength Effects 6 Floor and ceiling effects on recollection can sabotage the LSE for recollection; crucially, even if floor and ceiling effects are not present for overall recollection, “floor” and “ceiling” effects on recollection of individual items can also sabotage the LSE. To use an extreme example, if the overall probability of (above-floor) recollection is .5, it is possible that 50% of memory traces are too impoverished to support any recollection, and 50% of memory traces are so rich and distinctive that -in effect -they are impervious to interference from other list items; in this situation, increasing list strength will have no effect on recollection. The key to minimizing floor and ceiling effects is to select encoding parameters that force participants to do enough processing of the stimulus to support subsequent recollection, but also prevent excessive processing of the stimulus. In these experiments we used a size judgment encoding task -words (concrete nouns) appeared on the screen and participants had to judge whether the thing denoted by that word would fit in a box (with pre-specified dimensions). Participants were given just over a second to make their size judgment; this encoding duration was selected in order to give participants just enough time to perform the encoding task -thereby guaranteeing that participants will do some elaboration on each stimulus -while at the same time preventing participants from over-elaborating on particular stimuli (e.g., forming a picture of a bicycle broken into pieces in order to fit into the box). Eliminating rehearsal confounds As discussed by Ratcliff et al. (1990), if participants rehearse weak items at the expense of strong items in the Strong Interference (“mixed list”) condition, this can mask an LSE by artificially boosting memory for weak items; furthermore, if participants rehearse strong items at the expense of weak items in the Strong Interference condition, this can result in a spurious LSE by artificially reducing weak-item memory (this occurred in Yonelinas et al., 1992). The paradigm described above was designed to minimize rehearsal confounds: Stimuli were presented very briefly, and performing the encoding task took up almost the entire presentation interval, so participants had very little time left over for rehearsing previously presented stimuli. Furthermore, the study list was structured such that all of the Target items were presented before any of the Interference items were repeated; as such, there was no way to tell which items were “weak” and which were “strong” during the part of the list where Targets were presented, so there was no way for participants to redistribute rehearsal according to strength. Experiment 1 A straightforward way to isolate the contribution of recollection is to look at selfreport measures: Whenever a participant thinks an item is “old” (studied), you can simply ask them whether they “remember” studying the item (i.e., they recollect specific details) or whether they “know” the item was studied (i.e., it seems familiar, but no specific details come to mind; Tulving, 1985; Gardiner, 1988; Rajaram, 1993). Using this selfreport data it is possible to estimate the probability that items will be called “old” based on recollection (this is simply the probability of making a “remember” response); furthermore, if one assumes that recollection and familiarity are independent, it is possible to estimate the probability that items will be called “old” based on familiarity (assuming independence, the unconditional probability of saying “old” based on familiarity = the probability of saying “know” on trials where participants do not make a “remember” response; Jacoby et al., 1997). Thus, we can calculate three discrimination scores: Discrimination based on “old” responses, discrimination based on “recollection List Strength Effects 7 old” responses, and discrimination based on “familiarity old” responses (where the latter two measures are derived from remember-know data). The CLS model's prediction in this context is clear: Increasing list strength should reduce discrimination based on recollection, but it should not reduce discrimination based on familiarity. The effect of list strength on old-new discrimination will depend on the relative extent to which recollection and familiarity are contributing to these old-new judgments. It seems safe to assume that familiarity contributes to some extent to these judgments; as such, the list strength effect should be smaller for old-new discrimination than for discrimination based on recollection. In addition to predicting that recollection-based discrimination should decrease, the CLS model also makes more specific predictions about how list strength should affect recollection-based hits and false alarms. According to the CLS model, recollection is a continuously varying signal, and increasing list strength shifts the studied-item recollection distribution to the left (towards zero). Because increasing list strength increases the odds that a studied item will trigger zero recollection, the net effect of increasing list strength should be a decrease in recollection-based hits. The CLS model also predicts that lure recollection should be close to floor (regardless of interference) in this experiment because of the hippocampus' ability to assign relatively distinct representations to studied items and unrelated lures; as such, recollection-based false alarms should be low in both the Weak Interference and Strong Interference conditions (Norman & O'Reilly, 2001). Remember-know controversies The idea that “remember” and “know” responses tap into two distinct signals, and thus can be used to measure these signals' contribution to recognition memory, is controversial. Most prominently, Donaldson (1996) and Hirshman and Master (1997; see also Hirshman & Henzler, 1998) have argued that remember-know data can be explained in terms of a single-process signal-detection model whereby participants apply two criteria to a familiarity signal that is normally distributed for studied items and lures; items are called “old” if their familiarity exceeds the lower criterion; items that are called “old” are given a “remember” response if their familiarity exceeds the upper criterion (and a “know” response otherwise). According to this view, “remember” responses are just high-confidence “old” responses. This single-process model of remember-know data is appealingly simple; however, two recent meta-analyses, conducted by Gardiner and Gregg (1997) and Rotello, Macmillan, and Reeder (submitted) very clearly show that the single-process model does not fit extant data. For example, a key prediction of the single-process model is that estimates of recognition sensitivity computed using “remember” responses should equal estimates of recognition sensitivity computed using “old” responses; however, Gardiner and Gregg (1997) found a significant trend for sensitivity to be greater when sensitivity is computed using “old” responses, as opposed to “remember” responses alone. Furthermore, Rotello et al. (submitted) found that z-ROC curves generated by plotting “remember” hits vs. false alarms and “old” hits vs. false alarms have a larger slope, on average, than z-ROC curves that plot hits vs. false alarms based on a high confidence criterion and a lower confidence criterion. Finally, Gardiner and Java (1990) and Rajaram (1993) found that manipulations (use of words vs. nonwords as stimuli, and masked repetition priming at test, respectively) can differentially affect “rememberList Strength Effects 8 know” and “sure-unsure” judgments. Thus, it is clear that differences between “remember” and “know” responses can not simply be reduced to differences in confidence (although, as discussed later, the “remember-know” difference is not completely uncorrelated with confidence, and there is no guarantee that “rememberknow” is dissociable from confidence in every remember-know experiment); rather, it is necessary to posit some kind of dual-process account of remember-know data. The studies cited above provide general evidence in favor of dual-process accounts, but they do not specifically support the claim that “remember” responses index the hippocampal recollection process postulated by the CLS model. Two recent studies provide more direct evidence that “remember” responses index hippocampal contributions to recognition, but “know” responses do not. First, Eldridge, Knowlton, Furmanski, Bookheimer, and Engel (2000) conducted an fMRI study and found that highconfidence “remember” responses were associated with increased hippocampal activity, but high-confidence “know” responses did not trigger increased hippocampal activity. Secondly, Holdstock (personal communication) gave a remember-know test to a patient with focal, bilateral hippocampal damage; to ensure that “remember” responses were actually based on retrieval of specific information, Holdstock had participants verbally justify their “remember” responses. The patient showed impaired discrimination based on “remember” responses (her “remember” discrimination score was 1.8 standard deviations below the control mean), but the patient’s ability to discriminate based on familiarity (assuming that “know” responses index familiarity, and that recollection and familiarity are independent; Jacoby et al., 1997) was unimpaired. To summarize: In order to relate “remember-know” data to the CLS model's predictions about list strength, it is important to establish that “remember” and “know” responses tap into the hippocampal recollection and cortical familiarity processes, respectively. We reviewed several findings showing that single-process (familiarityonly) models can not explain remember-know data in its entirety. Furthermore, we reviewed evidence that “remember” responses (but not “know” responses) specifically tap the hippocampus. While none of this evidence is conclusive, we believe that it is strong enough for us to proceed (with caution) in using remember-familiar data to test the CLS model's predictions. Method Participants. Sixty-four Harvard University undergraduates and graduate students (32 women and 32 men, mean age = 19.3 years) volunteered to participate in the experiment. The experiment lasted approximately 1 hour and participants were either paid $10 or given course credit. Materials. Stimuli were 320 highly imageable, concrete, medium-frequency nouns; imageability, concreteness, and Kucera-Francis frequency data were obtained from the MRC Psycholinguistic Database (Coltheart, 1981): mean imageability = 5.94 2 Hanley et al. (2001) found substantial “remember”-based discrimination in a patient with damage to the extended hippocampal system, but this patient showed above-floor levels of visuospatial recall, indicating some degree of spared hippocampal-system functioning. It is plausible that “remember” responses in this patient were driven by preserved hippocampal recollection of visuospatial details from the study phase. List Strength Effects 9 out of 7, range = 5.38 to 6.59; mean concreteness = 5.95 out of 7, range = 4.65 to 6.62; mean K-F frequency = 17 occurrences per million, range = 0 to 94. Twenty-four additional words were used as Primacy and Recency Buffers at study. Of the 320 words that were not used as Buffers, 160 were designated Target words, and 160 were designated Interference words. Words were selected as Targets with the help of data from a pilot experiment that approximated the Weak Interference condition of the actual experiment; items that were assigned “remember” responses close to zero percent or 100 percent of the time in the pilot experiment were, whenever possible, excluded, because recollection of these items might be subject to floor or ceiling effects (respectively) in the actual experiment. Stimuli were arranged into four groups. Each group contained two sets of 20 Target words (hereafter called Target Set One and Target Set Two) and one set of 40 Interference words, for a total of 80 words. An effort was made to ensure that, within a group, none of the Interference words were related in an obvious way (e.g., “guitar” and “violin”) to any of the Target words, and that none of the items from Target Set One were related to items from Target Set Two. Lastly, 6 of the 24 Buffer items were assigned to each of the four groups. All sets of 20 Targets and 40 Interference words were matched to one another for average imageability, concreteness, and K-F frequency, and “recollectability” (operationalized as how often items were given “remember” responses in the pilot experiment). Making sure that Target sets used in Strong Interference and Weak Interference lists are, on average, equally recollectable should increase the reliability of the LSE for recollection. Design. For each participant, there were four study-test blocks (two Weak Interference blocks and two Strong Interference blocks); each of the four stimulus groups was assigned to one of these blocks. In Strong Interference blocks, participants studied 20 Target items (either the 20 items from Target Set One or the 20 items from Target Set Two; the nonstudied group served as Lures on the recognition test) once and the 40 Interference items six times. In Weak Interference blocks, participants studied 20 Target items once and the 40 Interference items once. The general structure of the Weak and Strong Interference blocks follows the description provided in the Shared Design Elements section. Four primacy buffers were presented at the beginning of each study list, and two recency buffers were presented after all Targets and Interference items had been presented once (but before any Interference items were repeated). Across participants, each stimulus group was used equally often in the Weak and Strong Interference conditions, and appeared equally often in each of the four study-test blocks. For half of the participants, items from Target Set One (within each group) were used as Targets and items from Target Set Two were used as Lures; vice-versa for the other participants; this way, each item that served as a Target served equally often as a Lure. The order of the four study-test blocks (two Weak Interference and two Strong Interference blocks) was counterbalanced, such that for half of the participants, the order of blocks was Weak-Strong-Weak-Strong, and for the other participants the order of blocks was Strong-Weak-Strong-Weak. Overall, there were 16 between-subjects counterbalancing conditions: (4 possible group orders) X (Strong or Weak Interference first) X (study Target Set One or Two). List Strength Effects 10 Procedure. The experiment was conducted using a Power Computing PowerCenter Pro 210 Macintosh-compatible computer, running the PsyScope experiment scripting environment (Cohen, MacWhinney, Flatt, & Provost, 1993). In deciding how long to present items at study, we sought to give participants just enough time to perform the encoding task. It was not possible to determine, based on pilot data, whether 1150 or 1300 ms study time (per item) would be optimal, given that participants varied in how quickly they could perform the size judgment task. Therefore, study time was included as a between-subjects manipulation. Before the start of the experiment, participants were shown a “banker's box” (approximately 1 foot wide, 2 feet long, 1 foot deep) on the floor of the testing room. At study, each word was presented on the computer monitor for 1150 ms or 1300 ms, with a 500 ms interval between words; for each word, participants were asked to indicate (during the 1150 or 1300 ms interval) whether the object designated by the word would fit inside the banker's box. If the participant entered a response during the 1150 or 1300 ms interval, the computer made a “beep” noise; if they failed to respond within the 1150 or 1300 ms second interval, the computer made a “buzz” noise. Participants were informed at the beginning of the experiment that their memory would be tested for the words they studied. They were also warned that some items would be presented multiple times at study. For the video game phase of the experiment, participants played a variant of Tetris called “Catch the Buzz”. They were told that they should try their best to accumulate points, and that the experimenter would tell them when they could stop playing the game. If participants were unfamiliar with Tetris, they were allowed to play an easier version of “Catch the Buzz”. For Strong Interference blocks, the video game phase lasted 2 minutes; for Weak Interference blocks, the video game phase lasted 7 minutes 30 seconds for participants in the 1150 ms study condition, and it lasted 8 minutes for participants in the 1300 ms study condition. At test, data on recollection and familiarity were obtained using a variant of the remember-know procedure; the term “familiar” was used instead of “know” because “know” has undesirable connotations of certainty and “familiar” does not (Donaldson, MacKenzie, & Underhill, 1996). Words appeared one at a time on the computer monitor; participants were given an unlimited amount of time to judge whether each word was “old” or “new”, and -if they pressed “old” -whether they “remembered” the word or it was just “familiar”. Participants were instructed to give a “remember” response if they specifically recollected doing the size rating task on that item (i.e., they remembered thinking about whether the item would fit in the box); if, on the other hand, the participant's “old” response was based purely on familiarity, and they did not recollect any specific details, they were instructed to give a “familiar” response. Participants were given as much time as they needed to make their old-new and remember-familiar judgments. Participants were given a short practice phase, in which they studied a list of 10 words (some of which were presented multiple times), played “Catch the Buzz” for 2 minutes, and were tested on the words they studied, before the start of the actual experiment. Participants were asked to repeat the practice study phase if they failed to respond in time, or gave obviously incorrect responses, on more than eight trials; very few participants had to repeat the practice study phase, and no participant had to repeat List Strength Effects 11 the practice study phase more than once. Participants were informed, after the practice phase, that they would be cycling through the three tasks (study, video game, and test) multiple times; also, they were told that each test phase only contained 1) items from the immediately preceding study phase and 2) completely new items, so, for example, they did not need to worry about items from the practice or from the first study phase showing up on the third memory test. Dependent Measures All measures of recognition sensitivity and bias assume an underlying model -for example, the commonly used d' and A' measures of sensitivity have a very specific interpretation under signal detection theory; if we assume that recognition is based on a familiarity signal that is normally distributed for both studied items and lures, d' is an estimate of the distance between the studied and lure distributions, and A' is an estimate of the area under the ROC curve (Donaldson, 1993). Even though our proposed explanation of the LSE conflicts in several ways with this single-process signal detection model (i.e., the CLS model posits two distinct processes, and the recollection signal is not distributed normally), we decided to use standard signal-detection indices of sensitivity, for the following reasons: First, both A' and d' should be sensitive to the predicted effect of list strength (a selective decrease in hits, when recollection is driving performance); both A' and d' decrease when hits decrease but false alarms do not. Second, the CLS recollection process is not strictly high-threshold (nor is it Gaussian; see Norman & O'Reilly, 2001), so high-threshold measures of recollection-based sensitivity (such as the measures used by Yonelinas et al., 1996) are no more “theoretically appropriate” than A' or d'. Third, all other list strength studies that have computed single-point estimates of sensitivity have used signaldetection measures, so using A' and d' increases comparability with these studies . In summary: There are practical benefits to using A' and d', but our use of these measures does not imply that we are endorsing the single-process signal-detection model. Our results show that, qualitatively, the effects of list strength on A' and d' are very similar; none of the major points that we make rely on use of A' as opposed to d'. Both d' and A' results are reported in the tables that accompany each experiment, but to save space, we only discuss A' results in the main text; A' = .50 indicates chance performance, and A' = 1.00 indicates perfect discrimination; below-chance discrimination yields A' values between zero and .50. When A' is used to measure sensitivity, Donaldson (1992) recommends using B''D to index bias; B''D varies between -1.00 (extremely liberal) and +1.00 (extremely conservative). We report B''D values in the text of the paper (and in the tables, we also report values of c, the companion bias measure for d'; Snodgrass & Corwin, 1988). However, we should emphasize that interpreting B''D and c as measures of bias assumes a single-process signal detection model -if the assumptions that underlie this model are not met, it is unclear what exactly these “bias” measures index. As with our choice of sensitivity measures, we are using B''D and c for practical reasons -to facilitate comparison with other studies that have discussed list 3 In instances where participants showed below-chance sensitivity (hits < false alarms), a modified formula for calculating A' was used, as specified by Aaronson and Watts (1987). List Strength Effects 12 strength effects for bias (e.g., Hirshman, 1995), and to provide a rough index of how willing participants are to say “old”. We will call A' and B''D, computed based on “old” responses, A'(Old) and B''D(Old), to distinguish them from the other measures of sensitivity and bias discussed below. Recollection-based discrimination was computed by plugging the “remember” hit and false alarm rates into the formula for A' -we call this measure A'(R). The next issue to resolve is how to measure familiarity-based discrimination. In order to estimate the probability of saying “old” to studied items (or lures) based on familiarity, one must correct for the fact that remember-familiar tests measure the familiarity of nonrecollected items (recollected items are assigned “remember” responses, regardless of how familiar they are). Jacoby et al. (1997) advocate resolving this problem by assuming that recollection and familiarity are stochastically independent (but see, e.g., Curran & Hintzman, 1995, for discussion of possible problems with this assumption). Assuming independence, the overall probability of saying “old” based on familiarity = the probability of saying “old” to non-recollected items based on familiarity = (F / (1 R)); in this formula, F = the probability of making a “familiar” response; R = the probability of making a “remember” response. This formula can be used to calculate a familiaritybased hit rate and a familiarity-based false alarm rate. Familiarity-based discrimination was computed by plugging the familiarity-based hit rate and familiarity-based false alarm rate into the formula for A' -we call this measure A'(F). We also computed bias for familiarity and recollection by plugging the appropriate hit and false alarm values into the formula for B''D -we call these measures B''D(R) and B''D(F), respectively. Derived measures (A' and B''D, for old-new recognition, recollection, and familiarity) were computed based on individual participant data, then averaged across participants. For this experiment, and all subsequent experiments, alpha was set to .05, two-tailed. Results Raw recognition data from Experiment 1 are presented in Table 1 and derived measures are presented in Table 2. Data were analyzed using a 2 X 2 ANOVA, with Interference Strength as a within-subjects factor, and Study Time (1150 ms vs. 1300 ms) 4 We also computed recollection-based discrimination using the formula suggested by Yonelinas, Kroll, Dobbins, Lazzara, and Knight (1998): P(Recollection) = (“remember” hits “remember” false alarms) / (1 “remember” false alarms). This formula assumes that recollection is a high-threshold process, and that “remember” false alarms are attributable to guessing. Results obtained using this formula were qualitatively identical to results obtained using A'(R) -the LSE for A'(R) was significant if and only if the LSE for P(Recollection) was significant. To save space, we do not report P(Recollection) data here. 5 Because some of the measures of sensitivity and bias that we compute are undefined with extremal values (i.e., hits or FA = 0 or 1), we (following the lead of Shiffrin et al., 1995) substituted .5/N (where N is the number of items per condition) when hits or FAs = 0, and we substituted 1 (.5/N) when hits or FAs = 1. List Strength Effects 13 as a between-subjects factor. There were no significant main effects or interactions involving Study Time. ----------------------------Insert Table 1 and Table 2 about here ----------------------------A'(R) was significantly lower in the Strong Interference condition than in the Weak Interference condition, F(1, 62) = 45.45, MSE = .001. However, the effect of Interference Strength on familiarity-based discrimination, indexed by A'(F), was not significant, F(1, 61) = .576, MSE = .018. Finally, old-new recognition discrimination, as indexed by A'(Old), was not significantly affected by Interference Strength in this experiment, F(1, 62) = 1.22, MSE = .001. The LSE for A'(R) was significantly larger than the LSE for A'(Old), F(1, 62) = 57.60, MSE = .001, and it was also significantly larger than the LSE for A'(F), F(1, 61) = 5.50, MSE = .019. The size of the LSE for A'(Old) was larger than the LSE for A'(F), but this difference was not significant, F(1,61) = 1.21, MSE = .016. Figure 2 plots the size of the LSE for A'(R), A'(Old), and A'(F). ------------------------------------------Insert Figure 2 about here ------------------------------------------Old-new responding, recollection-based responding, and familiarity-based responding were all significantly more conservative in the Strong Interference condition compared to the Weak Interference condition: for B''D(Old), F(1, 62) = 182.77, MSE = .106; for B''D(R), F(1, 62) = 40.83, MSE = .090; for B''D(F), F(1, 61) = 25.57, MSE = .181. Discussion In Experiment 1, a robust LSE for recollection-based discrimination was obtained with unrelated word stimuli. As predicted, the primary effect of list strength on recollection was a sharp decrease in the proportion of “remember” hits. Overall, “remember” false alarms were rare; this is consistent with the model's prediction that, most of the time, lures should not trigger recollection. “Remember” false alarms also decreased as a function of list strength; however, because these false alarm rates were close to floor, the decrease in false recollection was small relative to the decrease in correct recollection. Increasing list strength had no effect on participants' ability to discriminate between studied items and lures based on familiarity; familiarity-based discrimination (assuming recollection and familiarity are independent) was nonsignificantly better in the Strong Interference condition than in the Weak Interference condition. Crucially, the LSE for recognition sensitivity (computed based on “old” responses) was not significant in this experiment. This replicates the null LSE for recognition sensitivity obtained by other studies that have used unrelated verbal stimuli at 6 One participant in the 1150 ms condition was omitted from the A'(F) and B''D(F) analyses because they had a 100% “remember” hit rate in the Weak Interference condition (making it impossible to measure familiarity-based hits in this condition). Importantly, the LSE for A'(F) was still far from significant when extremal A'(F) values (.5 and 1.0) were substituted for the missing value. List Strength Effects 14 study, and unrelated lures at test (e.g., Ratcliff et al., 1990; Ratcliff et al., 1992; Ratcliff et al., 1994; Yonelinas et al., 1992; Hirshman, 1995; Murnane & Shiffrin, 1991a). Given that there was a robust LSE for recollection-based discrimination in this experiment, the failure to obtain a significant LSE for old-new recognition sensitivity can be attributed to participants relying to a large extent on familiarity when making their old-new judgments -according to the CLS model, the more that participants rely on familiarity, the smaller the LSE for recognition will be. The LSE for old-new recognition discrimination and the LSE for familiarity-based discrimination were both significantly smaller than the LSE for recollection-based discrimination in this experiment. Regarding list strength effects for bias: Several list strength studies -the present study included -have found, using signal-detection measures of bias, that participants respond more conservatively in the Strong Interference (“mixed weak”) condition than in the Weak Interference (“pure weak”) condition (see Hirshman, 1995, for a review). From the perspective of the CLS model, it is possible to explain the observed LSE for B''D(R) in terms of lost recollection: If the number of studied-items triggering above-floor recollection decreases, the result will be a decrease in “remember” hits, which translates into a decrease in B''D(R). The most straightforward way to explain the LSE for B''D(Old) and B''D(F) (where recollection is contributing less, or not at all) is to posit that participants adopt a more conservative familiarity criterion in the Strong Interference condition. Hirshman (1995) argues that participants set their familiarity criterion based on the estimated range of familiarity scores associated with studied items (i.e., participants place their criterion some fixed proportion of the distance between the high endpoint and low endpoint of the range); according to this account, participants use a stricter criterion in the Strong Interference condition because their estimate of the high end of the range is higher in this condition than in the Weak Interference condition. Finally, while the observed LSE for A'(R) is readily explained by the dual-process CLS model, it may also be possible to reconcile this result with the (very different) view that recognition is driven by a single familiarity process that is unaffected by list strength (e.g., Shiffrin, Ratcliff, & Clark., 1990, Shiffrin & Steyvers, 1997). It is a well-known fact that signal-detection measures of sensitivity such as A' and d' are not completely independent of bias (see, e.g., Snodgrass & Corwin, 1988). Thus, the LSE for A'(R) may simply be an artefact of the bias shift that was observed in this experiment. In the absence of other information about the nature of the underlying distributions, there is no way to rule out the hypothesis that our results are due to bias shifts (applied to a single underlying signal), with no real change in sensitivity. Subsequent experiments address this issue by collecting confidence rating data. This makes it possible to compute multipoint estimates of sensitivity (such as Ag; Macmillan & Creelman, 1991) that are unaffected by bias. Experiment 2 In Experiment 2, we set out to smoothly vary the extent to which recognition performance is being driven by recollection vs. familiarity; the CLS model predicts that the magnitude of the LSE for recognition sensitivity should increase as the contribution of recollection (relative to familiarity) increases. The logic of Experiment 2 hinges on the idea that recollection triggers highconfidence “old” responses, whereas familiarity is associated with range of different confidence values. This view differs from the view (refuted earlier) that recollection and List Strength Effects 15 familiarity result in high and low confidence, respectively, because it allows for highconfidence responses based on familiarity (as well as low-confidence responses based on familiarity, and high-confidence responses based on recollection). According to the CLS model, the high level of confidence associated with recollection is a consequence of the diagnosticity of the (hippocampal) recollection signal. The CLS model predicts that false recollection should be very rare, because of the hippocampus' ability to assign distinct representations to stimuli (see Norman & O'Reilly, 2001 for more details). Thus, if an item triggers strong recollection, you can be very confident that it was studied (Yonelinas, 1994, 2001). If recollection leads to high-confidence “old” responses, it should be possible to vary the extent to which recollection is contributing to recognition by varying the confidence criterion for saying “old”: Recognition sensitivity computed using highconfidence “old” responses should disproportionately reflect the influence of recollection; using a lower confidence criterion for saying “old” should increase the influence of familiarity, relative to recollection. Relating this back to list strength: If increasing list strength impairs recollection, and high-confidence “old” responses are primarily driven by recollection, then an LSE should be present when recognition sensitivity is computed using high confidence “old” responses. Conversely, lowering the confidence criterion for saying “old” should reduce the size of the LSE, insofar as this will reduce the contribution of recollection, relative to familiarity (and, according to the CLS model, familiarity-based discrimination is not affected by list strength). These are the key predictions that we set out to test in this experiment. A large body of empirical evidence supports the claim that recollection triggers high-confidence responses, whereas familiarity is associated with a wide range of confidence values. For example, several studies have asked participants to rate their confidence when they make remember-know judgments; these studies have found that “remember” responses tend to be assigned the highest confidence rating; in contrast, “know” responses are spread out more widely across the confidence scale (Tulving, 1985; Yonelinas et al., 1996; Yonelinas, 2001). Also, Yonelinas (1994) conducted an experiment where he computed “% old” responses based on different confidence criteria (e.g., call an item “old” if confidence > 3), and then he used the process dissociation procedure (Jacoby, 1991) to compute estimates of recollection and familiarity based on these “% old” scores; he found that criterion placement did not affect processdissociation estimates of recollection, but process-dissociation estimates of familiaritybased responding increased with use of increasingly liberal (i.e., lower) confidence criteria. This is consistent with the idea that familiarity is often associated with low confidence responses (hence, lowering your criterion boosts familiarity-based responding) but recollection is not. Predictions: Summary In summary, there is good evidence for the claim that high confidence recognition responses disproportionately reflect the influence of recollection, relative to recognition responses computed using a more liberal criterion. When combined with our hypothesis that list strength affects recollection but not familiarity, this claim implies that an LSE for recognition sensitivity (indexed using A') should be obtained when A' is computed using List Strength Effects 16 high-confidence responses; the LSE for A' should shrink and eventually disappear as the confidence criterion for recognition is shifted towards lower values. Method Participants. Thirty-six University of Colorado undergraduates (14 women and 22 men; mean age = 19.3 years) participated in the experiment. The experiment lasted approximately 1 hour and participants received course credit. Materials. Stimuli were 300 highly imageable, concrete, familiar mediumfrequency nouns; imageability, concreteness, familiarity, and Kucera-Francis frequency data were obtained from the MRC Psycholinguistic Database (Coltheart, 1981): mean imageability = 5.76 out of 7, range = 5.02 to 6.59; mean concreteness = 5.83 out of 7, range = 5.00 to 6.48; mean familiarity = 5.02, range = 4.00 to 6.16; mean K-F frequency = 15.8 occurrences per million, range = 0 to 99; mean word length = 5.54, range = 3 10. As discussed later, the CLS model predicts that list strength effects will vary in size as a function of target-lure similarity; because the purpose of this experiment was to examine list strength effects with nominally unrelated lures, we took steps to ensure that none of the words were strongly (semantically) related to one another. Pairwise semantic relatedness assessments of 847 concrete nouns were generated using Latent Semantic Analysis (applied to the GenCOL corpus, which is meant to reflect what a person has read up to the first year of college; semantic representations were constrained to use 300 feature dimensions; Landauer, Foltz, & Laham, 1998); of these 1000 words, 300 words were selected such that the maximum pairwise LSA cosine for the 300 words (larger cosines reflect higher semantic relatedness) was .42. A small number of nearsynonymous words not caught by the LSA screening were removed by hand (e.g., coffincasket). Also, some compound words were excluded because their constituent words were also included in the stimulus set, and an attempt was made to exclude ambiguous words (e.g., “ram”). In addition to the 300 words described above, 20 other words were used as Primacy and Recency Buffers at study. The 300 words not used as Buffers were split into two 150-word groups (Group One and Group Two), and each group was divided into three 50-word subgroups (Subgroups A, B, and C). As in Experiment 1, steps were taken to ensure that the six 50word subgroups were matched, on average, for important word characteristics. Design. There were two study-test blocks: A Weak Interference block and a Strong Interference block. For half of the participants the Weak Interference block came first; vice-versa for the other half of the participants. Assignment of words to conditions was balanced such that each word appeared equally often as a Target, Interference item, and Lure; also, each word appeared equally often in the first vs. second block, and in the Weak Interference vs. Strong Interference block. This balancing was accomplished by having each of the two word Groups serve equally often in the Strong and Weak Interference blocks, and by having each of the three word Subgroups within each group serve equally often as Targets, Interference Items, and Lures. Combining this rotation of 7 Subgroups were matched for K-F frequency, familiarity, concreteness, imageability, and length. Also, we compiled remember-familiar and confidence ratings for individual items as part of a pilot experiment, and we made sure that subgroups were matched, on average, for item memorability (operationalized using these ratings). List Strength Effects 17 2 Groups and 3 Subgroups through conditions with whether Strong or Weak Interference came first, there were 2 X 3 X 2 = 12 between-subjects counterbalancing conditions. The overall structure of the Strong and Weak Interference blocks was the same as in Experiment 1. The only differences were as follows: Each of the blocks contained more items than in Experiment 1 (50 Targets and 50 Interference items, vs. 20 Targets and 40 Interference items in Experiment 1). Also, we used a different video game (Skittles, instead of Catch the Buzz). In this experiment, five primacy buffers were presented at the beginning of each study list, and five recency buffers were presented after all Targets and Interference items had been presented once (but before any Interference items were repeated). As in the previous experiment, the lengths of the study and video game phases were complementary, such that the time between studying and being tested on a Target item was equivalent for Strong Interference and Weak Interference blocks. The video game lasted 2 minutes in the Strong Interference condition, and 8 minutes, 53 seconds in the Weak Interference condition. Procedure. Testing was done on an Apple iMac computer running PsyScope. The study procedure was essentially identical to the study procedure from Experiment 1: Words appeared onscreen for 1150 ms (with 500 ms between words), and participants were instructed to respond “yes” if a typical instance of that item would fit in the banker's box, and to respond “no” if a typical instance of that item would not fit in the box. If the participant entered a response during the 1150 ms interval, the computer made a “beep” noise; if they failed to respond within the 1150 ms interval, the computer made a “buzz” noise. At test, for each item, participants first rated their recognition confidence on a scale from 1 to 6; the numbers 1-6 were labeled (in order) “definitely new”, “probably new”, “guess new”, “guess old”, “probably old” and “definitely old”. If participants gave a 4, 5, or 6 response (indicating that they thought the word was studied), they were asked to press the “remember” key if they recollected specific details from when the word was presented at study, and to press the “familiar” key if they responded “old” (4, 5, or 6) because the item seemed familiar (but they did not remember any specific details). The “remember-familiar” instructions were very similar to the instructions used in Experiment 1; one small change is that participants were given several examples of the kinds of things that would justify a “remember” response (e.g., if they remembered thinking about whether the stimulus would fit in a the box, or if they remembered forming a mental image of the stimulus, or if they remembered how the word looked when it appeared on screen at study). Participants were encouraged to spread out their confidence ratings across the 1-6 scale. The test was self-paced but participants were told not to dwell too long on any one item. As in Experiment 1, participants were given a short practice study and test phase before the start of the actual experiment. Results Raw data from Experiment 2 are presented in Table 3, and derived measures of sensitivity and bias are presented in Table 4. Data were analyzed using an ANOVA with Interference Strength as a within-subjects factor. ------------------------------------------Insert Table 3, Table 4, and Figure 3 about here ------------------------------------------List Strength Effects 18 The main purpose of this experiment was to examine how the size of the LSE for A' would vary as a function of the confidence criterion used to compute “old” responses. Recognition sensitivity (A') was computed using different confidence criteria for accepting an item as “old”. A' scores for the Weak and Strong Interference conditions (based on different confidence criteria) are listed in Table 4, and LSE difference scores (i.e., A' Weak Interference A' Strong Interference) for different confidence criteria are plotted in Figure 3. A planned linear contrast on these difference scores showed that, as predicted, the magnitude of the LSE for increased as recognition scores were computed with increasingly conservative (i.e., higher) confidence criteria; the contrast was highly significant, F(1, 140) = 49.53, MSE = .002. The LSE for A' was significant with the >4 confidence criterion, F(1, 35) = 9.13, MSE = .001, and with the >5 confidence criterion, F(1, 35) = 27.15, MSE = .001. The LSE for A' did not approach significance with more liberal confidence criteria (>1, >2, and >3), all F(1, 35) values <= .65. Regarding list strength effects on recollection and familiarity: As in Experiment 1, we used “remember” responses to estimate the probability of saying “old” based on recollection and we used the independence remember-know formula (applied to “remember” and “familiar” responses) to estimate the probability of saying “old” based on familiarity. The LSE for recollection-based discrimination, as indexed by A'(R), was significant, F(1, 35) = 27.04, MSE = .001. The LSE for familiarity-based discrimination, as indexed by A'(F), was not significant, F(1, 35) = .11, MSE = .006; numerically, performance was actually slightly better in the Strong Interference condition than in the Weak Interference condition. The size of the LSE for A'(R) was significantly larger than the size of the LSE for A'(Old) (i.e., A' computed using a >3 confidence criterion), F(1, 35) = 15.10, MSE = .001, and it was also significantly larger than the size of the LSE for A'(F), F(1, 35) = 5.83, MSE = .006. The size of the LSE for A'(Old) was larger than the LSE for A'(F), but this difference was not significant, F(1, 35) = 1.35, MSE = .002. Looking at measures of bias: The LSE for B''D was highly significant for all five confidence criteria, F(1, 35) >= 11.45; also, as in Experiment 2, there was a significant LSE for B''D(R) (F(1, 35) = 20.42, MSE = .091) and for B''D(F) (F(1, 35) = 26.16, MSE = .097). For all of these B''D measures, responding was more conservative in the Strong Interference condition. ------------------------------------------Insert Figure 4 about here ------------------------------------------The ROC curves (plotted based on the raw data in Table 3) for the Weak Interference and Strong Interference conditions are shown in Figure 4. From this figure, it is apparent that points from the two Interference conditions lie on distinct ROC curves, with a larger area under the Weak Interference ROC than under the Strong Interference ROC (thereby indicating greater overall recognition sensitivity in the Weak Interference condition). We estimated the area under the ROC curve for individual participants using the Ag measure described in Macmillan and Creelman (1991), which factors in data from all points on the ROC. Ag was larger in the Weak Interference condition than in the 8 To compute Ag, we assumed that the endpoints of the ROC are (0, 0) and (1, 1); this assumption is valid if recognition is based on a normally-distributed familiarity signal, but it is not valid according to dual-process models -e.g., both the CLS model and the List Strength Effects 19 Strong Interference condition (.914 versus .881); this difference was significant, F(1, 35) = 8.10, MSE = .002. ------------------------------------------Insert Figure 5 about here ------------------------------------------Finally, remember-familiar data obtained in Experiment 2 provide converging evidence for the claim -essential to the logic of this experiment -that using a high confidence criterion increases the relative extent to which recollection (vs. familiarity) is driving recognition performance. Figure 5 plots the average probability (collapsing across Study Status and Interference Strength) that an item will be called “old” based on recollection, and the average probability that an item will be called “old” based on familiarity, as a function of the old-new confidence criterion. The probability of saying “old” based on recollection does not change much as a function of criterion placement, indicating that most “remember” responses were given the highest possible confidence rating; in contrast, the probability of saying “old” based on familiarity increases steadily as the confidence criterion is relaxed, showing that “familiar” responses were more evenly distributed across the confidence scale. To determine the relative extent to which recollection vs. familiarity is driving performance, we can look at the ratio of the two probabilities (“recollection old” vs. “familiarity old”) -this ratio is close to one when “old” responses are computed based on a neutral (>3) confidence criterion, and the ratio increases as the confidence criterion increases; for the >5 confidence criterion, the probability of saying “old” based on recollection is more than six times larger than the probability of saying “old” based on familiarity. Signal-detection interpretations of these results The canonical single-process, signal-detection view of recognition memory posits that recognition is based on a familiarity signal that is normally distributed for studied items and lures. According to this view, recognition performance can be described in terms of the slope of the z-ROC curve (which corresponds to the ratio of the standard deviations of the lure and studied-item familiarity distributions), and a sensitivity parameter (da) that indexes the standardized distance between the lure and studied-item familiarity distributions (Macmillan & Creelman, 1991). Although our results are Yonelinas (1994) model predict a positive Y-intercept when recollection is contributing to performance. If this prediction is correct, then Ag will underestimate the area under the ROC (to a varying extent, depending on the actual Y-intercept value). However, so long as the observed ROC curves encompass a wide range of hit and FA values, estimation error resulting from our use of (0, 0) and (1, 1) as endpoints will be proportionally quite small, and the overall utility of Ag (as a means of comparing sensitivity across conditions) will not be compromised. 9 These probability values were computed in the following manner: For each confidence criterion (starting with confidence = 3), all items assigned a confidence rating greater than the criterion value were called “old”; then, we used remember-familiar data from these “old” items to compute P(“recollection old”) = P(“remember” and “old”), and P(“familiarity old”) = P(“familiar” and “old”) / (1 P(“remember” and “old”)) List Strength Effects 20 consistent with the predictions of the dual-process CLS model, a single-process model could -in principle -also provide an adequate explanation of the data. In particular, it may be possible to explain our finding that the LSE for A' increased as the confidence criterion was shifted (from liberal to conservative values) using a single-process model -if the slope of the z-ROC curve is lower in the Weak Interference condition than in the Strong Interference condition, then A' scores computed based on conservative criteria will show an LSE and A' scores computed based on liberal criteria will show a negative LSE, even if there is no actual difference in sensitivity. Figure 6 illustrates how z-ROC slope differences can lead to A’ differences that vary in magnitude (and direction) as a function of criterion placement; it shows two hypothetical z-ROC curves that differ in slope but correspond to equal levels of sensitivity (indexed using da). When responding is conservative (towards the lower-left corner of the graph), the low-slope line is further from the diagonal -the difference between hits and false alarms is, on average, larger the low-slope condition, so A’ scores computed in this region tend to be larger in the low-slope condition. However, when responding is more liberal (towards the upper-right corner of the graph), the high-slope line is further from the diagonal, so A’ scores computed in this region tend to be larger in the high slope condition. ------------------------------------------Insert Figure 6 about here ------------------------------------------The slope-difference account of our data hinges on the idea that the z-ROC slope is lower in the Weak (vs. Strong) Interference condition. To test this idea, we computed maximum likelihood estimates of z-ROC slope and da (as a function of Interference Strength) for individual participants using the RSCORE+ algorithm (Harvey, 2001). Because estimating signal-detection parameters from the ROC was not the primary goal of this experiment, we collected less data per participant per condition than is typically the case in an ROC experiment (for discussion of problems that can arise in ROC analysis when too little data is collected per participant, see Yonelinas & Quamme, submitted); nonetheless, our parameter estimates were stable enough (across participants) to be informative. The average z-ROC slope for the Weak Interference condition was .70 (SEM = .05) and the average z-ROC slope for the Strong Interference condition was .66 (SEM = .06); this difference was not significant, F(1,35) = .23, MSE = .082. da was significantly higher in the Weak Interference condition than in the Strong Interference condition (in the Weak Interference condition, da = 2.13, SEM = .09; in the Strong Interference condition, da = 1.76, SEM = .09; F(1,35) = 9.17, MSE = .260). These da 10 We thank Caren Rotello and David Huber for pointing out this possibility. 11 RSCORE+ builds on the RSCORE parameter estimation algorithm developed by Dorfman and Alf (1969; see also Dorfman, Beavers, & Saslow, 1973) by incorporating more robust nonlinear fitting techniques and other mathematical advances. The RSCORE+ software can be downloaded from http://psych.colorado.edu/~lharvey (from the main page, follow the link to “Software”). List Strength Effects 21 results converge with the Ag results presented earlier -both indicate that overall sensitivity was higher in the Weak Interference condition. Importantly, the results of this analysis rule out the possibility that observed differences in the size of the LSE for A' (across different confidence criteria) are attributable to the slope differences. The average z-ROC slope was numerically (but nonsignificantly) higher in the Weak Interference condition, whereas the slope explanation of our results requires the z-ROC slope to be lower in the Weak Interference condition. To be clear, this analysis does not rule out single-process, signal-detection explanations of our results in general; but it does show that the basic findings from this experiment can not be attributed to anything as straightforward as a difference in slopes. Discussion Our central prediction for Experiment 2 was that the LSE for A' should be largest when a high confidence criterion is used (and performance is driven primarily by recollection), relative to conditions where a lower confidence criterion is used (and familiarity is making a more substantial contribution). The results of Experiment 2 confirm this prediction -a significant LSE for recognition sensitivity emerged when A' was computed using a high confidence criterion (4 or 5) for accepting an item as “old”. When A' was computed using lower confidence criteria, we obtained the same null (nonsignificant) list strength effect that other studies have found. Indeed, the results show a linear trend whereby the list strength effect decreases monotonically, and eventually turns negative, as the confidence criterion is lowered; this is consistent with the idea that lowering the confidence criterion for saying “old” increases the relative contribution of familiarity (thereby attenuating the list strength effect). Also, remember-know data from this experiment provide a conceptual replication of the most important finding from Experiment 1: There was a robust LSE for discrimination based on “remember” responses, but no LSE for familiarity-based discrimination (computed using the independence remember-know formula). Experiment 3 Yet another way to isolate the influence of recollection is to use related lures (i.e., lures that are highly similar to specific studied items) at test. The CLS model predicts that hippocampal recollection should discriminate well between studied items and related lures, because of the hippocampus' ability to assign relatively distinct representations to similar input patterns. In contrast, the cortical familiarity signal should not discriminate well -because cortex assigns similar representations to studied items and related lures, both studied items and related lures will trigger strong feelings of familiarity. The CLS model therefore predicts that yes-no (YN) recognition tests with related lures should load heavily on the hippocampal recollection process, relative to tests with unrelated lures (where both familiarity and recollection discriminate). As such, we would expect a larger list strength effect on YN test with related lures, than on a YN test with unrelated lures. Several empirical results suggest that recollection is especially important for discriminating between studied items and very similar distractors. A large number of these studies have used a plurality-memory paradigm (Hintzman, Curran, & Oppy, 1992), in which participants have to discriminate between studied words and switched-plurality lures (e.g., after studying “scorpions”, participants might be given the lure “scorpion” at test) as well as nominally unrelated lures (e.g., study “scorpions”, test with “banana”). One study, conducted by Hintzman and Curran (1994), exploits the idea that speeded List Strength Effects 22 responding disrupts recollection more than familiarity (see also Gronlund & Ratcliff, 1989; Hintzman & Caulton, 1997; Hintzman, Caulton, & Levitin, 1998). If discrimination with related lures depends critically on recollection (but discrimination with unrelated lures does not), speeded responding should disproportionately hurt participants' ability to discriminate between studied items and switched-plurality lures, relative to their ability to discriminate between studied items and nominally unrelated lures; Hintzman and Curran obtained exactly this result (but see Brockdorff & Lamberts, 2000, and Rotello & Heit, 1999, for a single-process account of these findings). The CLS model also predicts that ROC curves should have a (mostly) linear shape when recollection is driving performance and studied stimuli are relatively dissimilar, whereas ROC curves driven by familiarity should have a curvilinear shape (see Norman & O'Reilly, 2001 for more details; Yonelinas et al., 1996 also make this prediction). In keeping with the view that plurality discrimination is supported by recollection, Rotello, Macmillan, and Van Tassel (2000) found that plurality-discrimination ROC curves had a linear shape, whereas studied vs. unrelated-lure ROC curves had more of a curvilinear shape. Finally, the CLS model predicts that hippocampal damage should impair YN recognition with related lures more than YN recognition with unrelated lures (insofar as the former task depends on recollection, and recollection is supported by the hippocampus); consistent with this view, Holdstock et al. (in press) found that a patient with focal hippocampal damage was strongly impaired on a visual object recognition test with highly similar lures, but was relatively unimpaired on a version of the test where lures were not closely related to studied items. In summary: Several findings indicate that use of similar lures will increase the relative extent to which participants rely on recollection; this, in turn, should increase the odds of obtaining an LSE for recognition sensitivity. To test this hypothesis, we used a variant of the plurals recognition paradigm (Hintzman et al., 1992; Curran, 2000) described above. Participants studied singular and plural words. At test, participants were instructed to say “old” if the test word exactly matched a studied word, and to say “new” otherwise; there were two kinds of lures: related switched-plurality (SP) lures and unrelated lures. The CLS model predicts that 12 According to the CLS model, recollection comes “on line” after familiarity because the system responsible for recollection (the hippocampus) is located after the system responsible for familiarity (medial temporal lobe cortex) in the processing stream (Norman & O’Reilly, 2001). 13 Using lures that are very similar (but not identical) to studied items should increase the odds of obtaining an LSE for recognition sensitivity, but using lures that are familiar because they appeared earlier in the experiment (but not on the study list; e.g., Ratcliff et al., 1990, Experiment 4) may actually decrease the odds of obtaining an LSE. In this situation, recognition discrimination is based primarily on recollection of list context information, which is shared by strengthened and non-strengthened studied items; according to the CLS model, recollection of shared details should not be adversely affected by list strength (see Norman & O'Reilly, 2001, for discussion of how list strength affects recollection of shared vs. item-specific information). List Strength Effects 23 the ability to discriminate between studied words and related SP lures should depend on recollection. Thus, we should find a significant list strength effect for studied vs. SP discrimination, but not necessarily for studied vs. unrelated discrimination, which can also be supported by familiarity. Furthermore, we can also look at SP vs. unrelated pseudodiscrimination, i.e., how much more likely are participants to say old to related vs. unrelated lures. Familiarity supports pseudodiscrimination (insofar as SP lures will be more familiar than unrelated lures), but recollection of plurality information lowers pseudodiscrimination by allowing participants to confidently reject SP lures (i.e., participants can confidently reject “scorpion” if they recollect having studied “scorpions”; for evidence that this recall-to-reject process contributes to performance on plurality recognition tests, see Rotello et al., 2000). If increasing list strength boosts discrimination based on familiarity, but lowers recollection, both of these effects will work in concert to boost pseudodiscrimination. Hence, we predict a large negative LSE for pseudodiscrimination (i.e., it should be higher in the Strong Interference condition than the Weak Interference condition). Method Participants. Eighty University of Colorado undergraduates and graduate students (49 women and 31 men, mean age = 20.3 years) volunteered to participate in the experiment. The experiment lasted approximately 1 hour and participants were either paid $10 or given course credit. Materials. Stimuli were 250 highly imageable, concrete, familiar mediumfrequency nouns; for all of these words, the plural form of the word is generated by adding “s” to the singular form of the word. Practically all of these words were also used as stimuli (in their singular form) in Experiment 2, so the overall characteristics of the stimuli used in this experiment were practically identical to the characteristics of the stimuli used in Experiment 2. In addition to the aforementioned 250 words, 20 other words were used as Primacy and Recency Buffers at study. The 250 words not used as Buffers were split into 10 groups of 25 words. These groups were matched, on average, for important word characteristics such as word frequency, as well as memorability (see Experiment 2 Methods for more details). Design. Apart from the plurality manipulation, the design of this experiment was very similar to the design of the prior two experiments. There were two study-test blocks: A Weak Interference block and a Strong Interference block. For half of the participants the Weak Interference block came first; vice-versa for the other half of the participants. Half of the items on they study list were studied in their singular form, and half were studied in their plural form. In situations where an item was presented repeatedly at study, the item was always presented as a singular word, or always presented as a plural word -in no case was an item studied in both its singular form and its plural form. The 10 word groups were rotated across the 10 conditions shown in Table 5 to ensure that words from each group served equally often in each condition. ------------------------------------------Insert Table 5 about here ------------------------------------------Finally, each item was studied equally often (across participants) in its singular and plural form. Combining all of these factors together, there were 40 between-subjects List Strength Effects 24 counterbalancing conditions: (rotate 10 word groups across conditions) X (study each item as both a singular and a plural word) X (Weak block first vs. Strong block first). The overall structure of the Weak and Strong Interference study lists was the same as in Experiment 2; each list contained 50 Targets and 50 Interference items; Interference items were studied 1X in the Weak Interference list vs. 6X in the Strong Interference block (see the Shared Design Elements section for more details). A minor difference between this experiment and the preceding two experiments is that stimuli were presented in a random order in Experiment 3 (subject to the constraints outlined in Figure 1), whereas the preceding two experiments presented stimuli in a fixed order for a given counterbalancing condition. We used a different video game (Gem Master) in this experiment. As in Experiments 1 and 2, the length of the video game phase was complementary to the length of the study phase, such that the total time elapsed between studying a target item and being tested on that item was the same in the Weak and Strong Interference conditions. The video game lasted 2 minutes in the Strong Interference condition, and 8 minutes, 53 seconds in the Weak Interference condition. The recognition test was comprised of 25 studied target items (items presented in the same plurality at study and test), 25 switched-plurality lures (target items that were presented in a different plurality at study vs. at test), and 25 unrelated lures (items that were not presented in either plurality at study). The 75 test items were presented in a random order, with the constraint that each miniblock of 15 items consisted of 5 studied words, 5 switched-plurality lures, and 5 unrelated lures. After the aforementioned 75 items were presented, participants were given 15 extra test items: 5 studied interference items, 5 lures generated by switching the plurality of studied interference items, and 5 more unrelated lures; these different groups were randomly mixed together. We did not score these extra 15 test trials; the purpose of testing interference items was to reinforce the idea that participants should pay attention to interference items at study. Procedure. Testing was done on an Dell Dimension computer running E-Prime software (Psychology Software Tools, Pittsburgh, PA). The study procedure was very similar to the procedure used in Experiments 1 and 2: Words appeared on screen for 1150 ms (with 500 ms between words). The main difference is that, in this experiment, participants were asked to pay close attention to the plurality of studied items, and the encoding task was modified to force participants to attend to plurality. Specifically, participants were told: If the word is plural, they should picture more than one of that object, and say whether multiple (i.e., at least two) copies of that object would fit in the box; if the word is singular, they should picture only one of that object, and say whether that single object would fit in the box. The instructions repeatedly emphasized that -in order to have good plurality memory -participants had to actively try to picture multiple objects for plural words and single objects for singular words. If participants failed to enter a response (“no” or “yes”) within the 1150 ms interval in which an item was onscreen, the experiment was temporarily suspended and a message appeared onscreen telling the participant to respond more quickly; participants had to press the space bar to continue. At test, participants had to make a “studied-nonstudied” judgment for each item. Participants were told to respond “studied” if the test word exactly matched a word that was studied during the size judgment task (i.e., they studied this word in this plurality), List Strength Effects 25 and they were told to respond “nonstudied” if the test word did not exactly match a studied word. Participants were also told to be very particular about the plurality of test words (i.e., if “SCORPION” is presented at study, but its plural form, “SCORPIONS”, is presented at test, the correct answer would be “nonstudied”). For each item, after participants made their “studied-nonstudied” response, they were asked to rate their confidence on a 3-point scale (1 = guess; 2 = probably right; 3 = sure). Participants were encouraged to spread out their confidence ratings across the entire scale. When we analyzed the data, confidence ratings were converted to a 6-point scale that matches the scale used in Experiment 2 (1 = sure new; 2 = probably new; 3 = guess new; 4 = guess old; 5 = probably old; 6 = sure old). The test was self-paced but participants were told not to dwell too long on any one item. As in the preceding experiments, participants were given a short practice study and test phase before the start of the actual experiment. Results Raw data are presented in Table 6 and derived sensitivity measures are presented in Table 7. ------------------------------------------Insert Table 6, Table 7, and Figure 7 about here ------------------------------------------Data were analyzed using an ANOVA with Interference Strength as a withinsubjects factor. We computed three different kinds of recognition discrimination: studied vs. unrelated lure discrimination (S vs. U) , studied vs. switched-plurality lure discrimination (S vs. SP), and switched-plurality vs. unrelated lure pseudodiscrimination (SP vs. U); in each case, we used Ag to index recognition sensitivity. There was a significant LSE for S vs. SP discrimination, F(1, 79) = 5.58, MSE = .006; the LSE for S vs. U discrimination was not significant, F(1, 79) = 2.25, MSE = .005; and there was a significant negative LSE for SP vs. U pseudodiscrimination, F(1, 79) = 12.02, MSE = .007. Figure 7 plots the size of the LSE for S vs. SP discrimination, S vs. U discrimination, and SP vs. U pseudodiscrimination (all indexed using Ag). ------------------------------------------Insert Figure 8 about here ------------------------------------------Because we collected confidence data in this experiment, we can look at how the size of the LSE for A' varies as a function of the confidence criterion used to compute “old” responses; we also wanted to see whether the effect of criterion placement on the LSE for A' varies as a function of lure relatedness. Figure 8 plots the size of the LSE for S vs. U discrimination and S vs. SP discrimination (both indexed using A'), as a function of the confidence criterion for saying “old”. The results for S vs. U discrimination replicate the results of Experiment 2 -there was a significant LSE for A' when a high (>5) confidence criterion was used, F(1, 79) = 18.18, MSE = .002; as the confidence criterion is lowered, the size of the LSE for A' decreases and eventually the LSE becomes negative; the negative LSE for the >1 confidence criterion was significant, F(1, 79) = 14 To compute Ag (for each participant), we first calculated hits and false alarms based on different confidence criteria; then, we fed these (hit, FA) pairs, along with the points (0,0) and (1,1), into the formula for Ag. See Footnote 8 for more discussion of the Ag measure. List Strength Effects 26 16.52, MSE = .002. A planned linear contrast on the size of the LSE for S vs. U discrimination (indexed using A'), as a function of confidence criterion, was significant, F(1, 316) = 71.04, MSE = .005. The results for S vs. SP discrimination are very different from the results for S vs. U discrimination -the LSE for A' was numerically positive for all confidence criteria; the LSE was actually numerically smallest for the >5 confidence criterion and there was a trend for the LSE to increase as the confidence criterion was lowered, although this trend was far from significant; post-hoc linear contrast F(1, 316) = .72, MSE = .006. For S vs. SP discrimination, the LSE for A' was not significant for any of the individual confidence criteria at p < .05 two-tailed, although the LSE for the >1 confidence criterion was significant at p < .05 one-tailed, F(1, 79) = 3.53, MSE = .010. We directly compared the size of the LSE for A' when switched-plurality lures were used (S vs. SP discrimination) vs. when unrelated lures were used (S vs. U discrimination). For the >1, >2, >3, and >4 confidence criteria, the LSE for A' was (numerically) larger when SP lures were used vs. when U lures were used. This difference was significant for the >1 criterion, F(1, 79) = 24.43, MSE = .103, and for the >2 criterion, F(1, 79) = 7.48, MSE = .007. For the >3 confidence criterion, the difference was only significant one-tailed, F(1,79) = 2.97, MSE = .006. For the >4 and >5 confidence criteria, the difference in the size of the LSE for A' (when SP lures were used vs. when U lures were used) was not significant, both F(1, 79) values <= 1.12. As in previous experiments, there was an LSE for bias -responding was more conservative in the Strong Interference condition than in the Weak Interference condition; for all comparisons (S vs. SP, S vs. U, SP vs. U) and all confidence criteria the LSE for B''D was highly significant, all F(1, 79) values >= 19.48. Discussion There were two major predictions for this experiment: First, there should be an LSE for studied vs. SP lure discrimination; second, there should be a negative LSE for SP vs. unrelated lure pseudodiscrimination. Both of these predictions were confirmed by the data. The CLS model predicts a significant LSE for studied vs. SP lure discrimination because (according to the model) discrimination with related lures depends on recollection, and recollection is impaired by list strength. The negative LSE for SP vs. unrelated lure pseudodiscrimination can also be explained in terms of reduced recollection -according to the CLS model, increasing list strength lowers the odds that SP lures will be rejected (based on recollection of studied plurality information); thus, increasing list strength results in an increase in “old” responses to SP lures, which boosts pseudodiscrimination. The LSE for studied vs. unrelated lure discrimination (indexed using Ag) was smaller in Experiment 3 than in Experiment 2 (in Experiment 3, mean LSE = .017 and SEM = .011; in Experiment 2, mean LSE = .033 and SEM = .011; the LSE was significant in Experiment 2 but not in Experiment 3). The CLS model predicts that recollection and familiarity can both support discrimination when lures are unrelated to studied items; the size of the LSE for studied vs. unrelated lure discrimination will be a function of exactly how much recollection (relative to familiarity) is contributing. Thus, we may be able to explain the reduced size of the LSE in terms of the idea that recollection was contributing less to studied vs. unrelated lure discrimination in this experiment than in Experiment 2. There are several reasons why this may have been the List Strength Effects 27 case. First, recollection of item information (i.e., did I study this word, regardless of plurality) is diagnostic in Experiment 2, but item information alone is not diagnostic in this experiment -if you do not remember plurality, you can not be sure that you studied this exact word. It therefore stands to reason that participants would weight item recollection less heavily in Experiment 3 than in Experiment 2. Also, we collected remember-familiar data in Experiment 2 but not in Experiment 3; use of rememberfamiliar testing may induce participants to pay more attention to recollection (in situations where familiarity also discriminates) than they would otherwise. In this experiment, we replicated the key pattern of results from Experiment 2: The LSE for S vs. U discrimination (indexed using A') was significant when a high (>5) confidence criterion was used, and the LSE for A' decreased as the confidence criterion was lowered. However, we found a different pattern of results for S vs. SP discrimination --when SP lures were used, the LSE for A' stayed relatively constant as the confidence criterion was lowered. Why does lowering the confidence criterion used to compute “old” responses have different effects on the LSE for A' when lures are related (SP) vs. when lures are unrelated (U) to studied items? According to the CLS model, increasing list strength affects S vs. U discrimination primarily by reducing the amount of recollection triggered by studied items -items that would have been assigned a high confidence rating (based on recollection) are assigned a somewhat lower confidence rating; this shift is detectable when a high confidence criterion is used but not when a lower confidence criterion is used. Thus, the LSE for A' should be larger with a high confidence criterion vs. when a lower confidence criterion is used. In contrast, the CLS model predicts that increasing list strength will affect S vs. SP discrimination in two distinct ways: Increasing list strength will reduce the number of high confidence “old” (i.e., confidence rating = 6) responses assigned to studied items due to recollection, and it will also reduce the number of high confidence “new” (i.e., confidence rating = 1) responses assigned to SP lures due to recollection of plurality information from the study phase. Thus, increasing list strength results in a decrease in confidence ratings assigned to studied items and an increase in confidence ratings assigned to lures. The former shift is maximally detectable when a high confidence criterion is used (e.g., >5) , and the latter shift is maximally detectable when a low confidence criterion is used (e.g., >1). Because there are effects of list strength at both ends of the confidence spectrum, and both effects are deleterious to recognition performance, it should be possible to detect a list strength effect regardless of criterion placement -this is what we found. This dissociation, whereby the LSE for A' is positive regardless of criterion placement when lures are related to studied items, but the LSE for A' shrinks (and becomes negative) as the confidence criterion is lowered when lures are unrelated to studied items, supports the CLS model's prediction that lure relatedness will interact with list strength. General Discussion The Complementary Learning Systems (CLS) recognition memory model predicts a dissociation whereby list strength effects should be obtained for recognition driven by hippocampal recollection, but not for recognition driven by cortical familiarity (Norman & O'Reilly, 2001). This implies that an LSE for recognition sensitivity should be present when recollection is driving recognition performance. Conversely, the LSE for List Strength Effects 28 recognition sensitivity should be null or negative to the extent that participants are relying on familiarity. We confirmed this prediction using three separate experiments, which used three distinct means of isolating the contribution of recollection to recognition performance. In all three experiments, there was a significant LSE for sensitivity measures that load heavily on recollection; in contrast, the LSE was not significant (and sometimes negative) for sensitivity measures that load more heavily on familiarity. In Experiments 1 and 2, we used self-report measures of recollection and familiarity to isolate the respective contributions of these processes. The LSE for discrimination based on recollection was highly significant in both experiments; in contrast, the LSE for familiarity-based discrimination (assuming that recollection and familiarity are independent) was not significant, and numerically negative. The LSE for discrimination based on “old” responses (which can be driven by recollection or familiarity) was also not significant in either experiment. In Experiments 2 and 3, we used confidence ratings to isolate the contribution of recollection. The CLS model suggests that recollection should trigger high confidence responses, more so than familiarity; therefore, measures of sensitivity that are computed based on high-confidence responses should show an LSE, more so than measures of sensitivity that are computed based on lower confidence criteria. In keeping with this prediction, we found a significant LSE for A' in both experiments when sensitivity was computed using a high (>5) confidence criterion for saying “old”, and the size of the LSE for A' decreased monotonically as the confidence criterion was lowered. Finally, in Experiment 3, we relied on the CLS model's prediction that recollection should be especially important when related lures (i.e., lures that are similar to specific studied items) are used at test. If recognition tests with related lures depend on recollection (more so than tests with nominally unrelated lures), then the LSE for recognition sensitivity should be larger on tests with related lures than on tests with unrelated lures. In Experiment 3, we used related, switched-plurality lures as well as unrelated lures -the LSE for recognition sensitivity (indexed using Ag) was significant with related lures but not with unrelated lures. Furthermore, we computed recognition sensitivity (A') based on different confidence criteria; with unrelated lures, the LSE for A' decreased significantly as the confidence criterion was lowered; but when lures were very similar to studied items, the LSE for A' stayed relatively constant as the confidence criterion was lowered. As a result of these contrasting trends, the LSE for A' was significantly larger when related (vs. unrelated) lures were used, for three of the five confidence criteria (>1, >2, and >3). Relation to other list strength studies In the experiments reported here, we replicated the null list strength effect that other studies have found, when recognition sensitivity is computed based on whether participants think the item is “old” or “new”, and lures are not strongly related to studied items; in all three of our experiments, the LSE was not significant when A' was computed based on the probability of calling an item “old” (in Experiment 1) or the probability of giving an item a confidence rating >3 (in Experiments 2 and 3). The fact that we replicated the null LSE for old-new recognition sensitivity, despite our use of a paradigm that differs in several ways from the paradigm used in other list strength studies, attests to the robustness of this finding. List Strength Effects 29 This study is the first to explore the effect of list strength on “remember” and “familiar” responses, and it is also the first list strength study to use extremely similar lures (i.e., the switched-plurality lures we used in Experiment 3) -thus, the fact that we found list strength effects on discrimination computed using “remember” responses, and discrimination using high-similarity lures, does not directly contradict extant results. Other studies have used related lures (e.g., Shiffrin, Huber, & Marinelli, 1995 used nonstudied category exemplars from studied categories as lures) but lures in these studies were not nearly so similar (to studied items) as the lures used here -as such, participants may have been able to rely on familiarity (which, by hypothesis, is unaffected by list strength) in these experiments. However, some of our results do appear to contradict extant list strength data. Specifically: Ratcliff et al. (1994) ran several list strength experiments where they collected confidence data at test; using raw data provided in the Appendix of the Ratcliff et al. article, we were able to compute A' based on different confidence criteria; none of the Ratcliff et al. experiments found an LSE (i.e., better sensitivity for “pure weak” items than “mixed weak” items) when sensitivity was computed based on high-confidence responses --across all of Ratcliff et al.'s experiments, the size of the LSE for A' (computed based on the highest confidence criterion) varied from -.039 (in Experiment 1) to .011 (in Experiment 5, for low frequency word stimuli). This directly contradicts our finding, from Experiment 2 and Experiment 3, of a substantial LSE (> .03, in A' units) when A' was computed based on high-confidence (>5) responses. How can we reconcile these conflicting results? At present, we can only speculate as to what the relevant differences might be. In our experiments, we took several steps to boost the size of the LSE: Our experiments used a powerful “strength” manipulation: Interference items were presented six times in the Strong Interference condition, vs. once in the Weak Interference condition. Also, the experiments reported here used an encoding task that was designed to ensure that memory traces would be rich enough to support recollection, and -at the same time -prevent excess elaboration that might lead to ceiling effects on recollection. In contrast, Ratcliff et al. used a less extensive strength manipulation, and their experiments did not use an encoding task (apart from “learn these words”); furthermore, some (but not all) of the Ratcliff experiments used short study durations (on the order of 50-100 ms), which may have led to floor effects on recollection (see Gardiner & Gregg, 1997, for evidence that recollection is very poor following shallow encoding and brief study presentations). Additional research is needed to determine which (if any) of these factors are responsible for the differences between our results and the results of Ratcliff et al. (1994). Finally, we should note that the CLS model clearly predicts an LSE for cued recall (when the task involves forming novel associations between stimuli), for the exact same reason that it predicts an LSE for recollection-based recognition discrimination -increasing list strength impairs retrieval of episodic information from the hippocampal system (see Norman & O'Reilly, 2001, for more details). However, some studies have 15 The CLS model predicts that cued recall requires the hippocampus when the task involves learning novel associations between stimuli (and a relatively small number of study presentations). However, it also predicts that cued recall using pre-existing associations (i.e., associations formed prior to the start of the experiment) can be List Strength Effects 30 failed to find a significant LSE for cued recall with unrelated word-pair stimuli (e.g., Ratcliff et al., 1990, Experiment 3); furthermore, in other studies that have found a significant LSE for cued recall, the size of this effect was quite small (e.g., Ratcliff et al., 1990, Experiment 6). We think that the small size of the LSE for cued recall in published studies may be attributable to ceiling effects on recollection of individual items (i.e., if memory traces are extremely distinctive, they will not suffer interference) and use of a less-than-maximally-powerful list strength manipulation. This hypothesis needs to be tested directly: One prediction is that use of an encoding task like the task used here (i.e., a task designed to minimize floor and ceiling effects on recollection) should bolster the LSE for cued recall, relative to a condition where participants are just told to “learn the words”. Another prediction is that increasing the amount of strengthening that occurs at study (i.e., the number of interference-item repetitions) should bolster the LSE for cued recall. Implications for extant mathematical models of the LSE Up to this point, almost all theoretical work on the LSE for recognition has been conducted within the framework of single-process, global matching mathematical models (for a review of global matching models, see Clark & Gronlund, 1996); according to these models, recognition decisions are based (in their entirety) on a scalar signal that indexes how well the test probe matches each of the items stored in memory. Ratcliff et al.'s (1990) finding of a null LSE for recognition sensitivity was a watershed event in the development of mathematical models of recognition memory. Most global matching models in the literature circa 1990 predicted that increasing list strength should impair recognition discrimination (see Shiffrin et al., 1990, for discussion of the issue). SAM (Gillund & Shiffrin, 1984) is typical of these models; in SAM, increasing list strength increases the mean global match signal triggered by both targets and lures, and the variance of the global match signal (intuitively, the consequences of test probe X spuriously matching memory trace Y are larger when memory trace Y is strong than when memory trace Y is weak). This increase in variance leads to decreased discriminability. Researchers have been working from 1990 to the present to modify global matching models so they predict the null LSE for recognition obtained by Ratcliff et al. (1990). Just as the Ratcliff et al. (1990) results pose problems for recognition models that predict that list strength effects will always be obtained, the results reported here pose problems for recognition models that predict list strength effects will never be obtained for item recognition sensitivity (e.g., Murdock & Kahana, 1993; Dennis & Humphreys, 2001). Murdock and Kahana (1993) argue that items from the current study list make a negligibly small contribution to the variance of the global match signal, relative to the contribution all of the other items that have been studied (over the person's lifetime); thus, strengthening items from the current list will not boost variance enough to hurt recognition. Dennis and Humphreys (2001) argue that, in recognition tests that use single supported to some extent by other systems, e.g., medial temporal lobe neocortex, that are less susceptible to list strength effects. Thus, cued recall using pre-existing associations may not show an LSE. Consistent with this view, Bauml (1997) did not find an LSE for cued recall in an experiment where they used well-learned category-exemplar pairs as stimuli (e.g., VEGETABLE-tomato). List Strength Effects 31 words as stimuli, the primary source of noise when a word is presented at test is exposure to that word outside of the experimental context (“context noise”). According to this model, other words from the study list do not affect the memory signal triggered by a word at test (i.e., there is no “item noise”); as such, strengthening some list items should have no effect on memory for other, non-strengthened list items. These two models, in their present form, can not accommodate the significant list strength effects for recognition sensitivity reported here, except by arguing that list strength is confounded with some other factor. For example, Dennis and Humphreys (2001) argue that -in retroactive interference designs -list length and strength effects could arise spuriously if participants mentally focus on the latter part of the Strong Interference list (which does not contain target items) when making recognition judgments at test. Another approach to modeling the null LSE for recognition sensitivity is to posit that differentiation occurs as a consequence of strengthening (Shiffrin et al., 1990); the gist of differentiation is that -as participants acquire experience with an item -the item's representation becomes increasingly refined, and increasingly distinct from the representations of other items. In models where differentiation occurs, strengthening a memory trace decreases the odds that it will (spuriously) match a lure at test; therefore, increasing list strength may actually reduce variability, by reducing the number of spurious matches to interference items (both the REM model presented by Shiffrin & Steyvers, 1997, and the model presented by McClelland & Chappell, 1998, have this property). Recognition models like REM can accommodate both the null LSE for recognition sensitivity reported by Ratcliff et al., and the significant LSE reported in this study, depending on parameter settings. However, it remains to be seen whether REM can accommodate the specific pattern of results reported here, e.g., the interaction depicted in Figure 8, whereby using a high confidence criterion for saying “old” boosts the size of the LSE for S vs. U discrimination (indexed using A'), but using a high confidence criterion does not boost the size of the LSE for S vs. SP discrimination. Conclusions The experiments reported here show that list strength effects sometimes are obtained for recognition sensitivity. Furthermore, the dual-process Complementary Learning Systems neural network model appears to be a useful guide as to when list strength effects will be obtained. In three separate experiments, we found a significant LSE for sensitivity measures that (according to the CLS model) should load heavily on recollection, and the LSE was not significant -and often numerically negative -for sensitivity measures that (according to the CLS model) should load more heavily on familiarity. Future research will explore the boundary conditions of the list strength effects reported here (i.e., why were they obtained here but not in other studies such as Ratcliff et al., 1994); also, although the results reported here are certainly consistent with the CLS model's predictions, more work is needed to assess whether the CLS model provides a better account of these results than other extant models (e.g., REM; Shiffrin & Steyvers, 1997). List Strength Effects 32 ReferencesAaronson, D., & Watts, B. (1987). Extensions of Grier's computational formulasfor A' and B'' to below-chance performance. Psychological Bulletin, 102, 439-442. Bauml, K. (1997). The list-strength effect: Strength-dependent competition orsuppression? Psychonomic Bulletin & Review, 4(2), 260-264.Brockdorff, N., & Lamberts, K. (2000) A feature-sampling account of the timecourse of old-new recognition judgments. Journal of Experimental Psychology:Learning, Memory, & Cognition, 26(1), 77-102Clark, S. E., & Gronlund, S. D. (1996). Global matching models of recognitionmemory: How the models match the data. Psychonomic Bulletin & Review, 3(1), 37-60.Cohen, J. D., MacWhinney, B., Flatt, M., & Provost, J. (1993). PsyScope: A newgraphic interactive environment for designing Psychology experiments. BehavioralResearch Methods, Instruments, and Computers, 25(2), 257-271.Coltheart, M. (1981). The MRC psycholinguistic database. Quarterly Journal of Experimental Psychology, 33A, 497-505.Curran, T. (2000). Brain potentials of recollection and familiarity. Memory &Cognition, 28(6), 923-938.Curran, T., & Hintzman, D. L. (1995). Violations of the independenceassumption in process dissociation. Journal of Experimental Psychology: Learning,Memory, & Cognition, 21(3), 531-547.Dennis, S., & Humphreys, M. S. (2001). A context noise model of episodic wordrecognition. Psychological Review, 108(2), 452-478.Donaldson, W. (1992). Measuring recognition memory. Journal of ExperimentalPsychology: General, 121, 275-277.Donaldson, W. (1993). Accuracy of d' and A' as estimates of sensitivity. Bulletinof the Psychonomic Society, 31, 271-274.Donaldson, W. (1996). The role of decision processes in remembering andknowing. Memory & Cognition, 24(4), 523-533.Donaldson, W., MacKenzie, T. M., & Underhill, C. F. (1996). A comparison ofrecollective memory and source monitoring. Psychonomic Bulletin & Review, 3(4), 486-490.Dorfman, D. D., & Alf, E., Jr. (1969). Maximum-likelihood estimation ofparameters of signal-detection theory and determination of confidence intervals -rating-method data. Journal of Mathematical Psychology, 6(3), 487-496.Dorfman, D. D., Beavers, L. L., & Saslow, C. (1973). Estimation of signaldetection theory parameters from rating-method data: A comparison of the method ofscoring and direct search. Bulletin of the Psychonomic Society, 1(3), 207-208.Eldridge, L. L., Knowlton, B. J., Furmanski, C. S., Bookheimer, S. Y., & Engel,S. A. (2000). Remembering episodes: A selective role for the hippocampus duringretrieval. Nature Neuroscience, 3(11), 1149-52.Gardiner, J. M. (1988). Functional aspects of recollective experience. Memory &Cognition, 16(4), 309-313.Gardiner, J. M., & Gregg, V. H. (1997). Recognition memory with little or noremembering: Implications for a detection model. Psychonomic Bulletin & Review, 4(4),474-479. List Strength Effects 33 Gardiner, J. M., & Java, R. I. (1990). Recollective experience in word andnonword recognition. Memory & Cognition, 18(1), 23-30.Gillund, G., & Shiffrin, R. M. (1984). A retrieval model for both recognition andrecall. Psychological Review, 91, 1-67. Gronlund, S. D., & Ratcliff, R. (1989). Time course of item and associativeinformation: Implications for global memory models. Journal of ExperimentalPsychology: Learning, Memory, & Cognition, 15, 846-858.Hanley, J. R., Davies, A. D., Downes, J. J., Roberts, J. N., Gong, Q. Y., & Mayes,A. R. (2001). Remembering and knowing in a patient with preserved recognition andimpaired recall. Neuropsychologia, 39(9), 1003-1010.Harvey, L. O., Jr. (2001). Parameter Estimation of Signal Detection Models:RSCORE+ User's Manual [Computer software manual]. Boulder, CO: Author.Hintzman, D. L., & Curran, T. (1994). Retrieval dynamics of recognition andfrequency judgments: Evidence for separate processes of familiarity and recall. Journal of Memory & Language, 33(1), 1-18.Hintzman, D. L., Curran, T., & Oppy, B. (1992). Effects of similarity andrepetition on memory: Registration without learning? Journal of ExperimentalPsychology: Learning, Memory, & Cognition, 18(4), 667-680.Hintzman, D. L., & Caulton, D. A. (1997). Recognition memory and modalityjudgments: A comparison of retrieval dynamics. Journal of Memory & Language, 37, 1-23.Hintzman, D. L., Caulton, D. A., & Levitin, D. J. (1998). Retrieval dynamics inrecognition and list discrimination: Further evidence of separate processes of familiarityand recall. Memory & Cognition, 26, 449-462.Hirshman, E. (1995). Decision processes in recognition memory: Criterion shiftsand the list-strength paradigm. Journal of Experimental Psychology: Learning, Memory,& Cognition, 21(2), 302-313.Hirshman, E., & Henzler, A. (1998). The role of decision processes in consciousrecollection. Psychological Science, 9(1), 61-65.Hirshman, E., & Master, S. (1997). Modeling the conscious correlates ofrecognition memory: Reflections on the remember-know paradigm. Memory &Cognition, 25(3), 345-351.Holdstock, J. S., Mayes, A. R., Roberts, N., Cezayirli, E., Isaac, C. L., O'Reilly,R. C., & Norman, K. A. (in press). Under what conditions is recognition relatively sparedfollowing selective hippocampal lesions? Hippocampus.Jacoby, L. L., Yonelinas, A. P., & Jennings, J. M. (1997). The relation betweenconscious and unconscious (automatic) influences: A declaration of independence. In J.D. Cohen & J. W. Schooler (Eds.), Scientific approaches to consciousness (pp. 13-47).Mahwah, NJ: Erlbaum.Kahana, M. J., & Rizzuto, D. (submitted). An analysis of the recognition-recallrelation in four distributed memory models.Landauer, T. K., Foltz, P. W., & Laham, D. (1998) An introduction to LatentSemantic Analysis. Discourse Processes, 25, 259-284.Macmillan, N. A., & Creelman, C. D. (1991). Detection theory: A user's guide.New York: Cambridge University Press. List Strength Effects 34 Mandler, G. (1980). Recognizing: The judgment of previous occurrence.Psychological Review, 8, 252-271.McClelland, J. L., & Chappell, M. (1998). Familiarity breeds differentiation: Asubjective-likelihood approach to the effects of experience in recognition memory.Psychological Review, 105(4), 724-760.Murdock, B. B., & Kahana, M. J. (1993). Analysis of the list-strength effect.Journal of Experimental Psychology: Learning, Memory, & Cognition, 19(3), 689-697.Murnane, K., & Shiffrin, R. M. (1991a). Interference and the representation ofevents in memory. Journal of Experimental Psychology: Learning, Memory, &Cognition, 17(5), 855-874.Murnane, K., & Shiffrin, R. M. (1991b). Word repetitions in sentence recognition.Memory & Cognition, 19(2), 119-130.Norman, K. A., & O'Reilly, R. C. (2001). Modeling hippocampal and neocorticalcontributions to recognition memory: A complementary learning systems approach.(ICS Technical Report 01-02). Boulder, CO: University of Colorado, Institute ofCognitive Science.Rajaram, S. (1993). Remembering and knowing: Two means of access to thepersonal past. Memory & Cognition, 21(1), 89-102.Ratcliff, R., Clark, S. E., & Shiffrin, R. M. (1990). List-strength effect: I. Dataand discussion. Journal of Experimental Psychology: Learning, Memory, & Cognition,16(2), 163-178.Ratcliff, R., McKoon, G., & Tindall, M. (1994). Empirical generality of data fromrecognition memory receiver-operating characteristic functions and implications for theglobal memory models. Journal of Experimental Psychology: Learning, Memory, &Cognition, 20(4), 763-785.Ratcliff, R., Sheu, C.-F., & Gronlund, S. D. (1992). Testing global memorymodels using ROC curves. Psychological Review, 99(3), 518-535.Rotello, C., Macmillan, N. A., & Reeder, J. A. (submitted). A two-dimensionalsignal detection model of remember-know judgments.Rotello, C., Macmillan, N. A., & Van Tassel, G. (2000). Recall-to-reject inrecognition: Evidence from ROC curves. Journal of Memory & Language, 43(1), 67-88.Rotello, C. M., & Heit, E. (1999). Two-process models of recognition memory:Evidence for recall-to-reject? Journal of Memory & Language, 40, 432-453.Shiffrin, R. M., Huber, D. E., & Marinelli, K. (1995). Effects of category lengthand strength on familiarity in recognition. Journal of Experimental Psychology:Learning, Memory, and Cognition, 21(2), 267-287.Shiffrin, R. M., Ratcliff, R., & Clark, S. E. (1990). List-strength effect: II.Theoretical mechanisms. Journal of Experimental Psychology: Learning, Memory, &Cognition, 16(2), 179-195.Shiffrin, R. M., & Steyvers, M. (1997). A model for recognition memory: REM -retrieving effectively from memory. Psychonomic Bulletin & Review, 4(2), 145-166.Snodgrass, J. G., & Corwin, J. (1988). Pragmatics of measuring recognitionmemory: Applications to dementia and amnesia. Journal of Experimental Psychology:General, 117, 34-50.Tulving, E. (1985). Memory and consciousness. Canadian Psychology, 26(1), 1-12 List Strength Effects 35 Tulving, E., & Hastie, R. (1972). Inhibition effects of intralist repetition in freerecall. Journal of Experimental Psychology, 92, 297-304.Yonelinas, A. P. (1994). Receiver-operating characteristics in recognitionmemory: Evidence for a dual-process model. Journal of Experimental Psychology: Learning, Memory, & Cognition, 20(6), 1341-1354.Yonelinas, A. P. (2001). Consciousness, control, and confidence: The three Csof recognition memory. Journal of Experimental Psychology: General, 130(3), 361-379.Yonelinas, A. P., Dobbins, I., Szymanski, M. D., Dhaliwal, H. S., & King, L.(1996). Signal-detection, threshold, and dual-process models of recognition memory;ROCs and conscious recollection. Consciousness and Cognition, 5, 418-441.Yonelinas, A. P., Hockley, W. E., & Murdock, B. B. (1992). Tests of the list-strength effect in recognition memory. Journal of Experimental Psychology: Learning,Memory, & Cognition, 18(2), 345-355.Yonelinas, A. P., Kroll, N. E. A., Dobbins, I., Lazzara, M., & Knight, R. T.(1998). Recollection and familiarity deficits in amnesia: Convergence of rememberknow, process dissociation, and receiver operating characteristic data. Neuropsychology,12(3), 323-339.Yonelinas, A. P., & Quamme, J. R. (submitted). Linear and curvilinear receiveroperating characteristics in relational recognition tests. List Strength Effects 36 Author NoteThis research was supported by NIH NRSA grant MH12582-01, a grant from theSackler Scholar Programme in Psychobiology at Harvard University, and by NationalInstitute on Aging grant AG0-8441. Experiment 1 was conducted in partial fulfillment ofthe Ph.D. requirement, Department of Psychology, Harvard University. I thank TimCurran and David Huber for commenting on this manuscript. I also thank My Nguyenand Mani Nadjmi for running participants in Experiments 2 and 3.Correspondence concerning this article should be sent to Dr. Kenneth A. Norman,Department of Psychology, University of Colorado at Boulder, Campus Box 345,Boulder, CO, 80309. Electronic mail may be sent to [email protected].
منابع مشابه
A dual-process account of the list-length and strength-based mirror effects in recognition ¬リニ
Manipulating either list length (e.g., few vs. many study items) or encoding strength (e.g., one presentation vs. multiple presentations of each study item) produces a recognition mirror effect. A formal dual-process theory of recognition memory that accounts for the word-frequency mirror effect is extended to account for the list-length and strength-based mirror effects. According to this theo...
متن کاملEvent-related potential correlates of interference effects on recognition memory.
The question of interference (how new learning affects previously acquired knowledge and vice versa) is a central theoretical issue in episodic memory research, but very few human neuroimaging studies have addressed this question. Here, we used event-related potentials (ERPs) to test the predictions of the complementary learning systems (CLS) model regarding how list strength manipulations (str...
متن کاملList strength effect without list length effect in recognition memory.
The study of list length effects (adding items to a list affects memory for the other items) and list strength effects (strengthening some items in a list affects memory for the nonstrengthened items) is important to constrain models of memory. In recognition memory, a list length effect is generally found, whereas a list strength effect is not. Using the switched-plurality procedure in an old-...
متن کاملDifferential effects of list strength on recollection and familiarity.
Numerous studies have found a null list strength effect (LSE) for recognition sensitivity: Strengthening memory traces associated with some studied items does not impair recognition of nonstrengthened studied items. In Experiment 1, the author found a LSE using receiver operating characteristic-based measures of recognition sensitivity. To account for the discrepancy between this and prior rese...
متن کامل