Conserving Fuel in Statistical Language Learning: Predicting Data Requirements
نویسنده
چکیده
The paradigm for nlp known as statistical language learning (sll) has flourished in recent times, being seen as a quick and easy way to get off the ground. Research systems have been launched at many nlp problems including sense disambiguation (Yarowsky, 1992), anaphora resolution (Dagan and Itai, 1990), prepositional phrase attachment (Hindle and Rooth, 1993) and lexical acquisition (Brent, 1993). This has all been fueled by the large text corpora which are increasingly available (Marcus et al., 1993). Since these systems learn to navigate language by consuming text, they are critically dependent on the data that drives them. In this paper I address the practical concern of predicting how much training data is sufficient for a given system. First, I briefly review earlier results and show how these can be combined to bound the expected accuracy of a mode-based learner as a function of the volume of training data. I then develop a more accurate estimate of the expected accuracy function under the assumption that inputs are uniformly distributed. Since this estimate is expensive to compute, I also give a close but cheaply computable approximation to it. Finally, I report on a series of simulations exploring the effects of inputs that are not uniformly distributed. 1 Background 1.1 Do We Need To Know? Even though text is becoming increasingly available, it is often expensive, especially if it must be annotated. Consider the decisions facing the sll technology consumer, that is, the architect of a planned commercial nlp system. For each module which is to employ sll, an appropriate technique must be selected. If different techniques require different amounts of data to achieve a given accuracy, the architect would like to know what these requirements are in advance in order to make an informed choice. Further, once the technique is chosen, she must decide how much data to collect or purchase for training. Because this data can be expensive, foreknowledge of data requirements is highly valuable. Thus, in order to make statistical nlp technology practical, a predictive theory of data requirements is needed. Despite this need, very little attention has been paid to the problem.1 ∗This paper has been accepted for publication at the Eigth Australian Joint Conference on Artificial Intelligence, Canberra, 1995. See de Haan (1992) for an investigation of sample sizes for linguistic studies. 1 1.2 Foundations For A Theory All the sll systems mentioned above employ knowledge gained from a corpus to make decisions. Abstractly, this knowledge can be represented as a mapping from observable features (inputs) to decision outcomes (outputs). Following Lauer (1995) I will call each distinguished input a bin and each possible output a value. There is a probability distribution across the bins representing how instances fall into bins. Also, for each bin, there is a probability distribution across the set of values representing how instances in that bin take on values. For the system to perform accurately, most (but not necessarily all) of the instances falling in a particular bin must have the same value. In what follows I will make several assumptions: Training and test data are drawn from the same distributions. The set of possible values is binary (examples include Hindle and Rooth, 1993 and Lauer, 1994). The probability of the most likely value in each bin is constant.2 Finally, I will only consider a simple learning algorithm: collect the training instances falling into each bin and then select the most frequent value for each. This mode-based learner is employed directly in the unigram tagger of Charniak (1993, p49) and is at the heart of many systems. 1.3 Optimal Accuracy There are two sources of error in statistical language learners of the kind we are considering. First, since the values are not necessarily fully determined by the bins, no matter what value the learner assigns to a bin there will always be errors (the optimal error rate). Second, since training data is limited, the learner may not have sufficient data available to acquire accurate rules. The combination of these sources of error results in some degree of inaccuracy for the system. We are interested in estimating the accuracy for various volumes of training data. Since the optimal error rate is independent of the amount of training data, it will always exist no matter how much data is used. As the amount of training data increases we expect the accuracy to get closer to this optimal. Let B be the set of bins, V the set of values, Pr(b) the probability that an instance falls into the bin b and Pr(v | b) the probability of the value v given the bin b. If we denote the most likely value in each bin as vb = argmaxv∈V Pr(v | b), then the expected value of the optimal accuracy is determined by the likelihood of this value occurring in each bin. OA = ∑ b∈B Pr(b) Pr(vb | b) (1) If we know the probability that an algorithm will learn the value v for the bin b (denote this Pr(learn(b) = v)), then we can also calculate the expected accuracy rate: EA = ∑
منابع مشابه
Language Learning Strategy Use and Prediction of Foreign Language Proficiency Among Iranian EFL Learners
The purpose of this study was twofold. Firstly, it attempted to investigate whether language learning strategies (LLSs) can predict foreign language (FL) proficiency. Secondly, it examined what kind of LLSs Iranian learners of English use more frequently in FL institutes. To do so, 112 intermediate Iranian EFL learners participated in the study. Oxford’s Strategy Inventory for Language Learning...
متن کاملThe Effect of Applying Color and Light Training Materials on the Female First Grade Students’ Learning Outcome of Persian Language Lessons in Sharoud
The color images raise the level of educational by providing more detailed information; therefore, these images are believed to be effective in gaining a deeper understanding of the lessons. Moreover, Proper lighting enhances students’ learning and performance. This study aimed at assessing the impact of training color and light materials on elementary school girls’ attention and learning in Pe...
متن کاملc m p - lg / 9 50 90 02 v 1 7 S ep 1 99 5 Conserving Fuel in Statistical Language Learning : Predicting
The paradigm for nlp known as statistical language learning (sll) has flourished in recent times, being seen as a quick and easy way to get off the ground. Research systems have been launched at many nlp problems including sense disambiguation (Yarowsky, 1992), anaphora resolution (Dagan and Itai, 1990), prepositional phrase attachment (Hindle and Rooth, 1993) and lexical acquisition (Brent, 19...
متن کاملQuantifying Investment in Language Learning: Model and Questionnaire Development and Validation in the Iranian Context
The present exploratory study aimed to provide a more tangible and comprehensive picture of the construct of investment in language learning through investigating the issue from a quantitative perspective. To this end, the present researchers followed three main phases. First, a hypothesized model of investment in language learning with six components was developed for the Iranian English as a ...
متن کاملMeasuring Attitudinal Disposition of Undergraduate Students to English Language Learning: The Nigerian University Experience
The purpose of this study was to investigate the undergraduate students’ attitudinal disposition towards English language learning owing to their scholastic disposition to English language in the course of their studying in a Nigerian university. The study adopted descriptive survey research design. The sample consisted of an intact class of 332 Part 3 undergraduate students who registere...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/cmp-lg/9509002 شماره
صفحات -
تاریخ انتشار 1995