Learning Local Error Bars for Nonlinear Regression

نویسندگان

  • David A. Nix
  • Andreas S. Weigend
چکیده

We present a new method for obtaining local error bars for nonlinear regression, i.e., estimates of the confidence in predicted values that depend on the input. We approach this problem by applying a maximumlikelihood framework to an assumed distribution of errors. We demonstrate our method first on computer-generated data with locally varying, normally distributed target noise. We then apply it to laser data from the Santa Fe Time Series Competition where the underlying system noise is known quantization error and the error bars give local estimates of model misspecification. In both cases, the method also provides a weightedregression effect that improves generalization performance. 1 Learning Local Error Bars Using a Maximum Likelihood Framework: Motivation, Concept, and Mechanics Feed-forward artificial neural networks used for nonlinear regression can be interpreted as predicting the mean of the target distribution as a function of (conditioned on) the input pattern (e.g., Buntine & Weigend, 1991; Bishop, 1994), typically using one linear output unit per output variable. If parameterized, this conditional target distribution (CID) may also be ·http://www.cs.colorado.edu/~andreas/Home.html. This paper is available with figures in colors as ftp://ftp.cs.colorado.edu/pub/ Time-Series/MyPapers/nix.weigenCLnips7.ps.Z . 490 David A. Nix, Andreas S. Weigend viewed as an error model (Rumelhart et al., 1995). Here, we present a simple method that provides higher-order information about the cm than simply the mean. Such additional information could come from attempting to estimate the entire cm with connectionist methods (e.g., "Mixture Density Networks," Bishop, 1994; "fractional binning, "Srivastava & Weigend, 1994) or with non-connectionist methods such as a Monte Carlo on a hidden Markov model (Fraser & Dimitriadis, 1994). While non-parametric estimates of the shape of a C1D require large quantities of data, our less data-hungry method (Weigend & Nix, 1994) assumes a specific parameterized form of the C1D (e.g., Gaussian) and gives us the value of the error bar (e.g., the width of the Gaussian) by finding those parameters which maximize the likelihood that the target data was generated by a particular network model. In this paper we derive the specific update rules for the Gaussian case. We would like to emphasize, however, that any parameterized unimodal distribution can be used for the em in the method presented here. j------------, I I------T-------, I A I A2,. ) I , o y(x) 0 cr IX I '. /\ i \ O'OOh k : I I I I I I I : l ,-----------_. Figure 1: Architecture of the network for estimating error bars using an auxiliary output unit. All weight layers have full connectivity. This architecture allows the conditional variance ~2 -unit access to both information in the input pattern itself and in the hidden unit representation formed while learning the conditional mean, y(x). We model the desired observed target value d as d(x) = y(x) + n(x), where y(x) is the underlying function we wish to approximate and n(x) is noise drawn from the assumed cm. Just as the conditional mean of this cm, y(x), is a function of the input, the variance (j2 of the em, the noise level, may also vary as a function of the input x (noise heterogeneity). Therefore, not only do we want the network to learn a function y(x) that estimates the conditional mean y(x) of the cm, but we also want it to learn a function a-2 (x) that estimates the conditional variance (j2(x). We simply add an auxiliary output unit, the a-2-unit, to compute our estimate of (j2(x). Since (j2(x) must be positive, we choose an exponential activation function to naturally impose this bound: a-2 (x) = exp [Lk Wq2khk (x) + ,8], where,8 is the offset (or "bias"), and Wq2k is the weight between hidden unit k and the a-2-unit. The particular connectivity of our architecture (Figure 1), in which the a-2-unit has a hidden layer of its own that receives connections from both the y-unit's hidden layer and the input pattern itself, allows great flexibility in learning a-2 (x). In contrast, if the a-2-unit has no hidden layer of its own, the a-2-unit is constrained to approximate (j2 (x) using only the exponential of a linear combination of basis functions (hidden units) already tailored to represent y(x) (since learning the conditional variance a-2(x) before learning the conditional mean y(x) is troublesome at best). Such limited connectivity can be too constraining on the functional forms for a-2( x) and, in our experience, I The case of a single Gaussian to represent a unimodal distribution can also been generalized to a mixture of several Gaussians that allows the modeling of multimodal distributions (Bishop, 1994). Learning Local Error Bars for Nonlinear Regression 491 produce inferior results. This is a significant difference compared to Bishop's (1994) Gaussian mixture approach in which all output units are directly connected to one set of hidden units. The other extreme would be not to share any hidden units at all, i.e., to employ two completely separate sets of hidden units, one to the y(x)-unit, the other one to the a-2(x)-unit. This is the right thing to do if there is indeed no overlap in the mapping from the inputs to y and from the inputs to cr2• The two examples discussed in this paper are between these two extremes; this justifies the mixed architecture we use. Further discussion on shared vs. separate hidden units for the second example of the laser data is given by Kazlas & Weigend (1995, this volume). For one of our network outputs, the y-unit, the target is easily available-it is simply given by d. But what is the target for the a-2-unit? By maximizing the likelihood of our network ' model N given the data, P(Nlx, d), a target is "invented" as follows. Applying Bayes' rule and assuming statistical independence of the errors, we equivalently do gradient descent in the negative log likelihood of the targets d given the inputs and the network model, summed over all patterns i (see Rumelhart et at., 1995): C = Li In P(dilxi, N). Traditionally, the resulting form of this cost function involves only the estimate Y(Xi) of the conditional mean; the variance of the CID is assumed to be constant for all Xi, and the constant terms drop out after differentiation. In contrast, we allow the conditional variance to depend on x and explicitly keep these terms in C, approximating the conditional variance for Xi by a-2(Xi). Given any network architecture and any parametric form for the ern (Le., any error model), the appropriate weight-update equations for gradient decent learning can be straightforwardly derived. Assuming normally distributed errors around y(x) corresponds to a em density function of P(dilxj) = [27rcr2(Xi)t 1/ 2 exp {d~:Y~.) 2}. Using the network output Y(Xi) ~ y(Xi) to estimate the conditional mean and using the auxiliary output a-2(Xi) ~ cr2(xd to estimate the conditional variance, we obtain the monotonically related negative log lik lib d I P(d I Af\ II 2 A2() [di-y(Xi)]2 S . all e 00, n i Xi, n J 2" n 7rcr Xi + 2".2(X.) . ummatlon over patterns gives the total cost: C = ! ,,{ [di =y(xd] 2 + Ina-2(Xi) + In27r} 2 ~ cr2(Xi) , (1) To write explicit weight-update equations, we must specify the network unit transfer functions. Here we choose a linear activation function for the y-unit, tanh functions for the hidden units, and an exponential function for the a-2 -unit. We can then take derivatives of the cost C with respect to the network weights. To update weights connected to the Y and a-2 -units we have: 11 a-2~i) [di Y(Xi)] hj(Xi) (2) 11 2a-2~Xi) {[di y(Xi)f a-2 (Xi) } hk (Xi) (3) where 11 is the learning rate. For weights not connected to the output, the weight-update equations are derived using the chain rule in the same way as in standard backpropagation. Note that Eq. (3) is equivalent to training a separate function-approximation network for a-2(x) where the targets are the squared errors [di y(Xi)]2]. Note also that if a-2(Xj) is 492 David A. Nix, Andreas S. Weigend constant, Eqs. (1)-(2) reduce to their familiar forms for standard backpropagation with a sum-squared error cost function. The 1/&2(X) term in Eqs. (2)-(3) can be interpreted as a form of "weighted regression," increasing the effective learning rate in low-noise regions and reducing it in high-noise regions. As a result, the network emphasizes obtaining small errors on those patterns where it can (low &2); it discounts learning patterns for which the expected error is going to be large anyway (large &2). This weighted-regression term can itself be highly beneficial where outliers (i.e., samples from high-noise regions) would ordinarily pull network resources away from fitting low-noise regions which would otherwise be well approximated. For simplicity, we use simple gradient descent learning for training. Other nonlinear minimization techniques could be applied, however, but only if the following problem is avoided. If the weighted-regression term described above is allowed a significant influence early in learning, local minima frequently result. This is because input patterns for which low errors are initially obtained are interpreted as "low noise" in Eqs. (2)-(3) and overemphasized in learning. Conversely, patterns for which large errors are initially obtained (because significant learning of y has not yet taken place) are erroneously discounted as being in "high-noise" regions and little subsequent learning takes place for these patterns, leading to highly-suboptimal solutions. This problem can be avoided if we separate training into the following three phases: Phase I (Initial estimate of the conditional mean): Randomly split the available data into equal halves, sets A and 8. Assuming u2(x) is constant, learn the estimate of the conditional mean y(x) using set A as the training set. This corresponds to "traditional" training using gradient descent on a simple squared-error cost function, i.e., Eqs. (1)-(2) without the 1/&2(X) terms. To reduce overfitting, training is considered complete at the minimum of the squared error on the cross-validation set 8, monitored at the end of each complete pass through the training data. Phase II (Initial estimate of the conditional variance): Attach a layer of hidden units connected to both the inputs and the hidden units of the network from Phase I (see Figure 1). Freeze the weights trained in Phase I, and train the &2-unit to predict the squared errors (see Eq. (3», again using simple gradient descent as in Phase I. The training set for this phase is set 8, with set A used for cross-validation. If set A were used as the training set in this phase as well, any overfitting in Phase I could result in seriously underestimating u2(x). To avoid this risk, we interchange the data sets. The initial value for the offset (3 of the &2-unit is the natural logarithm of the mean squared error (from Phase I) of set 8. Phase II stops when the squared error on set A levels off or starts to increase. Phase ill (Weighted regression): Re-split the available data into two new halves, A' and 8'. Unfreeze all weights and train all network parameters to minimize the full cost function C on set A'. Training is considered complete when C has reached its minimum on set 8'.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Accurate Solubility Prediction with Error Bars for Electrolytes: A Machine Learning Approach

Accurate in silico models for predicting aqueous solubility are needed in drug design and discovery and many other areas of chemical research. We present a statistical modeling of aqueous solubility based on measured data, using a Gaussian Process nonlinear regression model (GPsol). We compare our results with those of 14 scientific studies and 6 commercial tools. This shows that the developed ...

متن کامل

How wrong can we get? A review of machine learning approaches and error bars.

A large number of different machine learning methods can potentially be used for ligand-based virtual screening. In our contribution, we focus on three specific nonlinear methods, namely support vector regression, Gaussian process models, and decision trees. For each of these methods, we provide a short and intuitive introduction. In particular, we will also discuss how confidence estimates (er...

متن کامل

Confidence in Data Mining Model Predictions: a Financial Engineering Application

This paper describes a generally applicable robust method for determining prediction intervals for models derived by non-linear regression. Hypothesis tests for bias are applied. The concept is demonstrated by application to a standard synthetic example, and is then applied to prediction intervals for a financial engineering example viz. option pricing using data from LIFFE for 'ESX' European s...

متن کامل

Estimating the domain of applicability for machine learning QSAR models: a study on aqueous solubility of drug discovery molecules

We investigate the use of different Machine Learning methods to construct models for aqueous solubility. Models are based on about 4000 compounds, including an in-house set of 632 drug discovery molecules of Bayer Schering Pharma. For each method, we also consider an appropriate method to obtain error bars, in order to estimate the domain of applicability (DOA) for each model. Here, we investig...

متن کامل

An analysis of the metric structure of the weight space of feedforward networks and its application to time series modeling and prediction

We study symmetries of feedforward networks in terms of their corresponding groups. We nd that these groups naturally act on and partition weight space into disjunct domains. We derive an algorithm to generate representative weight vectors in a fundamental domain. The analysis of the metric structure of the fundamental domain leads to improved evaluation procedures of learning results, such as ...

متن کامل

An Analysis of the Metric Structure of the WeightSpace of Feedforward Networks and its Application

We study symmetries of feedforward networks in terms of their corresponding groups. We nd that these groups naturally act on and partition weight space into disjunct domains. We derive an algorithm to generate representative weight vectors in a fundamental domain. The analysis of the metric structure of the fundamental domain leads to improved evaluation procedures of learning results, such as ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994