Linear regression for numeric symbolic variables: an ordinary least squares approach based on Wasserstein Distance
نویسندگان
چکیده
In this paper we present a linear regression model for modal symbolic data. The observed variables are histogram variables according to the definition given in Bock and Diday [1] and the parameters of the model are estimated using the classic Least Squares method. An appropriate metric is introduced in order to measure the error between the observed and the predicted distributions. In particular, the Wasserstein distance is proposed. Some properties of such metric are exploited to predict the response variable as direct linear combination of other independent histogram variables. Measures of goodness of fit are discussed. An application on real data corroborates the proposed method.
منابع مشابه
Multiple Linear Regression for Histogram Data using Least Squares of Quantile Functions: a Two-components model
Abstract. Histograms are commonly used for representing summaries of observed data and they can be considered non parametric estimates of probability distributions. Symbolic Data Analysis formalized the concept of histogram symbolic variable, as a variable which allows to describe statistical units by histograms instead of single values. In this paper we present a linear regression model for mu...
متن کاملA robust least squares fuzzy regression model based on kernel function
In this paper, a new approach is presented to fit arobust fuzzy regression model based on some fuzzy quantities. Inthis approach, we first introduce a new distance between two fuzzynumbers using the kernel function, and then, based on the leastsquares method, the parameters of fuzzy regression model isestimated. The proposed approach has a suitable performance to<b...
متن کاملFuzzy Hybrid least-Squares Regression Approach to Estimating the amount of Extra Cellular Recombinant Protein A from Escherichia coli BL21
Introduction: Immune Protein A is a component with a vast spectrum of biochemical, biological and medical usages. The coding gene of this protein was extracted from Staphylococcus aureus and was cloned and expressed in Escherichia coli bacteria. Suitable statistical methods are utilized to optimize expression conditions for evaluating experiment accuracy , guarantee the accuracy of subsequent ...
متن کاملRobust high-dimensional semiparametric regression using optimized differencing method applied to the vitamin B2 production data
Background and purpose: By evolving science, knowledge, and technology, we deal with high-dimensional data in which the number of predictors may considerably exceed the sample size. The main problems with high-dimensional data are the estimation of the coefficients and interpretation. For high-dimension problems, classical methods are not reliable because of a large number of predictor variable...
متن کاملExact and approximate solutions of fuzzy LR linear systems: New algorithms using a least squares model and the ABS approach
We present a methodology for characterization and an approach for computing the solutions of fuzzy linear systems with LR fuzzy variables. As solutions, notions of exact and approximate solutions are considered. We transform the fuzzy linear system into a corresponding linear crisp system and a constrained least squares problem. If the corresponding crisp system is incompatible, then the fuzzy ...
متن کامل