Technical Report for Collaborative Research Center Sfb 876 Providing Information by Resource- Constrained Data Analysis Subproject A1 Data Mining for Ubiquitous System Software a New Look at Regularization for Probabilistic Graphical Models

نویسنده

  • Katharina Morik
چکیده

Probabilistic graphical models can simulate and classify high dimensional, heterogeneous data and serve as underlying formalism in the data analysis of biology, physics, computer vision, natural language processing and others. Their parameter dimension is a function of the treewidth of the data's conditional independence graph and the data domain. Even if most dependencies are ignored and a pairwise approximation is considered, the dimension remains large, e.g., it exceeds millions. Hence, the sample complexity is large, models tend to overfit the data and the learned models are hard to communicate. Classic 1-regularization approaches can not be applied without changing the underlying graphical structure, which is not an option if certain dependencies should be handled explicitly by the model. A recent advance in this area has shown, that a combination of regularization and reparametrization can lead to sparse models for spatio-temporal data. This is achieved by explicitly encoding knowledge about the spatio-temporal nature of the data into the model parameters. We discuss related approaches and how these techniques can be generalized to other data domains. Furthermore, we present a first promising experimental result. The exponential family of densities arises naturally as the maximum entropy distribution that reproduces data's empirical marginals. We consider a discrete 1 (multivariate) ran-1 Restricting ourselves to discrete models is just for notational convenience. The ideas presented here do apply to models with continuous variables as well. 2 The-symbol denotes the Cartesian product. d maps realizations x to a d-dimensional feature vector. The map φ G is fully determined by a graph G = (V, E) that encodes conditional independence between the components of X (see [8]). In the most general case, the dimension d is a multivariate polynomial in the variables state space sizes, i.e., d = C * ∈C(G) v ∈C |X v |, where C * (G) is the set of maximal cliques (fully connected subgraphs). In practice, often only the vertices and edges of G are modeled. Nevertheless, large state spaces or large graphs will still lead to millions of parameters. Classic 1-regularization [4] can not be applied directly, since the estimation of the graph and the estimation of the parameters are coupled: regu-larization based graph estimation approaches (e.g. [3]) perform maximum a posteriori parameter estimation, whereas a Laplace prior pushes small parameter values towards 0. The remaining non-zero weights represent the estimated graph G. Obviously, a subsequent estimation of model parameters with 1-regularization can …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Technische Universität Dortmund Subproject A1 Data Mining for Ubiquitous System Software Information Extraction in Rapidminer

This paper describes the Information Extraction Plugin 1 [3] which allows the use of Information Extraction mechanisms in RapidMiner 2 .

متن کامل

A data mining approach to employee turnover prediction (case study: Arak automotive parts manufacturing)

Training and adaption of employees are time and money consuming. Employees’ turnover can be predicted by their organizational and personal historical data in order to reduce probable loss of organizations. Prediction methods are highly related to human resource management to obtain patterns by historical data. This article implements knowledge discovery steps on real data of a manufacturing pla...

متن کامل

Providing a model for predicting blood pressure fluctuations after induction of general anesthesia with data mining: a brief report

Background: Fluctuations in blood pressure after induction of general anesthesia have played a significant role in complications of surgery. Therefore, the present study was performed by identifying the causes of blood pressure fluctuations after induction of anesthesia, predicting and preventing them. Methods: For this study which is a retrospective cohort, data mining methods in the data set...

متن کامل

Data Envelopment Analysis with LINGO Modeling for Technical Educational Group of an Organization

Data Envelopment Analysis (DEA) was developed to help compare the relative performance of decision-making units. It is a non-parametric method for performing frontier analysis. It uses linear programming to estimate the efficiency of multiple decision-making units and it is commonly used in production, management and economics [3]. DEA generates an efficiency score between 0 and 1 for each unit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015