A MDL4BMF: Minimum Description Length for Boolean Matrix Factorization

نویسندگان

  • PAULI MIETTINEN
  • JILLES VREEKEN
چکیده

s 859 3 933 1.2 – 168 19 DBLP 6 980 19 13.0 19 15 4 Dialect 1 334 506 16.1 389 56 37 DNA Amp. 4 590 392 1.5 365 39 54 Mammals 2 183 124 20.5 122 50 13 Mushroom 8 192 112 19.3 112 175 59 Newsgroups 400 800 3.5 398 17 17 NSF Abstracts 12 841 4 894 0.9 – 1835 2098 Paleo 501 139 5.1 46 96 19 Newsgroups is a subset of the well-known 20Newsgroups dataset,7 containing, for 400 posts from 4 newsgroups8, the usage of 800 words. NSF Abstracts contains the occurrence of terms in abstracts of successful NSF grant applications.9 The pre-processing follows Miettinen [2009]. The resulting data is extremely sparse (0.9% of elements are 1s). Last, Paleo consists of fossil records per location.10 We ran each method on these datasets, and give the returned model orders in Table I. Transfer cost is computationally rather expensive, in particular for the larger and more sparse datasets, and did not finish within reasonable time for Abstracts and NSF. When we investigate the model orders, we see an interesting reversal compared to the synthetic data. Here, PANDA does not strongly overestimate, but T-Cost does: many of the estimates obtained by T-Cost are full-rank, or close to it. While for these datasets there is no ground truth, these scores are beyond what would be inspectable by hand. The estimates provided by PANDA, on the other hand, seem more realistic. For Abstracts, Mammals, and Mushroom, however, PANDA estimates the model order much higher than ASSO. For ASSO, with the exception of NSF, the factorisations of the identified model orders are all such that a data analyst can inspect by hand. We discuss the results of ASSO in closer detail below. First, however, we investigate the sensitivity to ASSO’s parameter τ for the model order estimates on real data. In Figure 8, we plot two typical examples of total encoded lengths, L(A, H), for ASSO with TXD and using different values of k and τ . In the left figure, for the DNA data, we see the landscape to be a valley, with extreme high values for overly complex and overly simplistic models. The value of k at which ASSO minimizes L(A, H) is found in the distinct valley, at k = 54. The shape of the valley 7http://people.csail.mit.edu/jrennie/20Newsgroups/ 8The authors are grateful to Ata Kabán for pre-processing the data. The exact pre-processing is detailed in [Miettinen 2009]. 9http://kdd.ics.uci.edu/databases/nsfabs/nsfawards.html 10NOW public release 030717, available from http://www.helsinki.fi/science/now/ [Fortelius et al. 2003]. ACM Transactions on Knowledge Discovery from Data, Vol. V, No. N, Article A, Publication date: January 2014. MDL4BMF: Minimum Description Length for Boolean Matrix Factorization A:25

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Getting to Know the Unknown Unknowns: Destructive-Noise Resistant Boolean Matrix Factorization

Finding patterns in binary data is a classical problem in data mining, dating back to at least frequent itemset mining. More recently, approaches such as tiling and Boolean matrix factorization (BMF), have been proposed to find sets of patterns that aim to explain the full data well. These methods, however, are not robust against non-trivial destructive noise, i.e. when relatively many 1s are r...

متن کامل

NASSAU: Description Length Minimization for Boolean Matrix Factorization

Boolean Matrix Factorization (BMF) is an important tool in data mining that in many cases allows to increase interpretability for binary data. In BMF one decomposes a given binary matrix into a Boolean product of binary factors such that some cost function is minimized. In this work we consider the description length of the data as the cost, which has been proven effective in uncovering true st...

متن کامل

Inference of Gene Regulatory Networks Based on a Universal Minimum Description Length

The Boolean network paradigm is a simple and effective way to interpret genomic systems, but discovering the structure of these networks remains a difficult task. The minimum description length (MDL) principle has already been used for inferring genetic regulatory networks from time-series expression data and has proven useful for recovering the directed connections in Boolean networks. However...

متن کامل

A new approach for building recommender system using non negative matrix factorization method

Nonnegative Matrix Factorization is a new approach to reduce data dimensions. In this method, by applying the nonnegativity of the matrix data, the matrix is ​​decomposed into components that are more interrelated and divide the data into sections where the data in these sections have a specific relationship. In this paper, we use the nonnegative matrix factorization to decompose the user ratin...

متن کامل

A Projected Alternating Least square Approach for Computation of Nonnegative Matrix Factorization

Nonnegative matrix factorization (NMF) is a common method in data mining that have been used in different applications as a dimension reduction, classification or clustering method. Methods in alternating least square (ALS) approach usually used to solve this non-convex minimization problem.  At each step of ALS algorithms two convex least square problems should be solved, which causes high com...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014