COPA: Constrained PARAFAC2 for Sparse&Large Datasets
نویسندگان
چکیده
PARAFAC2 has demonstrated success in modeling irregular tensors, where the tensor dimensions vary across one of the modes. An example scenario is jointly modeling treatments across a set of patients with varying number of medical encounters, where the alignment of events in time bears no clinical meaning, and it may also be impossible to align them due to their varying length. Despite recent improvements on scaling up unconstrained PARAFAC2, its model factors are usually dense and sensitive to noise which limits their interpretability. As a result, the following open challenges remain: a) various modeling constraints, such as temporal smoothness, sparsity and non-negativity, are needed to be imposed for interpretable temporal modeling and b) a scalable approach is required to support those constraints efficiently for large datasets. To tackle these challenges, we propose a COnstrained PARAFAC2 (COPA) method, which carefully incorporates optimization constraints such as temporal smoothness, sparsity, and non-negativity in the resulting factors. To efficiently support all those constraints, COPA adopts a hybrid optimization framework using alternating optimization and alternating direction method of multiplier (AOADMM). As evaluated on large electronic health record (EHR) datasets with hundreds of thousands of patients, COPA achieves significant speedups (up to 36× faster) over prior PARAFAC2 approaches that only attempt to handle a subset of the constraints that COPA enables. Overall, our method outperforms all the baselines attempting to handle a subset of the constraints in terms of speed, while achieving the same level of accuracy. Through a case study on temporal phenotyping of medically complex children, we demonstrate how the constraints imposed by COPA reveal concise phenotypes and meaningful temporal profiles of patients. The clinical interpretation of both the phenotypes and the temporal profiles was confirmed by a medical expert.
منابع مشابه
Mammalian Eye Gene Expression Using Support Vector Regression to Evaluate a Strategy for Detecting Human Eye Disease
Background and purpose: Machine learning is a class of modern and strong tools that can solve many important problems that nowadays humans may be faced with. Support vector regression (SVR) is a way to build a regression model which is an incredible member of the machine learning family. SVR has been proven to be an effective tool in real-value function estimation. As a supervised-learning appr...
متن کاملFace Recognition in Thermal Images based on Sparse Classifier
Despite recent advances in face recognition systems, they suffer from serious problems because of the extensive types of changes in human face (changes like light, glasses, head tilt, different emotional modes). Each one of these factors can significantly reduce the face recognition accuracy. Several methods have been proposed by researchers to overcome these problems. Nonetheless, in recent ye...
متن کاملFace Recognition using an Affine Sparse Coding approach
Sparse coding is an unsupervised method which learns a set of over-complete bases to represent data such as image and video. Sparse coding has increasing attraction for image classification applications in recent years. But in the cases where we have some similar images from different classes, such as face recognition applications, different images may be classified into the same class, and hen...
متن کاملApplication to Fault Detection and Diagnosis in Semiconductor Etch
Monitoring and fault detection of batch chemical processes is complicated by stretching of the time axis, resulting in batches of different length. This paper offers an approach to the unequal time axis problem using the Parallel Factor Analysis 2 (PARAFAC2) model. In part I of this series an algorithm for PARAFAC2 was developed and extended to N-way arrays. Unlike PARAFAC, the PARAFAC2 model d...
متن کاملSparse Density Representations for Simultaneous Inference on Large Spatial Datasets
Large spatial datasets often represent a number of spatial point processes generated by distinct entities or classes of events. When crossed with covariates, such as discrete time buckets, this can quickly result in a data set with millions of individual density estimates. Applications that require simultaneous access to a substantial subset of these estimates become resource constrained when d...
متن کامل