Improved Cluster Partition in Principal Component Analysis Guided Clustering
نویسندگان
چکیده
Principal component analysis (PCA) guided clustering approach is widely used in high dimensional data to improve the efficiency of Kmeans cluster solutions. Typically, Pearson correlation is used in PCA to provide an eigenanalysis to obtain the associated components that account for most of the variations in the data. However, PCA based Pearson correlation can be sensitive on non-Gaussian distributed data, which involve skewed observations such as outlying values. Thus, applying PCA based Pearson correlation on such data could affect cluster partitions and generate extremely imbalanced clusters in a high dimensional space. In this study, Tukey's biweight correlation based on Mestimate approach in PCA is used as an alternative to Pearson correlation. This approach is more resistant to outlying values as it examines each observation and down weight those that lie far from the center of the data. In particular two major features are highlighted: (1) fewer components are retained and imbalanced clusters at the recommended cumulative percentage of variation threshold is avoided; (2) the cluster quality with respect to external, internal and relative criteria as shown in Rand, Silhouette and Davies-Bouldin indices, outperform that of the clusters from PCA based Pearson correlation. General Terms Data Structures and Algorithms.
منابع مشابه
Principal component methods - hierarchical clustering - partitional clustering : why would we need to choose for visualizing data ?
This paper combines three exploratory data analysis methods, principal component methods, hierarchical clustering and partitioning, to enrich the description of the data. Principal component methods are used as preprocessing step for the clustering in order to denoise the data, transform categorical data in continuous ones or balanced groups of variables. The principal component representation ...
متن کاملclustering - partitional clustering : why would we need to choose for visualizing data ?
This paper combines three exploratory data analysis methods, principal component methods, hierarchical clustering and partitioning, to enrich the description of the data. Principal component methods are used as preprocessing step for the clustering in order to denoise the data, transform categorical data in continuous ones or balanced groups of variables. The principal component representation ...
متن کاملCluster Ensembles for High Dimensional Clustering: An Empirical Study
This paper studies cluster ensembles for high dimensional data clustering. We examine three different approaches to constructing cluster ensembles. To address high dimensionality, we focus on ensemble construction methods that build on two popular dimension reduction techniques, random projection and principal component analysis (PCA). We present evidence showing that ensembles generated by ran...
متن کاملUsing Clustering and Factor Analysis in Cross Section Analysis Based on Economic-Environment Factors
Homogeneity of groups in studies those use cross section and multi-level data is important. Most studies in economics especially panel data analysis need some kinds of homogeneity to ensure validity of results. This paper represents the methods known as clustering and homogenization of groups in cross section studies based on enviro-economics components. For this, a sample of 92 countries which...
متن کاملValidity-guided (re)clustering with applications to image segmentation
When clustering algorithms are applied to image seg-mentation, the goal is to solve a classification problem. However, these algorithms do not directly optimize classification quality. As a result, they are susceptible to two problems: P1) the criterion they optimize may not be a good estimator of " true " classification quality, and P2) they often admit many (suboptha€) solutions. This paper i...
متن کامل