Clustering by Maximizing Sum-of-Squared Separation Distance
نویسندگان
چکیده
Maximizing the separating margin is crucial for the good generalization performance of Support Vector Machines (SVMs). Analogous to the definition of separation distance or separating margin in SVMs, we propose a definition on separation distance in clustering tasks when a hyperplane is used to separate clusters. For given training data and a given metric distance, by maximizing the proposed separation distance, our clustering algorithm constructs an “optimal” hyperplane that can be applied to unseen data in the future. The resulting hyperplane corresponds to a nonlinear decision boundary in the input feature space through an appropriate distance feature mapping. A graph-theoretic perspective of the proposed method is discussed. In particular, we show that, under certain conditions, the proposed clustering algorithm is equivalent to a spectral relaxed graph cut. Extensive experimental results are provided to validate the method.
منابع مشابه
A survey on exact methods for minimum sum-of-squares clustering
Minimum sum-of-squares clustering (MSSC) consists in partitioning a given set of n entities into k clusters in order to minimize the sum of squared distances from the entities to the centroid of their cluster. Among many criteria used for cluster analysis, the minimum sum-of-squares is one of the most popular since it expresses both homogeneity and separation. A mathematical programming formula...
متن کاملیادگیری نیمه نظارتی کرنل مرکب با استفاده از تکنیکهای یادگیری معیار فاصله
Distance metric has a key role in many machine learning and computer vision algorithms so that choosing an appropriate distance metric has a direct effect on the performance of such algorithms. Recently, distance metric learning using labeled data or other available supervisory information has become a very active research area in machine learning applications. Studies in this area have shown t...
متن کاملAn Efficient Unified K-Means Clustering Technique for Microarray Gene Expression Data
Problem statement: Using microarray techniques one could monitor the expressions levels of thousands of genes simultaneously. One challenge was how to derive meaningful insights into expressed data. This might be carried out by clustering techniques such as hierarchical and k-means, but most of the clustering techniques were largely heuristic in nature and are associated with some unresolved is...
متن کاملRepeated Record Ordering for Constrained Size Clustering
One of the main techniques used in data mining is data clustering, which has many applications in computer science, biology, and social sciences. Constrained clustering is a type of clustering in which side information provided by the user is incorporated into current clustering algorithms. One of the well researched constrained clustering algorithms is called microaggregation. In a microaggreg...
متن کاملT-test distance and clustering criterion for speaker diarization
In this paper, we present an application of student’s t-test to measure the similarity between two speaker models. The measure is evaluated by comparing with other distance metrics: the Generalized Likelihood Ratio, the Cross Likelihood Ratio and the Normalized Cross Likelihood Ratio in speaker detection task. We also propose an objective criterion for speaker clustering. The criterion deduces ...
متن کامل