Beyond Hartigan Consistency: Merge Distortion Metric for Hierarchical Clustering
نویسندگان
چکیده
Hierarchical clustering is a popular method for analyzing data which associates a tree to a dataset. Hartigan consistency has been used extensively as a framework to analyze such clustering algorithms from a statistical point of view. Still, as we show in the paper, a tree which is Hartigan consistent with a given density can look very different than the correct limit tree. Specifically, Hartigan consistency permits two types of undesirable configurations which we term over-segmentation and improper nesting. Moreover, Hartigan consistency is a limit property and does not directly quantify difference between trees. In this paper we identify two limit properties, separation and minimality, which address both over-segmentation and improper nesting and together imply (but are not implied by) Hartigan consistency. We proceed to introduce a merge distortion metric between hierarchical clusterings and show that convergence in our distance implies both separation and minimality. We also prove that uniform separation and minimality imply convergence in the merge distortion metric. Furthermore, we show that our merge distortion metric is stable under perturbations of the density. Finally, we demonstrate applicability of these concepts by proving convergence results for two clustering algorithms. First, we show convergence (and hence separation and minimality) of the recent robust single linkage algorithm of Chaudhuri and Dasgupta (2010). Second, we provide convergence results on manifolds for topological split tree clustering.
منابع مشابه
Improved Error Bounds for Tree Representations of Metric Spaces
Estimating optimal phylogenetic trees or hierarchical clustering trees from metric data is an important problem in evolutionary biology and data analysis. Intuitively, the goodness-of-fit of a metric space to a tree depends on its inherent treeness, as well as other metric properties such as intrinsic dimension. Existing algorithms for embedding metric spaces into tree metrics provide distortio...
متن کاملClustering of Musical Sounds using
This paper describes a hierarchical clustering of musical signals based on information derived from spectral and bispectral acoustic distortion measures. This clustering reveals the ultra metric structure that exists in the set of sounds, with a clear interpretation of the distances between the sounds as the statistical divergence between the sound models. Spectral, bispectral and combined clus...
متن کاملTemporal Hierarchical Clustering
We study hierarchical clusterings of metric spaces that change over time. This is a natural geometric primitive for the analysis of dynamic data sets. Specifically, we introduce and study the problem of finding a temporally coherent sequence of hierarchical clusterings from a sequence of unlabeled point sets. We encode the clustering objective by embedding each point set into an ultrametric spa...
متن کاملOlli Virmajoki Pairwise Nearest Neighbor Method Revisited
The pairwise nearest neighbor (PNN) method, also known as Ward's method belongs to the class of agglomerative clustering methods. The PNN method generates hierarchical clustering using a sequence of merge operations until the desired number of clusters is obtained. This method selects the cluster pair to be merged so that it increases the given objective function value least. The main drawback ...
متن کاملPackage ‘ BHC ’
December 21, 2016 Type Package Title Bayesian Hierarchical Clustering Version 1.26.0 Date 2011-09-07 Author Rich Savage, Emma Cooke, Robert Darkins, Yang Xu Maintainer Rich Savage Description The method performs bottom-up hierarchical clustering, using a Dirichlet Process (infinite mixture) to model uncertainty in the data and Bayesian model selection to decide at eac...
متن کامل