Cluster Stability Analysis using Sub-sampling

نویسندگان

  • Reda Alhajj
  • Osman Abul
  • Faruk Polat
چکیده

Cluster stability research is involved with the validity of clusters generated by a clustering algorithm. In other words, it answers whether generated clusters are true clusters or due to chance. Estimating true numbers of clusters is related to this problem, since often the cluster validity is based on this estimate. In the literature, there are a number of methods available for both purposes. In most of the cases, assessing validity turns out to be determining the best parameter of clustering algorithm. The confidence estimation is addressed in relatively less number of research papers. In those, confidence is given in terms of the proportion of cases clustering together. Our motivation is making confidence estimation about the clusters itself, i.e. not specifically addressing specific cases. Here we propose three meta-methods from this perspective for cluster stability problem. To the best of the our knowledge, these methods are novel. The methods are all based on sub-sampling of the dataset. The methods are general and can be used with evaluation of clustering generated by wide range of clustering algorithms available. The first method, first makes a clustering using given clustering algorithm and cluster count. Next, it randomly samples from the labelled clusters, then it builds a supervised classifier on the selected subset, the induced classifier evaluates the non-selected portion. Random sub-sampling and evaluation steps are repeated many times, finally the overall accuracy gives the stability of the clustering. To find the best stable clustering for the given algorithm, overall steps are repeated for all possible number of clusters and best stable clustering is chosen for confidence estimation. Instead of random sub-sampling, 10-fold cross-validation is also employed. The second method is based on the subset selection of original clusters. First of all given clustering algorithm finds clusters. For each subset of these clusters, an algorithm that estimates the true number of clusters is used. The argument here is that, if initial clustering is stable, then for each subset of it we expect number of clusters estimated is the same as cardinality of selected subset. The above single step is for assessing the reliability of cluster itself. If the reliability of randomized algorithm like k-means is the concern, the overall steps are repeated for averaging. The confidence is computed as the ratio of correct estimations. It may be the case that, clustering has given large number of clusters (e.g. say 20 clusters). In this case, trying all subsets become computationalyintractable so we resort to subset sampling instead. The third method uses the idea that if a cluster is stable, further clustering the cases in the cluster will reveal one cluster. For each of the clusters, an estimator algorithm is run and expected to give that there is one cluster. The whole step is repeated many times with sub-sampling of dataset, i.e. a bootstrapping approach. Confidence is computed similar to the second method. Bootstrapping approach is employed for confidence estimation. The second and third method can also be used for selecting the best number of clusters in the sense that give highest confidence.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Genetic Contribution of Grapevine (Vitis Vinifera L.) Main Yield Components in Final Yield

Objective: Yield components and genetic contribution have the most important in final yield and breeding programs of crop plants. For this purpose, 20 varieties of grapevines with Russia origin were evaluated in Urmia and Takestan research station (under full irrigation and drought stress). Methods: Twenty grapevine genotypes were evaluated in Urmia and Takestan research station (under full irr...

متن کامل

Genetic Contribution of Grapevine (Vitis Vinifera L.) Main Yield Components in Final Yield

Objective: Yield components and genetic contribution have the most important in final yield and breeding programs of crop plants. For this purpose, 20 varieties of grapevines with Russia origin were evaluated in Urmia and Takestan research station (under full irrigation and drought stress). Methods: Twenty grapevine genotypes were evaluated in Urmia and Takestan research station (under full irr...

متن کامل

Analysis of the Role of Cultural and Social Capital on Family Stability among Couples in Chahardangeh

The purpose of this study is to investigate the effect of cultural capital and social capital on family stability among couples in Chahardangeh. The research method is descriptive-correlational done by survey method. The statistical population was couples living in Chahardangeh in the second half of 2000. According to Cochran's formula, 200 people were selected as the sample size and the sampli...

متن کامل

Supervised sampling for clustering large data sets

The problem of clustering large data sets has attracted a lot of current research. The approaches taken are mainly based either on the more efficient implementation or modification of existing methods or/and on the construction of clusters from a small sub-sample of the data and then the assignment of all observations in those clusters. The current paper focuses on the latter direction. An alte...

متن کامل

When Less is More: Improvements in Medical Image Segmentation through Spatial Sub-Sampling

Segmentation is a common task in medical image analysis. It is frequently solved by fitting an intensity model, consisting of distributions for each pure tissue and each partial volume tissue combination, to the intensity histogram of the image data. However, this approach discards any spatial information present in the data. We present a method that recovers some of this information via region...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003