Semi-supervised incremental clustering of categorical data

نویسندگان

  • Dan A. Simovici
  • Natima Singla
چکیده

Résumé. Le clustering semi-supervisé combine l’apprentissage supervisé and non-supervisé pour produire meilleurs clusterings. Dans la phase initiale supervisée de l’algorithme, un échantillon d’apprentissage est produit par selection aléatoire. On suppose que les exemples de l’échantillon d’apprentissage sont étiquetés par un attribut de classe. Puis, un algorithme incrémentiel développé pour les données catégoriques est utilisé pour produire un ensemble de clusters pur (tels que les exemple de chaque cluster ont la même étiquette), qui servent de “seeding clusters” pour la deuxiéme phase non-supervisée de l’algorithme. Dans cette phase, l’algorithme incrémentiel est appliqué aux données non étiquetées. La qualité du clustering est évaluée par l’index de Gini moyen des clusters. Les expériences démontrent que des très bons clusterings peuvent être obtenus avec des petits échantillons d’apprentissage.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Semi-supervised Learning Framework to Cluster Mixed Data Types

We propose a semi-supervised framework to handle diverse data formats or data with mixedtype attributes. Our preliminary results in clustering data with mixed numerical and categorical attributes show that the proposed semi-supervised framework gives better clustering results in the categorical domain. Thus the seeds obtained from clustering the numerical domain give an additional knowledge to ...

متن کامل

Extracting Prior Knowledge from Data Distribution to Migrate from Blind to Semi-Supervised Clustering

Although many studies have been conducted to improve the clustering efficiency, most of the state-of-art schemes suffer from the lack of robustness and stability. This paper is aimed at proposing an efficient approach to elicit prior knowledge in terms of must-link and cannot-link from the estimated distribution of raw data in order to convert a blind clustering problem into a semi-supervised o...

متن کامل

An Improved Semi-Supervised Clustering Algorithm Based on Active Learning

In semi supervised clustering is one of the major tasks and aims at grouping the data objects into meaningful classes (clusters) such that the similarity of objects within clusters is maximized and the similarity of objects between clusters is minimized. The dataset sometimes may be in mixed nature that is it may consist of both numeric and categorical type of data. Naturally these two types of...

متن کامل

Active Learning of constraints using incremental approach in semi-supervised clustering

Semi-supervised clustering aims to improve clustering performance by considering user-provided side information in the form of pairwise constraints. We study the active learning problem of selecting must-link and cannot-link pairwise constraints for semi-supervised clustering. We consider active learning in an iterative framework; each iteration queries are selected based on the current cluster...

متن کامل

Wised Semi-Supervised Cluster Ensemble Selection: A New Framework for Selecting and Combing Multiple Partitions Based on Prior knowledge

The Wisdom of Crowds, an innovative theory described in social science, claims that the aggregate decisions made by a group will often be better than those of its individual members if the four fundamental criteria of this theory are satisfied. This theory used for in clustering problems. Previous researches showed that this theory can significantly increase the stability and performance of...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005