Unsupervised Learning with Mixed Numeric and Nominal Data

نویسندگان

  • Cen Li
  • Gautam Biswas
چکیده

ÐThis paper presents a Similarity-Based Agglomerative Clustering (SBAC) algorithm that works well for data with mixed numeric and nominal features. A similarity measure, proposed by Goodall for biological taxonomy [15], that gives greater weight to uncommon feature value matches in similarity computations and makes no assumptions of the underlying distributions of the feature values, is adopted to define the similarity measure between pairs of objects. An agglomerative algorithm is employed to construct a dendrogram and a simple distinctness heuristic is used to extract a partition of the data. The performance of SBAC has been studied on real and artificially generated data sets. Results demonstrate the effectiveness of this algorithm in unsupervised discovery tasks. Comparisons with other clustering schemes illustrate the superior performance of this approach. Index TermsÐAgglomerative clustering, conceptual clustering, feature weighting, interpretation, knowledge discovery, mixed numeric and nominal data, similarity measures, 2 aggregation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering Mixed Data via Diffusion Maps

Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, customer segmentation, trend analysis, pattern recognition and image analysis. Although many clustering algorithms have been proposed most of them deal with clustering of numerical data. Finding the similarity between numeric objects usually relies on a com...

متن کامل

Towards a General Technique for Transformation of Nominal Features into Numeric Features in Supervised Learning

Almost all of the machine learning problems require data preprocessing. This stage is especially important for problems where the datasets contain features of mixed types (i.e. nominal and numeric). An often practice in such cases is to transform each nominal features into many dummy (i.e. binary) features. Also many classification algorithms have preference of numeric attributes over nominal a...

متن کامل

The Mining and Analysis of Data with Mixed Attribute Types

Mining and analysis of large data sets has become a major contributor to the exploitation of Artificial Intelligence in a wide range of real life challenges, including education, business intelligence and research. In the field of education, the mining, extraction and exploitation of useful information and patterns from student data provides lecturers, trainers and organisations with the potent...

متن کامل

Conceptual Clustering with Numeric-and-Nominal Mixed Data | A New Similarity Based System

This paper presents a new Similarity Based Agglomerative Clustering(SBAC) algorithm that works well for data with mixed numeric and nominal features. A similarity measure, proposed by Goodall for biological taxonomy[13], that gives greater weight to uncommon feature-value matches in similarity computations and makes no assumptions of the underlying distributions of the feature-values, is adopte...

متن کامل

Knowledge Discovery from Mixed Data by Artificial Neural Network with Unsupervised Learning

Knowledge discovery or data mining from massive data is a hot issue in business and academia in recent years. Real-world data are usually of mixed-type, consisting of categorical and numeric attributes. Mining knowledge from massive, mixed data is challenge. To explore unknown data, visualized analysis allows users to gain some initial understanding regarding the data and to prepare for further...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEEE Trans. Knowl. Data Eng.

دوره 14  شماره 

صفحات  -

تاریخ انتشار 2002