Clustering Categorical Sequences with Variable-Length Tuples Representation

نویسندگان

  • Liang Yuan
  • Zhiling Hong
  • Lifei Chen
  • Qiang Cai
چکیده

Clustering categorical sequences is currently a difficult problem due to the lack of an efficient representation model for sequences. Unlike the existing models, which mainly focus on the fixed-length tuples representation, in this paper, a new representation model on the variablelength tuples is proposed. The variable-length tuples are obtained using a pruning method applied to delete the redundant tuples from the suffix tree, which is created for the fixed-length tuples with a large memorylength of sequences, in terms of the entropy-based measure evaluating the redundancy of tuples. A partitioning algorithm for clustering categorical sequences is then defined based on the normalized representation using tuples collected from the pruned tree. Experimental studies on six real-world sequence sets show the effectiveness and suitability of the proposed method for subsequence-based clustering.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

LIMBO: Scalable Clustering of Categorical Data

Clustering is a problem of great practical importance in numerous applications. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values. We introduce LIMBO, a scalable hierarchical categorical clustering algorithm that builds on the Information Bottleneck (IB) framework for quantifying the relevant ...

متن کامل

LIMBO: A Scalable Algorithm to Cluster Categorical Data

Clustering is a problem of great practical importance in numerous applications. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values. In this work, we introduce LIMBO, a scalable hierarchical categorical clustering algorithm that builds on the Information Bottleneck (IB) framework for quantifying...

متن کامل

Scalable Hierarchical Clustering Method for Sequences of Categorical Values

Data clustering methods have many applications in the area of data mining. Traditional clustering algorithms deal with quantitative or categorical data points. However, there exist many important databases that store categorical data sequences, where significant knowledge is hidden behind sequential dependencies between the data. In this paper we introduce a problem of clustering categorical da...

متن کامل

Clustering From Categorical Data Sequences

The three-parameter cluster model is a combinatorial stochastic process that generates categorical response sequences by randomly perturbing a fixed clustering parameter. This clear relationship between the observed data and the underlying clustering is particularly attractive in cluster analysis, in which supervised learning is a common goal and missing data is a familiar issue. The model is w...

متن کامل

Categorical Data Clustering based on an Alternative Data Representation Technique

Clustering categorical data is relatively difficult than clustering numeric data. In numeric data the inherent geometric properties can be used in defining distance functions between data points. In case of categorical data, a distance or dissimilarity function can’t be defined directly. An extension of the classical k-means algorithm for categorical data has been done in [1], where a method of...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016