Exact clustering in linear time

نویسندگان

  • Jonathan A. Marshall
  • Lawrence C. Rafsky
چکیده

The time complexity of data clustering has been viewed as fundamentally quadratic, slowing with the number of data items, as each item is compared for similarity to preceding items. Clustering of large data sets has been infeasible without resorting to probabilistic methods or to capping the number of clusters. Here we introduce MIMOSA, a novel class of algorithms which achieve linear-time computational complexity on clustering tasks. MIMOSA algorithms mark and match partial-signature keys in a hash table to obtain exact, error-free cluster retrieval. Benchmark measurements, on clustering a data set of 10,000,000 news articles by news topic, found that a MIMOSA implementation finished more than four orders of magnitude faster than a standard centroid implementation. Summary: Big-data computations that would have taken years can now be done in minutes. INTRODUCTION Data clustering techniques are widely used in computational data science. With increasing data capacities and speeds in computing, scientists in many domains (1, 2, 3, 4, 5, 6) seek to perform clustering on ever-larger “big data” sets. Clustering may entail comparing data items to one another along several dimensions, and assigning similar data items to the same group. With large data sets, similarity computation becomes slow and expensive, as each data item is compared to a large number of other data items. The time complexity of similarity clustering has been viewed as fundamentally O(n 2 ): quadratic in the number of data items. Even with aggressive techniques such as probabilistic algorithms (3, 4, 5, 6, 7, 8, 9, 10), partitioning (3, 4, 10, 11, 12), and parallelization, comparing similarity between the items in a large data set can require a prohibitive amount of computation (2, 3, 5, 7, 8, 9), or can yield generally inferior clustering (13, 14). The k-means algorithm limits the comparisons of each item to k cluster centroids, resulting in O(nk) time complexity, for a small, fixed value of k determined in advance. Large data sets may have many clusters; limiting the number to a fixed k may result in inadequate cluster quality for certain applications. Similarity clustering in linear or near-linear time can be obtained via probabilistic clustering algorithms, such as MinHash (4, 7, 8, 9) methods – but at the cost of admitting errors in retrieval, such as false negatives, in which the algorithm may (with small probability) erroneously omit certain cluster members during cluster retrieval. An omission may be tolerable in some application domains (e.g., document deduplication, advertisement targeting), but unacceptable in ar X iv :1 70 2. 05 42 5v 2 [c s. D S] 1 7 Fe b 20 17 others (e.g., medical diagnosis, scientific analysis, engineering designs) – which may require or prefer an error-free, or exact, clustering method rather than a probabilistic, or approximate, one. Unlike earlier methods, MIMOSA (Mark-In, Match-Out Similarity Analysis) algorithms perform exact similarity clustering in O(n) time (linear time complexity in the number of data items) – not suffering probabilistic errors, nor capping the number of clusters. Because a MIMOSA algorithm takes about the same time to process the millionth data item as it takes to process the first, it performs clustering faster than other exact methods when the number of data items is large. Data item signatures MIMOSA algorithms are signature-based. Each data item has a signature } ..., , { 1 i n i i i S S S = : a limited-size set of elements that describe the data item, so that the signatures of similar data items may have one or more elements in common. MIMOSA finds data items whose signatures are similar, and clusters them accordingly. For example, in a news analysis application where each data item is a news article, a signature might be a set of keywords denoting the most important people, companies, and events in the article. An article of 700 words, entitled “School, infrastructure bond measures fill U.S. ballots,” might have signature BALLOT-BOND-BORROW-CALIFORNIA-INFRASTRUCTUREMEASURE-MUNICIPAL-SCHOOL-TAX-TRANSIT-VOTE-YIELD. Each element is chosen or derived for high informational value. Terms of lower value, such as common stopwords (“the”) or words appearing infrequently in the article (“airport”) are typically omitted from a news article signature. Articles whose signatures share several elements – i.e., cover the same news topic – can belong to the same cluster. In other example applications, a signature can describe the expressed proteins from a gene, chemical activity measurements, a user’s web browsing behavior, a mailing address for marketing, patient medical symptoms, a business’s credit history, psychological, demographic, or census survey entries, or sensor readings from a scientific apparatus or industrial machine. Structure of MIMOSA algorithms A MIMOSA run is preconfigured by specifying a similarity measure ) ( s , a minimum similarity threshold value θ, and a list A of the size values that are allowed or expected for signatures. When ) , ( Y X s meets or exceeds θ, then X and Y are said to be similar to each other. One popular and useful similarity measure is Jaccard similarity, Y X Y X Y X s U I ≡ ) , ( , in which the pairwise similarity score depends on the sizes of both the intersection (overlap) and the union of the two signatures X and Y. A partial signature is a subset of the elements of a signature. For the news article example above, one of the partial signatures is BOND-CALIFORNIA-SCHOOL-TAX-VOTE. This example partial signature has size 5, the number of its elements. A signature of size n has at most 1 2 − n partial signatures. From each signature i S , a MIMOSA algorithm derives a set of partial signatures of certain sizes:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SAHN Clustering in Arbitrary Metric Spaces Using Heuristic Nearest Neighbor Search

Sequential agglomerative hierarchical non-overlapping (SAHN) clustering techniques [10] belong to the classical clustering methods that are applied heavily in many application domains, e.g., in cheminformatics [4]. Asymptotically optimal SAHN clustering algorithms are known for arbitrary dissimilarity measures, but their quadratic time and space complexity even in the best case still limits the...

متن کامل

Exact Subspace Clustering in Linear Time

Subspace clustering is an important unsupervised learning problem with wide applications in computer vision and data analysis. However, the state-of-the-art methods for this problem suffer from high time complexity—quadratic or cubic in n (the number of data instances). In this paper we exploit a data selection algorithm to speedup computation and the robust principal component analysis to stre...

متن کامل

Coreference Clustering using Column Generation

In this paper we describe a novel way of generating an optimal clustering for coreference resolution. Where usually heuristics are used to generate a document-level clustering, based on the output of local pairwise classifiers, we propose a method that calculates an exact solution. We cast the clustering problem as an Integer Linear Programming (ILP) problem, and solve this by using a column ge...

متن کامل

Graph Clustering with Surprise: Complexity and Exact Solutions

Clustering graphs based on a comparison of the number of links within clusters and the expected value of this quantity in a random graph has gained a lot of attention and popularity in the last decade. Recently, Aldecoa and Maŕın proposed a related, but slightly different approach leading to the quality measure surprise, and reported good behavior in the context of synthetic and real world benc...

متن کامل

Experimental Evaluation of Algorithmic Effort Estimation Models using Projects Clustering

One of the most important aspects of software project management is the estimation of cost and time required for running information system. Therefore, software managers try to carry estimation based on behavior, properties, and project restrictions. Software cost estimation refers to the process of development requirement prediction of software system. Various kinds of effort estimation patter...

متن کامل

The Exact Solution of Min-Time Optimal Control Problem in Constrained LTI Systems: A State Transition Matrix Approach

In this paper, the min-time optimal control problem is mainly investigated in the linear time invariant (LTI) continuous-time control system with a constrained input. A high order dynamical LTI system is firstly considered for this purpose. Then the Pontryagin principle and some necessary optimality conditions have been simultaneously used to solve the optimal control problem. These optimality ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1702.05425  شماره 

صفحات  -

تاریخ انتشار 2017