K-means clustering

A Bad Instance for k-Means++

Journal: :Theor. Comput. Sci. 2011

Tobias Brunsch Heiko Röglin

k-means++ is a seeding technique for the k-means method with an expected approximation ratio of O(log k), where k denotes the number of clusters. Examples are known on which the expected approximation ratio of k-means++ is Ω(log k), showing that the upper bound is asymptotically tight. However, it remained open whether k-means++ yields an O(1)-approximation with probability 1/poly(k) or even wi...

متن کامل

persistent k-means: stable data clustering algorithm based on k-means algorithm

Journal: :journal of computer and robotics 0

rasool azimi faculty of computer and information technology engineering, qazvin branch, islamic azad university, qazvin, iran hedieh sajedi department of computer science, college of science, university of tehran, tehran, iran

identifying clusters or clustering is an important aspect of data analysis. it is the task of grouping a set of objects in such a way those objects in the same group/cluster are more similar in some sense or another. it is a main task of exploratory data mining, and a common technique for statistical data analysis this paper proposed an improved version of k-means algorithm, namely persistent k...

متن کامل

An efficient approximation to the K-means clustering for massive data

Journal: :Knowl.-Based Syst. 2017

Marco Capó Aritz Pérez Martínez José Antonio Lozano

Due to the progressive growth of the amount of data available in a wide variety of scientific fields, it has become more difficult to manipulate and analyze such information. In spite of its dependency on the initial settings and the large number of distance computations that it can require to converge, the K-means algorithm remains as one of the most popular clustering methods for massive data...

متن کامل

Distributed and Provably Good Seedings for k-Means in Constant Rounds

2017

Olivier Bachem Mario Lucic Andreas Krause

The k-means++ algorithm is the state of the art algorithm to solve k-Means clustering problems as the computed clusterings are O(log k) competitive in expectation. However, its seeding step requires k inherently sequential passes through the full data set making it hard to scale to massive data sets. The standard remedy is to use the k-means‖ algorithm which reduces the number of sequential rou...

متن کامل

Validity Index for Fuzzy K-Means Clustering Using the Gap Statistic Method

2005

Chinatsu Arima Kazumi Hakamada Masahiro Okamoto Taizo Hanai

متن کامل

Разработка ансамбля алгоритмов кластеризации на основе изменяющихся метрик расстояний (Development of the Clustering Algorithms Ensemble Based on Varying Distances Metrics)

2016

Pyotr Bochkaryov Vasiliy Kireev

В настоящее время происходит активное накопление данных большого объёма в различных информационных средах, таких как социальные, корпоративные, научные и другие. Интенсивное использование больших данных в различных областях стимулирует повышенный интерес исследователей к развитию методов и средств обработки и анализа массивных данных огромных объёмов и значительного многообразия. Одним из персп...

متن کامل

Retrospective Analysis of Software Projects using k-Means Clustering

Journal: :Softwaretechnik-Trends 2010

Steffen Herbold Jens Grabowski Helmut Neukirchen Stephan Waack

Software projects are usually analyzed by experts based on their previous experience, their intuition and data they gather about the project. In this work, we show an approach for a purely data-driven retrospective project analysis. We plan to build on this work to make predictions about the evolution of software projects.

متن کامل

Metin Madenciliği Kullanılarak Yazılım Kullanımına Dair Bulguların Elde Edilmesi

2015

Deniz Kilinç Fatma Bozyigit Akin Özçift Fatih Yücalar Emin Borandag

Özet. Yazılım teknolojileri hızla ilerlemekte ve buna paralel olarak hem kamu alanında hem de özel sektörde gerçekleştirilen yazılım projelerinin sayısı artmaktadır. Yazılım otomasyon projelerinden elde edilen en büyük çıktılardan birisi kuşkusuz ki üretilen verilerdir. Yüksek boyutlu, anlaşılması güç bu verilerin işlenerek, daha anlamlı ve yönlendirici verilere dönüştürülmesi önemli bir ihtiya...

متن کامل

ISCAS at Subtopic Mining Task in NTCIR9

2011

Xue Jiang Xianpei Han Le Sun

In this paper, we describe our work at subtopic mining subtask in NTCIR-9 in simplified Chinese. To find possible subtopics of a specific query, we select related queries recorded by query log, or titles of searching results provided by Google and Baidu, or the catalog of corresponding entry in Baidu encyclopedia, which are lexically similar as the original query, then we apply k-means algorith...

متن کامل

Identification Of Diverse Database Subsets Using Property-Based And Fragment-Based Molecular Descriptions

2008

Mark Ashton John Barnard Florence Casset Michael Charlton Geoffrey Downs Dominique Gorse John Holliday Roger Lahana Peter Willett

This paper reports a comparison of calculated molecular properties and of 2D fragment bit-strings when used for the selection of structurally diverse subsets of a file of 44295 compounds. MaxMin dissimilarity-based selection and k-means clusterbased selection are used to select subsets containing between 1% and 20% of the file. Investigation of the numbers of bioactive molecules in the selected...

متن کامل