Efficient Big Text Data Clustering Algorithms using Hadoop and Spark

نویسندگان

چکیده

Document clustering is a traditional, efficient and yet quite effective, text mining technique when we need to get better insight of the documents collection that could be grouped together. The K-Means algorithm Hierarchical Agglomerative Clustering (HAC) are two most known commonly used algorithms; former due its low time cost latter accuracy. However, even use in over large-scale collections can lead unacceptable costs. In this paper first address some valuable approaches for document such 'big data' (large-scale) collections. We then present very promising alternatives: (a) variation an existing K-Means-based fast (known as BigKClustering - BKC) so it applied clustering, (b) hybrid approach based on customized version Buckshot algorithm, which applies hierarchical procedure sample input dataset uses results initial centers assignment rest documents, with few iterations. also give highly adaptations proposed techniques MapReduce model experimentally tested using Apache Hadoop Spark real cluster environment. As comes out experiments, they both acceptable quality well significant improvements (compared especially Buckshot-based algorithm), thus constituting alternatives big

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Big Data using Hadoop

: “Big Data” is data that becomes large enough that it cannot be processed using conventional methods. The term Big Data concerns with the huge volume, complex and rapidly growing data sets with multiple, independent sources .Due to fast development of networking ,data storage and data collection capacity the concept of big data is now rapidly expanding in all science and engineering domains in...

متن کامل

Big Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing

Clustering of class labels can be generated automatically, which is much lower quality than labels specified by human. If the class labels for clustering are provided, the clustering is more effective. In classic document clustering based on vector model, documents appear terms frequency without considering the semantic information of each document. The property of vector model may be incorrect...

متن کامل

Big Data Using Hadoop

17ANSP-BD-001 Hadoop Performance Modeling for JobEstimation and Resource Provisioning MapReduce has become a major computing model for data intensive applications. Hadoop, an open source implementationof MapReduce, has been adopted by an increasingly growing user community. Cloud computing service providers such as AmazonEC2 Cloud offer the opportunities for Hadoop users to lease a certain amou...

متن کامل

Hadoop Based Big Data Clustering using Genetic & K-Means Algorithm

This is the era of huge and large sets of data or can say Big Data. Clustering of Big data plays several important roles for Big Data analytics. In this paper, we are introducing Big Data clustering algorithm by combining Genetic and K-Means algorithm using Hadoop framework. The major aim of this hybrid algorithm is to make clustering process faster and also raise the accuracy of resultant clus...

متن کامل

High Performance clustering for Big Data Mining using Hadoop

Now a day, organizations across public and private sectors have made a premeditated decision to big data into competitive advantage. The motivation and challenge of extracting value from big data is similar in many ways to the age-old problem of distilling business intelligence from transactional data. Hadoop is a speedily budding ecosystem of components based on big data Map Reduce algorithm a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: International journal of computer applications

سال: 2021

ISSN: ['0975-8887']

DOI: https://doi.org/10.5120/ijca2021921030