Map Reduce Text Clustering Using Vector Space Model

نویسندگان

  • R. C. Saritha
  • Usha Rani
چکیده

Information retrieval is the area of finding particular web pages via a query to an internet search engine. Even though well sophisticated algorithms and data structures are used in traditional computer techniques to create indexes for efficiently organize and retrieve information systems, currently data mining techniques like clustering are used to enhance the efficiency of retrieval process. Most of the data on the internet is in the form of unstructured, text clustering becomes mandatory step for search engines to group the similar text documents for faster information retrieval. In order to store elastic resources of unstructured data, Hadoop is coined to store and compute data in parallel and distributed environment. The well known traditional approach to cluster text documents is vector space model implemented by k-means algorithm. This paper presents map reduce approach for clustering the documents using vector space model. The experimental study shows that this approach is efficient with the increase of text corpus along with number of nodes in the cluster. Keywords—Vector space model, map reduce, text clustering, map reduce k-means, Hadoop

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Method of Hierarchical Text Clustering Based on Lsa-Hgsom

Text clustering has been recognized as an important component in data mining. Self-Organizing Map (SOM) based models have been found to have certain advantages for clustering sizeable text data. However, current existing approaches lack in providing an adaptive hierarchical structure within in a single model. This paper presents a new method of hierarchical text clustering based on combination ...

متن کامل

Neural Network Based Document Clustering Using WordNet Ontologies

Three novel text vector representation approaches for neural network based document clustering are proposed. The first is the extended significance vector model (ESVM), the second is the hypernym significance vector model (HSVM) and the last is the hybrid vector space model (HyM). ESVM extracts the relationship between words and their preferred classified labels. HSVM exploits a semantic relati...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

A New Agglomerative Hierarchical Clustering Algorithm Implementation based on the Map Reduce Framework

Text clustering is one of the difficult and hot research fields in the text mining research. Combing Map Reduce framework and the neuron initialization method of VPSOM (vector pressing SelfOrganizing Model) algorithm, a new text clustering algorithm is presented. It divides the large text vector dataset into data blocks, each of which then processed in different distributed data node of Map Red...

متن کامل

Integrating contextual information to enhance SOM-based text document clustering

Exploration of text corpora using self-organizing maps has shown promising results in recent years. Topographic map approaches usually use the original vector space model known from Information Retrieval for text document representation. In this paper I present a two stage model using features based on sentence categories as alternative approach which includes contextual information. Algorithmi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014