Instant Message Clustering Based on Extended Vector Space Model

نویسندگان

  • Le Wang
  • Yan Jia
  • Weihong Han
چکیده

Instant intercommunion techniques such as Instant Messaging (IM) are widely popularized. Aiming at such kind of large scale masscommunication media, clustering on its text content is a practical method to analyze the characteristic of text content in instant messages, and find or track the social hot topics. However, key words in one instant message usually are few, even latent; moreover, single message can not describe the conversational context. This is very different from general document and makes common clustering algorithms unsuitable. A novel method called WR-KMeans is proposed, which synthesizes related instant messages as a conversation and enriches conversation’s vector by words which are not included in this conversation but are closely related with existing words in this conversation. WR-KMeans performs clustering like k-means on this extended vector space of conversations. Experiments on the public datasets show that WR-KMeans outperforms the traditional k-means and bisecting k-means algorithms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Double Clustering in Latent Semantic Indexing

Document clustering is a widely researched area of information retrieval. The large amount of documents which must be handled needs automatic organizing. A popular approach to clustering documents and messages is the vector space model, which represents texts with feature vectors, usually generated from the set of terms contained in the message. The clustering based on the document-term frequen...

متن کامل

Toward Public Opinions Detection: Measuring the Similarity between Instant Messages

Text clustering can be adopted to detect the public opinions in instant messages, which offer the greatest potential for social applications. However, existing approaches for text clustering perform poorly when the instant messages contain many fresh words and sparse key words. To improve the efficiency of the text clustering, this paper proposes a new similarity measure model called PoDeM. In ...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Clustering blog entries based on the hybrid document model enhanced by the extended anchor texts and co-referencing links

In this paper, we propose a document vector space model where weights of noun terms vary depending on positions within the texts of blog entries as search results. We extend “extended anchor texts” (i.e., extra texts surrounding anchor texts) with the exponential potential such that the weight of a noun term decreases exponentially as the distance between the term and link increases. In order t...

متن کامل

A Novel Design of Instant Messaging Service Extended from Short Message Service with XMPP

Instant messaging (IM) services have grown dramatically in recent years. In telephone networks, short message service keeps dominating mobile data services. As the rise of nomadic need, to bridge both messaging services is becoming a promising niche. In this paper we propose an infrastructure, based on the XML-based protocol, XMPP, to simplify the interconnections and enable the legacy handsets...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007