Is the contextual information relevant in text clustering by compression?
نویسندگان
چکیده
Usually, when analyzing data that have not been processed or filtered yet, it can be observed that not all the data have equal importance. Thus, it is common to find relevant data surrounded by non relevant one. This occurs when analyzing textual information due to its intrinsic nature: texts contain words that provide a lot of information about the subject matter, whereas they contain other words with a little meaning or relevance. We believe that although in principle the non-relevant words are not as important as the relevant ones, the former constitute the substrate that supports the last. Since this substrate is the context that surrounds the relevant information, we call it the contextual information. In this paper, we analyze the relevance that the contextual information has in textual data, in a clustering by compression scenario. We generate the contextual information applying a distortion technique previously developed by the authors. One of the main characteristics of this technique is that it maintains the contextual information. In this paper we compare this technique with three new distortion techniques that destroy the contextual information in different ways. The experimental results support our hypothesis that the contextual information is relevant at least in the area of text clustering by compression. 2012 Elsevier Ltd. All rights reserved.
منابع مشابه
Relevance of Contextual Information in Compression-Based Text Clustering
Despite the wide use of compression distances in knowledge discovery and data mining, little has been done to interpret their results or to explain their behavior. In this paper we take a step towards understanding compression distances by analyzing the relevance of contextual information in compression-based text clustering. In order to do so, two kinds of word removal are explored, one that m...
متن کاملThe Impact of Contextual Clue Selection on Inference
Linguistic information can be conveyed in the form of speech and written text, but it is the content of the message that is ultimately essential for higher-level processes in language comprehension, such as making inferences and associations between text information and knowledge about the world. Linguistically, inference is the shovel that allows receivers to dig meaning out from the text with...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملA survey on Automatic Text Summarization
Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...
متن کاملMEDICAL IMAGE COMPRESSION: A REVIEW
Within recent years the use of medical images for diagnosis purposes has become necessity. The limitation in transmission and storage space also growing size of medical images has necessitated the need for efficient method, then image Compression is required as an efficient way to reduces irrelevant and redundancy of the image data in order to be able to store or transmits data. It also reduces...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Expert Syst. Appl.
دوره 39 شماره
صفحات -
تاریخ انتشار 2012