TOSOM: A Topic-Oriented Self-Organizing Map for Text Organization
نویسندگان
چکیده
The self-organizing map (SOM) model is a well-known neural network model with wide spread of applications. The main characteristics of SOM are two-fold, namely dimension reduction and topology preservation. Using SOM, a high-dimensional data space will be mapped to some low-dimensional space. Meanwhile, the topological relations among data will be preserved. With such characteristics, the SOM was usually applied on data clustering and visualization tasks. However, the SOM has main disadvantage of the need to know the number and structure of neurons prior to training, which are difficult to be determined. Several schemes have been proposed to tackle such deficiency. Examples are growing/expandable SOM, hierarchical SOM, and growing hierarchical SOM. These schemes could dynamically expand the map, even generate hierarchical maps, during training. Encouraging results were reported. Basically, these schemes adapt the size and structure of the map according to the distribution of training data. That is, they are data-driven or dataoriented SOM schemes. In this work, a topic-oriented SOM scheme which is suitable for document clustering and organization will be developed. The proposed SOM will automatically adapt the number as well as the structure of the map according to identified topics. Unlike other data-oriented SOMs, our approach expands the map and generates the hierarchies both according to the topics and their characteristics of the neurons. The preliminary experiments give promising result and demonstrate the plausibility of the method. Keywords—Self-Organizing Map, Topic Identification, Learning Algorithm, Text Clustering.
منابع مشابه
Document Clustering and Visualization with Latent Dirichlet Allocation and Self-Organizing Maps
Clustering and visualization of large text document collections aids in browsing, navigation, and information retrieval. We present a document clustering and visualization method based on Latent Dirichlet Allocation and self-organizing maps (LDA-SOM). LDA-SOM clusters documents based on topical content and renders clusters in an intuitive twodimensional format. Document topics are inferred usin...
متن کاملText Classification and Labelling of Document Clusters with Self-Organising Maps
The freely available law on the Internet could be one of the best application areas of text classification and labelling. This paper explores the high potential of the self-organising map for information reconnaissance by classifying and describing unknown legal text collections. The maps can be seen as topic-oriented libraries that are automatically created without intellectual input. The clus...
متن کاملWord Category Maps based on Emergent Features Created by ICA
In this paper, we assume that word co-occurrence statistics can be used to extract meaningful features, exhibiting syntactic and semantic behavior, from text data. Independent component analysis (ICA), an unsupervised statistical method, is applied to word usage statistics, calculated from a natural language corpora, to extract a number of features. With a self-organizing map (SOM), we will dem...
متن کاملLanguage segmentation for Optical Character Recognition using Self Organizing Maps
Modern optical character recognition (OCR) systems perform optimally on single-font monolingual texts, and have lower performance on bilingual and multilingual texts. For many OCR tasks it is necessary to accurately recognize characters from bilingual texts such as dictionaries or grammar books. We present a novel approach to segmenting bilingual text, easily extensible to more than two languag...
متن کاملSelf-Organizing-Map-Based Metamodeling for Massive Text Data Exploration
In this study, we describe the use of the self-organizing map (SOM) as a metamodeling technique to design a parallel text data exploration system. Firstly, the large textual collections are divided into various small data subsets. Based on the different subsets, different unitary SOM models, i.e., base models, are then trained for word clustering map. In this phase, different SOM models are imp...
متن کامل