Employing Latent Dirichlet Allocation Model for Topic Extraction of Chinese Text
نویسنده
چکیده
The hidden topic model of Chinese text, which possesses complicated semantics, is urgently needed, since China has occupied an increasingly significant role during the booming development of globalization over recent years. This paper details and elaborates the basic process of extracting latent Chinese topics by demonstrating a Chinese topic extraction schema based on Latent Dirichlet Allocation (LDA) model. Furthermore, the application was practiced in CCL, an authoritative Chinese corpus, to extract topics for its nine categories. With rigorous empirical analysis, extracting the LDA results has a considerably higher average precision rate as opposed to other three comparable Chinese topic extraction techniques; however the average recall rate is worse than KNN and almost the same with the PLSI model. Moreover, the recall rate and precision rate of LDA-CH is worse than LDA-EH. Therefore, the LDA model should be improved to adapt to the distinctive feature of Chinese words with the purpose of making it better for Chinese topic extraction.
منابع مشابه
Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation
Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...
متن کاملChinese Short-Text Classification Based on Topic Model with High-Frequency Feature Expansion
Short text differs from traditional documents in its shortness and sparseness. Feature extension can ease the problem of high sparseness in the vector space model, but it inevitably introduces noise. To resolve this problem, this paper proposes a high-frequency feature expansion method based on a latent Dirichlet allocation (LDA) topic model. High-frequency features are extracted from each cate...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملLanguage model adaptation using latent dirichlet allocation and an efficient topic inference algorithm
We present an effort to perform topic mixture-based language model adaptation using latent Dirichlet allocation (LDA). We use probabilistic latent semantic analysis (PLSA) to automatically cluster a heterogeneous training corpus, and train an LDAmodel using the resultant topicdocument assignments. Using this LDA model, we then construct topic-specific corpora at the utterance level for interpol...
متن کاملTopic Model for Person Identification using Gait Sequence Analysis
Gait sequence analysis from the input binary silhouettes, has various applications, such as person identification, human action recognition, event recognition and classification. The gait feature extraction is a key step in gait analysis. The ’Topic Model’, used for text classification, is one of the potential semantic approaches to study gait sequence analysis. The proposed algorithm uses Late...
متن کامل