Chinese-mining Preprocessing Technology Based on Text Trait Optimizing
نویسنده
چکیده
How to get the target text quickly becomes a technical limitation with the using of massive data. While obtaining the Chinese target information, the segmentation of the sentence is supposed to be the key according to research. To mine the segmentation of English text is relatively simple for the space is used as a interval, meanwhile the Chinese segmentation is much more difficult. So in this paper the reciprocal crossing segmentation algorithm and the trait-optimizing vector model are designed to improve the mining efficiency of Chinese information. Based on dictionary, an improved segmentation algorithm is adopted in text pretreatment processing, which is based on vector space module, to do experiments on the segmentation algorithm and to analyze the segmentation results. And that segmentation algorithm is already proved to be very effective in the text mining of text trait vector module.
منابع مشابه
Chinese Novelty Mining
Automated mining of novel documents or sentences from chronologically ordered documents or sentences is an open challenge in text mining. In this paper, we describe the preprocessing techniques for detecting novel Chinese text and discuss the influence of different Part of Speech (POS) filtering rules on the detection performance. Experimental results on APWSJ and TREC 2004 Novelty Track data s...
متن کاملDocument Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کاملA Statistical Text Mining Method for Patent Analysis
Most text data from diverse document databases are unsuitable for analytical methods based on statistics and machine learning algorithms. Patent documents are also compiled into text datasets. Similar to other document datasets, we therefore need to transform patent documents into structured data for a statistical analysis. This transformation is performed using the preprocessing of text mining...
متن کاملOrthogonal Processing for Measuring the Tonality of Egyptian Microblogs
Subjectivity and Sentiment Analysis (SSA) research in Arabic is still in its beginning phases regarding the research done in English on different granularities (sentence and document levels). In this paper, a simple system is proposed to perform sentiment analysis (or polarity detection) using an aggressive stemmer in the preprocessing phase followed by a Fuzzy classifier. The main focus of thi...
متن کاملSentiment Analisis on Web-based Reviews using Data Mining and Support Vector Machine
This work aims to use sentiment analysis techniques, data mining, text mining and natural language processing to indicate the polarity of texts using support vector machine. Weka software and a movie review database from Internet Movie Database IMDb were used. This work uses preprocessing filters and WRAPPER techniques and Support Vector Machine (SVM) for classification. It presents better resu...
متن کامل