Chinese-mining Preprocessing Technology Based on Text Trait Optimizing

نویسنده

  • XIAOYONG WANG
چکیده

How to get the target text quickly becomes a technical limitation with the using of massive data. While obtaining the Chinese target information, the segmentation of the sentence is supposed to be the key according to research. To mine the segmentation of English text is relatively simple for the space is used as a interval, meanwhile the Chinese segmentation is much more difficult. So in this paper the reciprocal crossing segmentation algorithm and the trait-optimizing vector model are designed to improve the mining efficiency of Chinese information. Based on dictionary, an improved segmentation algorithm is adopted in text pretreatment processing, which is based on vector space module, to do experiments on the segmentation algorithm and to analyze the segmentation results. And that segmentation algorithm is already proved to be very effective in the text mining of text trait vector module.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Chinese Novelty Mining

Automated mining of novel documents or sentences from chronologically ordered documents or sentences is an open challenge in text mining. In this paper, we describe the preprocessing techniques for detecting novel Chinese text and discuss the influence of different Part of Speech (POS) filtering rules on the detection performance. Experimental results on APWSJ and TREC 2004 Novelty Track data s...

متن کامل

Document Analysis And Classification Based On Passing Window

In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...

متن کامل

A Statistical Text Mining Method for Patent Analysis

Most text data from diverse document databases are unsuitable for analytical methods based on statistics and machine learning algorithms. Patent documents are also compiled into text datasets. Similar to other document datasets, we therefore need to transform patent documents into structured data for a statistical analysis. This transformation is performed using the preprocessing of text mining...

متن کامل

Orthogonal Processing for Measuring the Tonality of Egyptian Microblogs

Subjectivity and Sentiment Analysis (SSA) research in Arabic is still in its beginning phases regarding the research done in English on different granularities (sentence and document levels). In this paper, a simple system is proposed to perform sentiment analysis (or polarity detection) using an aggressive stemmer in the preprocessing phase followed by a Fuzzy classifier. The main focus of thi...

متن کامل

Sentiment Analisis on Web-based Reviews using Data Mining and Support Vector Machine

This work aims to use sentiment analysis techniques, data mining, text mining and natural language processing to indicate the polarity of texts using support vector machine. Weka software and a movie review database from Internet Movie Database IMDb were used. This work uses preprocessing filters and WRAPPER techniques and Support Vector Machine (SVM) for classification. It presents better resu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013