The Recognition Method of Unknown Chinese Words in Fragments Based on Mutual Information

نویسندگان

  • Qian Zhu
  • Xian-Yi Cheng
  • Zi-juan Gao
چکیده

This paper presents a method of using mutual information to improve the recognition algorithm of unknown Chinese words, it can resolve the complexity of weight settings and the increasing garbage strings caused by the omnisegmentation of fragments that affected the efficiency of unknown Chinese words recognition existed in the literature[7]. The process of the method is as following: first, segment the text, and then segment the fragments that get in the first step to generate a temporary dictionary, then use rules and frequency information to calculate the mutual information of every string in the temporary dictionary. Finally, the greedy algorithm is used to obtain the longest path of each fragment, so to abstract the unknown Chinese words in the fragments.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A heuristic method based on a statistical approach for Chinese text segmentation

The authors propose a heuristic method for Chinese automatic text segmentation based on a statistical approach. This method is developed based on statistical information about the association among adjacent characters in Chinese text. Mutual information of bi-grams and significant estimation of tri-grams are utilized. A heuristic method with six rules is then proposed to determine the segmentat...

متن کامل

The Identification and Classification of Unknown Words in Chinese An N-Grams-Based Approach

In this paper, we propose a new approach to identify unknown words in Chinese. This approach adopts an n-grams program to sort out the collocating word / character sequences which are possible words and phrases in Chinese. In addition to proposing the criteria for identifying Chinese new words, was also classify these new words according to their structural and semantic characteristics. The cor...

متن کامل

A Novel Subsampling Method for 3D Multimodality Medical Image Registration Based on Mutual Information

Mutual information (MI) is a widely used similarity metric for multimodality image registration. However, it involves an extremely high computational time especially when it is applied to volume images. Moreover, its robustness is affected by existence of local maxima. The multi-resolution pyramid approaches have been proposed to speed up the registration process and increase the accuracy of th...

متن کامل

An Information-Theoretic Discussion of Convolutional Bottleneck Features for Robust Speech Recognition

Convolutional Neural Networks (CNNs) have been shown their performance in speech recognition systems for extracting features, and also acoustic modeling. In addition, CNNs have been used for robust speech recognition and competitive results have been reported. Convolutive Bottleneck Network (CBN) is a kind of CNNs which has a bottleneck layer among its fully connected layers. The bottleneck fea...

متن کامل

Extraction Of Chinese Compound Words - An Experimental Study On A Very Large Corpus

This paper is to introduce a statistical method to extract Chinese compound words from a very large corpus1. This method is based on mutual information and context dependency. Experimental results show that this method is efficient and robust compared with other approaches. We also examined the impact of different parameter settings, corpus size and heterogeneousness on the extraction results. ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • JCIT

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2010