A Boundary-Oriented Chinese Segmentation Method Using N-Gram Mutual Information
نویسندگان
چکیده
This paper describes our participation in the Chinese word segmentation task of CIPS-SIGHAN 2010. We implemented an n-gram mutual information (NGMI) based segmentation algorithm with the mixed-up features from unsupervised, supervised and dictionarybased segmentation methods. This algorithm is also combined with a simple strategy for out-of-vocabulary (OOV) word recognition. The evaluation for both open and closed training shows encouraging results of our system. The results for OOV word recognition in closed training evaluation were however found unsatisfactory.
منابع مشابه
Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information
In this paper, we propose an unsupervised segmentation approach, named "n-gram mutual information", or NGMI, which is used to segment Chinese documents into ncharacter words or phrases, using language statistics drawn from the Chinese Wikipedia corpus. The approach alleviates the tremendous effort that is required in preparing and maintaining the manually segmented Chinese text for training pur...
متن کاملObject-Oriented Method for Automatic Extraction of Road from High Resolution Satellite Images
As the information carried in a high spatial resolution image is not represented by single pixels but by meaningful image objects, which include the association of multiple pixels and their mutual relations, the object based method has become one of the most commonly used strategies for the processing of high resolution imagery. This processing comprises two fundamental and critical steps towar...
متن کاملUnsupervised Word Segmentation Without Dictionary
This prototype system demonstrates a novel method of word segmentation based on corpus statistics. Since the central technique we used is unsupervised training based on a large corpus, we refer to this approach as unsupervised word segmentation. The unsupervised approach is general in scope and can be applied to both Mandarin Chinese and Taiwanese. In this prototype, we illustrate its use in wo...
متن کاملA heuristic method based on a statistical approach for Chinese text segmentation
The authors propose a heuristic method for Chinese automatic text segmentation based on a statistical approach. This method is developed based on statistical information about the association among adjacent characters in Chinese text. Mutual information of bi-grams and significant estimation of tri-grams are utilized. A heuristic method with six rules is then proposed to determine the segmentat...
متن کاملPlant Classification in Images of Natural Scenes Using Segmentations Fusion
This paper presents a novel approach to automatic classifying and identifying of tree leaves using image segmentation fusion. With the development of mobile devices and remote access, automatic plant identification in images taken in natural scenes has received much attention. Image segmentation plays a key role in most plant identification methods, especially in complex background images. Wher...
متن کامل