Chinese Word Segmentation in FTRD Beijing
نویسندگان
چکیده
This paper presents a word segmentation system in France Telecom R&D Beijing, which uses a unified approach to word breaking and OOV identification. The output can be customized to meet different segmentation standards through the application of an ordered list of transformation. The system participated in all the tracks of the segmentation bakeoff -PK-open, PKclosed, AS-open, AS-closed, HK-open, HK-closed, MSR-open and MSRclosed -and achieved the state-of-theart performance in MSR-open, MSRclose and PK-open tracks. Analysis of the results shows that each component of the system contributed to the scores.
منابع مشابه
Incorporating New Words Detection with Chinese Word Segmentation
With development in Chinese words segmentation, in-vocabulary word segmentation and named entity recognition achieves state-of-art performance. However, new words become bottleneck to Chinese word segmentation. This paper presents the result from Beijing Institute of Technology (BIT) in the Sixth International Chinese Word Segmentation Bakeoff in 2010. Firstly, the author reviewed the problem c...
متن کاملIdentification of Chinese Personal Names in Unrestricted Texts
Automatic identification of Chinese personal names in unrestricted texts is a key task in Chinese word segmentation, and can affect other NLP tasks such as word segmentation and information retrieval, if it is not properly addressed. This paper (1) demonstrates the problems of Chinese personal name identification in some IT applications, (2) analyzes the structure of Chinese personal names, and...
متن کاملIntroduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff
In this paper, we roughly described the procedures of our segmentation system, including the methods for resolving segmentation ambiguities and identifying unknown words. The CKIP group of Academia Sinica participated in testing on open and closed tracks of Beijing University (PK) and Hong Kong Cityu (HK). The evaluation results show our system performs very well in either HK open track or HK c...
متن کاملA Hybrid Approach to Chinese Word Segmentation around CRFs
In this paper, we present a Chinese word segmentation system which is consisted of four components, i.e. basic segmentation, named entity recognition, error-driven learner and new word detector. The basic segmentation and named entity recognition, implemented based on conditional random fields, are used to generate initial segmentation results. The other two components are used to refine the re...
متن کاملDesign of CKIP Chinese Word Segmentation System
In this paper, we describe the design of the CKIP Chinese word segmentation system and analyse its performance. The system utilizes a modulized approach. Independent modules were designed to solve the problems of segmentation ambiguities and identifying unknown words. Segmentation ambiguities are resolved by a hybrid method of using heuristic and statistical rules. Regular-type unknown words ar...
متن کامل