Which is More Suitable for Chinese Word Segmentation , the Generative Model or the Discriminative One ? F ∗
نویسندگان
چکیده
Since the traditional word-based n-gram model, a generative approach, cannot handle those out-of-vocabulary (OOV) words in the testing-set, the character-based discriminative approach has been widely adopted recently. However, this discriminative model, though is more robust to OOV words, fails to deliver satisfactory performance for those in-vocabulary (IV) words that have been observed before. Having analyzed the wordbased approach, its capability to handle the dependency between adjacent characters within a word, which is believed that the human adopts for doing segmentation, is found to account for its excellent performance for those IV words. To incorporate the intra-word characters dependency, a character-based approach with a generative model is thus proposed in this paper. The experiments conducted on the second SIGHAN Bakeoffs have shown that the proposed model not only achieves a good balance between those IV words and OOV words, but also outperforms the above-mentioned well-known approaches under the similar conditions.
منابع مشابه
Which is More Suitable for Chinese Word Segmentation, the Generative Model or the Discriminative One?
Since the traditional word-based n-gram model, a generative approach, cannot handle those out-of-vocabulary (OOV) words in the testing-set, the character-based discriminative approach has been widely adopted recently. However, this discriminative model, though is more robust to OOV words, fails to deliver satisfactory performance for those in-vocabulary (IV) words that have been observed before...
متن کاملA Character-Based Joint Model for Chinese Word Segmentation
The character-based tagging approach is a dominant technique for Chinese word segmentation, and both discriminative and generative models can be adopted in that framework. However, generative and discriminative character-based approaches are significantly different and complement each other. A simple joint model combining the character-based generative model and the discriminative one is thus p...
متن کاملIntegrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation
Current character-based approaches are not robust for cross domain Chinese word segmentation. In this paper, we alleviate this problem by deriving a novel enhanced character-based generative model with a new abstract aggregate candidate-feature, which indicates if the given candidate prefers the corresponding position-tag of the longest dictionary matching word. Since the distribution of the pr...
متن کاملWord Segmenter for Chinese Micro-blogging Text Segmentation - Report for CIPS-SIGHAN'2014 Bakeoff
This paper presents our system for the CIPSSIGHAN-2014 bakeoff task of Chinese word segmentation. This system adopts a characterbased joint approach, which combines a character-based generative model and a character-based discriminative model. To further improve the performance in cross-domain, an external dictionary is employed. In addition, pre-processing and post-processing rules are utilize...
متن کاملA Character-Based Joint Model for CIPS-SIGHAN Word Segmentation Bakeoff 2010
This paper presents a Chinese Word Segmentation system for the closed track of CIPS-SIGHAN Word Segmentation Bakeoff 2010. This system adopts a character-based joint approach, which combines a character-based generative model and a character-based discriminative model. To further improve the crossdomain performance, we use an additional semi-supervised learning procedure to incorporate the unla...
متن کامل