Which is More Suitable for Chinese Word Segmentation , the Generative Model or the Discriminative One ? F ∗

نویسندگان

  • Kun Wang
  • Chengqing Zong
  • Keh-Yih Su
چکیده

Since the traditional word-based n-gram model, a generative approach, cannot handle those out-of-vocabulary (OOV) words in the testing-set, the character-based discriminative approach has been widely adopted recently. However, this discriminative model, though is more robust to OOV words, fails to deliver satisfactory performance for those in-vocabulary (IV) words that have been observed before. Having analyzed the wordbased approach, its capability to handle the dependency between adjacent characters within a word, which is believed that the human adopts for doing segmentation, is found to account for its excellent performance for those IV words. To incorporate the intra-word characters dependency, a character-based approach with a generative model is thus proposed in this paper. The experiments conducted on the second SIGHAN Bakeoffs have shown that the proposed model not only achieves a good balance between those IV words and OOV words, but also outperforms the above-mentioned well-known approaches under the similar conditions.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Which is More Suitable for Chinese Word Segmentation, the Generative Model or the Discriminative One?

Since the traditional word-based n-gram model, a generative approach, cannot handle those out-of-vocabulary (OOV) words in the testing-set, the character-based discriminative approach has been widely adopted recently. However, this discriminative model, though is more robust to OOV words, fails to deliver satisfactory performance for those in-vocabulary (IV) words that have been observed before...

متن کامل

A Character-Based Joint Model for Chinese Word Segmentation

The character-based tagging approach is a dominant technique for Chinese word segmentation, and both discriminative and generative models can be adopted in that framework. However, generative and discriminative character-based approaches are significantly different and complement each other. A simple joint model combining the character-based generative model and the discriminative one is thus p...

متن کامل

Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation

Current character-based approaches are not robust for cross domain Chinese word segmentation. In this paper, we alleviate this problem by deriving a novel enhanced character-based generative model with a new abstract aggregate candidate-feature, which indicates if the given candidate prefers the corresponding position-tag of the longest dictionary matching word. Since the distribution of the pr...

متن کامل

Word Segmenter for Chinese Micro-blogging Text Segmentation - Report for CIPS-SIGHAN'2014 Bakeoff

This paper presents our system for the CIPSSIGHAN-2014 bakeoff task of Chinese word segmentation. This system adopts a characterbased joint approach, which combines a character-based generative model and a character-based discriminative model. To further improve the performance in cross-domain, an external dictionary is employed. In addition, pre-processing and post-processing rules are utilize...

متن کامل

A Character-Based Joint Model for CIPS-SIGHAN Word Segmentation Bakeoff 2010

This paper presents a Chinese Word Segmentation system for the closed track of CIPS-SIGHAN Word Segmentation Bakeoff 2010. This system adopts a character-based joint approach, which combines a character-based generative model and a character-based discriminative model. To further improve the crossdomain performance, we use an additional semi-supervised learning procedure to incorporate the unla...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012