Discriminative Boosting from Dictionary and Raw Text - A Novel Approach to Build A Chinese Word Segmenter

نویسندگان

  • Fandong Meng
  • Wenbin Jiang
  • Hao Xiong
  • Qun Liu
چکیده

Chinese word segmentation (CWS) is a basic and important task for Chinese information processing. Standard approaches to CWS treat it as a sequence labelling task. Without manually annotated corpora, these approaches are ineffective. When a dictionary is available, dictionary maximum matching (DMM) is a good alternative. However, its performance is far from perfect due to the poor ability on out-of-vocabulary (OOV) words recognition. In this paper, we propose a novel approach that integrates the advantages of discriminative training and DMM, to build a high quality word segmenter with only a dictionary and a raw text. Experiments in CWS on different domains show that, compared with DMM, our approach brings significant improvements in both the news domain and the Chinese medicine patent domain, with error reductions of 21.50% and 13.66%, respectively. Furthermore, our approach achieves recall rate increments of OOV words by 42.54% and 23.72%, respectively in both domains.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Web-based Approach To Chinese Word Segmentation

Chinese text processing requires the detection of word boundaries. This is a non-trivial step because Chinese does not contain explicit whitespace between words. Existing word segmentation techniques make use of precompiled dictionaries and treebanks. The creation of dictionaries and treebanks is a labor-intensive process and consequently they are updated infrequently. Furthermore, due to their...

متن کامل

Word Segmenter for Chinese Micro-blogging Text Segmentation - Report for CIPS-SIGHAN'2014 Bakeoff

This paper presents our system for the CIPSSIGHAN-2014 bakeoff task of Chinese word segmentation. This system adopts a characterbased joint approach, which combines a character-based generative model and a character-based discriminative model. To further improve the performance in cross-domain, an external dictionary is employed. In addition, pre-processing and post-processing rules are utilize...

متن کامل

A Maximum Entropy Approach to Chinese Word Segmentation

We participated in the Second International Chinese Word Segmentation Bakeoff. Specifically, we evaluated our Chinese word segmenter in the open track, on all four corpora, namely Academia Sinica (AS), City University of Hong Kong (CITYU), Microsoft Research (MSR), and Peking University (PKU). Based on a maximum entropy approach, our word segmenter achieved the highest F measure for AS, CITYU, ...

متن کامل

Exploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation

Unknown words and word segmentation granularity are two main problems in Chinese word segmentation for ChineseJapanese Machine Translation (MT). In this paper, we propose an approach of exploiting common Chinese characters shared between Chinese and Japanese in Chinese word segmentation optimization for MT aiming to solve these problems. We augment the system dictionary of a Chinese segmenter b...

متن کامل

Chinese Word Segmentation for Terrorism-Related Contents

In order to analyze security and terrorism related content in Chinese, it is important to perform word segmentation on Chinese documents. There are many previous studies on Chinese word segmentation. The two major approaches are statistic-based and dictionary-based approaches. The pure statistic methods have lower precision, while the pure dictionary-based method cannot deal with new words and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012